060 Texture Mapping Optimization
Just for funsies I've been playing around with implementing perspective correct texture mapping fast on my 060 (again, since I've forgotten most of what I knew about software rasterization). Got most things working following Chris Hecker's seminal series[0], and of course Kalms also joined in on the fun [1]. FWIW this mostly works, but I found that the 28.4 subpixel precision used in [0] isn't sufficient in practice to avoid going outside texture bounds (.10 bits seems to be enough though)).
Anyway, the optimized inner loop from [0] (final part) looks like this in 68k asm (my translation, assuming standard 16.16 at first): Code:
; D0 = Initial Texture Offset: (U>>16)+(V>>16)*TextureWidth Code:
add.l d4,d2 ; Cycle 1 [D0 ready in 2] [0]: http://www.chrishecker.com/Miscellan...nical_Articles [1]: https://www.lysator.liu.se/~mikaelk/...ectivetexture/ [2]: http://ada.untergrund.net/?p=boardthread&id=19 Other good discussions (not referenced) |
I'm not super familiar with 060 dual issue stuff, but you may want to take a look at Georg Steger's 060 DOOM floor mapping routines.
https://github.com/mheyer32/DoomAtta...gine.asm#L8128 I found it clever how the fractional parts of the texture coordinates always stay in the upper word, while the integer part stays in the lower word. No swaps etc necessary. He did it by "cross-packing" the X/Y parts packing t like this deltas: a = fracDY|intDX b = fracDX|intDY x,y: c = fracY|intX d = fracX|intY Now if you do, c = ADDX(c, a) if fracDY overflows, d = ADDX(d, b) will add 1 to intY and in turn if in the same addition fracX overflows, the next c = ADDX(c, a) will increase the integer part of X and so forth... This might be old news to you, but I found it a pretty cool technique! |
Thanks, sounds similar to what's described at https://amycoders.org/opt/innerloops.html but it's nice to see it put into practice with 060 annotations. Even if it's old news, the point is that I've forgotten most of it :)
Managed to squeeze a bit more performance out of it by doing more, but less complicated work: Code:
add.l d4,d2 ; VF+=DVF Cycle 0/5 [sOEP in subsequent ones] Code:
Time (ns) Per pixel (cycles) |
Only for information. I found this massive 3d coding page sometime.
https://mikro.naprvyraz.sk/docs/ |
Quote:
For 060 specific stuff it seems that reaching for the FPU should be preferred to (high-precision) fixedpoint math (sorry LC owners). Really hurts that they cut 32x32->64bit multiply and 64/32->32 division from HW. Lack of the former also means that you can't even turn most 32-bit integer divisions by a constant into multiplications to increase performance. |
Quote:
So, I will take the opportunity to remind you about the often forgotten 64 bit u/s multiply functions of utility.library. These functions were handled by the exception trap (ISP code) in older 68060.libraries. That's exactly why I put some extra work into both libraries here: - Added optimized Mult64u/s ISP patch to utility.library functions (Much faster than exception trap code) https://eab.abime.net/showthread.php?t=96791 https://eab.abime.net/showthread.php?t=101115 Unfortunately, there were no Div64u/s functions in utility.library to patch. ;) |
Quote:
Quote:
That's not the case anymore. 3.2 and up provides a 64/32 division in utility. Or actually two of them. No need to patch them, the utility.library already provides an implementation of them that is 68060-friendly. |
Quote:
http://aminet.net/package/util/boot/Mult64Patch http://aminet.net/package/util/boot/UtilPatch Quote:
But thanks for the information anyway, for the next library release I should add an exec version test before the utility.library patch (so OS 3.14+ users can get their money's worth *Cough*). :D |
Quote:
Quote:
While at it, be aware that there are a couple of other issues with UMult64/SMult64 on some processors and some kickstarts, namely returning the result in the wrong order (hi and lo swapped). Later versions of SetPatch take care of that as well. The 3.1 SetPatch probably does not. It *may* be included in the 3.9 SetPatch, though I'm not sure at this moment. Note that the above means that potentially any patch you install there may be overridden by SetPatch once again if the returned order is incorrect. |
Quote:
Quote:
Code:
-------------------- AmigaOS 3.1.4.(1) project ----------------------- |
Since both vbcc and (newer) gcc 's call their own support libraries when compiling for 060 rather than relying on either the 68060.library fallback or utility.library, and I know to avoid the instructions in my asm code it doesn't seem super relevant what some patches may or may not do.
My point was that for 060 it really seems like common optimization that would apply for 486/pentium (and maybe 040?) like favoring 16.16 fixpoint over floating numbers in certain situation rarely translate. For cases where you absolutely need e.g. 128-bit intermediary results things may be different, but for demo/game stuff that hardly ever applies. |
Quote:
Quote:
|
Quote:
Quote:
Quote:
|
As far as the MULU/MULS utility.library code is concerned, there is not really much to optimize. Frankly, if your code is really time critical, then the switch between 68020-68040 and 68060 code needs to be made at a much coarser level as otherwise the additional instructions to move values to the source registers, calling utility and moving the results back to where they are needed will eat up any micro-optimization in the above function. That does not make it useless - it is just your average service function you need for example for large disk support, e.g. by the FFS, the HDToolBox and workbench. I believe the FFS use case triggered the necessity to have a 060 detection within the utility.library such that 060 systems without an F-Space "debug ROM" could safely boot from large disks.
The 64bit divide is another issue as in its full case (the 64/32 full divide) requires a bit more care. Some optimizations have been performed there that go beyond the actual Mot ISP code, which is just a straight implementation of Knuth's "Algorithm D", but I believe that this requires a bit more thought and a more careful comparison of that and the classical "egyptian" (binary) division algorithm if you want to be faster. Not that it matters often. |
There is a maybe usable divide implementation: https://ridiculousfish.com/blog/post...episode-v.html
Adjusted for the 64/32: http://franke.ms/cex/z/j78asa |
Quote:
Otherwise doing a (poor) synthetic benchmark with VBCC I get: Code:
BuiltinMul 1626 ns [81 cycles] Code:
volatile uint32_t x=0x20000000; Not dunking on anyone involved here, and of course stuff can be optimized, but like I wrote earlier I think it's clear that for 060 (with a FPU) it's not a good idea to use e.g. 16.16 fixedpoint instead of floats/doubles if you need to do anything apart from adding and subtracting. |
since I haven't access to a real 68060... what about this implementation?
Code:
#extern long sdiv64(long hi asm("d1"), long lo asm("d0"), long d asm("d2")); |
Quote:
I also found what I forgot to change: __builtin_clzll must be replaced by __builtin_clz otherwise shift is way to big. => http://franke.ms/cex/z/Yd6E14 |
Cool, I did consider trying to extract the result from the extended precision result, but hadn't had time. Doesn't seem to be a massive speed improvement, but if the conversion/rounding can be optimized it might have potential. Results:
smul64: 69 cycles sdiv64: 138 cycles divllu (returning just quotient): 175 cycles I measured overhead to be ~37 cycles (just calling a dummy asm routine with 2 stack arguments and storing a result in "r"). P.S. isn't it a bit dangerous to use a stack "red zone" like that? Probably OK if not in supervisor mode, but seems sketchy :) |
Quote:
unfolded the sign handling and added proper stack handling which reduced the object size by 16 bytes. Should be a tad faster now. Code:
#extern long sdiv64(long hi asm("d1"), long lo asm("d0"), long d asm("d2")); |
All times are GMT +2. The time now is 20:31. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.