24 July 2022, 20:08 | #1 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
|
060 Texture Mapping Optimization
Just for funsies I've been playing around with implementing perspective correct texture mapping fast on my 060 (again, since I've forgotten most of what I knew about software rasterization). Got most things working following Chris Hecker's seminal series[0], and of course Kalms also joined in on the fun [1]. FWIW this mostly works, but I found that the 28.4 subpixel precision used in [0] isn't sufficient in practice to avoid going outside texture bounds (.10 bits seems to be enough though)).
Anyway, the optimized inner loop from [0] (final part) looks like this in 68k asm (my translation, assuming standard 16.16 at first): Code:
; D0 = Initial Texture Offset: (U>>16)+(V>>16)*TextureWidth ; D1 = UFrac : U<<16 (0.32 fixpoint) ; D2 = VFrac : V<<16 (0.32 fixpoint) ; D3 = DUFrac : DUDX<<16 (0.32 fixpoint) ; D4 = DVFrac : DVDX<<16 (0.32 fixpoint) ; A0 = Dest : Where pixels will be drawn ; A1 = Texture ; A2 = UIntVintFrac ; Points to second element of array containing: { (DUDX>>16)+((DUDX>>16)+1)*TextureWidth, ; (DUDX>>16)+(DUDX>>16)*TextureWidth } ; i.e. element[0] is a normal increment and element[-1] is when there's a V carry move.b (a1,d0.l),(a0)+ ; *Dest++ = Texture[Offset] add.l d4,d2 ; VFrac += DVFrac subx.l d5,d5 ; Save carry move.l (a2,d5.l*4),d7 ; Get index from UIntVintFrac add.l d3,d1 ; UFrac += DUFrac addx.l d7,d0 ; Offset += UintVint + carry Code:
add.l d4,d2 ; Cycle 1 [D0 ready in 2] ; sOEP free subx.l d5,d5 ; Cycle 2 [D0 ready in 1] ; sOEP unavailable move.b (a1,d0.l),d6 ; Cycle 3 [D0 ready?, D5 ready in 2] add.l d3,d1 move.b d6,(a0)+ ; Cycle 4 [D5 ready in 1] ; sOEP free move.l (a2,d5.l*4),d7 ; Cycle 5 [D5 ready?] ; sOEP free addx.l d7,d0 ; Cycle 6 ; sOEP unavailable [0]: http://www.chrishecker.com/Miscellan...nical_Articles [1]: https://www.lysator.liu.se/~mikaelk/...ectivetexture/ [2]: http://ada.untergrund.net/?p=boardthread&id=19 Other good discussions (not referenced) Last edited by paraj; 24 July 2022 at 20:13. |
24 July 2022, 23:20 | #2 |
Registered User
Join Date: Jul 2017
Location: San Jose
Posts: 652
|
I'm not super familiar with 060 dual issue stuff, but you may want to take a look at Georg Steger's 060 DOOM floor mapping routines.
https://github.com/mheyer32/DoomAtta...gine.asm#L8128 I found it clever how the fractional parts of the texture coordinates always stay in the upper word, while the integer part stays in the lower word. No swaps etc necessary. He did it by "cross-packing" the X/Y parts packing t like this deltas: a = fracDY|intDX b = fracDX|intDY x,y: c = fracY|intX d = fracX|intY Now if you do, c = ADDX(c, a) if fracDY overflows, d = ADDX(d, b) will add 1 to intY and in turn if in the same addition fracX overflows, the next c = ADDX(c, a) will increase the integer part of X and so forth... This might be old news to you, but I found it a pretty cool technique! |
25 July 2022, 12:45 | #3 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
|
Thanks, sounds similar to what's described at https://amycoders.org/opt/innerloops.html but it's nice to see it put into practice with 060 annotations. Even if it's old news, the point is that I've forgotten most of it
Managed to squeeze a bit more performance out of it by doing more, but less complicated work: Code:
add.l d4,d2 ; VF+=DVF Cycle 0/5 [sOEP in subsequent ones] subx.l d7,d7 ; Save carry Cycle 1 pOEP-only move.b (a1,d0.l),d6 ; Load pixel Cycle 2 pOEP add.l a2,d0 ; Add Int(DU)+Int(DV)*TextureWidth sOEP add.l d3,d1 ; UF+=DUF Cycle 3 pOEP and.l d5,d7 ; VCarry*TextureWidth sOEP addx.l d7,d0 ; +=VCarry*TextureWidth Cycle 4 pOEP-only move.b d6,(a0)+ ; Store pixel Cycle 5 pOEP Code:
Time (ns) Per pixel (cycles) Overhead 651.6 N/A Original 3468.8 8.80 Optimized 2512.2 5.81 New 2286.7 5.11 |
31 July 2022, 21:08 | #4 |
German Translator
Join Date: Aug 2018
Location: Drübeck / Germany
Age: 49
Posts: 183
|
Only for information. I found this massive 3d coding page sometime.
https://mikro.naprvyraz.sk/docs/ |
02 August 2022, 17:50 | #5 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
|
Quote:
For 060 specific stuff it seems that reaching for the FPU should be preferred to (high-precision) fixedpoint math (sorry LC owners). Really hurts that they cut 32x32->64bit multiply and 64/32->32 division from HW. Lack of the former also means that you can't even turn most 32-bit integer divisions by a constant into multiplications to increase performance. |
|
02 August 2022, 23:16 | #6 | |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
|
Quote:
So, I will take the opportunity to remind you about the often forgotten 64 bit u/s multiply functions of utility.library. These functions were handled by the exception trap (ISP code) in older 68060.libraries. That's exactly why I put some extra work into both libraries here: - Added optimized Mult64u/s ISP patch to utility.library functions (Much faster than exception trap code) https://eab.abime.net/showthread.php?t=96791 https://eab.abime.net/showthread.php?t=101115 Unfortunately, there were no Div64u/s functions in utility.library to patch. |
|
03 August 2022, 13:47 | #7 | ||
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,216
|
Quote:
Quote:
That's not the case anymore. 3.2 and up provides a 64/32 division in utility. Or actually two of them. No need to patch them, the utility.library already provides an implementation of them that is 68060-friendly. |
||
03 August 2022, 15:20 | #8 | ||
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
|
Quote:
http://aminet.net/package/util/boot/Mult64Patch http://aminet.net/package/util/boot/UtilPatch Quote:
But thanks for the information anyway, for the next library release I should add an exec version test before the utility.library patch (so OS 3.14+ users can get their money's worth *Cough*). Last edited by SpeedGeek; 03 August 2022 at 16:18. |
||
03 August 2022, 16:22 | #9 | ||
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,216
|
Quote:
Quote:
While at it, be aware that there are a couple of other issues with UMult64/SMult64 on some processors and some kickstarts, namely returning the result in the wrong order (hi and lo swapped). Later versions of SetPatch take care of that as well. The 3.1 SetPatch probably does not. It *may* be included in the 3.9 SetPatch, though I'm not sure at this moment. Note that the above means that potentially any patch you install there may be overridden by SetPatch once again if the returned order is incorrect. |
||
03 August 2022, 17:19 | #10 | ||
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
|
Quote:
Quote:
Code:
-------------------- AmigaOS 3.1.4.(1) project ----------------------- Changes for release 45.1 (1.1.2018) - The 64-bit math routines return now in the 68000 version the results in proper order, namely with the low 32 bits in register d0 and the high 32 bits in register d1. - Added 68060 specialized versions of 64bit math. Note that these functions are currently never enabled as exec does not identify the 68060. - Improved the 68000 version of SMult64 a tiny little bit. - Retired the 68020-only version of utility. |
||
03 August 2022, 18:29 | #11 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
|
Since both vbcc and (newer) gcc 's call their own support libraries when compiling for 060 rather than relying on either the 68060.library fallback or utility.library, and I know to avoid the instructions in my asm code it doesn't seem super relevant what some patches may or may not do.
My point was that for 060 it really seems like common optimization that would apply for 486/pentium (and maybe 040?) like favoring 16.16 fixpoint over floating numbers in certain situation rarely translate. For cases where you absolutely need e.g. 128-bit intermediary results things may be different, but for demo/game stuff that hardly ever applies. |
03 August 2022, 18:31 | #12 | |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,216
|
Quote:
You should probably read this stuff to the very end. That's all old news. Utility does have a 68060 detection function in it, no need to worry, even as of 3.1.4. |
|
03 August 2022, 19:36 | #13 | |||
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
|
Quote:
Quote:
Quote:
Last edited by SpeedGeek; 04 August 2022 at 14:50. |
|||
04 August 2022, 07:23 | #14 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,216
|
As far as the MULU/MULS utility.library code is concerned, there is not really much to optimize. Frankly, if your code is really time critical, then the switch between 68020-68040 and 68060 code needs to be made at a much coarser level as otherwise the additional instructions to move values to the source registers, calling utility and moving the results back to where they are needed will eat up any micro-optimization in the above function. That does not make it useless - it is just your average service function you need for example for large disk support, e.g. by the FFS, the HDToolBox and workbench. I believe the FFS use case triggered the necessity to have a 060 detection within the utility.library such that 060 systems without an F-Space "debug ROM" could safely boot from large disks.
The 64bit divide is another issue as in its full case (the 64/32 full divide) requires a bit more care. Some optimizations have been performed there that go beyond the actual Mot ISP code, which is just a straight implementation of Knuth's "Algorithm D", but I believe that this requires a bit more thought and a more careful comparison of that and the classical "egyptian" (binary) division algorithm if you want to be faster. Not that it matters often. |
04 August 2022, 14:25 | #15 |
bye
Join Date: Jun 2016
Location: Some / Where
Posts: 680
|
There is a maybe usable divide implementation: https://ridiculousfish.com/blog/post...episode-v.html
Adjusted for the 64/32: http://franke.ms/cex/z/j78asa |
04 August 2022, 20:01 | #16 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
|
Quote:
Otherwise doing a (poor) synthetic benchmark with VBCC I get: Code:
BuiltinMul 1626 ns [81 cycles] BuiltinDiv 5490 ns [275 cycles] UMult64 1418 ns [71 cycles] EmuMul 62797 ns [3140 cycles] EmuDiv 73514 ns [3676 cycles] FMul 1966 ns [98 cycles] FDiv 2330 ns [116 cycles] Code:
volatile uint32_t x=0x20000000; volatile uint32_t y=0x10000000; volatile uint64_t r; void BuiltinMul(void) { r = (uint64_t)x * y; } uint64_t __EmuMul(__reg("d0") ULONG, __reg("d1") ULONG) = "\tmulu.l\td0,d0:d1"; void EmuMul(void) { r = __EmuMul(x, y); } void FMul(void) { r = (uint64_t)((double)x * y); } Not dunking on anyone involved here, and of course stuff can be optimized, but like I wrote earlier I think it's clear that for 060 (with a FPU) it's not a good idea to use e.g. 16.16 fixedpoint instead of floats/doubles if you need to do anything apart from adding and subtracting. |
|
05 August 2022, 11:25 | #17 |
bye
Join Date: Jun 2016
Location: Some / Where
Posts: 680
|
since I haven't access to a real 68060... what about this implementation?
Code:
#extern long sdiv64(long hi asm("d1"), long lo asm("d0"), long d asm("d2")); _sdiv64: .globl _sdiv64 tst.l d1 bne.s .ldiv divs.l d2,d0 rts .mi: move.l #0x8000,a0 fmove.d #0e-1.25e-1,fp1 neg.l d0 negx.l d1 tst.l d1 bne.s .ldivs divs.l d2,d0 neg.l d0 rts .ldiv: bmi .mi sub.l a0,a0 .ldivs: fmove.s #0e1.25e-1,fp1 exg d3,a1 bfffo d1{#0,#0},d3 sub.w #32,d3 neg.w d3 bfins d1,(-8,a7){0,d3} bfins d0,(-8,a7){d3,32} add.w #16382 + 32,d3 add.l a0,d3 swap d3 move.l d3,(-12,a7) exg d3,a1 fmove.x (-12,a7),fp0 fdiv.l d2,fp0 fadd.x fp1,fp0 # fintrz.x fp0 .toInt32: moveq #0,d1 fmove.x fp0,(-12,a7) move.w (-12,a7),d1 and.w #0x7fff,d1 sub.w #16382,d1 bfextu (-8,a7){0:d1},d0 btst #7,(-12,a7) bne.s .Neg32 rts .Neg32: neg.l d0 rts # extern long long smul64(long a asm("d0"), long b asm("d1")); _smul64: .globl _smul64 fmove.l d0,fp0 fmul.l d1,fp0 .toInt64: moveq #0,d1 fmove.x fp0,(-12,a7) move.w (-12,a7),d1 and.w #0x7fff,d1 sub.w #16382,d1 cmp.w #32,d1 ble.s .L1 sub.w #32,d1 bfextu (-8,a7){0:d1},d0 bfextu (-8,a7){d1:32},d1 btst #7,(-12,a7) bne.s .Neg64 rts .Neg64: neg.l d1 negx.l d0 rts .L1: moveq #0,d0 bfextu (-8,a7){0:d1},d1 btst #7,(-12,a7) bne.s .Neg64 rts Last edited by bebbo; 05 August 2022 at 11:58. Reason: added a missing tst.l d1 after negx.l |
05 August 2022, 15:01 | #18 | |
bye
Join Date: Jun 2016
Location: Some / Where
Posts: 680
|
Quote:
I also found what I forgot to change: __builtin_clzll must be replaced by __builtin_clz otherwise shift is way to big. => http://franke.ms/cex/z/Yd6E14 |
|
05 August 2022, 17:26 | #19 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
|
Cool, I did consider trying to extract the result from the extended precision result, but hadn't had time. Doesn't seem to be a massive speed improvement, but if the conversion/rounding can be optimized it might have potential. Results:
smul64: 69 cycles sdiv64: 138 cycles divllu (returning just quotient): 175 cycles I measured overhead to be ~37 cycles (just calling a dummy asm routine with 2 stack arguments and storing a result in "r"). P.S. isn't it a bit dangerous to use a stack "red zone" like that? Probably OK if not in supervisor mode, but seems sketchy |
05 August 2022, 18:35 | #20 | |
bye
Join Date: Jun 2016
Location: Some / Where
Posts: 680
|
Quote:
unfolded the sign handling and added proper stack handling which reduced the object size by 16 bytes. Should be a tad faster now. Code:
#extern long sdiv64(long hi asm("d1"), long lo asm("d0"), long d asm("d2")); _sdiv64: .globl _sdiv64 tst.l d1 bne.s .ldiv divs.l d2,d0 rts .mi: move.l #0x8000,a0 fmove.d #0e-1.25e-1,fp1 neg.l d0 negx.l d1 tst.l d1 bne.s .ldivs divs.l d2,d0 neg.l d0 rts .ldiv: bmi .mi sub.l a0,a0 .ldivs: fmove.s #0e1.25e-1,fp1 exg d3,a1 bfffo d1{#0,#0},d3 sub.w #32,d3 neg.w d3 subq.l #8,a7 bfins d1,(a7){0,d3} bfins d0,(a7){d3,32} add.w #16382+32,d3 add.l a0,d3 swap d3 move.l d3,-(a7) exg d3,a1 fmove.x (a7),fp0 fdiv.l d2,fp0 moveq #0,d1 | pulled up fadd.x fp1,fp0 # fintrz.x fp0 #.toInt32: # moveq #0,d1 | pulled up fmove.x fp0,(a7) move.w (a7),d1 addq.l #4,a7 bmi.s .Neg32 sub.w #16382,d1 bfextu (a7){0:d1},d0 addq.l #8,a7 rts .Neg32: sub.w #16382+0x8000,d1 bfextu (a7){0:d1},d0 addq.l #8,a7 neg.l d0 rts # extern long long smul64(long a asm("d0"), long b asm("d1")); _smul64: .globl _smul64 fmove.l d0,fp0 fmul.l d1,fp0 #.toInt64: moveq #0,d1 fmove.x fp0,-(a7) move.w (a7),d1 addq.l #4,a7 bmi.s .Neg64 sub.w #16382,d1 cmp.w #32,d1 ble.s .L1 sub.w #32,d1 bfextu (a7){0:d1},d0 bfextu (a7){d1:32},d1 addq.l #8,a7 rts .Neg64: sub.w #16382+0x8000,d1 cmp.w #32,d1 ble.s .L1neg sub.w #32,d1 bfextu (a7){0:d1},d0 bfextu (a7){d1:32},d1 addq.l #8,a7 neg.l d1 negx.l d0 rts .L1: moveq #0,d0 bfextu (a7){0:d1},d1 addq.l #8,a7 rts .L1neg: moveq #0,d0 bfextu (a7){0:d1},d1 addq.l #8,a7 neg.l d1 rts Last edited by bebbo; 05 August 2022 at 18:44. Reason: pulled up an insn after the fdiv |
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
replace a color with a texture | turrican3 | support.WinUAE | 9 | 18 October 2019 08:10 |
Code optimization. | gazj82 | Coders. Blitz Basic | 26 | 08 July 2018 15:56 |
Sound Quest - Texture | emufan | request.Apps | 2 | 08 April 2016 20:03 |
3D Graphics: possible optimization? | sandruzzo | Coders. General | 3 | 26 February 2016 08:01 |
Speed! - no 3D texture mapping | s2325 | HOL data problems | 1 | 17 October 2010 16:32 |
|
|