English Amiga Board

English Amiga Board (https://eab.abime.net/index.php)
-   Coders. Asm / Hardware (https://eab.abime.net/forumdisplay.php?f=112)
-   -   060 Texture Mapping Optimization (https://eab.abime.net/showthread.php?t=111352)

paraj 24 July 2022 20:08

060 Texture Mapping Optimization
 
Just for funsies I've been playing around with implementing perspective correct texture mapping fast on my 060 (again, since I've forgotten most of what I knew about software rasterization). Got most things working following Chris Hecker's seminal series[0], and of course Kalms also joined in on the fun [1]. FWIW this mostly works, but I found that the 28.4 subpixel precision used in [0] isn't sufficient in practice to avoid going outside texture bounds (.10 bits seems to be enough though)).

Anyway, the optimized inner loop from [0] (final part) looks like this in 68k asm (my translation, assuming standard 16.16 at first):
Code:

; D0 = Initial Texture Offset: (U>>16)+(V>>16)*TextureWidth
; D1 = UFrac                : U<<16 (0.32 fixpoint)
; D2 = VFrac                : V<<16 (0.32 fixpoint)
; D3 = DUFrac                : DUDX<<16 (0.32 fixpoint)
; D4 = DVFrac                : DVDX<<16 (0.32 fixpoint)

; A0 = Dest                  : Where pixels will be drawn
; A1 = Texture
; A2 = UIntVintFrac          ; Points to second element of array containing: { (DUDX>>16)+((DUDX>>16)+1)*TextureWidth,
                            ;                                                (DUDX>>16)+(DUDX>>16)*TextureWidth }
                            ; i.e. element[0] is a normal increment and element[-1] is when there's a V carry

        move.b (a1,d0.l),(a0)+      ; *Dest++ = Texture[Offset]
        add.l  d4,d2                ; VFrac += DVFrac
        subx.l d5,d5                ; Save carry
        move.l (a2,d5.l*4),d7      ; Get index from UIntVintFrac
        add.l  d3,d1                ; UFrac += DUFrac
        addx.l d7,d0                ; Offset += UintVint + carry

With that long prelude, my question is can we do better than my optimized attempt (assume the loop is unrolled 8/16 times), and are my annotations correct (assuming no cache misses)?

Code:

        add.l  d4,d2              ; Cycle 1 [D0 ready in 2]
        ; sOEP free

        subx.l  d5,d5              ; Cycle 2 [D0 ready in 1]
        ; sOEP unavailable

        move.b  (a1,d0.l),d6        ; Cycle 3 [D0 ready?, D5 ready in 2]
        add.l  d3,d1

        move.b  d6,(a0)+            ; Cycle 4 [D5 ready in 1]
        ; sOEP free

        move.l  (a2,d5.l*4),d7      ; Cycle 5 [D5 ready?]
        ; sOEP free

        addx.l  d7,d0              ; Cycle 6
        ; sOEP unavailable

(I can clean up my testbed if anyone wants to play along at home).

[0]: http://www.chrishecker.com/Miscellan...nical_Articles
[1]: https://www.lysator.liu.se/~mikaelk/...ectivetexture/
[2]: http://ada.untergrund.net/?p=boardthread&id=19 Other good discussions (not referenced)

pipper 24 July 2022 23:20

I'm not super familiar with 060 dual issue stuff, but you may want to take a look at Georg Steger's 060 DOOM floor mapping routines.
https://github.com/mheyer32/DoomAtta...gine.asm#L8128

I found it clever how the fractional parts of the texture coordinates always stay in the upper word, while the integer part stays in the lower word. No swaps etc necessary.
He did it by "cross-packing" the X/Y parts packing t like this

deltas:
a = fracDY|intDX
b = fracDX|intDY

x,y:
c = fracY|intX
d = fracX|intY

Now if you do,
c = ADDX(c, a)
if fracDY overflows,
d = ADDX(d, b)
will add 1 to intY and in turn if in the same addition fracX overflows, the next
c = ADDX(c, a) will increase the integer part of X
and so forth...

This might be old news to you, but I found it a pretty cool technique!

paraj 25 July 2022 12:45

Thanks, sounds similar to what's described at https://amycoders.org/opt/innerloops.html but it's nice to see it put into practice with 060 annotations. Even if it's old news, the point is that I've forgotten most of it :)

Managed to squeeze a bit more performance out of it by doing more, but less complicated work:
Code:

        add.l  d4,d2          ; VF+=DVF                  Cycle 0/5 [sOEP in subsequent ones]
        subx.l  d7,d7          ; Save carry              Cycle 1 pOEP-only
        move.b  (a1,d0.l),d6    ; Load pixel              Cycle 2 pOEP
        add.l  a2,d0          ; Add Int(DU)+Int(DV)*TextureWidth sOEP
        add.l  d3,d1          ; UF+=DUF                  Cycle 3 pOEP
        and.l  d5,d7          ; VCarry*TextureWidth              sOEP
        addx.l  d7,d0          ; +=VCarry*TextureWidth    Cycle 4 pOEP-only
        move.b  d6,(a0)+        ; Store pixel              Cycle 5 pOEP

And do some proper timing:
Code:

            Time (ns)        Per pixel (cycles)
Overhead        651.6            N/A
Original        3468.8            8.80
Optimized        2512.2            5.81
New                2286.7            5.11

The per pixel numbers were obtained by subtracting the measured overhead (looping/calling the function/pushing/popping registers/returning) and dividing by cycle time (20ns) and number of pixels (16). Slightly noisy, but they line up nicely with the expected numbers.

Rock'n Roll 31 July 2022 21:08

Only for information. I found this massive 3d coding page sometime.
https://mikro.naprvyraz.sk/docs/

paraj 02 August 2022 17:50

Quote:

Originally Posted by Rock'n Roll (Post 1557164)
Only for information. I found this massive 3d coding page sometime.
https://mikro.naprvyraz.sk/docs/

Blast from the past with a lot of those documents. Think I first saw them while browsing https://hornet.org/code/ (which is apparently still up) back in the 90ies.


For 060 specific stuff it seems that reaching for the FPU should be preferred to (high-precision) fixedpoint math (sorry LC owners). Really hurts that they cut 32x32->64bit multiply and 64/32->32 division from HW. Lack of the former also means that you can't even turn most 32-bit integer divisions by a constant into multiplications to increase performance.

SpeedGeek 02 August 2022 23:16

Quote:

Originally Posted by paraj (Post 1557476)
For 060 specific stuff it seems that reaching for the FPU should be preferred to (high-precision) fixedpoint math (sorry LC owners). Really hurts that they cut 32x32->64bit multiply and 64/32->32 division from HW. Lack of the former also means that you can't even turn most 32-bit integer divisions by a constant into multiplications to increase performance.

I was just about to question what specific code on the 060 would be optimized any differently than 020-040 code. But then I remembered that the unimplemented 060 instructions are probably the best qualified ones for optimization.

So, I will take the opportunity to remind you about the often forgotten 64 bit u/s multiply functions of utility.library. These functions were handled by the exception trap (ISP code) in older 68060.libraries.

That's exactly why I put some extra work into both libraries here:

- Added optimized Mult64u/s ISP patch to utility.library
functions (Much faster than exception trap code)

https://eab.abime.net/showthread.php?t=96791
https://eab.abime.net/showthread.php?t=101115

Unfortunately, there were no Div64u/s functions in utility.library to patch. ;)

Thomas Richter 03 August 2022 13:47

Quote:

Originally Posted by SpeedGeek (Post 1557515)
So, I will take the opportunity to remind you about the often forgotten 64 bit u/s multiply functions of utility.library. These functions were handled by the exception trap (ISP code) in older 68060.libraries.

*Cough* Pretty much every 68060.library I'm aware of takes care of these vectors. Actually, since 3.1.4 and up, utility.library takes care of them.


Quote:

Originally Posted by SpeedGeek (Post 1557515)

Unfortunately, there were no Div64u/s functions in utility.library to patch. ;)


That's not the case anymore. 3.2 and up provides a 64/32 division in utility. Or actually two of them. No need to patch them, the utility.library already provides an implementation of them that is 68060-friendly.

SpeedGeek 03 August 2022 15:20

Quote:

Originally Posted by Thomas Richter (Post 1557578)
*Cough* Pretty much every 68060.library I'm aware of takes care of these vectors. Actually, since 3.1.4 and up, utility.library takes care of them.

*Cough* your awareness is obviously somewhat limited. The Carsten S. 68060 libraries didn't patch them at all. The Tekmagic060 and P5 68060 libraries patched but didn't optimize them:

http://aminet.net/package/util/boot/Mult64Patch

http://aminet.net/package/util/boot/UtilPatch


Quote:

Originally Posted by Thomas Richter (Post 1557578)
That's not the case anymore. 3.2 and up provides a 64/32 division in utility. Or actually two of them. No need to patch them, the utility.library already provides an implementation of them that is 68060-friendly.

Once again, your condescending attitude towards backwards compatibility shows it's face. :(
But thanks for the information anyway, for the next library release I should add an exec version test before the utility.library patch (so OS 3.14+ users can get their money's worth *Cough*). :D

Thomas Richter 03 August 2022 16:22

Quote:

Originally Posted by SpeedGeek (Post 1557590)

What exactly does that show? That there are useless patches available on Aminet? I surely agree with that. It does not invalidate my statement.


Quote:

Originally Posted by SpeedGeek (Post 1557590)
Once again, your condescending attitude towards backwards compatibility shows it's face. :(

I beg your pardon. This is not a "backwards compatibility issue". It is an issue of "completeness of implementation", and that issue is fixed in almost any 68060.library that has been around. Not only mine. There is no compatibility issue here at all.



While at it, be aware that there are a couple of other issues with UMult64/SMult64 on some processors and some kickstarts, namely returning the result in the wrong order (hi and lo swapped). Later versions of SetPatch take care of that as well. The 3.1 SetPatch probably does not. It *may* be included in the 3.9 SetPatch, though I'm not sure at this moment.


Note that the above means that potentially any patch you install there may be overridden by SetPatch once again if the returned order is incorrect.

SpeedGeek 03 August 2022 17:19

Quote:

Originally Posted by Thomas Richter (Post 1557593)
What exactly does that show? That there are useless patches available on Aminet? I surely agree with that. It does not invalidate my statement.

"Useless" patches available on Aminet? Certainly not! You consider them useless because they don't serve any purpose which you think is important or beneficial.

Quote:

Originally Posted by Thomas Richter (Post 1557593)
I beg your pardon. This is not a "backwards compatibility issue". It is an issue of "completeness of implementation", and that issue is fixed in almost any 68060.library that has been around. Not only mine. There is no compatibility issue here at all.

While at it, be aware that there are a couple of other issues with UMult64/SMult64 on some processors and some kickstarts, namely returning the result in the wrong order (hi and lo swapped). Later versions of SetPatch take care of that as well. The 3.1 SetPatch probably does not. It *may* be included in the 3.9 SetPatch, though I'm not sure at this moment.

Note that the above means that potentially any patch you install there may be overridden by SetPatch once again if the returned order is incorrect.

I just downloaded the release notes for OS 3.1.4.1 and guess what?

Code:

-------------------- AmigaOS 3.1.4.(1) project -----------------------

Changes for release 45.1 (1.1.2018)

- The 64-bit math routines return now in the 68000 version
  the results in proper order, namely with the low 32 bits
  in register d0 and the high 32 bits in register d1.

- Added 68060 specialized versions of 64bit math. Note that
  these functions are currently never enabled as exec does
  not identify the 68060.

- Improved the 68000 version of SMult64 a tiny little bit.

- Retired the 68020-only version of utility.

Now, how exactly do users get their money's worth from new but "Never Enabled" functions? :rolleyes

paraj 03 August 2022 18:29

Since both vbcc and (newer) gcc 's call their own support libraries when compiling for 060 rather than relying on either the 68060.library fallback or utility.library, and I know to avoid the instructions in my asm code it doesn't seem super relevant what some patches may or may not do.

My point was that for 060 it really seems like common optimization that would apply for 486/pentium (and maybe 040?) like favoring 16.16 fixpoint over floating numbers in certain situation rarely translate. For cases where you absolutely need e.g. 128-bit intermediary results things may be different, but for demo/game stuff that hardly ever applies.

Thomas Richter 03 August 2022 18:31

Quote:

Originally Posted by SpeedGeek (Post 1557601)
"Useless" patches available on Aminet? Certainly not! You consider them useless because they don't serve any purpose which you think is important or beneficial.

They *are* useless because every 68060.library I'm aware of provides that as features. Sorry if yours was incomplete and required them.


Quote:

Originally Posted by SpeedGeek (Post 1557601)

Now, how exactly do users get their money's worth from new but "Never Enabled" functions? :rolleyes

You should probably read this stuff to the very end. That's all old news. Utility does have a 68060 detection function in it, no need to worry, even as of 3.1.4.

SpeedGeek 03 August 2022 19:36

Quote:

Originally Posted by Thomas Richter (Post 1557606)
They *are* useless because every 68060.library I'm aware of provides that as features. Sorry if yours was incomplete and required them.

Now, you didn't read completely what I wrote. The Aminet patches and the code I added to Carsten's 68060.library address both issues - patching and optimization. The only difference is that the Aminet patches use some other optimized 64 bit code and I decided to optimize the Motorola/Freescale ISP code.

Quote:

Originally Posted by Thomas Richter (Post 1557606)
You should probably read this stuff to the very end. That's all old news. Utility does have a 68060 detection function in it, no need to worry, even as of 3.1.4.

I did read this stuff to the end. Now, how would I know that the Official OS 3.1.4.1 release notes don't agree with you to the end?


Quote:

Originally Posted by paraj (Post 1557605)
Since both vbcc and (newer) gcc 's call their own support libraries when compiling for 060 rather than relying on either the 68060.library fallback or utility.library, and I know to avoid the instructions in my asm code it doesn't seem super relevant what some patches may or may not do.

My point was that for 060 it really seems like common optimization that would apply for 486/pentium (and maybe 040?) like favoring 16.16 fixpoint over floating numbers in certain situation rarely translate. For cases where you absolutely need e.g. 128-bit intermediary results things may be different, but for demo/game stuff that hardly ever applies.

That's a valid point which could make much of the discussion here pointless. However, you might want to do some performance testing of your compiler's libraries and see how well they compare against the patched and/or library functions discussed here.

Thomas Richter 04 August 2022 07:23

As far as the MULU/MULS utility.library code is concerned, there is not really much to optimize. Frankly, if your code is really time critical, then the switch between 68020-68040 and 68060 code needs to be made at a much coarser level as otherwise the additional instructions to move values to the source registers, calling utility and moving the results back to where they are needed will eat up any micro-optimization in the above function. That does not make it useless - it is just your average service function you need for example for large disk support, e.g. by the FFS, the HDToolBox and workbench. I believe the FFS use case triggered the necessity to have a 060 detection within the utility.library such that 060 systems without an F-Space "debug ROM" could safely boot from large disks.

The 64bit divide is another issue as in its full case (the 64/32 full divide) requires a bit more care. Some optimizations have been performed there that go beyond the actual Mot ISP code, which is just a straight implementation of Knuth's "Algorithm D", but I believe that this requires a bit more thought and a more careful comparison of that and the classical "egyptian" (binary) division algorithm if you want to be faster. Not that it matters often.

bebbo 04 August 2022 14:25

There is a maybe usable divide implementation: https://ridiculousfish.com/blog/post...episode-v.html

Adjusted for the 64/32: http://franke.ms/cex/z/j78asa

paraj 04 August 2022 20:01

Quote:

Originally Posted by bebbo (Post 1557721)
There is a maybe usable divide implementation: https://ridiculousfish.com/blog/post...episode-v.html

Adjusted for the 64/32: http://franke.ms/cex/z/j78asa

Haven't debugged it, but it seems to do a division by zero when I try it. I probably need to recompile my cross compiler, but even taking the assembly code directly from compiler explorer gives the same result, and still haven't actually gotten the support libraries properly compiled for 060 (meaning they rely on emulated instructions). Will have to look into that later..

Otherwise doing a (poor) synthetic benchmark with VBCC I get:
Code:

BuiltinMul          1626 ns [81 cycles]
BuiltinDiv          5490 ns [275 cycles]
UMult64              1418 ns [71 cycles]
EmuMul              62797 ns [3140 cycles]
EmuDiv              73514 ns [3676 cycles]
FMul                1966 ns [98 cycles]
FDiv                2330 ns [116 cycles]

(The Mult64 patch seems to improve UMult64 to 68 cycles). This is on KS 39.106 (so no UDivMod64) with 68060.library 47.1. I timed 100000 function calls to something that looks like:
Code:

volatile uint32_t x=0x20000000;
volatile uint32_t y=0x10000000;
volatile uint64_t r;
void BuiltinMul(void) { r = (uint64_t)x * y; }
uint64_t __EmuMul(__reg("d0") ULONG, __reg("d1") ULONG) = "\tmulu.l\td0,d0:d1";
void EmuMul(void) { r = __EmuMul(x, y); }
void FMul(void) { r = (uint64_t)((double)x * y); }

VBCC seems to use a full 64x64 multiplication (even though it isn't necessary) which is why UMult64 comes out ahead. FMul/FDiv incur a lot of overhead in converting to/from floating point numbers (u)int64<->double is especially costly but still win by a large margin for divisions.

Not dunking on anyone involved here, and of course stuff can be optimized, but like I wrote earlier I think it's clear that for 060 (with a FPU) it's not a good idea to use e.g. 16.16 fixedpoint instead of floats/doubles if you need to do anything apart from adding and subtracting.

bebbo 05 August 2022 11:25

since I haven't access to a real 68060... what about this implementation?
Code:

#extern long  sdiv64(long hi asm("d1"), long lo asm("d0"), long d asm("d2"));
_sdiv64: .globl _sdiv64
        tst.l        d1
        bne.s        .ldiv
        divs.l        d2,d0
        rts

.mi:
        move.l        #0x8000,a0
        fmove.d        #0e-1.25e-1,fp1
        neg.l        d0
        negx.l        d1
        tst.l        d1
        bne.s        .ldivs
        divs.l        d2,d0
        neg.l        d0
        rts

.ldiv:
        bmi        .mi
        sub.l        a0,a0
.ldivs:       
        fmove.s        #0e1.25e-1,fp1
        exg                d3,a1
        bfffo        d1{#0,#0},d3
        sub.w        #32,d3
        neg.w        d3
       
        bfins        d1,(-8,a7){0,d3}
        bfins        d0,(-8,a7){d3,32}
       
        add.w        #16382 + 32,d3
        add.l        a0,d3       
        swap        d3
        move.l        d3,(-12,a7)
        exg                d3,a1
       
        fmove.x        (-12,a7),fp0
        fdiv.l        d2,fp0
        fadd.x        fp1,fp0
#        fintrz.x        fp0
       
.toInt32:
        moveq        #0,d1
        fmove.x fp0,(-12,a7)
        move.w        (-12,a7),d1
        and.w        #0x7fff,d1
        sub.w        #16382,d1
        bfextu        (-8,a7){0:d1},d0
        btst        #7,(-12,a7)
        bne.s        .Neg32
        rts
.Neg32:
        neg.l        d0
        rts

# extern long long smul64(long a asm("d0"), long b asm("d1"));
_smul64: .globl _smul64
        fmove.l        d0,fp0
        fmul.l        d1,fp0

.toInt64:
        moveq        #0,d1
        fmove.x fp0,(-12,a7)
        move.w        (-12,a7),d1
        and.w        #0x7fff,d1
        sub.w        #16382,d1
        cmp.w        #32,d1
        ble.s                .L1
        sub.w        #32,d1
        bfextu        (-8,a7){0:d1},d0
        bfextu        (-8,a7){d1:32},d1
        btst        #7,(-12,a7)
        bne.s        .Neg64
        rts
.Neg64:
        neg.l        d1
        negx.l        d0
        rts
.L1:
        moveq        #0,d0
        bfextu        (-8,a7){0:d1},d1
        btst        #7,(-12,a7)
        bne.s        .Neg64
        rts

for the values I tested, it seems to work. Is this approach to slow? Or does rounding mode kill it?

bebbo 05 August 2022 15:01

Quote:

Originally Posted by bebbo (Post 1557721)
There is a maybe usable divide implementation: https://ridiculousfish.com/blog/post...episode-v.html

Adjusted for the 64/32: http://franke.ms/cex/z/j78asa


I also found what I forgot to change:


__builtin_clzll must be replaced by __builtin_clz otherwise shift is way to big.



=> http://franke.ms/cex/z/Yd6E14

paraj 05 August 2022 17:26

Cool, I did consider trying to extract the result from the extended precision result, but hadn't had time. Doesn't seem to be a massive speed improvement, but if the conversion/rounding can be optimized it might have potential. Results:
smul64: 69 cycles
sdiv64: 138 cycles
divllu (returning just quotient): 175 cycles
I measured overhead to be ~37 cycles (just calling a dummy asm routine with 2 stack arguments and storing a result in "r").

P.S. isn't it a bit dangerous to use a stack "red zone" like that? Probably OK if not in supervisor mode, but seems sketchy :)

bebbo 05 August 2022 18:35

Quote:

Originally Posted by paraj (Post 1557949)
Cool, I did consider trying to extract the result from the extended precision result, but hadn't had time. Doesn't seem to be a massive speed improvement, but if the conversion/rounding can be optimized it might have potential. Results:
smul64: 69 cycles
sdiv64: 138 cycles
divllu (returning just quotient): 175 cycles
I measured overhead to be ~37 cycles (just calling a dummy asm routine with 2 stack arguments and storing a result in "r").

P.S. isn't it a bit dangerous to use a stack "red zone" like that? Probably OK if not in supervisor mode, but seems sketchy :)


unfolded the sign handling and added proper stack handling which reduced the object size by 16 bytes. Should be a tad faster now.



Code:

#extern long sdiv64(long hi asm("d1"), long lo asm("d0"), long d asm("d2"));
_sdiv64:        .globl        _sdiv64
        tst.l        d1
        bne.s        .ldiv
        divs.l        d2,d0
        rts

.mi:
        move.l        #0x8000,a0
        fmove.d        #0e-1.25e-1,fp1
        neg.l        d0
        negx.l        d1
        tst.l        d1
        bne.s        .ldivs
        divs.l        d2,d0
        neg.l        d0
        rts

.ldiv:
        bmi        .mi
        sub.l        a0,a0
.ldivs:       
        fmove.s        #0e1.25e-1,fp1
        exg                d3,a1
        bfffo        d1{#0,#0},d3
        sub.w        #32,d3
        neg.w        d3

        subq.l        #8,a7
        bfins        d1,(a7){0,d3}
        bfins        d0,(a7){d3,32}
       
        add.w        #16382+32,d3
        add.l        a0,d3       
        swap        d3
        move.l        d3,-(a7)
        exg                d3,a1
       
        fmove.x        (a7),fp0
        fdiv.l        d2,fp0
        moveq        #0,d1        | pulled up
        fadd.x        fp1,fp0
#        fintrz.x        fp0
       
#.toInt32:
#        moveq        #0,d1        | pulled up
        fmove.x        fp0,(a7)
        move.w        (a7),d1
        addq.l        #4,a7
        bmi.s        .Neg32
        sub.w        #16382,d1
        bfextu        (a7){0:d1},d0       
        addq.l        #8,a7
        rts
.Neg32:
        sub.w        #16382+0x8000,d1
        bfextu        (a7){0:d1},d0
        addq.l        #8,a7
        neg.l        d0
        rts

#        extern long long smul64(long a asm("d0"), long b asm("d1"));
_smul64:        .globl        _smul64
        fmove.l        d0,fp0
        fmul.l        d1,fp0

#.toInt64:
        moveq        #0,d1
        fmove.x        fp0,-(a7)
        move.w        (a7),d1
        addq.l        #4,a7
        bmi.s        .Neg64
        sub.w        #16382,d1
        cmp.w        #32,d1
        ble.s        .L1
        sub.w        #32,d1
        bfextu        (a7){0:d1},d0
        bfextu        (a7){d1:32},d1
        addq.l        #8,a7
        rts
       
.Neg64:       
        sub.w        #16382+0x8000,d1
        cmp.w        #32,d1
        ble.s        .L1neg
        sub.w        #32,d1
        bfextu        (a7){0:d1},d0
        bfextu        (a7){d1:32},d1
        addq.l        #8,a7
       
        neg.l        d1
        negx.l        d0
        rts
.L1:
        moveq        #0,d0
        bfextu        (a7){0:d1},d1
        addq.l        #8,a7
        rts

.L1neg:
        moveq        #0,d0
        bfextu        (a7){0:d1},d1
        addq.l        #8,a7

        neg.l        d1
        rts



All times are GMT +2. The time now is 20:31.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.

Page generated in 0.05045 seconds with 11 queries