060 Texture Mapping Optimization

paraj · 24 July 2022, 20:08

Just for funsies I've been playing around with implementing perspective correct texture mapping fast on my 060 (again, since I've forgotten most of what I knew about software rasterization). Got most things working following Chris Hecker's seminal series[0], and of course Kalms also joined in on the fun [1]. FWIW this mostly works, but I found that the 28.4 subpixel precision used in [0] isn't sufficient in practice to avoid going outside texture bounds (.10 bits seems to be enough though)).

Anyway, the optimized inner loop from [0] (final part) looks like this in 68k asm (my translation, assuming standard 16.16 at first):

Code:

; D0 = Initial Texture Offset: (U>>16)+(V>>16)*TextureWidth
; D1 = UFrac                 : U<<16 (0.32 fixpoint)
; D2 = VFrac                 : V<<16 (0.32 fixpoint)
; D3 = DUFrac                : DUDX<<16 (0.32 fixpoint)
; D4 = DVFrac                : DVDX<<16 (0.32 fixpoint)

; A0 = Dest                  : Where pixels will be drawn
; A1 = Texture
; A2 = UIntVintFrac          ; Points to second element of array containing: { (DUDX>>16)+((DUDX>>16)+1)*TextureWidth, 
                             ;                                                 (DUDX>>16)+(DUDX>>16)*TextureWidth }
                             ; i.e. element[0] is a normal increment and element[-1] is when there's a V carry

        move.b (a1,d0.l),(a0)+      ; *Dest++ = Texture[Offset]
        add.l  d4,d2                ; VFrac += DVFrac
        subx.l d5,d5                ; Save carry
        move.l (a2,d5.l*4),d7       ; Get index from UIntVintFrac
        add.l  d3,d1                ; UFrac += DUFrac
        addx.l d7,d0                ; Offset += UintVint + carry

With that long prelude, my question is can we do better than my optimized attempt (assume the loop is unrolled 8/16 times), and are my annotations correct (assuming no cache misses)?

Code:

        add.l   d4,d2               ; Cycle 1 [D0 ready in 2]
        ; sOEP free

        subx.l  d5,d5               ; Cycle 2 [D0 ready in 1]
        ; sOEP unavailable

        move.b  (a1,d0.l),d6        ; Cycle 3 [D0 ready?, D5 ready in 2]
        add.l   d3,d1

        move.b  d6,(a0)+            ; Cycle 4 [D5 ready in 1]
        ; sOEP free

        move.l  (a2,d5.l*4),d7      ; Cycle 5 [D5 ready?]
        ; sOEP free

        addx.l  d7,d0               ; Cycle 6
        ; sOEP unavailable

(I can clean up my testbed if anyone wants to play along at home).

[0]: http://www.chrishecker.com/Miscellan...nical_Articles
[1]: https://www.lysator.liu.se/~mikaelk/...ectivetexture/
[2]: http://ada.untergrund.net/?p=boardthread&id=19 Other good discussions (not referenced)

pipper · 24 July 2022, 23:20

I'm not super familiar with 060 dual issue stuff, but you may want to take a look at Georg Steger's 060 DOOM floor mapping routines.
https://github.com/mheyer32/DoomAtta...gine.asm#L8128

I found it clever how the fractional parts of the texture coordinates always stay in the upper word, while the integer part stays in the lower word. No swaps etc necessary.
He did it by "cross-packing" the X/Y parts packing t like this

deltas:
a = fracDY|intDX
b = fracDX|intDY

x,y:
c = fracY|intX
d = fracX|intY

Now if you do,
c = ADDX(c, a)
if fracDY overflows,
d = ADDX(d, b)
will add 1 to intY and in turn if in the same addition fracX overflows, the next
c = ADDX(c, a) will increase the integer part of X
and so forth...

This might be old news to you, but I found it a pretty cool technique!

paraj · 25 July 2022, 12:45

Thanks, sounds similar to what's described at https://amycoders.org/opt/innerloops.html but it's nice to see it put into practice with 060 annotations. Even if it's old news, the point is that I've forgotten most of it

Managed to squeeze a bit more performance out of it by doing more, but less complicated work:

Code:

        add.l   d4,d2           ; VF+=DVF                  Cycle 0/5 [sOEP in subsequent ones]
        subx.l  d7,d7           ; Save carry               Cycle 1 pOEP-only
        move.b  (a1,d0.l),d6    ; Load pixel               Cycle 2 pOEP
        add.l   a2,d0           ; Add Int(DU)+Int(DV)*TextureWidth sOEP
        add.l   d3,d1           ; UF+=DUF                  Cycle 3 pOEP
        and.l   d5,d7           ; VCarry*TextureWidth              sOEP
        addx.l  d7,d0           ; +=VCarry*TextureWidth    Cycle 4 pOEP-only
        move.b  d6,(a0)+        ; Store pixel              Cycle 5 pOEP

And do some proper timing:

Code:

            Time (ns)	Per pixel (cycles)
Overhead	651.6	    N/A
Original	3468.8	    8.80
Optimized 	2512.2	    5.81
New	        2286.7	    5.11

The per pixel numbers were obtained by subtracting the measured overhead (looping/calling the function/pushing/popping registers/returning) and dividing by cycle time (20ns) and number of pixels (16). Slightly noisy, but they line up nicely with the expected numbers.

Rock'n Roll · 31 July 2022, 21:08

Only for information. I found this massive 3d coding page sometime.
https://mikro.naprvyraz.sk/docs/

paraj · 02 August 2022, 17:50

Quote:

Originally Posted by Rock'n Roll

Only for information. I found this massive 3d coding page sometime.
https://mikro.naprvyraz.sk/docs/

Blast from the past with a lot of those documents. Think I first saw them while browsing https://hornet.org/code/ (which is apparently still up) back in the 90ies.

For 060 specific stuff it seems that reaching for the FPU should be preferred to (high-precision) fixedpoint math (sorry LC owners). Really hurts that they cut 32x32->64bit multiply and 64/32->32 division from HW. Lack of the former also means that you can't even turn most 32-bit integer divisions by a constant into multiplications to increase performance.

SpeedGeek · 02 August 2022, 23:16

Quote:

Originally Posted by paraj

For 060 specific stuff it seems that reaching for the FPU should be preferred to (high-precision) fixedpoint math (sorry LC owners). Really hurts that they cut 32x32->64bit multiply and 64/32->32 division from HW. Lack of the former also means that you can't even turn most 32-bit integer divisions by a constant into multiplications to increase performance.

I was just about to question what specific code on the 060 would be optimized any differently than 020-040 code. But then I remembered that the unimplemented 060 instructions are probably the best qualified ones for optimization.

So, I will take the opportunity to remind you about the often forgotten 64 bit u/s multiply functions of utility.library. These functions were handled by the exception trap (ISP code) in older 68060.libraries.

That's exactly why I put some extra work into both libraries here:

- Added optimized Mult64u/s ISP patch to utility.library
functions (Much faster than exception trap code)

https://eab.abime.net/showthread.php?t=96791
https://eab.abime.net/showthread.php?t=101115

Unfortunately, there were no Div64u/s functions in utility.library to patch.

Thomas Richter · 03 August 2022, 13:47

Quote:

Originally Posted by SpeedGeek

So, I will take the opportunity to remind you about the often forgotten 64 bit u/s multiply functions of utility.library. These functions were handled by the exception trap (ISP code) in older 68060.libraries.

*Cough* Pretty much every 68060.library I'm aware of takes care of these vectors. Actually, since 3.1.4 and up, utility.library takes care of them.

Quote:

Originally Posted by SpeedGeek

Unfortunately, there were no Div64u/s functions in utility.library to patch.

That's not the case anymore. 3.2 and up provides a 64/32 division in utility. Or actually two of them. No need to patch them, the utility.library already provides an implementation of them that is 68060-friendly.

SpeedGeek · 03 August 2022, 15:20

Quote:

Originally Posted by Thomas Richter

*Cough* Pretty much every 68060.library I'm aware of takes care of these vectors. Actually, since 3.1.4 and up, utility.library takes care of them.

*Cough* your awareness is obviously somewhat limited. The Carsten S. 68060 libraries didn't patch them at all. The Tekmagic060 and P5 68060 libraries patched but didn't optimize them:

http://aminet.net/package/util/boot/Mult64Patch

http://aminet.net/package/util/boot/UtilPatch

Quote:

Originally Posted by Thomas Richter

That's not the case anymore. 3.2 and up provides a 64/32 division in utility. Or actually two of them. No need to patch them, the utility.library already provides an implementation of them that is 68060-friendly.

Once again, your condescending attitude towards backwards compatibility shows it's face.

But thanks for the information anyway, for the next library release I should add an exec version test before the utility.library patch (so OS 3.14+ users can get their money's worth *Cough*).

Thomas Richter · 03 August 2022, 16:22

Quote:

Originally Posted by SpeedGeek

*Cough* your awareness is obviously quite limited:

http://aminet.net/package/util/boot/Mult64Patch

http://aminet.net/package/util/boot/UtilPatch

What exactly does that show? That there are useless patches available on Aminet? I surely agree with that. It does not invalidate my statement.

Quote:

Originally Posted by SpeedGeek

Once again, your condescending attitude towards backwards compatibility shows it's face.

I beg your pardon. This is not a "backwards compatibility issue". It is an issue of "completeness of implementation", and that issue is fixed in almost any 68060.library that has been around. Not only mine. There is no compatibility issue here at all.

While at it, be aware that there are a couple of other issues with UMult64/SMult64 on some processors and some kickstarts, namely returning the result in the wrong order (hi and lo swapped). Later versions of SetPatch take care of that as well. The 3.1 SetPatch probably does not. It *may* be included in the 3.9 SetPatch, though I'm not sure at this moment.

Note that the above means that potentially any patch you install there may be overridden by SetPatch once again if the returned order is incorrect.

SpeedGeek · 03 August 2022, 17:19

Quote:

Originally Posted by Thomas Richter

What exactly does that show? That there are useless patches available on Aminet? I surely agree with that. It does not invalidate my statement.

"Useless" patches available on Aminet? Certainly not! You consider them useless because they don't serve any purpose which you think is important or beneficial.

Quote:

Originally Posted by Thomas Richter

I beg your pardon. This is not a "backwards compatibility issue". It is an issue of "completeness of implementation", and that issue is fixed in almost any 68060.library that has been around. Not only mine. There is no compatibility issue here at all.

While at it, be aware that there are a couple of other issues with UMult64/SMult64 on some processors and some kickstarts, namely returning the result in the wrong order (hi and lo swapped). Later versions of SetPatch take care of that as well. The 3.1 SetPatch probably does not. It *may* be included in the 3.9 SetPatch, though I'm not sure at this moment.

Note that the above means that potentially any patch you install there may be overridden by SetPatch once again if the returned order is incorrect.

I just downloaded the release notes for OS 3.1.4.1 and guess what?

Code:

-------------------- AmigaOS 3.1.4.(1) project -----------------------

Changes for release 45.1 (1.1.2018)

- The 64-bit math routines return now in the 68000 version
  the results in proper order, namely with the low 32 bits
  in register d0 and the high 32 bits in register d1.

- Added 68060 specialized versions of 64bit math. Note that
  these functions are currently never enabled as exec does
  not identify the 68060.

- Improved the 68000 version of SMult64 a tiny little bit.

- Retired the 68020-only version of utility.

Now, how exactly do users get their money's worth from new but "Never Enabled" functions?

paraj · 03 August 2022, 18:29

Since both vbcc and (newer) gcc 's call their own support libraries when compiling for 060 rather than relying on either the 68060.library fallback or utility.library, and I know to avoid the instructions in my asm code it doesn't seem super relevant what some patches may or may not do.

My point was that for 060 it really seems like common optimization that would apply for 486/pentium (and maybe 040?) like favoring 16.16 fixpoint over floating numbers in certain situation rarely translate. For cases where you absolutely need e.g. 128-bit intermediary results things may be different, but for demo/game stuff that hardly ever applies.

Thomas Richter · 03 August 2022, 18:31

Quote:

Originally Posted by SpeedGeek

"Useless" patches available on Aminet? Certainly not! You consider them useless because they don't serve any purpose which you think is important or beneficial.

They *are* useless because every 68060.library I'm aware of provides that as features. Sorry if yours was incomplete and required them.

Quote:

Originally Posted by SpeedGeek

Now, how exactly do users get their money's worth from new but "Never Enabled" functions?

You should probably read this stuff to the very end. That's all old news. Utility does have a 68060 detection function in it, no need to worry, even as of 3.1.4.

SpeedGeek · 03 August 2022, 19:36

Quote:

Originally Posted by Thomas Richter

They *are* useless because every 68060.library I'm aware of provides that as features. Sorry if yours was incomplete and required them.

Now, you didn't read completely what I wrote. The Aminet patches and the code I added to Carsten's 68060.library address both issues - patching and optimization. The only difference is that the Aminet patches use some other optimized 64 bit code and I decided to optimize the Motorola/Freescale ISP code.

Quote:

Originally Posted by Thomas Richter

You should probably read this stuff to the very end. That's all old news. Utility does have a 68060 detection function in it, no need to worry, even as of 3.1.4.

I did read this stuff to the end. Now, how would I know that the Official OS 3.1.4.1 release notes don't agree with you to the end?

Quote:

Originally Posted by paraj

Since both vbcc and (newer) gcc 's call their own support libraries when compiling for 060 rather than relying on either the 68060.library fallback or utility.library, and I know to avoid the instructions in my asm code it doesn't seem super relevant what some patches may or may not do.

My point was that for 060 it really seems like common optimization that would apply for 486/pentium (and maybe 040?) like favoring 16.16 fixpoint over floating numbers in certain situation rarely translate. For cases where you absolutely need e.g. 128-bit intermediary results things may be different, but for demo/game stuff that hardly ever applies.

That's a valid point which could make much of the discussion here pointless. However, you might want to do some performance testing of your compiler's libraries and see how well they compare against the patched and/or library functions discussed here.

Thomas Richter · 04 August 2022, 07:23

As far as the MULU/MULS utility.library code is concerned, there is not really much to optimize. Frankly, if your code is really time critical, then the switch between 68020-68040 and 68060 code needs to be made at a much coarser level as otherwise the additional instructions to move values to the source registers, calling utility and moving the results back to where they are needed will eat up any micro-optimization in the above function. That does not make it useless - it is just your average service function you need for example for large disk support, e.g. by the FFS, the HDToolBox and workbench. I believe the FFS use case triggered the necessity to have a 060 detection within the utility.library such that 060 systems without an F-Space "debug ROM" could safely boot from large disks.

The 64bit divide is another issue as in its full case (the 64/32 full divide) requires a bit more care. Some optimizations have been performed there that go beyond the actual Mot ISP code, which is just a straight implementation of Knuth's "Algorithm D", but I believe that this requires a bit more thought and a more careful comparison of that and the classical "egyptian" (binary) division algorithm if you want to be faster. Not that it matters often.

bebbo · 04 August 2022, 14:25

There is a maybe usable divide implementation: https://ridiculousfish.com/blog/post...episode-v.html

Adjusted for the 64/32: http://franke.ms/cex/z/j78asa

paraj · 04 August 2022, 20:01

Quote:

Originally Posted by bebbo

There is a maybe usable divide implementation: https://ridiculousfish.com/blog/post...episode-v.html

Adjusted for the 64/32: http://franke.ms/cex/z/j78asa

Haven't debugged it, but it seems to do a division by zero when I try it. I probably need to recompile my cross compiler, but even taking the assembly code directly from compiler explorer gives the same result, and still haven't actually gotten the support libraries properly compiled for 060 (meaning they rely on emulated instructions). Will have to look into that later..

Otherwise doing a (poor) synthetic benchmark with VBCC I get:

Code:

BuiltinMul           1626 ns [81 cycles]
BuiltinDiv           5490 ns [275 cycles]
UMult64              1418 ns [71 cycles]
EmuMul               62797 ns [3140 cycles]
EmuDiv               73514 ns [3676 cycles]
FMul                 1966 ns [98 cycles]
FDiv                 2330 ns [116 cycles]

(The Mult64 patch seems to improve UMult64 to 68 cycles). This is on KS 39.106 (so no UDivMod64) with 68060.library 47.1. I timed 100000 function calls to something that looks like:

Code:

volatile uint32_t x=0x20000000;
volatile uint32_t y=0x10000000;
volatile uint64_t r;
void BuiltinMul(void) { r = (uint64_t)x * y; }
uint64_t __EmuMul(__reg("d0") ULONG, __reg("d1") ULONG) = "\tmulu.l\td0,d0:d1";
void EmuMul(void) { r = __EmuMul(x, y); }
void FMul(void) { r = (uint64_t)((double)x * y); }

VBCC seems to use a full 64x64 multiplication (even though it isn't necessary) which is why UMult64 comes out ahead. FMul/FDiv incur a lot of overhead in converting to/from floating point numbers (u)int64<->double is especially costly but still win by a large margin for divisions.

Not dunking on anyone involved here, and of course stuff can be optimized, but like I wrote earlier I think it's clear that for 060 (with a FPU) it's not a good idea to use e.g. 16.16 fixedpoint instead of floats/doubles if you need to do anything apart from adding and subtracting.

bebbo · 05 August 2022, 11:25

since I haven't access to a real 68060... what about this implementation?

Code:

#extern long  sdiv64(long hi asm("d1"), long lo asm("d0"), long d asm("d2"));
_sdiv64: .globl _sdiv64
	tst.l	d1
	bne.s	.ldiv
	divs.l	d2,d0
	rts

.mi:
	move.l	#0x8000,a0
	fmove.d	#0e-1.25e-1,fp1
	neg.l	d0
	negx.l	d1
	tst.l	d1
	bne.s	.ldivs
	divs.l	d2,d0
	neg.l	d0
	rts

.ldiv:
	bmi	.mi
	sub.l	a0,a0
.ldivs:	
	fmove.s	#0e1.25e-1,fp1
	exg		d3,a1
	bfffo	d1{#0,#0},d3
	sub.w	#32,d3
	neg.w	d3
	
	bfins	d1,(-8,a7){0,d3}
	bfins	d0,(-8,a7){d3,32}
	
	add.w	#16382 + 32,d3
	add.l	a0,d3	
	swap	d3
	move.l	d3,(-12,a7)
	exg		d3,a1
	
	fmove.x	(-12,a7),fp0
	fdiv.l	d2,fp0
	fadd.x	fp1,fp0
#	fintrz.x	fp0
	
.toInt32:
	moveq	#0,d1
	fmove.x fp0,(-12,a7)
	move.w	(-12,a7),d1
	and.w	#0x7fff,d1
	sub.w	#16382,d1
	bfextu	(-8,a7){0:d1},d0
	btst	#7,(-12,a7)
	bne.s	.Neg32
	rts
.Neg32:
	neg.l	d0
	rts

# extern long long smul64(long a asm("d0"), long b asm("d1"));
_smul64: .globl _smul64
	fmove.l	d0,fp0
	fmul.l	d1,fp0

.toInt64:
	moveq	#0,d1
	fmove.x fp0,(-12,a7)
	move.w	(-12,a7),d1
	and.w	#0x7fff,d1
	sub.w	#16382,d1
	cmp.w	#32,d1
	ble.s		.L1
	sub.w	#32,d1
	bfextu	(-8,a7){0:d1},d0
	bfextu	(-8,a7){d1:32},d1
	btst	#7,(-12,a7)
	bne.s	.Neg64
	rts
.Neg64:
	neg.l	d1
	negx.l	d0
	rts
.L1:
	moveq	#0,d0
	bfextu	(-8,a7){0:d1},d1
	btst	#7,(-12,a7)
	bne.s	.Neg64
	rts

for the values I tested, it seems to work. Is this approach to slow? Or does rounding mode kill it?

bebbo · 05 August 2022, 15:01

Quote:

Originally Posted by bebbo

There is a maybe usable divide implementation: https://ridiculousfish.com/blog/post...episode-v.html

Adjusted for the 64/32: http://franke.ms/cex/z/j78asa

I also found what I forgot to change:

__builtin_clzll must be replaced by __builtin_clz otherwise shift is way to big.

=> http://franke.ms/cex/z/Yd6E14

paraj · 05 August 2022, 17:26

Cool, I did consider trying to extract the result from the extended precision result, but hadn't had time. Doesn't seem to be a massive speed improvement, but if the conversion/rounding can be optimized it might have potential. Results:
smul64: 69 cycles
sdiv64: 138 cycles
divllu (returning just quotient): 175 cycles
I measured overhead to be ~37 cycles (just calling a dummy asm routine with 2 stack arguments and storing a result in "r").

P.S. isn't it a bit dangerous to use a stack "red zone" like that? Probably OK if not in supervisor mode, but seems sketchy

bebbo · 05 August 2022, 18:35

Quote:

Originally Posted by paraj

Cool, I did consider trying to extract the result from the extended precision result, but hadn't had time. Doesn't seem to be a massive speed improvement, but if the conversion/rounding can be optimized it might have potential. Results:
smul64: 69 cycles
sdiv64: 138 cycles
divllu (returning just quotient): 175 cycles
I measured overhead to be ~37 cycles (just calling a dummy asm routine with 2 stack arguments and storing a result in "r").

P.S. isn't it a bit dangerous to use a stack "red zone" like that? Probably OK if not in supervisor mode, but seems sketchy

unfolded the sign handling and added proper stack handling which reduced the object size by 16 bytes. Should be a tad faster now.

Code:

#extern long sdiv64(long hi asm("d1"), long lo asm("d0"), long d asm("d2"));
_sdiv64:	.globl	_sdiv64
	tst.l	d1
	bne.s	.ldiv
	divs.l	d2,d0
	rts

.mi:
	move.l	#0x8000,a0
	fmove.d	#0e-1.25e-1,fp1
	neg.l	d0
	negx.l	d1
	tst.l	d1
	bne.s	.ldivs
	divs.l	d2,d0
	neg.l	d0
	rts

.ldiv:
	bmi	.mi
	sub.l	a0,a0
.ldivs:	
	fmove.s	#0e1.25e-1,fp1
	exg		d3,a1
	bfffo	d1{#0,#0},d3
	sub.w	#32,d3
	neg.w	d3

	subq.l	#8,a7
	bfins	d1,(a7){0,d3}
	bfins	d0,(a7){d3,32}
	
	add.w	#16382+32,d3
	add.l	a0,d3	
	swap	d3
	move.l	d3,-(a7)
	exg		d3,a1
	
	fmove.x	(a7),fp0
	fdiv.l	d2,fp0
	moveq	#0,d1	| pulled up
	fadd.x	fp1,fp0
#	fintrz.x	fp0
	
#.toInt32:
#	moveq	#0,d1	| pulled up
	fmove.x	fp0,(a7)
	move.w	(a7),d1
	addq.l	#4,a7
	bmi.s	.Neg32
	sub.w	#16382,d1
	bfextu	(a7){0:d1},d0	
	addq.l	#8,a7
	rts
.Neg32:
	sub.w	#16382+0x8000,d1
	bfextu	(a7){0:d1},d0
	addq.l	#8,a7
	neg.l	d0
	rts

#	extern long long smul64(long a asm("d0"), long b asm("d1"));
_smul64:	.globl	_smul64
	fmove.l	d0,fp0
	fmul.l	d1,fp0

#.toInt64:
	moveq	#0,d1
	fmove.x	fp0,-(a7)
	move.w	(a7),d1
	addq.l	#4,a7
	bmi.s	.Neg64
	sub.w	#16382,d1
	cmp.w	#32,d1
	ble.s	.L1
	sub.w	#32,d1
	bfextu	(a7){0:d1},d0
	bfextu	(a7){d1:32},d1
	addq.l	#8,a7
	rts
	
.Neg64:	
	sub.w	#16382+0x8000,d1
	cmp.w	#32,d1
	ble.s	.L1neg
	sub.w	#32,d1
	bfextu	(a7){0:d1},d0
	bfextu	(a7){d1:32},d1
	addq.l	#8,a7
	
	neg.l	d1
	negx.l	d0
	rts
.L1:
	moveq	#0,d0
	bfextu	(a7){0:d1},d1
	addq.l	#8,a7
	rts

.L1neg:
	moveq	#0,d0
	bfextu	(a7){0:d1},d1
	addq.l	#8,a7

	neg.l	d1
	rts

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
replace a color with a texture	turrican3	support.WinUAE	9	18 October 2019 08:10
Code optimization.	gazj82	Coders. Blitz Basic	26	08 July 2018 15:56
Sound Quest - Texture	emufan	request.Apps	2	08 April 2016 20:03
3D Graphics: possible optimization?	sandruzzo	Coders. General	3	26 February 2016 08:01
Speed! - no 3D texture mapping	s2325	HOL data problems	1	17 October 2010 16:32

24 July 2022, 20:08	#1
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,099	060 Texture Mapping Optimization Just for funsies I've been playing around with implementing perspective correct texture mapping fast on my 060 (again, since I've forgotten most of what I knew about software rasterization). Got most things working following Chris Hecker's seminal series[0], and of course Kalms also joined in on the fun [1]. FWIW this mostly works, but I found that the 28.4 subpixel precision used in [0] isn't sufficient in practice to avoid going outside texture bounds (.10 bits seems to be enough though)). Anyway, the optimized inner loop from [0] (final part) looks like this in 68k asm (my translation, assuming standard 16.16 at first): Code: ; D0 = Initial Texture Offset: (U>>16)+(V>>16)TextureWidth ; D1 = UFrac : U<<16 (0.32 fixpoint) ; D2 = VFrac : V<<16 (0.32 fixpoint) ; D3 = DUFrac : DUDX<<16 (0.32 fixpoint) ; D4 = DVFrac : DVDX<<16 (0.32 fixpoint) ; A0 = Dest : Where pixels will be drawn ; A1 = Texture ; A2 = UIntVintFrac ; Points to second element of array containing: { (DUDX>>16)+((DUDX>>16)+1)TextureWidth, ; (DUDX>>16)+(DUDX>>16)TextureWidth } ; i.e. element[0] is a normal increment and element[-1] is when there's a V carry move.b (a1,d0.l),(a0)+ ; Dest++ = Texture[Offset] add.l d4,d2 ; VFrac += DVFrac subx.l d5,d5 ; Save carry move.l (a2,d5.l4),d7 ; Get index from UIntVintFrac add.l d3,d1 ; UFrac += DUFrac addx.l d7,d0 ; Offset += UintVint + carry With that long prelude, my question is can we do better than my optimized attempt (assume the loop is unrolled 8/16 times), and are my annotations correct (assuming no cache misses)? Code: add.l d4,d2 ; Cycle 1 [D0 ready in 2] ; sOEP free subx.l d5,d5 ; Cycle 2 [D0 ready in 1] ; sOEP unavailable move.b (a1,d0.l),d6 ; Cycle 3 [D0 ready?, D5 ready in 2] add.l d3,d1 move.b d6,(a0)+ ; Cycle 4 [D5 ready in 1] ; sOEP free move.l (a2,d5.l4),d7 ; Cycle 5 [D5 ready?] ; sOEP free addx.l d7,d0 ; Cycle 6 ; sOEP unavailable (I can clean up my testbed if anyone wants to play along at home). [0]: http://www.chrishecker.com/Miscellan...nical_Articles [1]: https://www.lysator.liu.se/~mikaelk/...ectivetexture/ [2]: http://ada.untergrund.net/?p=boardthread&id=19 Other good discussions (not referenced) Last edited by paraj; 24 July 2022 at 20:13.

24 July 2022, 23:20	#2
pipper Registered User Join Date: Jul 2017 Location: San Jose Posts: 652	I'm not super familiar with 060 dual issue stuff, but you may want to take a look at Georg Steger's 060 DOOM floor mapping routines. https://github.com/mheyer32/DoomAtta...gine.asm#L8128 I found it clever how the fractional parts of the texture coordinates always stay in the upper word, while the integer part stays in the lower word. No swaps etc necessary. He did it by "cross-packing" the X/Y parts packing t like this deltas: a = fracDY\|intDX b = fracDX\|intDY x,y: c = fracY\|intX d = fracX\|intY Now if you do, c = ADDX(c, a) if fracDY overflows, d = ADDX(d, b) will add 1 to intY and in turn if in the same addition fracX overflows, the next c = ADDX(c, a) will increase the integer part of X and so forth... This might be old news to you, but I found it a pretty cool technique!

25 July 2022, 12:45	#3
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,099	Thanks, sounds similar to what's described at https://amycoders.org/opt/innerloops.html but it's nice to see it put into practice with 060 annotations. Even if it's old news, the point is that I've forgotten most of it Managed to squeeze a bit more performance out of it by doing more, but less complicated work: Code: add.l d4,d2 ; VF+=DVF Cycle 0/5 [sOEP in subsequent ones] subx.l d7,d7 ; Save carry Cycle 1 pOEP-only move.b (a1,d0.l),d6 ; Load pixel Cycle 2 pOEP add.l a2,d0 ; Add Int(DU)+Int(DV)TextureWidth sOEP add.l d3,d1 ; UF+=DUF Cycle 3 pOEP and.l d5,d7 ; VCarryTextureWidth sOEP addx.l d7,d0 ; +=VCarry*TextureWidth Cycle 4 pOEP-only move.b d6,(a0)+ ; Store pixel Cycle 5 pOEP And do some proper timing: Code: Time (ns) Per pixel (cycles) Overhead 651.6 N/A Original 3468.8 8.80 Optimized 2512.2 5.81 New 2286.7 5.11 The per pixel numbers were obtained by subtracting the measured overhead (looping/calling the function/pushing/popping registers/returning) and dividing by cycle time (20ns) and number of pixels (16). Slightly noisy, but they line up nicely with the expected numbers.

31 July 2022, 21:08	#4
Rock'n Roll German Translator Join Date: Aug 2018 Location: Drübeck / Germany Age: 49 Posts: 183	Only for information. I found this massive 3d coding page sometime. https://mikro.naprvyraz.sk/docs/

03 August 2022, 18:29	#11
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,099	Since both vbcc and (newer) gcc 's call their own support libraries when compiling for 060 rather than relying on either the 68060.library fallback or utility.library, and I know to avoid the instructions in my asm code it doesn't seem super relevant what some patches may or may not do. My point was that for 060 it really seems like common optimization that would apply for 486/pentium (and maybe 040?) like favoring 16.16 fixpoint over floating numbers in certain situation rarely translate. For cases where you absolutely need e.g. 128-bit intermediary results things may be different, but for demo/game stuff that hardly ever applies.

04 August 2022, 07:23	#14
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,216	As far as the MULU/MULS utility.library code is concerned, there is not really much to optimize. Frankly, if your code is really time critical, then the switch between 68020-68040 and 68060 code needs to be made at a much coarser level as otherwise the additional instructions to move values to the source registers, calling utility and moving the results back to where they are needed will eat up any micro-optimization in the above function. That does not make it useless - it is just your average service function you need for example for large disk support, e.g. by the FFS, the HDToolBox and workbench. I believe the FFS use case triggered the necessity to have a 060 detection within the utility.library such that 060 systems without an F-Space "debug ROM" could safely boot from large disks. The 64bit divide is another issue as in its full case (the 64/32 full divide) requires a bit more care. Some optimizations have been performed there that go beyond the actual Mot ISP code, which is just a straight implementation of Knuth's "Algorithm D", but I believe that this requires a bit more thought and a more careful comparison of that and the classical "egyptian" (binary) division algorithm if you want to be faster. Not that it matters often.

04 August 2022, 14:25	#15
bebbo bye Join Date: Jun 2016 Location: Some / Where Posts: 680	There is a maybe usable divide implementation: https://ridiculousfish.com/blog/post...episode-v.html Adjusted for the 64/32: http://franke.ms/cex/z/j78asa

05 August 2022, 17:26	#19
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,099	Cool, I did consider trying to extract the result from the extended precision result, but hadn't had time. Doesn't seem to be a massive speed improvement, but if the conversion/rounding can be optimized it might have potential. Results: smul64: 69 cycles sdiv64: 138 cycles divllu (returning just quotient): 175 cycles I measured overhead to be ~37 cycles (just calling a dummy asm routine with 2 stack arguments and storing a result in "r"). P.S. isn't it a bit dangerous to use a stack "red zone" like that? Probably OK if not in supervisor mode, but seems sketchy

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)