English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 24 July 2022, 20:08   #1
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
060 Texture Mapping Optimization

Just for funsies I've been playing around with implementing perspective correct texture mapping fast on my 060 (again, since I've forgotten most of what I knew about software rasterization). Got most things working following Chris Hecker's seminal series[0], and of course Kalms also joined in on the fun [1]. FWIW this mostly works, but I found that the 28.4 subpixel precision used in [0] isn't sufficient in practice to avoid going outside texture bounds (.10 bits seems to be enough though)).

Anyway, the optimized inner loop from [0] (final part) looks like this in 68k asm (my translation, assuming standard 16.16 at first):
Code:
; D0 = Initial Texture Offset: (U>>16)+(V>>16)*TextureWidth
; D1 = UFrac                 : U<<16 (0.32 fixpoint)
; D2 = VFrac                 : V<<16 (0.32 fixpoint)
; D3 = DUFrac                : DUDX<<16 (0.32 fixpoint)
; D4 = DVFrac                : DVDX<<16 (0.32 fixpoint)

; A0 = Dest                  : Where pixels will be drawn
; A1 = Texture
; A2 = UIntVintFrac          ; Points to second element of array containing: { (DUDX>>16)+((DUDX>>16)+1)*TextureWidth, 
                             ;                                                 (DUDX>>16)+(DUDX>>16)*TextureWidth }
                             ; i.e. element[0] is a normal increment and element[-1] is when there's a V carry

        move.b (a1,d0.l),(a0)+      ; *Dest++ = Texture[Offset]
        add.l  d4,d2                ; VFrac += DVFrac
        subx.l d5,d5                ; Save carry
        move.l (a2,d5.l*4),d7       ; Get index from UIntVintFrac
        add.l  d3,d1                ; UFrac += DUFrac
        addx.l d7,d0                ; Offset += UintVint + carry
With that long prelude, my question is can we do better than my optimized attempt (assume the loop is unrolled 8/16 times), and are my annotations correct (assuming no cache misses)?

Code:
        add.l   d4,d2               ; Cycle 1 [D0 ready in 2]
        ; sOEP free

        subx.l  d5,d5               ; Cycle 2 [D0 ready in 1]
        ; sOEP unavailable

        move.b  (a1,d0.l),d6        ; Cycle 3 [D0 ready?, D5 ready in 2]
        add.l   d3,d1

        move.b  d6,(a0)+            ; Cycle 4 [D5 ready in 1]
        ; sOEP free

        move.l  (a2,d5.l*4),d7      ; Cycle 5 [D5 ready?]
        ; sOEP free

        addx.l  d7,d0               ; Cycle 6
        ; sOEP unavailable
(I can clean up my testbed if anyone wants to play along at home).

[0]: http://www.chrishecker.com/Miscellan...nical_Articles
[1]: https://www.lysator.liu.se/~mikaelk/...ectivetexture/
[2]: http://ada.untergrund.net/?p=boardthread&id=19 Other good discussions (not referenced)

Last edited by paraj; 24 July 2022 at 20:13.
paraj is offline  
Old 24 July 2022, 23:20   #2
pipper
Registered User
 
Join Date: Jul 2017
Location: San Jose
Posts: 652
I'm not super familiar with 060 dual issue stuff, but you may want to take a look at Georg Steger's 060 DOOM floor mapping routines.
https://github.com/mheyer32/DoomAtta...gine.asm#L8128

I found it clever how the fractional parts of the texture coordinates always stay in the upper word, while the integer part stays in the lower word. No swaps etc necessary.
He did it by "cross-packing" the X/Y parts packing t like this

deltas:
a = fracDY|intDX
b = fracDX|intDY

x,y:
c = fracY|intX
d = fracX|intY

Now if you do,
c = ADDX(c, a)
if fracDY overflows,
d = ADDX(d, b)
will add 1 to intY and in turn if in the same addition fracX overflows, the next
c = ADDX(c, a) will increase the integer part of X
and so forth...

This might be old news to you, but I found it a pretty cool technique!
pipper is online now  
Old 25 July 2022, 12:45   #3
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
Thanks, sounds similar to what's described at https://amycoders.org/opt/innerloops.html but it's nice to see it put into practice with 060 annotations. Even if it's old news, the point is that I've forgotten most of it

Managed to squeeze a bit more performance out of it by doing more, but less complicated work:
Code:
        add.l   d4,d2           ; VF+=DVF                  Cycle 0/5 [sOEP in subsequent ones]
        subx.l  d7,d7           ; Save carry               Cycle 1 pOEP-only
        move.b  (a1,d0.l),d6    ; Load pixel               Cycle 2 pOEP
        add.l   a2,d0           ; Add Int(DU)+Int(DV)*TextureWidth sOEP
        add.l   d3,d1           ; UF+=DUF                  Cycle 3 pOEP
        and.l   d5,d7           ; VCarry*TextureWidth              sOEP
        addx.l  d7,d0           ; +=VCarry*TextureWidth    Cycle 4 pOEP-only
        move.b  d6,(a0)+        ; Store pixel              Cycle 5 pOEP
And do some proper timing:
Code:
            Time (ns)	Per pixel (cycles)
Overhead	651.6	    N/A
Original	3468.8	    8.80
Optimized 	2512.2	    5.81
New	        2286.7	    5.11
The per pixel numbers were obtained by subtracting the measured overhead (looping/calling the function/pushing/popping registers/returning) and dividing by cycle time (20ns) and number of pixels (16). Slightly noisy, but they line up nicely with the expected numbers.
paraj is offline  
Old 31 July 2022, 21:08   #4
Rock'n Roll
German Translator
 
Rock'n Roll's Avatar
 
Join Date: Aug 2018
Location: Drübeck / Germany
Age: 49
Posts: 183
Only for information. I found this massive 3d coding page sometime.
https://mikro.naprvyraz.sk/docs/
Rock'n Roll is offline  
Old 02 August 2022, 17:50   #5
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
Quote:
Originally Posted by Rock'n Roll View Post
Only for information. I found this massive 3d coding page sometime.
https://mikro.naprvyraz.sk/docs/
Blast from the past with a lot of those documents. Think I first saw them while browsing https://hornet.org/code/ (which is apparently still up) back in the 90ies.


For 060 specific stuff it seems that reaching for the FPU should be preferred to (high-precision) fixedpoint math (sorry LC owners). Really hurts that they cut 32x32->64bit multiply and 64/32->32 division from HW. Lack of the former also means that you can't even turn most 32-bit integer divisions by a constant into multiplications to increase performance.
paraj is offline  
Old 02 August 2022, 23:16   #6
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
Quote:
Originally Posted by paraj View Post
For 060 specific stuff it seems that reaching for the FPU should be preferred to (high-precision) fixedpoint math (sorry LC owners). Really hurts that they cut 32x32->64bit multiply and 64/32->32 division from HW. Lack of the former also means that you can't even turn most 32-bit integer divisions by a constant into multiplications to increase performance.
I was just about to question what specific code on the 060 would be optimized any differently than 020-040 code. But then I remembered that the unimplemented 060 instructions are probably the best qualified ones for optimization.

So, I will take the opportunity to remind you about the often forgotten 64 bit u/s multiply functions of utility.library. These functions were handled by the exception trap (ISP code) in older 68060.libraries.

That's exactly why I put some extra work into both libraries here:

- Added optimized Mult64u/s ISP patch to utility.library
functions (Much faster than exception trap code)

https://eab.abime.net/showthread.php?t=96791
https://eab.abime.net/showthread.php?t=101115

Unfortunately, there were no Div64u/s functions in utility.library to patch.
SpeedGeek is offline  
Old 03 August 2022, 13:47   #7
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,216
Quote:
Originally Posted by SpeedGeek View Post
So, I will take the opportunity to remind you about the often forgotten 64 bit u/s multiply functions of utility.library. These functions were handled by the exception trap (ISP code) in older 68060.libraries.
*Cough* Pretty much every 68060.library I'm aware of takes care of these vectors. Actually, since 3.1.4 and up, utility.library takes care of them.


Quote:
Originally Posted by SpeedGeek View Post

Unfortunately, there were no Div64u/s functions in utility.library to patch.

That's not the case anymore. 3.2 and up provides a 64/32 division in utility. Or actually two of them. No need to patch them, the utility.library already provides an implementation of them that is 68060-friendly.
Thomas Richter is offline  
Old 03 August 2022, 15:20   #8
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
Quote:
Originally Posted by Thomas Richter View Post
*Cough* Pretty much every 68060.library I'm aware of takes care of these vectors. Actually, since 3.1.4 and up, utility.library takes care of them.
*Cough* your awareness is obviously somewhat limited. The Carsten S. 68060 libraries didn't patch them at all. The Tekmagic060 and P5 68060 libraries patched but didn't optimize them:

http://aminet.net/package/util/boot/Mult64Patch

http://aminet.net/package/util/boot/UtilPatch


Quote:
Originally Posted by Thomas Richter View Post
That's not the case anymore. 3.2 and up provides a 64/32 division in utility. Or actually two of them. No need to patch them, the utility.library already provides an implementation of them that is 68060-friendly.
Once again, your condescending attitude towards backwards compatibility shows it's face.
But thanks for the information anyway, for the next library release I should add an exec version test before the utility.library patch (so OS 3.14+ users can get their money's worth *Cough*).

Last edited by SpeedGeek; 03 August 2022 at 16:18.
SpeedGeek is offline  
Old 03 August 2022, 16:22   #9
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,216
Quote:
Originally Posted by SpeedGeek View Post
What exactly does that show? That there are useless patches available on Aminet? I surely agree with that. It does not invalidate my statement.


Quote:
Originally Posted by SpeedGeek View Post
Once again, your condescending attitude towards backwards compatibility shows it's face.
I beg your pardon. This is not a "backwards compatibility issue". It is an issue of "completeness of implementation", and that issue is fixed in almost any 68060.library that has been around. Not only mine. There is no compatibility issue here at all.



While at it, be aware that there are a couple of other issues with UMult64/SMult64 on some processors and some kickstarts, namely returning the result in the wrong order (hi and lo swapped). Later versions of SetPatch take care of that as well. The 3.1 SetPatch probably does not. It *may* be included in the 3.9 SetPatch, though I'm not sure at this moment.


Note that the above means that potentially any patch you install there may be overridden by SetPatch once again if the returned order is incorrect.
Thomas Richter is offline  
Old 03 August 2022, 17:19   #10
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
Quote:
Originally Posted by Thomas Richter View Post
What exactly does that show? That there are useless patches available on Aminet? I surely agree with that. It does not invalidate my statement.
"Useless" patches available on Aminet? Certainly not! You consider them useless because they don't serve any purpose which you think is important or beneficial.

Quote:
Originally Posted by Thomas Richter View Post
I beg your pardon. This is not a "backwards compatibility issue". It is an issue of "completeness of implementation", and that issue is fixed in almost any 68060.library that has been around. Not only mine. There is no compatibility issue here at all.

While at it, be aware that there are a couple of other issues with UMult64/SMult64 on some processors and some kickstarts, namely returning the result in the wrong order (hi and lo swapped). Later versions of SetPatch take care of that as well. The 3.1 SetPatch probably does not. It *may* be included in the 3.9 SetPatch, though I'm not sure at this moment.

Note that the above means that potentially any patch you install there may be overridden by SetPatch once again if the returned order is incorrect.
I just downloaded the release notes for OS 3.1.4.1 and guess what?

Code:
-------------------- AmigaOS 3.1.4.(1) project -----------------------

Changes for release 45.1 (1.1.2018)

- The 64-bit math routines return now in the 68000 version
  the results in proper order, namely with the low 32 bits
  in register d0 and the high 32 bits in register d1.

- Added 68060 specialized versions of 64bit math. Note that
  these functions are currently never enabled as exec does
  not identify the 68060.

- Improved the 68000 version of SMult64 a tiny little bit.

- Retired the 68020-only version of utility.
Now, how exactly do users get their money's worth from new but "Never Enabled" functions?
SpeedGeek is offline  
Old 03 August 2022, 18:29   #11
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
Since both vbcc and (newer) gcc 's call their own support libraries when compiling for 060 rather than relying on either the 68060.library fallback or utility.library, and I know to avoid the instructions in my asm code it doesn't seem super relevant what some patches may or may not do.

My point was that for 060 it really seems like common optimization that would apply for 486/pentium (and maybe 040?) like favoring 16.16 fixpoint over floating numbers in certain situation rarely translate. For cases where you absolutely need e.g. 128-bit intermediary results things may be different, but for demo/game stuff that hardly ever applies.
paraj is offline  
Old 03 August 2022, 18:31   #12
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,216
Quote:
Originally Posted by SpeedGeek View Post
"Useless" patches available on Aminet? Certainly not! You consider them useless because they don't serve any purpose which you think is important or beneficial.
They *are* useless because every 68060.library I'm aware of provides that as features. Sorry if yours was incomplete and required them.


Quote:
Originally Posted by SpeedGeek View Post

Now, how exactly do users get their money's worth from new but "Never Enabled" functions?
You should probably read this stuff to the very end. That's all old news. Utility does have a 68060 detection function in it, no need to worry, even as of 3.1.4.
Thomas Richter is offline  
Old 03 August 2022, 19:36   #13
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
Quote:
Originally Posted by Thomas Richter View Post
They *are* useless because every 68060.library I'm aware of provides that as features. Sorry if yours was incomplete and required them.
Now, you didn't read completely what I wrote. The Aminet patches and the code I added to Carsten's 68060.library address both issues - patching and optimization. The only difference is that the Aminet patches use some other optimized 64 bit code and I decided to optimize the Motorola/Freescale ISP code.

Quote:
Originally Posted by Thomas Richter View Post
You should probably read this stuff to the very end. That's all old news. Utility does have a 68060 detection function in it, no need to worry, even as of 3.1.4.
I did read this stuff to the end. Now, how would I know that the Official OS 3.1.4.1 release notes don't agree with you to the end?


Quote:
Originally Posted by paraj View Post
Since both vbcc and (newer) gcc 's call their own support libraries when compiling for 060 rather than relying on either the 68060.library fallback or utility.library, and I know to avoid the instructions in my asm code it doesn't seem super relevant what some patches may or may not do.

My point was that for 060 it really seems like common optimization that would apply for 486/pentium (and maybe 040?) like favoring 16.16 fixpoint over floating numbers in certain situation rarely translate. For cases where you absolutely need e.g. 128-bit intermediary results things may be different, but for demo/game stuff that hardly ever applies.
That's a valid point which could make much of the discussion here pointless. However, you might want to do some performance testing of your compiler's libraries and see how well they compare against the patched and/or library functions discussed here.

Last edited by SpeedGeek; 04 August 2022 at 14:50.
SpeedGeek is offline  
Old 04 August 2022, 07:23   #14
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,216
As far as the MULU/MULS utility.library code is concerned, there is not really much to optimize. Frankly, if your code is really time critical, then the switch between 68020-68040 and 68060 code needs to be made at a much coarser level as otherwise the additional instructions to move values to the source registers, calling utility and moving the results back to where they are needed will eat up any micro-optimization in the above function. That does not make it useless - it is just your average service function you need for example for large disk support, e.g. by the FFS, the HDToolBox and workbench. I believe the FFS use case triggered the necessity to have a 060 detection within the utility.library such that 060 systems without an F-Space "debug ROM" could safely boot from large disks.

The 64bit divide is another issue as in its full case (the 64/32 full divide) requires a bit more care. Some optimizations have been performed there that go beyond the actual Mot ISP code, which is just a straight implementation of Knuth's "Algorithm D", but I believe that this requires a bit more thought and a more careful comparison of that and the classical "egyptian" (binary) division algorithm if you want to be faster. Not that it matters often.
Thomas Richter is offline  
Old 04 August 2022, 14:25   #15
bebbo
bye
 
Join Date: Jun 2016
Location: Some / Where
Posts: 680
There is a maybe usable divide implementation: https://ridiculousfish.com/blog/post...episode-v.html

Adjusted for the 64/32: http://franke.ms/cex/z/j78asa
bebbo is offline  
Old 04 August 2022, 20:01   #16
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
Quote:
Originally Posted by bebbo View Post
There is a maybe usable divide implementation: https://ridiculousfish.com/blog/post...episode-v.html

Adjusted for the 64/32: http://franke.ms/cex/z/j78asa
Haven't debugged it, but it seems to do a division by zero when I try it. I probably need to recompile my cross compiler, but even taking the assembly code directly from compiler explorer gives the same result, and still haven't actually gotten the support libraries properly compiled for 060 (meaning they rely on emulated instructions). Will have to look into that later..

Otherwise doing a (poor) synthetic benchmark with VBCC I get:
Code:
BuiltinMul           1626 ns [81 cycles]
BuiltinDiv           5490 ns [275 cycles]
UMult64              1418 ns [71 cycles]
EmuMul               62797 ns [3140 cycles]
EmuDiv               73514 ns [3676 cycles]
FMul                 1966 ns [98 cycles]
FDiv                 2330 ns [116 cycles]
(The Mult64 patch seems to improve UMult64 to 68 cycles). This is on KS 39.106 (so no UDivMod64) with 68060.library 47.1. I timed 100000 function calls to something that looks like:
Code:
volatile uint32_t x=0x20000000;
volatile uint32_t y=0x10000000;
volatile uint64_t r;
void BuiltinMul(void) { r = (uint64_t)x * y; }
uint64_t __EmuMul(__reg("d0") ULONG, __reg("d1") ULONG) = "\tmulu.l\td0,d0:d1";
void EmuMul(void) { r = __EmuMul(x, y); }
void FMul(void) { r = (uint64_t)((double)x * y); }
VBCC seems to use a full 64x64 multiplication (even though it isn't necessary) which is why UMult64 comes out ahead. FMul/FDiv incur a lot of overhead in converting to/from floating point numbers (u)int64<->double is especially costly but still win by a large margin for divisions.

Not dunking on anyone involved here, and of course stuff can be optimized, but like I wrote earlier I think it's clear that for 060 (with a FPU) it's not a good idea to use e.g. 16.16 fixedpoint instead of floats/doubles if you need to do anything apart from adding and subtracting.
paraj is offline  
Old 05 August 2022, 11:25   #17
bebbo
bye
 
Join Date: Jun 2016
Location: Some / Where
Posts: 680
since I haven't access to a real 68060... what about this implementation?
Code:
#extern long  sdiv64(long hi asm("d1"), long lo asm("d0"), long d asm("d2"));
_sdiv64: .globl _sdiv64
	tst.l	d1
	bne.s	.ldiv
	divs.l	d2,d0
	rts

.mi:
	move.l	#0x8000,a0
	fmove.d	#0e-1.25e-1,fp1
	neg.l	d0
	negx.l	d1
	tst.l	d1
	bne.s	.ldivs
	divs.l	d2,d0
	neg.l	d0
	rts

.ldiv:
	bmi	.mi
	sub.l	a0,a0
.ldivs:	
	fmove.s	#0e1.25e-1,fp1
	exg		d3,a1
	bfffo	d1{#0,#0},d3
	sub.w	#32,d3
	neg.w	d3
	
	bfins	d1,(-8,a7){0,d3}
	bfins	d0,(-8,a7){d3,32}
	
	add.w	#16382 + 32,d3
	add.l	a0,d3	
	swap	d3
	move.l	d3,(-12,a7)
	exg		d3,a1
	
	fmove.x	(-12,a7),fp0
	fdiv.l	d2,fp0
	fadd.x	fp1,fp0
#	fintrz.x	fp0
	
.toInt32:
	moveq	#0,d1
	fmove.x fp0,(-12,a7)
	move.w	(-12,a7),d1
	and.w	#0x7fff,d1
	sub.w	#16382,d1
	bfextu	(-8,a7){0:d1},d0
	btst	#7,(-12,a7)
	bne.s	.Neg32
	rts
.Neg32:
	neg.l	d0
	rts

# extern long long smul64(long a asm("d0"), long b asm("d1"));
_smul64: .globl _smul64
	fmove.l	d0,fp0
	fmul.l	d1,fp0

.toInt64:
	moveq	#0,d1
	fmove.x fp0,(-12,a7)
	move.w	(-12,a7),d1
	and.w	#0x7fff,d1
	sub.w	#16382,d1
	cmp.w	#32,d1
	ble.s		.L1
	sub.w	#32,d1
	bfextu	(-8,a7){0:d1},d0
	bfextu	(-8,a7){d1:32},d1
	btst	#7,(-12,a7)
	bne.s	.Neg64
	rts
.Neg64:
	neg.l	d1
	negx.l	d0
	rts
.L1:
	moveq	#0,d0
	bfextu	(-8,a7){0:d1},d1
	btst	#7,(-12,a7)
	bne.s	.Neg64
	rts
for the values I tested, it seems to work. Is this approach to slow? Or does rounding mode kill it?

Last edited by bebbo; 05 August 2022 at 11:58. Reason: added a missing tst.l d1 after negx.l
bebbo is offline  
Old 05 August 2022, 15:01   #18
bebbo
bye
 
Join Date: Jun 2016
Location: Some / Where
Posts: 680
Quote:
Originally Posted by bebbo View Post
There is a maybe usable divide implementation: https://ridiculousfish.com/blog/post...episode-v.html

Adjusted for the 64/32: http://franke.ms/cex/z/j78asa

I also found what I forgot to change:


__builtin_clzll must be replaced by __builtin_clz otherwise shift is way to big.



=> http://franke.ms/cex/z/Yd6E14
bebbo is offline  
Old 05 August 2022, 17:26   #19
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
Cool, I did consider trying to extract the result from the extended precision result, but hadn't had time. Doesn't seem to be a massive speed improvement, but if the conversion/rounding can be optimized it might have potential. Results:
smul64: 69 cycles
sdiv64: 138 cycles
divllu (returning just quotient): 175 cycles
I measured overhead to be ~37 cycles (just calling a dummy asm routine with 2 stack arguments and storing a result in "r").

P.S. isn't it a bit dangerous to use a stack "red zone" like that? Probably OK if not in supervisor mode, but seems sketchy
paraj is offline  
Old 05 August 2022, 18:35   #20
bebbo
bye
 
Join Date: Jun 2016
Location: Some / Where
Posts: 680
Quote:
Originally Posted by paraj View Post
Cool, I did consider trying to extract the result from the extended precision result, but hadn't had time. Doesn't seem to be a massive speed improvement, but if the conversion/rounding can be optimized it might have potential. Results:
smul64: 69 cycles
sdiv64: 138 cycles
divllu (returning just quotient): 175 cycles
I measured overhead to be ~37 cycles (just calling a dummy asm routine with 2 stack arguments and storing a result in "r").

P.S. isn't it a bit dangerous to use a stack "red zone" like that? Probably OK if not in supervisor mode, but seems sketchy

unfolded the sign handling and added proper stack handling which reduced the object size by 16 bytes. Should be a tad faster now.



Code:
#extern long sdiv64(long hi asm("d1"), long lo asm("d0"), long d asm("d2"));
_sdiv64:	.globl	_sdiv64
	tst.l	d1
	bne.s	.ldiv
	divs.l	d2,d0
	rts

.mi:
	move.l	#0x8000,a0
	fmove.d	#0e-1.25e-1,fp1
	neg.l	d0
	negx.l	d1
	tst.l	d1
	bne.s	.ldivs
	divs.l	d2,d0
	neg.l	d0
	rts

.ldiv:
	bmi	.mi
	sub.l	a0,a0
.ldivs:	
	fmove.s	#0e1.25e-1,fp1
	exg		d3,a1
	bfffo	d1{#0,#0},d3
	sub.w	#32,d3
	neg.w	d3

	subq.l	#8,a7
	bfins	d1,(a7){0,d3}
	bfins	d0,(a7){d3,32}
	
	add.w	#16382+32,d3
	add.l	a0,d3	
	swap	d3
	move.l	d3,-(a7)
	exg		d3,a1
	
	fmove.x	(a7),fp0
	fdiv.l	d2,fp0
	moveq	#0,d1	| pulled up
	fadd.x	fp1,fp0
#	fintrz.x	fp0
	
#.toInt32:
#	moveq	#0,d1	| pulled up
	fmove.x	fp0,(a7)
	move.w	(a7),d1
	addq.l	#4,a7
	bmi.s	.Neg32
	sub.w	#16382,d1
	bfextu	(a7){0:d1},d0	
	addq.l	#8,a7
	rts
.Neg32:
	sub.w	#16382+0x8000,d1
	bfextu	(a7){0:d1},d0
	addq.l	#8,a7
	neg.l	d0
	rts

#	extern long long smul64(long a asm("d0"), long b asm("d1"));
_smul64:	.globl	_smul64
	fmove.l	d0,fp0
	fmul.l	d1,fp0

#.toInt64:
	moveq	#0,d1
	fmove.x	fp0,-(a7)
	move.w	(a7),d1
	addq.l	#4,a7
	bmi.s	.Neg64
	sub.w	#16382,d1
	cmp.w	#32,d1
	ble.s	.L1
	sub.w	#32,d1
	bfextu	(a7){0:d1},d0
	bfextu	(a7){d1:32},d1
	addq.l	#8,a7
	rts
	
.Neg64:	
	sub.w	#16382+0x8000,d1
	cmp.w	#32,d1
	ble.s	.L1neg
	sub.w	#32,d1
	bfextu	(a7){0:d1},d0
	bfextu	(a7){d1:32},d1
	addq.l	#8,a7
	
	neg.l	d1
	negx.l	d0
	rts
.L1:
	moveq	#0,d0
	bfextu	(a7){0:d1},d1
	addq.l	#8,a7
	rts

.L1neg:
	moveq	#0,d0
	bfextu	(a7){0:d1},d1
	addq.l	#8,a7

	neg.l	d1
	rts

Last edited by bebbo; 05 August 2022 at 18:44. Reason: pulled up an insn after the fdiv
bebbo is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
replace a color with a texture turrican3 support.WinUAE 9 18 October 2019 08:10
Code optimization. gazj82 Coders. Blitz Basic 26 08 July 2018 15:56
Sound Quest - Texture emufan request.Apps 2 08 April 2016 20:03
3D Graphics: possible optimization? sandruzzo Coders. General 3 26 February 2016 08:01
Speed! - no 3D texture mapping s2325 HOL data problems 1 17 October 2010 16:32

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 23:16.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.11210 seconds with 14 queries