Optimizing the 68020+ 32-bit math - Page 6

litwr · 15 May 2021, 15:44

Quote:

Originally Posted by saimo

The trick exploits the sign-extension of suba.w to avoid clearing the upper word of a register and thus save two cycles. As a side effect, d7 is no longer needed.

suba.w d5,a5 works fine when bit #15 of d5 is 0, i.e. when d5 < 32768; given that d5 is the remainder of the division, the following condition applies:
* hence, it must be 7*D-1 < 32769 -> D < (32769+1)/7 = 4681.

Thank you very much. But you know the program must not reduce the maximum number of digits artificially.

Quote:

Originally Posted by saimo

The trick doesn't bring more speed on 68030 because suba.w takes 4 cycles instead of 2, unlike on the other 68020+ CPUs - actually, I'm starting to wonder if it's an erratum on the MC68020UM or MC68030UM, given the strong similarity of the two CPUs (my memory tells me that there actually is a penalty, but recently it failed me precisely regarding similar matters). Maybe I'll check it out later.

What crazy timings!

This reminds me the 80386 that became much slower than the 80286 on rotations despite having a barrel shifter!

Quote:

Originally Posted by saimo

To avoid losing cycles for the calculation of d (the mulu part), the latter has to change from:
d = d/2 + r[i]*10000
to:
d = (d + r[i]*20000)/2
This works also when d is odd thanks to the fact that r[i]*20000 is always even.

There is no d = d/2 + r[i]*10000 in my code so your substitution rather will not work.

Quote:

Originally Posted by saimo

Whether such condition is acceptable is another story. If not, a clr.w d5 before swap.w d5 is enough to remove suba.w d5,a5 and move .l2 one line up: this will cancel the 2-cycle gain and bring the limit of D back to the original value (7*D-1 < 65536 -> D < (65537+1)/7 = 9362) - but, of course, this is useful only in the non-mulu-optimiation case.

Thus we don't have any gain.

litwr · 15 May 2021, 15:50

Quote:

Originally Posted by meynaf

Use EQUR.

Thank you very much.

saimo · 15 May 2021, 18:34

Quote:

Originally Posted by litwr

What crazy timings!

This reminds me the 80386 that became much slower than the 80286 on rotations despite having a barrel shifter!

Please check out the edit in my previous post: I have added actual measurements regarding those timings.

Quote:

There is no d = d/2 + r[i]*10000 in my code so your substitution rather will not work.

What's this, then?

Code:

   lsr.l d5
   move -(a3),d0   ;r[i]
   mulu d1,d0      ;r[i]*10000
   add.l d0,d5     ;d += d + r[i]*10000
   move.l d5,d6

You are dividing d5 by 2 (at least, I guess that's the intention: lsr.l d5 does not even exist in M68k assembly, but lsr.w <memory> does exist for 1-bit shifts), and then adding r[i]*10000 to it (and I guess that d += d + r[i]*10000 is a wrong comment: it's either d += r[i]*10000 or d = d + r[i]*10000).

Quote:

Thus we don't have any gain.

It's still useful to free a data register in the mulu version.

Don_Adan · 15 May 2021, 19:10

You can use next optimisations:

Code:

 lea cv(pc),a0
; add cv(pc),d5 ;c + d/10000
 add.w (A0),D5
; swap d5 ;c <- d%10000
; move d5,cv
 move.l D5,(A0)+ ; A0 is buf now
; clr d5
; swap d5
 and.l #$0000FFFF,D5 ; or ext.l D5, if D5.W cant be minus value
; bsr PR0000
 bsr.w PR0001
 endif sub.w #14,d6 ;kv
 bne .l0


PR0000 ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3
 lea.l buf(pc),a0
PR0001
 move.l a0,d2
 bsr.s .l1
 moveq #4,d3
 move.l cout(pc),d1
 jmp Write(a6) ;call Write(stdout,buff,size)
.l1
 divu #1000,d5
 bsr .l0
 clr d5
 swap d5
 divu #100,d5
 bsr .l0
 clr d5
 swap d5
 divu #10,d5
 bsr .l0
 swap d5
.l0
 eori.b #'0',d5
 move.b d5,(a0)+
 rts

; cv dc.w 0
 time dc.l 0
 cout dc.l 0
 cnop 0,4 ; for fastest longword write
 cv dc.l 0
 buf ds.b 4

Not tested, but can works, i think.

Don_Adan · 15 May 2021, 19:31

btw. you can remove from source code

; clr cv

is not necessary.

litwr · 16 May 2021, 11:27

Quote:

Originally Posted by saimo

Please check out the edit in my previous post: I have added actual measurements regarding those timings.

Thank you. So timings are rather identical for the 68020 and 68030. However it is crazy that SUBA.W is slower than SUBA.L.

Quote:

Originally Posted by saimo

What's this, then?

Code:

   lsr.l d5
   move -(a3),d0   ;r[i]
   mulu d1,d0      ;r[i]*10000
   add.l d0,d5     ;d += d + r[i]*10000
   move.l d5,d6

You are dividing d5 by 2 (at least, I guess that's the intention: lsr.l d5 does not even exist in M68k assembly, but lsr.w <memory> does exist for 1-bit shifts), and then adding r[i]*10000 to it (and I guess that d += d + r[i]*10000 is a wrong comment: it's either d += r[i]*10000 or d = d + r[i]*10000).

Sorry, it seems I confused some things. So your code is correct. But why have you written that lsr.l d5 does not even exist in M68k assembly? It exists and is quite useful, it is a shorthand version for lsr.l #1,d5. Thank you for pointing out the typo in the comments.

Quote:

Originally Posted by saimo

It's still useful to free a data register in the mulu version.

The mulu-variant is only useful for the 68000 and maybe 68060. I doubt about the latter.

Quote:

Originally Posted by Don_Adan

You can use next optimisations:

I checked it and found out that your code became 4 bytes longer.

And any speed optimization outside the main loop is senseless unless you find a way to save hundreds cycles.

Quote:

Originally Posted by Don_Adan

btw. you can remove from source code
; clr cv
is not necessary.

Thank you. This gives us 9280 digits for the Amiga 1200 variant!

Don_Adan · 16 May 2021, 14:27

Code:

 lea cv(pc),a0    4 bytes opt
; add cv(pc),d5 ;c + d/10000 4 bytes ori
 add.w (A0),D5  2 bytes opt
; swap d5 ;c <- d%10000 2 bytes ori
; move d5,cv 6 bytes ori
 move.l D5,(A0)+ ; A0 is buf now 2 bytes opt
; clr d5 2 bytes ori
; swap d5 2 bytes ori
 and.l #$0000FFFF,D5 ; or ext.l D5, if D5.W cant be minus value 6 bytes or 2 bytes opt
 original code 16 bytes, optimised code 10 or 14 bytes max
 

; cv dc.w 0
 time dc.l 0
 cout dc.l 0
 cnop 0,4 ; for fastest longword write
 cv dc.l 0
 buf ds.b 4

Code is shortest, if you set correctly cv and buf at cnop 0,4 position without extra padding, size will be same or 4 bytes shortest if using ext.l D5.
And you wrote that speed is important, not size. When i asked.

saimo · 16 May 2021, 15:12

Quote:

Originally Posted by litwr

Thank you. So timings are rather identical for the 68020 and 68030. However it is crazy that SUBA.W is slower than SUBA.L.

Not really: the word operand has first to be sign-extended.

Quote:

But why have you written that lsr.l d5 does not even exist in M68k assembly? It exists and is quite useful, it is a shorthand version for lsr.l #1,d5.

It is not official syntax. In fact, I had never even seen it. To me it looks like an ill-advised assember-specific shorthand.

Don_Adan · 16 May 2021, 17:42

Anyway, if you need only shortest code, you can use:

Code:

 lea cv(PC),A0
 add.w (A0),D5
 move.l D5,(A0)
 ext.l D5           ; or "and.l #$0000FFFF,D5"
 bsr PR0000
 


 time dc.l 0
 cout dc.l 0
 cv dc.w 0
 buf ds.b 4

But this code is slowest because extra command (lea from PR0000) is called too.

Don_Adan · 17 May 2021, 07:34

You can use next version too, shortest next 2 bytes.

Code:

clr.l -(SP)  ; cv
.l0
....

 add.w (SP),D5 ; cv
 move.l D5,(SP) ; cv
 ext.l D5           ; or "and.l #$0000FFFF,D5"
 bsr PR0000
 endif
 sub.w #14,d6 ;kv
 bne .l0
 addq.l #4,SP

 time dc.l 0
 cout dc.l 0
; cv dc.w 0
 buf ds.b 4

litwr · 17 May 2021, 10:51

Quote:

Originally Posted by Don_Adan

[code]
Code is shortest, if you set correctly cv and buf at cnop 0,4 position without extra padding, size will be same or 4 bytes shortest if using ext.l D5.
And you wrote that speed is important, not size. When i asked.

Thank you. However I dare to repeat that the speed can't be affected by tiny optimizations outside the main loop.

Quote:

Originally Posted by saimo

Not really: the word operand has first to be sign-extended.

IMHO such things must be hardwired and instant.

Quote:

Originally Posted by saimo

It is not official syntax. In fact, I had never even seen it. To me it looks like an ill-advised assember-specific shorthand.

Sorry, you are wrong about this. Just look at the next figure from the official manual.

britelite · 17 May 2021, 10:59

Quote:

Originally Posted by litwr

Sorry, you are wrong about this. Just look at the next figure from the official manual.

No, he's correct. ROL <ea> refers to bit rotation in memory, not registers.

Don_Adan · 17 May 2021, 12:47

Maybe this is only small optimisation compared to current PR0000 routine, which is very slow and wasted many CPU cycles. I think that you can wrote much fastest version of this routine. And you can learn something more about 68k coding.
BTW. Small hint for you, use sub.w.

roondar · 17 May 2021, 14:26

Quote:

Originally Posted by britelite

No, he's correct. ROL <ea> refers to bit rotation in memory, not registers.

You are of course correct

But to be fair here, that notation did catch me off guard a couple of times as well when I first started doing 68000 code. It took a little while for that to click. I guess it was because I was used to the 6502, which doesn't use this notation at all.

So for anyone else who, like me, might misread this: as long as you remember that EA means Effective Address, it'll all be fine.

litwr · 18 May 2021, 18:56

It seems that the code for the 68k pi-spigot variant has become almost perfect.

Thanks a million to everybody. However I must list our major achievements:
1) a/b helped to find out the BVS optimization, it saves us 2 cycles;
2) modrobert discovered the MULU optimization advantage, this saves 4 cycles on the 68020 - it will be also interesting to run pi-mulu and pi-opt on the 68030 hardware to get exact number of saved cycles - http://eab.abime.net/showpost.php?p=...2&postcount=92 ;
3) saimo could get an almost impossible thing and save 2 cycles in the main loop;
4) Don_Adan helped to minimize the size of the code;
5) Thomas Richter provided some pieces of interesting information.
I have prepared maybe the last code for this thread. It helps clarify whether 68020/30 depends on alignment of an instruction after a jump. Please run pi-align and pi-na for me for 3000 digits on the 68020/30 hardware. These programs print only timings after about 30s of calculation.

Quote:

Originally Posted by Don_Adan

Maybe this is only small optimisation compared to current PR0000 routine, which is very slow and wasted many CPU cycles. I think that you can wrote much fastest version of this routine. And you can learn something more about 68k coding.
BTW. Small hint for you, use sub.w.

Would you like to be less cryptic again? Is it a way to make the code 100 cycles faster? Sorry I can't find a way to use your hint.

PR0000 was taken from 8-bit code, so it may be better optimized for a 16/32-bit processor. But it is outside the main loop. So its optimization is worth little.

Quote:

Originally Posted by saimo

But now I'm busy with other stuff (a game).

I am curious what kind of game have you told about? BTW do you know about my project http://aminet.net/package/game/misc/xlife-8 ?

litwr · 18 May 2021, 20:02

Quote:

Originally Posted by roondar

You are of course correct

But to be fair here, that notation did catch me off guard a couple of times as well when I first started doing 68000 code. It took a little while for that to click. I guess it was because I was used to the 6502, which doesn't use this notation at all.

So for anyone else who, like me, might misread this: as long as you remember that EA means Effective Address, it'll all be fine.

Thank you for the clarification. However I use VASM and this is a legal syntax there. I suspect that this is true for many other assemblers.

Don_Adan · 18 May 2021, 20:19

Write code works in loop too. You can check this version. If no my bugs can be fastest more than 100 cycles.

Code:

 clr.l -(SP) ; cv
.l0
....
 add.w (SP),D5 ; cv
 move.l D5,(SP) ; cv
 bsr PR0000
 endif
 sub.w #14,d6 ;kv
 bne .l0
 addq.l #4,SP ; restore stack

.....


PR0000 ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3

 lea $100.W,A0
 move.l #$30403030,D2
 move.w #1000,D3
b1000
 sub.w D3,D5
 bcs.b n100
 add.w A0,D2
 bra.b b1000

n100
 add.w D3,D5
 moveq #100,D3
b100
 sub.w D3,D5
 bcs.b n10
 addq.b #1,D2
 bra.b b100

n10
 add.w D3,D5
 swap D2
 moveq #10,D3
b10
 sub.w D3,D5
 bcs.b n1
 add.w A0,D2
 bra.b b10
n1
 add.b D5,D2

 lea cout(PC),A0
 move.l (A0)+,D1 
 move.l D2,(A0)
 move.l A0,D2 ; buf
 moveq #4,D3
 jmp Write(A6) ;call Write(stdout,buff,size)



 time dc.l 0 
 cout dc.l 0
 buf dc.l 0

saimo · 18 May 2021, 21:53

Quote:

Originally Posted by litwr

Sorry, you are wrong about this. Just look at the next figure from the official manual.

That isn't the official manual. The official manual is this. Check out page 3-155 (218).

Quote:

Originally Posted by litwr

I am curious what kind of game have you told about?

It's a minor game _not_ written in assembly, but in AMOS Professional (the fun is precisely to make a game that implements original effects that would have been unthinkable before in such a language). Basically, all that's left to do is the music - I have the base ready, but I'm rather uninspired lately

It's called Follix, and you can see the last (and a bit outdated) preview in this video: [ Show youtube player ]. (For completeness, I've had also the update of this 100% assembly game in the works since a very long time, and I'm stuck because there's one last tiny showstopper that happens only on a very specific machine, the one betatester of mine that could run tests for me isn't able to help me at the moment and I couldn't find someone else with an appropriate machine.)

Quote:

BTW do you know about my project http://aminet.net/package/game/misc/xlife-8 ?

To be honest, no, I didn't know about it. It looks like a research/"science" game, right?

modrobert · 18 May 2021, 23:14

Quote:

Originally Posted by litwr

I have prepared maybe the last code for this thread. It helps clarify whether 68020/30 depends on alignment of an instruction after a jump. Please run pi-align and pi-na for me for 3000 digits on the 68020/30 hardware. These programs print only timings after about 30s of calculation.

Code:

> pi-align
number pi calculator v12 (68020)
number of digits (up to 9280)? 3000
 30.74

Code:

> pi-na
number pi calculator v12 (68020)
number of digits (up to 9280)? 3000
 30.50

PS: Will try xlife-8, assuming it is a port of "game of life"?

Bruce Abbott · 19 May 2021, 01:02

Quote:

Originally Posted by saimo

I've had also the update of this 100% assembly game in the works since a very long time, and I'm stuck because there's one last tiny showstopper that happens only on a very specific machine,

What is the 'tiny showstopper', and what very specific machine induces it?

16 May 2021, 17:42	#109
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,959	Anyway, if you need only shortest code, you can use: Code: lea cv(PC),A0 add.w (A0),D5 move.l D5,(A0) ext.l D5 ; or "and.l #$0000FFFF,D5" bsr PR0000 time dc.l 0 cout dc.l 0 cv dc.w 0 buf ds.b 4 But this code is slowest because extra command (lea from PR0000) is called too.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68020 Bit Field Instructions	mcgeezer	Coders. Asm / Hardware	9	27 October 2023 23:21
68060 64-bit integer math	BSzili	Coders. Asm / Hardware	7	25 January 2021 21:18
Discovery: Math	Audio Snow	request.Old Rare Games	30	20 August 2018 12:17
Math apps	mtb	support.Apps	1	08 September 2002 18:59

15 May 2021, 19:31	#105
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,959	btw. you can remove from source code ; clr cv is not necessary.

17 May 2021, 12:47	#113
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,959	Maybe this is only small optimisation compared to current PR0000 routine, which is very slow and wasted many CPU cycles. I think that you can wrote much fastest version of this routine. And you can learn something more about 68k coding. BTW. Small hint for you, use sub.w.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)