English Amiga Board Optimizing the 68020+ 32-bit math
 Register Amiga FAQ Rules & Help Members List  /  Moderators List Today's Posts Mark Forums Read

15 May 2021, 15:44   #101
litwr
Senior Member

Join Date: Mar 2016
Location: Ozherele
Posts: 227
Quote:
 Originally Posted by saimo The trick exploits the sign-extension of suba.w to avoid clearing the upper word of a register and thus save two cycles. As a side effect, d7 is no longer needed. suba.w d5,a5 works fine when bit #15 of d5 is 0, i.e. when d5 < 32768; given that d5 is the remainder of the division, the following condition applies: * hence, it must be 7*D-1 < 32769 -> D < (32769+1)/7 = 4681.
Thank you very much. But you know the program must not reduce the maximum number of digits artificially.

Quote:
 Originally Posted by saimo The trick doesn't bring more speed on 68030 because suba.w takes 4 cycles instead of 2, unlike on the other 68020+ CPUs - actually, I'm starting to wonder if it's an erratum on the MC68020UM or MC68030UM, given the strong similarity of the two CPUs (my memory tells me that there actually is a penalty, but recently it failed me precisely regarding similar matters). Maybe I'll check it out later.
What crazy timings! This reminds me the 80386 that became much slower than the 80286 on rotations despite having a barrel shifter!

Quote:
 Originally Posted by saimo To avoid losing cycles for the calculation of d (the mulu part), the latter has to change from: d = d/2 + r[i]*10000 to: d = (d + r[i]*20000)/2 This works also when d is odd thanks to the fact that r[i]*20000 is always even.
There is no d = d/2 + r[i]*10000 in my code so your substitution rather will not work.

Quote:
 Originally Posted by saimo Whether such condition is acceptable is another story. If not, a clr.w d5 before swap.w d5 is enough to remove suba.w d5,a5 and move .l2 one line up: this will cancel the 2-cycle gain and bring the limit of D back to the original value (7*D-1 < 65536 -> D < (65537+1)/7 = 9362) - but, of course, this is useful only in the non-mulu-optimiation case.
Thus we don't have any gain.

15 May 2021, 15:50   #102
litwr
Senior Member

Join Date: Mar 2016
Location: Ozherele
Posts: 227
Quote:
 Originally Posted by meynaf Use EQUR.
Thank you very much.

15 May 2021, 18:34   #103
saimo
Registered User

Join Date: Aug 2010
Location: Italy
Posts: 407
Quote:
 Originally Posted by litwr What crazy timings! This reminds me the 80386 that became much slower than the 80286 on rotations despite having a barrel shifter!
Please check out the edit in my previous post: I have added actual measurements regarding those timings.

Quote:
 There is no d = d/2 + r[i]*10000 in my code so your substitution rather will not work.
What's this, then?
Code:
```   lsr.l d5
move -(a3),d0   ;r[i]
mulu d1,d0      ;r[i]*10000
add.l d0,d5     ;d += d + r[i]*10000
move.l d5,d6```
You are dividing d5 by 2 (at least, I guess that's the intention: lsr.l d5 does not even exist in M68k assembly, but lsr.w <memory> does exist for 1-bit shifts), and then adding r[i]*10000 to it (and I guess that d += d + r[i]*10000 is a wrong comment: it's either d += r[i]*10000 or d = d + r[i]*10000).

Quote:
 Thus we don't have any gain.
It's still useful to free a data register in the mulu version.

Last edited by saimo; 15 May 2021 at 19:14.

 15 May 2021, 19:10 #104 Don_Adan Registered User   Join Date: Jan 2008 Location: Warsaw/Poland Age: 53 Posts: 1,327 You can use next optimisations: Code: ``` lea cv(pc),a0 ; add cv(pc),d5 ;c + d/10000 add.w (A0),D5 ; swap d5 ;c <- d%10000 ; move d5,cv move.l D5,(A0)+ ; A0 is buf now ; clr d5 ; swap d5 and.l #\$0000FFFF,D5 ; or ext.l D5, if D5.W cant be minus value ; bsr PR0000 bsr.w PR0001 endif sub.w #14,d6 ;kv bne .l0 PR0000 ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3 lea.l buf(pc),a0 PR0001 move.l a0,d2 bsr.s .l1 moveq #4,d3 move.l cout(pc),d1 jmp Write(a6) ;call Write(stdout,buff,size) .l1 divu #1000,d5 bsr .l0 clr d5 swap d5 divu #100,d5 bsr .l0 clr d5 swap d5 divu #10,d5 bsr .l0 swap d5 .l0 eori.b #'0',d5 move.b d5,(a0)+ rts ; cv dc.w 0 time dc.l 0 cout dc.l 0 cnop 0,4 ; for fastest longword write cv dc.l 0 buf ds.b 4``` Not tested, but can works, i think.
 15 May 2021, 19:31 #105 Don_Adan Registered User   Join Date: Jan 2008 Location: Warsaw/Poland Age: 53 Posts: 1,327 btw. you can remove from source code ; clr cv is not necessary.
16 May 2021, 11:27   #106
litwr
Senior Member

Join Date: Mar 2016
Location: Ozherele
Posts: 227
Quote:
 Originally Posted by saimo Please check out the edit in my previous post: I have added actual measurements regarding those timings.
Thank you. So timings are rather identical for the 68020 and 68030. However it is crazy that SUBA.W is slower than SUBA.L.

Quote:
 Originally Posted by saimo What's this, then? Code: ``` lsr.l d5 move -(a3),d0 ;r[i] mulu d1,d0 ;r[i]*10000 add.l d0,d5 ;d += d + r[i]*10000 move.l d5,d6``` You are dividing d5 by 2 (at least, I guess that's the intention: lsr.l d5 does not even exist in M68k assembly, but lsr.w does exist for 1-bit shifts), and then adding r[i]*10000 to it (and I guess that d += d + r[i]*10000 is a wrong comment: it's either d += r[i]*10000 or d = d + r[i]*10000).
Sorry, it seems I confused some things. So your code is correct. But why have you written that lsr.l d5 does not even exist in M68k assembly? It exists and is quite useful, it is a shorthand version for lsr.l #1,d5. Thank you for pointing out the typo in the comments.

Quote:
 Originally Posted by saimo It's still useful to free a data register in the mulu version.
The mulu-variant is only useful for the 68000 and maybe 68060. I doubt about the latter.

Quote:
 Originally Posted by Don_Adan You can use next optimisations:
I checked it and found out that your code became 4 bytes longer. And any speed optimization outside the main loop is senseless unless you find a way to save hundreds cycles.

Quote:
 Originally Posted by Don_Adan btw. you can remove from source code ; clr cv is not necessary.
Thank you. This gives us 9280 digits for the Amiga 1200 variant!

 16 May 2021, 14:27 #107 Don_Adan Registered User   Join Date: Jan 2008 Location: Warsaw/Poland Age: 53 Posts: 1,327 Code: ``` lea cv(pc),a0 4 bytes opt ; add cv(pc),d5 ;c + d/10000 4 bytes ori add.w (A0),D5 2 bytes opt ; swap d5 ;c <- d%10000 2 bytes ori ; move d5,cv 6 bytes ori move.l D5,(A0)+ ; A0 is buf now 2 bytes opt ; clr d5 2 bytes ori ; swap d5 2 bytes ori and.l #\$0000FFFF,D5 ; or ext.l D5, if D5.W cant be minus value 6 bytes or 2 bytes opt original code 16 bytes, optimised code 10 or 14 bytes max ; cv dc.w 0 time dc.l 0 cout dc.l 0 cnop 0,4 ; for fastest longword write cv dc.l 0 buf ds.b 4``` Code is shortest, if you set correctly cv and buf at cnop 0,4 position without extra padding, size will be same or 4 bytes shortest if using ext.l D5. And you wrote that speed is important, not size. When i asked.
16 May 2021, 15:12   #108
saimo
Registered User

Join Date: Aug 2010
Location: Italy
Posts: 407
Quote:
 Originally Posted by litwr Thank you. So timings are rather identical for the 68020 and 68030. However it is crazy that SUBA.W is slower than SUBA.L.
Not really: the word operand has first to be sign-extended.

Quote:
 But why have you written that lsr.l d5 does not even exist in M68k assembly? It exists and is quite useful, it is a shorthand version for lsr.l #1,d5.
It is not official syntax. In fact, I had never even seen it. To me it looks like an ill-advised assember-specific shorthand.

 16 May 2021, 17:42 #109 Don_Adan Registered User   Join Date: Jan 2008 Location: Warsaw/Poland Age: 53 Posts: 1,327 Anyway, if you need only shortest code, you can use: Code: ``` lea cv(PC),A0 add.w (A0),D5 move.l D5,(A0) ext.l D5 ; or "and.l #\$0000FFFF,D5" bsr PR0000 time dc.l 0 cout dc.l 0 cv dc.w 0 buf ds.b 4``` But this code is slowest because extra command (lea from PR0000) is called too.
 17 May 2021, 07:34 #110 Don_Adan Registered User   Join Date: Jan 2008 Location: Warsaw/Poland Age: 53 Posts: 1,327 You can use next version too, shortest next 2 bytes. Code: ```clr.l -(SP) ; cv .l0 .... add.w (SP),D5 ; cv move.l D5,(SP) ; cv ext.l D5 ; or "and.l #\$0000FFFF,D5" bsr PR0000 endif sub.w #14,d6 ;kv bne .l0 addq.l #4,SP time dc.l 0 cout dc.l 0 ; cv dc.w 0 buf ds.b 4```
17 May 2021, 10:51   #111
litwr
Senior Member

Join Date: Mar 2016
Location: Ozherele
Posts: 227
Quote:
 Originally Posted by Don_Adan [code] Code is shortest, if you set correctly cv and buf at cnop 0,4 position without extra padding, size will be same or 4 bytes shortest if using ext.l D5. And you wrote that speed is important, not size. When i asked.
Thank you. However I dare to repeat that the speed can't be affected by tiny optimizations outside the main loop.

Quote:
 Originally Posted by saimo Not really: the word operand has first to be sign-extended.
IMHO such things must be hardwired and instant.

Quote:
 Originally Posted by saimo It is not official syntax. In fact, I had never even seen it. To me it looks like an ill-advised assember-specific shorthand.

Last edited by BippyM; 01 June 2021 at 18:24.

17 May 2021, 10:59   #112
britelite
Registered User

Join Date: Feb 2010
Location: Espoo / Finland
Posts: 782
Quote:
 Originally Posted by litwr Sorry, you are wrong about this. Just look at the next figure from the official manual.
No, he's correct. ROL <ea> refers to bit rotation in memory, not registers.

 17 May 2021, 12:47 #113 Don_Adan Registered User   Join Date: Jan 2008 Location: Warsaw/Poland Age: 53 Posts: 1,327 Maybe this is only small optimisation compared to current PR0000 routine, which is very slow and wasted many CPU cycles. I think that you can wrote much fastest version of this routine. And you can learn something more about 68k coding. BTW. Small hint for you, use sub.w.
17 May 2021, 14:26   #114
roondar
Registered User

Join Date: Jul 2015
Location: The Netherlands
Posts: 2,890
Quote:
 Originally Posted by britelite No, he's correct. ROL refers to bit rotation in memory, not registers.
You are of course correct

But to be fair here, that notation did catch me off guard a couple of times as well when I first started doing 68000 code. It took a little while for that to click. I guess it was because I was used to the 6502, which doesn't use this notation at all.

So for anyone else who, like me, might misread this: as long as you remember that EA means Effective Address, it'll all be fine.

18 May 2021, 18:56   #115
litwr
Senior Member

Join Date: Mar 2016
Location: Ozherele
Posts: 227
It seems that the code for the 68k pi-spigot variant has become almost perfect. Thanks a million to everybody. However I must list our major achievements:
1) a/b helped to find out the BVS optimization, it saves us 2 cycles;
2) modrobert discovered the MULU optimization advantage, this saves 4 cycles on the 68020 - it will be also interesting to run pi-mulu and pi-opt on the 68030 hardware to get exact number of saved cycles - http://eab.abime.net/showpost.php?p=...2&postcount=92 ;
3) saimo could get an almost impossible thing and save 2 cycles in the main loop;
4) Don_Adan helped to minimize the size of the code;
5) Thomas Richter provided some pieces of interesting information.
I have prepared maybe the last code for this thread. It helps clarify whether 68020/30 depends on alignment of an instruction after a jump. Please run pi-align and pi-na for me for 3000 digits on the 68020/30 hardware. These programs print only timings after about 30s of calculation.
Quote:
 Originally Posted by Don_Adan Maybe this is only small optimisation compared to current PR0000 routine, which is very slow and wasted many CPU cycles. I think that you can wrote much fastest version of this routine. And you can learn something more about 68k coding. BTW. Small hint for you, use sub.w.
Would you like to be less cryptic again? Is it a way to make the code 100 cycles faster? Sorry I can't find a way to use your hint. PR0000 was taken from 8-bit code, so it may be better optimized for a 16/32-bit processor. But it is outside the main loop. So its optimization is worth little.
Quote:
 Originally Posted by saimo But now I'm busy with other stuff (a game).
I am curious what kind of game have you told about? BTW do you know about my project http://aminet.net/package/game/misc/xlife-8 ?

Last edited by BippyM; 01 June 2021 at 18:24.

18 May 2021, 20:02   #116
litwr
Senior Member

Join Date: Mar 2016
Location: Ozherele
Posts: 227
Quote:
 Originally Posted by roondar You are of course correct But to be fair here, that notation did catch me off guard a couple of times as well when I first started doing 68000 code. It took a little while for that to click. I guess it was because I was used to the 6502, which doesn't use this notation at all. So for anyone else who, like me, might misread this: as long as you remember that EA means Effective Address, it'll all be fine.
Thank you for the clarification. However I use VASM and this is a legal syntax there. I suspect that this is true for many other assemblers.

Last edited by litwr; 18 May 2021 at 20:32.

 18 May 2021, 20:19 #117 Don_Adan Registered User   Join Date: Jan 2008 Location: Warsaw/Poland Age: 53 Posts: 1,327 Write code works in loop too. You can check this version. If no my bugs can be fastest more than 100 cycles. Code: ``` clr.l -(SP) ; cv .l0 .... add.w (SP),D5 ; cv move.l D5,(SP) ; cv bsr PR0000 endif sub.w #14,d6 ;kv bne .l0 addq.l #4,SP ; restore stack ..... PR0000 ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3 lea \$100.W,A0 move.l #\$30403030,D2 move.w #1000,D3 b1000 sub.w D3,D5 bcs.b n100 add.w A0,D2 bra.b b1000 n100 add.w D3,D5 moveq #100,D3 b100 sub.w D3,D5 bcs.b n10 addq.b #1,D2 bra.b b100 n10 add.w D3,D5 swap D2 moveq #10,D3 b10 sub.w D3,D5 bcs.b n1 add.w A0,D2 bra.b b10 n1 add.b D5,D2 lea cout(PC),A0 move.l (A0)+,D1 move.l D2,(A0) move.l A0,D2 ; buf moveq #4,D3 jmp Write(A6) ;call Write(stdout,buff,size) time dc.l 0 cout dc.l 0 buf dc.l 0```
18 May 2021, 21:53   #118
saimo
Registered User

Join Date: Aug 2010
Location: Italy
Posts: 407
Quote:
 Originally Posted by litwr Sorry, you are wrong about this. Just look at the next figure from the official manual.
That isn't the official manual. The official manual is this. Check out page 3-155 (218).

Quote:
 Originally Posted by litwr I am curious what kind of game have you told about?
It's a minor game _not_ written in assembly, but in AMOS Professional (the fun is precisely to make a game that implements original effects that would have been unthinkable before in such a language). Basically, all that's left to do is the music - I have the base ready, but I'm rather uninspired lately It's called Follix, and you can see the last (and a bit outdated) preview in this video: [ Show youtube player ]. (For completeness, I've had also the update of this 100% assembly game in the works since a very long time, and I'm stuck because there's one last tiny showstopper that happens only on a very specific machine, the one betatester of mine that could run tests for me isn't able to help me at the moment and I couldn't find someone else with an appropriate machine.)

Quote:
 BTW do you know about my project http://aminet.net/package/game/misc/xlife-8 ?
To be honest, no, I didn't know about it. It looks like a research/"science" game, right?

18 May 2021, 23:14   #119
modrobert
old bearded fool

Join Date: Jan 2010
Location: Bangkok
Age: 53
Posts: 572
Quote:
 Originally Posted by litwr I have prepared maybe the last code for this thread. It helps clarify whether 68020/30 depends on alignment of an instruction after a jump. Please run pi-align and pi-na for me for 3000 digits on the 68020/30 hardware. These programs print only timings after about 30s of calculation.
Code:
```> pi-align
number pi calculator v12 (68020)
number of digits (up to 9280)? 3000
30.74```
Code:
```> pi-na
number pi calculator v12 (68020)
number of digits (up to 9280)? 3000
30.50```
PS: Will try xlife-8, assuming it is a port of "game of life"?

Last edited by modrobert; 18 May 2021 at 23:23.

19 May 2021, 01:02   #120
Bruce Abbott
Registered User

Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 668
Quote:
 Originally Posted by saimo I've had also the update of this 100% assembly game in the works since a very long time, and I'm stuck because there's one last tiny showstopper that happens only on a very specific machine,
What is the 'tiny showstopper', and what very specific machine induces it?

 Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

 Similar Threads Thread Thread Starter Forum Replies Last Post BSzili Coders. Asm / Hardware 7 25 January 2021 21:18 mcgeezer Coders. Asm / Hardware 7 07 February 2019 14:59 Audio Snow request.Old Rare Games 30 20 August 2018 12:17 mtb support.Apps 1 08 September 2002 18:59

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules
 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home News Main     Amiga scene     Retrogaming General Discussion     Nostalgia & memories Support     New to Emulation or Amiga scene         Member Introductions     support.WinUAE     support.WinFellow     support.OtherUAE     support.FS-UAE         project.AmigaLive     support.Hardware         Hardware mods         Hardware pics     support.Games     support.Demos     support.Apps     support.Amiga Forever     support.Amix     support.Other Requests     request.UAE Wishlist     request.Old Rare Games     request.Demos     request.Apps     request.Modules     request.Music     request.Other     Looking for a game name ?     Games images which need to be WHDified abime.net - Hall Of Light     HOL news     HOL suggestions and feedback     HOL data problems     HOL contributions abime.net - Amiga Magazine Rack     AMR news     AMR suggestions and feedback     AMR data problems     AMR contributions abime.net - Home Projects     project.Amiga Lore     project.EAB     project.IRC     project.Mods Jukebox     project.Wiki abime.net - Hosted Projects     project.aGTW     project.APoV     project.ClassicWB     project.Jambo!     project.Green Amiga Alien GUIDES     project.Maptapper     project.Sprites     project.WinUAE - Kaillera Other Projects     project.Amiga Demo DVD     project.Amiga Game Factory     project.CARE     project.Amiga File Server     project.CD32 Conversion     project.Game Cover Art         GCA.Feedback and Suggestions         GCA.Work in Progress         GCA.Cover Requests         GCA.Usefull Programs         GCA.Helpdesk     project.KGLoad     project.MAGE     project.Missing Full Shareware Games     project.SPS (was CAPS)     project.TOSEC (amiga only)     project.WHDLoad         project.Killergorilla's WHD packs Misc     Amiga websites reviews     MarketPlace         Swapshop     Kinky Amiga Stuff     Collections     EAB's competition Coders     Coders. General         Coders. Releases         Coders. Tutorials     Coders. Asm / Hardware     Coders. System         Coders. Scripting         Coders. Nextgen     Coders. Language         Coders. C/C++         Coders. AMOS         Coders. Blitz Basic     Coders. Contest         Coders. Entries Creation     Graphics         Graphics. Work In Progress         Graphics. Finished Work         Graphics. Tutorials     Music         Music. Work In Progress         Music. Finished Work         Music. Tutorials

All times are GMT +2. The time now is 13:21.

 -- EAB3 skin ---- EAB2 skin ---- Mobile skin Archive - Top