English Amiga Board


Go Back   English Amiga Board > Coders > Coders. General

 
 
Thread Tools
Old 15 May 2021, 15:44   #101
litwr
Registered User
 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
Quote:
Originally Posted by saimo View Post
The trick exploits the sign-extension of suba.w to avoid clearing the upper word of a register and thus save two cycles. As a side effect, d7 is no longer needed.

suba.w d5,a5 works fine when bit #15 of d5 is 0, i.e. when d5 < 32768; given that d5 is the remainder of the division, the following condition applies:
* hence, it must be 7*D-1 < 32769 -> D < (32769+1)/7 = 4681.
Thank you very much. But you know the program must not reduce the maximum number of digits artificially.

Quote:
Originally Posted by saimo View Post
The trick doesn't bring more speed on 68030 because suba.w takes 4 cycles instead of 2, unlike on the other 68020+ CPUs - actually, I'm starting to wonder if it's an erratum on the MC68020UM or MC68030UM, given the strong similarity of the two CPUs (my memory tells me that there actually is a penalty, but recently it failed me precisely regarding similar matters). Maybe I'll check it out later.
What crazy timings! This reminds me the 80386 that became much slower than the 80286 on rotations despite having a barrel shifter!

Quote:
Originally Posted by saimo View Post
To avoid losing cycles for the calculation of d (the mulu part), the latter has to change from:
d = d/2 + r[i]*10000
to:
d = (d + r[i]*20000)/2
This works also when d is odd thanks to the fact that r[i]*20000 is always even.
There is no d = d/2 + r[i]*10000 in my code so your substitution rather will not work.

Quote:
Originally Posted by saimo View Post
Whether such condition is acceptable is another story. If not, a clr.w d5 before swap.w d5 is enough to remove suba.w d5,a5 and move .l2 one line up: this will cancel the 2-cycle gain and bring the limit of D back to the original value (7*D-1 < 65536 -> D < (65537+1)/7 = 9362) - but, of course, this is useful only in the non-mulu-optimiation case.
Thus we don't have any gain.
litwr is offline  
Old 15 May 2021, 15:50   #102
litwr
Registered User
 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
Quote:
Originally Posted by meynaf View Post
Use EQUR.
Thank you very much.
litwr is offline  
Old 15 May 2021, 18:34   #103
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by litwr View Post
What crazy timings! This reminds me the 80386 that became much slower than the 80286 on rotations despite having a barrel shifter!
Please check out the edit in my previous post: I have added actual measurements regarding those timings.

Quote:
There is no d = d/2 + r[i]*10000 in my code so your substitution rather will not work.
What's this, then?
Code:
   lsr.l d5
   move -(a3),d0   ;r[i]
   mulu d1,d0      ;r[i]*10000
   add.l d0,d5     ;d += d + r[i]*10000
   move.l d5,d6
You are dividing d5 by 2 (at least, I guess that's the intention: lsr.l d5 does not even exist in M68k assembly, but lsr.w <memory> does exist for 1-bit shifts), and then adding r[i]*10000 to it (and I guess that d += d + r[i]*10000 is a wrong comment: it's either d += r[i]*10000 or d = d + r[i]*10000).

Quote:
Thus we don't have any gain.
It's still useful to free a data register in the mulu version.

Last edited by saimo; 15 May 2021 at 19:14.
saimo is offline  
Old 15 May 2021, 19:10   #104
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
You can use next optimisations:

Code:
 lea cv(pc),a0
; add cv(pc),d5 ;c + d/10000
 add.w (A0),D5
; swap d5 ;c <- d%10000
; move d5,cv
 move.l D5,(A0)+ ; A0 is buf now
; clr d5
; swap d5
 and.l #$0000FFFF,D5 ; or ext.l D5, if D5.W cant be minus value
; bsr PR0000
 bsr.w PR0001
 endif sub.w #14,d6 ;kv
 bne .l0


PR0000 ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3
 lea.l buf(pc),a0
PR0001
 move.l a0,d2
 bsr.s .l1
 moveq #4,d3
 move.l cout(pc),d1
 jmp Write(a6) ;call Write(stdout,buff,size)
.l1
 divu #1000,d5
 bsr .l0
 clr d5
 swap d5
 divu #100,d5
 bsr .l0
 clr d5
 swap d5
 divu #10,d5
 bsr .l0
 swap d5
.l0
 eori.b #'0',d5
 move.b d5,(a0)+
 rts

; cv dc.w 0
 time dc.l 0
 cout dc.l 0
 cnop 0,4 ; for fastest longword write
 cv dc.l 0
 buf ds.b 4
Not tested, but can works, i think.
Don_Adan is offline  
Old 15 May 2021, 19:31   #105
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
btw. you can remove from source code

; clr cv

is not necessary.
Don_Adan is offline  
Old 16 May 2021, 11:27   #106
litwr
Registered User
 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
Quote:
Originally Posted by saimo View Post
Please check out the edit in my previous post: I have added actual measurements regarding those timings.
Thank you. So timings are rather identical for the 68020 and 68030. However it is crazy that SUBA.W is slower than SUBA.L.

Quote:
Originally Posted by saimo View Post
What's this, then?
Code:
   lsr.l d5
   move -(a3),d0   ;r[i]
   mulu d1,d0      ;r[i]*10000
   add.l d0,d5     ;d += d + r[i]*10000
   move.l d5,d6
You are dividing d5 by 2 (at least, I guess that's the intention: lsr.l d5 does not even exist in M68k assembly, but lsr.w <memory> does exist for 1-bit shifts), and then adding r[i]*10000 to it (and I guess that d += d + r[i]*10000 is a wrong comment: it's either d += r[i]*10000 or d = d + r[i]*10000).
Sorry, it seems I confused some things. So your code is correct. But why have you written that lsr.l d5 does not even exist in M68k assembly? It exists and is quite useful, it is a shorthand version for lsr.l #1,d5. Thank you for pointing out the typo in the comments.

Quote:
Originally Posted by saimo View Post
It's still useful to free a data register in the mulu version.
The mulu-variant is only useful for the 68000 and maybe 68060. I doubt about the latter.

Quote:
Originally Posted by Don_Adan View Post
You can use next optimisations:
I checked it and found out that your code became 4 bytes longer. And any speed optimization outside the main loop is senseless unless you find a way to save hundreds cycles.

Quote:
Originally Posted by Don_Adan View Post
btw. you can remove from source code
; clr cv
is not necessary.
Thank you. This gives us 9280 digits for the Amiga 1200 variant!
litwr is offline  
Old 16 May 2021, 14:27   #107
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
Code:
 lea cv(pc),a0    4 bytes opt
; add cv(pc),d5 ;c + d/10000 4 bytes ori
 add.w (A0),D5  2 bytes opt
; swap d5 ;c <- d%10000 2 bytes ori
; move d5,cv 6 bytes ori
 move.l D5,(A0)+ ; A0 is buf now 2 bytes opt
; clr d5 2 bytes ori
; swap d5 2 bytes ori
 and.l #$0000FFFF,D5 ; or ext.l D5, if D5.W cant be minus value 6 bytes or 2 bytes opt
 original code 16 bytes, optimised code 10 or 14 bytes max
 

; cv dc.w 0
 time dc.l 0
 cout dc.l 0
 cnop 0,4 ; for fastest longword write
 cv dc.l 0
 buf ds.b 4
Code is shortest, if you set correctly cv and buf at cnop 0,4 position without extra padding, size will be same or 4 bytes shortest if using ext.l D5.
And you wrote that speed is important, not size. When i asked.
Don_Adan is offline  
Old 16 May 2021, 15:12   #108
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by litwr View Post
Thank you. So timings are rather identical for the 68020 and 68030. However it is crazy that SUBA.W is slower than SUBA.L.
Not really: the word operand has first to be sign-extended.

Quote:
But why have you written that lsr.l d5 does not even exist in M68k assembly? It exists and is quite useful, it is a shorthand version for lsr.l #1,d5.
It is not official syntax. In fact, I had never even seen it. To me it looks like an ill-advised assember-specific shorthand.
saimo is offline  
Old 16 May 2021, 17:42   #109
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
Anyway, if you need only shortest code, you can use:
Code:
 lea cv(PC),A0
 add.w (A0),D5
 move.l D5,(A0)
 ext.l D5           ; or "and.l #$0000FFFF,D5"
 bsr PR0000
 


 time dc.l 0
 cout dc.l 0
 cv dc.w 0
 buf ds.b 4
But this code is slowest because extra command (lea from PR0000) is called too.
Don_Adan is offline  
Old 17 May 2021, 07:34   #110
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
You can use next version too, shortest next 2 bytes.
Code:
clr.l -(SP)  ; cv
.l0
....

 add.w (SP),D5 ; cv
 move.l D5,(SP) ; cv
 ext.l D5           ; or "and.l #$0000FFFF,D5"
 bsr PR0000
 endif
 sub.w #14,d6 ;kv
 bne .l0
 addq.l #4,SP

 time dc.l 0
 cout dc.l 0
; cv dc.w 0
 buf ds.b 4
Don_Adan is offline  
Old 17 May 2021, 10:51   #111
litwr
Registered User
 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
Quote:
Originally Posted by Don_Adan View Post
[code]
Code is shortest, if you set correctly cv and buf at cnop 0,4 position without extra padding, size will be same or 4 bytes shortest if using ext.l D5.
And you wrote that speed is important, not size. When i asked.
Thank you. However I dare to repeat that the speed can't be affected by tiny optimizations outside the main loop.

Quote:
Originally Posted by saimo View Post
Not really: the word operand has first to be sign-extended.
IMHO such things must be hardwired and instant.

Quote:
Originally Posted by saimo View Post
It is not official syntax. In fact, I had never even seen it. To me it looks like an ill-advised assember-specific shorthand.
Sorry, you are wrong about this. Just look at the next figure from the official manual.

Last edited by BippyM; 01 June 2021 at 18:24.
litwr is offline  
Old 17 May 2021, 10:59   #112
britelite
Registered User
 
Join Date: Feb 2010
Location: Espoo / Finland
Posts: 818
Quote:
Originally Posted by litwr View Post
Sorry, you are wrong about this. Just look at the next figure from the official manual.
No, he's correct. ROL <ea> refers to bit rotation in memory, not registers.
britelite is offline  
Old 17 May 2021, 12:47   #113
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
Maybe this is only small optimisation compared to current PR0000 routine, which is very slow and wasted many CPU cycles. I think that you can wrote much fastest version of this routine. And you can learn something more about 68k coding.
BTW. Small hint for you, use sub.w.
Don_Adan is offline  
Old 17 May 2021, 14:26   #114
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,410
Quote:
Originally Posted by britelite View Post
No, he's correct. ROL <ea> refers to bit rotation in memory, not registers.
You are of course correct

But to be fair here, that notation did catch me off guard a couple of times as well when I first started doing 68000 code. It took a little while for that to click. I guess it was because I was used to the 6502, which doesn't use this notation at all.

So for anyone else who, like me, might misread this: as long as you remember that EA means Effective Address, it'll all be fine.
roondar is online now  
Old 18 May 2021, 18:56   #115
litwr
Registered User
 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
It seems that the code for the 68k pi-spigot variant has become almost perfect. Thanks a million to everybody. However I must list our major achievements:
1) a/b helped to find out the BVS optimization, it saves us 2 cycles;
2) modrobert discovered the MULU optimization advantage, this saves 4 cycles on the 68020 - it will be also interesting to run pi-mulu and pi-opt on the 68030 hardware to get exact number of saved cycles - http://eab.abime.net/showpost.php?p=...2&postcount=92 ;
3) saimo could get an almost impossible thing and save 2 cycles in the main loop;
4) Don_Adan helped to minimize the size of the code;
5) Thomas Richter provided some pieces of interesting information.
I have prepared maybe the last code for this thread. It helps clarify whether 68020/30 depends on alignment of an instruction after a jump. Please run pi-align and pi-na for me for 3000 digits on the 68020/30 hardware. These programs print only timings after about 30s of calculation.
Quote:
Originally Posted by Don_Adan View Post
Maybe this is only small optimisation compared to current PR0000 routine, which is very slow and wasted many CPU cycles. I think that you can wrote much fastest version of this routine. And you can learn something more about 68k coding.
BTW. Small hint for you, use sub.w.
Would you like to be less cryptic again? Is it a way to make the code 100 cycles faster? Sorry I can't find a way to use your hint. PR0000 was taken from 8-bit code, so it may be better optimized for a 16/32-bit processor. But it is outside the main loop. So its optimization is worth little.
Quote:
Originally Posted by saimo View Post
But now I'm busy with other stuff (a game).
I am curious what kind of game have you told about? BTW do you know about my project http://aminet.net/package/game/misc/xlife-8 ?

Last edited by BippyM; 01 June 2021 at 18:24.
litwr is offline  
Old 18 May 2021, 20:02   #116
litwr
Registered User
 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
Quote:
Originally Posted by roondar View Post
You are of course correct

But to be fair here, that notation did catch me off guard a couple of times as well when I first started doing 68000 code. It took a little while for that to click. I guess it was because I was used to the 6502, which doesn't use this notation at all.

So for anyone else who, like me, might misread this: as long as you remember that EA means Effective Address, it'll all be fine.
Thank you for the clarification. However I use VASM and this is a legal syntax there. I suspect that this is true for many other assemblers.

Last edited by litwr; 18 May 2021 at 20:32.
litwr is offline  
Old 18 May 2021, 20:19   #117
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
Write code works in loop too. You can check this version. If no my bugs can be fastest more than 100 cycles.

Code:
 clr.l -(SP) ; cv
.l0
....
 add.w (SP),D5 ; cv
 move.l D5,(SP) ; cv
 bsr PR0000
 endif
 sub.w #14,d6 ;kv
 bne .l0
 addq.l #4,SP ; restore stack

.....


PR0000 ;prints d5, uses a0,a1(scratch),d0,d1,d2,d3

 lea $100.W,A0
 move.l #$30403030,D2
 move.w #1000,D3
b1000
 sub.w D3,D5
 bcs.b n100
 add.w A0,D2
 bra.b b1000

n100
 add.w D3,D5
 moveq #100,D3
b100
 sub.w D3,D5
 bcs.b n10
 addq.b #1,D2
 bra.b b100

n10
 add.w D3,D5
 swap D2
 moveq #10,D3
b10
 sub.w D3,D5
 bcs.b n1
 add.w A0,D2
 bra.b b10
n1
 add.b D5,D2

 lea cout(PC),A0
 move.l (A0)+,D1 
 move.l D2,(A0)
 move.l A0,D2 ; buf
 moveq #4,D3
 jmp Write(A6) ;call Write(stdout,buff,size)



 time dc.l 0 
 cout dc.l 0
 buf dc.l 0
Don_Adan is offline  
Old 18 May 2021, 21:53   #118
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by litwr View Post
Sorry, you are wrong about this. Just look at the next figure from the official manual.
That isn't the official manual. The official manual is this. Check out page 3-155 (218).

Quote:
Originally Posted by litwr View Post
I am curious what kind of game have you told about?
It's a minor game _not_ written in assembly, but in AMOS Professional (the fun is precisely to make a game that implements original effects that would have been unthinkable before in such a language). Basically, all that's left to do is the music - I have the base ready, but I'm rather uninspired lately It's called Follix, and you can see the last (and a bit outdated) preview in this video: [ Show youtube player ]. (For completeness, I've had also the update of this 100% assembly game in the works since a very long time, and I'm stuck because there's one last tiny showstopper that happens only on a very specific machine, the one betatester of mine that could run tests for me isn't able to help me at the moment and I couldn't find someone else with an appropriate machine.)

Quote:
BTW do you know about my project http://aminet.net/package/game/misc/xlife-8 ?
To be honest, no, I didn't know about it. It looks like a research/"science" game, right?
saimo is offline  
Old 18 May 2021, 23:14   #119
modrobert
old bearded fool
 
modrobert's Avatar
 
Join Date: Jan 2010
Location: Bangkok
Age: 56
Posts: 775
Quote:
Originally Posted by litwr View Post
I have prepared maybe the last code for this thread. It helps clarify whether 68020/30 depends on alignment of an instruction after a jump. Please run pi-align and pi-na for me for 3000 digits on the 68020/30 hardware. These programs print only timings after about 30s of calculation.
Code:
> pi-align
number pi calculator v12 (68020)
number of digits (up to 9280)? 3000
 30.74
Code:
> pi-na
number pi calculator v12 (68020)
number of digits (up to 9280)? 3000
 30.50
PS: Will try xlife-8, assuming it is a port of "game of life"?

Last edited by modrobert; 18 May 2021 at 23:23.
modrobert is offline  
Old 19 May 2021, 01:02   #120
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,546
Quote:
Originally Posted by saimo View Post
I've had also the update of this 100% assembly game in the works since a very long time, and I'm stuck because there's one last tiny showstopper that happens only on a very specific machine,
What is the 'tiny showstopper', and what very specific machine induces it?
Bruce Abbott is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
68020 Bit Field Instructions mcgeezer Coders. Asm / Hardware 9 27 October 2023 23:21
68060 64-bit integer math BSzili Coders. Asm / Hardware 7 25 January 2021 21:18
Discovery: Math Audio Snow request.Old Rare Games 30 20 August 2018 12:17
Math apps mtb support.Apps 1 08 September 2002 18:59

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 11:17.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.94356 seconds with 16 queries