20 May 2021, 22:51 | #161 |
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
IMHO another larger troll have just confounded all things. Long ago we discussed ways how to make the code shorter but now we seek ways to make the code faster. RawDoFmt can make the code shorter but slower.
|
20 May 2021, 23:03 | #162 |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,408
|
Just for the record, this is not correct for any model of 68K CPU.
DIVU.W on 68000 takes 140 cycles, with a maximum difference of less than 10% between slowest and fastest possible times*. It never takes as little as 78 cycles. DIVU on 68020/030 never takes 140 cycles. Highest cost is 79 cycles for DIVU.L (DIVU.W takes up to 44 cycles)**. DIVU on 040 and 060 take fewer cycles still, but I don't have the numbers on hand. *) See page 8-4 of the 68000 user manual. **) See page 8-30 of the 68020 user manual. If you have the one which has the cycle counts in chapter 9 instead, then it's on page 9-22. |
20 May 2021, 23:08 | #163 | |
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
However I agree that this program benchmark results are very specific, only one algo is tested. My project has name Rosetta Pi Spigot and I am sure you know what it means. It would be also interesting to compare most optimized programs - we don't have many alternatives for such comparisons. |
|
20 May 2021, 23:16 | #164 | |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,408
|
Quote:
|
|
20 May 2021, 23:22 | #165 | |
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
EDIT. More info is here. Last edited by litwr; 20 May 2021 at 23:31. |
|
20 May 2021, 23:28 | #166 |
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
It is wrong. The CPU designer sets only basic rules. You know that GCC usually doesn't use Intel syntax for assembly. Moreover GCC was not able to use this syntax until maybe 2005. GCC uses rather Moto's syntax for the x86.
|
21 May 2021, 00:04 | #167 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,958
|
Quote:
And next thing. Average cycles value for this routine is NOT EQUAL for average cycles for printing Pi routine Maybe you know that Pi started 31415..., these digits are very fast handled by my routine and very slow by your routine. And where you find that divu.w best case is 78 cycles for 68000? Its 140 cycles plus EA calculation, from my assembler book 140 plus 4 (EA). And again you compared 68000 cycles vs 68020 cycles. For 68020 my routine will be fastest too. Very funny if someone who dont know 68k coding, tell me about 68k coding. "align 2" aligning to word has no sense for 68k code, because every code on 68k is aligned to 2 bytes. THIS IS NOT x86. You can try to align to 4 (68020/68030) or 16 (68040/68060) bytes maybe it will be fastest. Because some assemblers handle lsl D5 as lsl.w #1,D5 then this is not equal that you wrote READABLE code. Some assemblers handled swap.w Dx as swap Dx, but some rejected. Good 68k code MUST be easy readable. You used move, not move.w and this is only lazy code, i dont like read similar code. |
|
21 May 2021, 00:46 | #168 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,958
|
Quote:
And yes, divu.w D1,D5 (2 bytes) will be 4 cycles fastest on 68000 than divu.w #1000,D5 (4 bytes), but you used second version. Seems your write routine takes about 3-4 secs for 3000 iterations on 68030. |
|
21 May 2021, 04:47 | #169 |
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,544
|
Except that it isn't precisely long - it's a signed byte extended to long - despite what the 68000 programmer's manual may say about it.
|
21 May 2021, 04:55 | #170 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,958
|
Quote:
Anyway for me this still can be optimised. We have or can have free registers (1 data and 2 address). The best for later optimisations will be know how many times overflow occured and for which cases. Perhaps later longdiv can be removed. Code:
.l0 clr.l d5 ;d <- 0 clr.l d7 move.l d6,d4 ;i <- kv, i <- i*2 adda.l d4,a3 subq.l #1,d4 ;b <- 2*i-1 move.w #10000,d1 bra.b .l4 .longdiv swap d3 move.w d3,d7 divu.w d4,d7 swap d7 move.w d7,d3 swap d3 divu.w d4,d3 move.w d3,d7 exg d3,d7 clr.w d7 swap d7 move.w d7,(a3) ;r[i] <- d%b bra.b .enddiv .l2 sub.l d3,d5 sub.l d7,d5 lsr.l #1,d5 .l4 move -(a3),d0 ; r[i] mulu.w d1,d0 ;r[i]*10000 add.l d0,d5 ;d += r[i]*10000 move.l d5,d3 divu.w d4,d3 bvs.s .longdiv move.w d3,d7 clr.w d3 swap d3 move.w d3,(a3) ;r[i] <- d%b .enddiv subq.w #2,d4 ;i <- i - 1 bcc.b .l2 ;the main loop divu.w d1,d5 ;removed with MULU optimization sub.w #28,d6 ;kv bne.b .l0 |
|
21 May 2021, 05:33 | #171 | |
old bearded fool
Join Date: Jan 2010
Location: Bangkok
Age: 56
Posts: 775
|
Quote:
EDIT: Confirmed, just checked, same output with 'vasm'. @ thread, Perhaps we should focus on the goal (thread topic) instead of bickering about details which have no actual impact on the result. Last edited by modrobert; 21 May 2021 at 06:34. |
|
21 May 2021, 07:53 | #172 | |
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,544
|
Quote:
As a benchmark this 'pi-spigot' is pretty silly, but then so are most synthetic benchmarks. So long as the rules are well defined and not too ridiculous I have no problem with them. This thread has turned out to be more interesting then I thought it would be. We should thank litwr for giving us an opportunity to deepen our understanding of 68k code and hone our programming skills, even if the task itself is a little silly. |
|
21 May 2021, 07:55 | #173 | ||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
It says 8->32. Quote:
Badly worded with "moveq has no size", i admit. Quote:
Quote:
Quote:
Quote:
The encoding does not allow it, but the syntax - by just allowing a form without a count - does say that if no count is there then it's 1. Encoding also does not allow add.b #$12,(a0) but most assemblers will silently convert it to addi. Is that incorrect in your view ? Motorola didn't really explicit the syntax. Everywhere they show it with the size omitted. |
||||||
21 May 2021, 10:14 | #174 | |||
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,408
|
Quote:
Quote:
But in this particular case, I still think the manual got it right because this instruction can never only touch a byte or word, it always only affects all 32 bits. This is different from add or move, because they can affect only bytes or words. Quote:
This is all I was trying to point out |
|||
21 May 2021, 10:40 | #175 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
swapis word size. But swapaffects full 32-bits of the register. |
|
21 May 2021, 10:49 | #176 |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,408
|
True, but I was pointing out why I agreed in the particular case of moveq. Swap is a different case and I may or may not agree with the manual on that one. Though I do kind of see what they're trying to say here.
|
21 May 2021, 13:07 | #177 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,958
|
For me moveq.l, exg.l, bxxx.l etc only wasted 2 bytes of source code (slowest assembling) , if someone know 68k assembler then know sizes of used by him operations. If someone dont know, then even writing correctly named instruction like "movem.w (SP)+,D0-D2" can cause problems. Many coders dont know how this instruction works. Because litwr is beginer then he can use moveq.l, but because this is github repository then better if he cleaning your code from moveq.l, lsr d5, move etc. Someone can start to learn 68k coding from this source and will be learn bad practices in 68k coding too.
|
21 May 2021, 14:13 | #178 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,958
|
Seems that this:
Code:
.l0 clr.l d5 ;d <- 0 clr.l d7 move.l d6,d4 ;i <- kv, i <- i*2 adda.l d4,a3 Code:
moveq #0,D7 .l0 clr.l d5 ;d <- 0 move.l d6,d4 ;i <- kv, i <- i*2 adda.l d4,a3 |
21 May 2021, 14:48 | #179 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,958
|
Someone can check, if this longdiv will be works?
Code:
.longdiv add.w D4,D4 divu.w D4,D3 lsr.w #1,D4 move.w D3,D7 clr.w D3 swap D3 lsr.w #1,D3 addx.l D7,D7 exg D3,D7 move.w D7,(A3) ;r[i] <- d%b bra.b .enddiv |
21 May 2021, 22:24 | #180 | ||||||||
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Of course, if we wanted to test only 16+ bit systems, removing the 64 KB limit would give some advantages for some systems. Let's think about a calculation of 10000 digits of the pi number. In this case we need to use elements larger than 16 bit in the array and we need more than 16-bit to address an element of the array. This gives advantages for 32-bit systems. So the ARM/80386+/IBM370/68000+/VAX/32016 get some bonuses in comparison with the 8086/80286/PDP11. But the slowest operation is division so those bonuses gives only small advantages in performance. To show this advantages we must remove most of systems used for testing. The price is too high. Last edited by litwr; 21 May 2021 at 22:36. |
||||||||
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
68020 Bit Field Instructions | mcgeezer | Coders. Asm / Hardware | 9 | 27 October 2023 23:21 |
68060 64-bit integer math | BSzili | Coders. Asm / Hardware | 7 | 25 January 2021 21:18 |
Discovery: Math | Audio Snow | request.Old Rare Games | 30 | 20 August 2018 12:17 |
Math apps | mtb | support.Apps | 1 | 08 September 2002 18:59 |
|
|