03 May 2021, 17:55 | #41 |
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
It should be criminally simple to create a benchmark (running, say, 1,000,000x) that compares execution time of various inputs to the DIV and create another variant that uses divu.w.
Shouldn't take more than 10 minutes to write. All you really need is just an actual HW to run it. A 20 MHz CPU should execute about 250,000 divisions in a second (at ~80 cycles). Feeding the two registers and looping should bring that to about ~175,000 (very rough estimate, don't have the cycle table at hand, so feel free to look it up and adjust accordingly). So, running a benchmark for 10,000,000x should provide sufficient benchmarking precision [from comparison standpoint]. In fact, what I don't understand, is that how come, if you are concerned about the performance, and you have an access to the HW, you actually haven't written such benchmark in the first place before even creating the thread ? Last edited by VladR; 03 May 2021 at 17:57. Reason: typos |
03 May 2021, 18:19 | #42 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
@litwr
If you have a spare register, you could do this: Code:
move.l #$ffff,d5; somewhere in your initialization code ... move.w d6,d7 swap.w d6 and.l d5,d7 move.w d6,(a3) and.l d5,d6 Regarding the puzzling timing results: are interrupts off during the test? Edit: I just had a quick look at the inner loop of your code; I noticed that it reads from memory the item calculated at the previous step: why not exploiting the fact that you already have it in a register? Unless I'm misunderstanding something, I'd propose this (for clarity, I have removed the mulu optimization stuff): Code:
addq.l #2,a3 move.l #$ffff,d3 <initialize d6> ... bra.b .l4 .longdiv divul.l d4,d7:d6 move.w d7,-(a3) move.l d7,d6 subq.l #2,d4 bcs.b .enddiv .l2 sub.l d6,d5 sub.l d7,d5 lsr.l #1,d5 .l4 mulu.w d1,d6 add.l d5,d6 move.l d6,d5 divu.w d4,d6 bvs.b .longdiv move.w d6,d7 swap.w d6 and.l d3,d7 move.w d6,-(a3) and.l d3,d6 subq.l #2,d4 bcc.b .l2 .enddiv Additional note: -(a3) gives a better result (3 cycles) than (a3)+ (4 cycles) on 68020 only in the best case (i.e. when the instructions using it benefit from the maximum overlap conditions), but otherwise it gives a worse result (5 cycles); also, it always performs worse on 68030 (it makes no difference on 68040 and 68060, instead); you might experiment by adding a nop before the loop: by changing the alignment of the instructions, you might get better results (but first you must have a reliable way of making tests). I see room for optimization also outside of the loop, but I'm affected by a weird allergy to reading code (I'm serious), so that's all I can do Last edited by saimo; 03 May 2021 at 23:52. |
03 May 2021, 20:47 | #43 |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,436
|
Right, just because I'm really rather interested in the Motorola 68K architecture, I did a few minor tests using extremely simple code I executed while all interrupts and the screen were disabled. I timed using the Amiga CIA timers and ran 4000 DIVU.L's and DIVUL.L's in a simple DBRA loop.
I found the following:
Last edited by roondar; 03 May 2021 at 20:54. |
04 May 2021, 08:35 | #44 |
old bearded fool
Join Date: Jan 2010
Location: Bangkok
Age: 56
Posts: 779
|
I only have 'pi-amiga-8.asm' and 'pi-amiga-9.asm' in 'pi-amiga-cmp.zip', so wanted to see exactly what the binaries you supplied did. The DIVU.W instruction is slow as well; 42-44 clocks (ten times slower than other average instructions).
|
04 May 2021, 08:41 | #45 | |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,307
|
Quote:
That depends on the processor. On my 68060, the time is (almost) constant, but the instruction is much quicker in the overflow case. The manual states so, it is 28 cycles, with a footnode saying that there are early-out conditions for the 32/16 division, which is exactly what I have found. I do not know for the 68020 or 68030. For the 68000, the manual gives timing information, and there it depends on the numbers. |
|
04 May 2021, 10:02 | #46 | |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,436
|
Quote:
So for the record: my test was done on a stock Amiga 1200, no expansions. |
|
04 May 2021, 11:43 | #47 | |||||||||
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
1) mc68000 and MULUopt=0; 2) mc68020 and MULUopt=0; 3) mc68000 and MULUopt=1; 4) mc68020 and MULUopt=1. BTW the table is updated. I also made a size optimization, so v10 is now the newest but it has the same speed as v9. Quote:
Quote:
|
|||||||||
04 May 2021, 13:37 | #48 | |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,436
|
Quote:
So compromises have to be made, where some instructions will end up being too fast and others too slow. |
|
04 May 2021, 17:52 | #49 |
WinUAE developer
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,570
|
When I was coding and testing my CPU tester, I noticed that 68020 and 68030 have almost all undocumented features working identically, for example DIV undocumented flags work exactly the same.
I am almost sure execution core and microcode is practically identical but everything else has been updated (memory controller, MMU, caches etc. Updating execution core probably wasn't worth the required testing) Division algorithm improvements: 68000 is very slow, pure microcode without any "hardware acceleration", timing fully known (but was not fully documented) 68010 division is improved, UAE emulated timing should be fully correct. I did some experiments and noticed that timing is correct if 68000 algorithm is slightly modified and conditional microcode branches removed, for example 68000 does it similar to "skip following microcode block if condition false" and skipped block was "subtract x from y". 68010 does similar to "subtract [x if true, 0 if not true] from y". (Apparently 68010 hardware gained feature that can conditionally force input to zero when doing ALU operations) This improved performance and reduced timing variation depending on input values (68010 DIV timing still depends on input variables but not as much as 68000). 68010 MUL also uses same method to improve performance. 68020/030 almost certainly uses similar algoritm that 68010 uses. 020/030 is still fully(?) microcoded. Unfortunately Motorola documentation only talks about max cycle times (as was already mentioned) which are useless for emulation purposes. It needs to use best case timing because max times generally are much less common in real world. Everything would run far too slowly when using max timings or even if using average times!. Thats why "inaccurate emulation" has to use minimum possible timings. Which are unknown. Caches and long pipeliline makes it impossible to get any useful timing results. Yes, you can disable caches and all and get "correct" timing results but they are also totally useless in real world.. (and some 68020+ only instructions might lack "slow down" completely. no one should use them in speed critical parts anyway) |
04 May 2021, 18:39 | #50 | ||||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
As for the speed improvement: * the removal of the registers conflicts does allow to exploit both the execution pipelines of the 68060 (the instructions used are suitable for both the primary and secondary pipeline), so the code would run faster on such CPU; * the and instruction after the write is for free also on 68020, as it executes while the write is being performed, whereas the same doesn't happen for bcc - forgot to mention earlier: if my memory doesn't fail me, branch instructions (I'm pretty such that this applies to dbcc at least) can't execute instead while writes happen, at least on 68020 and 68030 - it's something I had noticed experimentally a while back. Quote:
Quote:
Here's a proper version: Code:
move.l #$ffff,d3 ... bra.b .l4 .longdiv divul.l d4,d7:d6 move.w d7,(a3) subq.l #2,d4 bcs.b .enddiv .l2 sub.l d6,d5 sub.l d7,d5 .l4 move.w -(a3),d0 lsr.l #1,d5 mulu.w d1,d0 add.l d0,d5 move.l d5,d6 divu.w d4,d6 bvs.b .longdiv move.w d6,d7 swap.w d6 and.l d3,d7 move.w d6,(a3) and.l d3,d6 subq.l #2,d4 bcc.b .l2 .enddiv A new optimization here is that lsr.l has been moved to avoid 2 registers conflicts (this can be done safely given that the initialization clears d5 before jumping to .l4). For further speed improvements I suggest (sorry for repeating, but better safe than sorry): * try to put a nop before the loop code (a 2-byte different alignment might give a different result); * if it's possible to reverse the items order in the array, use (a3) in place of -(a3) and (a3)+ in place of (a3): this will give faster results on 68030 and maybe even on 68020 - this should also be checked with and without nop. |
||||
04 May 2021, 23:04 | #51 | |
old bearded fool
Join Date: Jan 2010
Location: Bangkok
Age: 56
Posts: 779
|
Quote:
Reverse order... Code:
pi-amiga1200-9mo number ? calculator v9 [MULUopt](68020) number of digits (up to 9248)? 9000 314... 295.12 Code:
pi-amiga-9mo number ? calculator v9 [MULUopt](68000) number of digits (up to 9248)? 9000 314... 292.28 |
|
05 May 2021, 04:55 | #52 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,039
|
Ok.
What is this? Code:
or d5,d5 ; ??? beq .l20 move d5,d1 addq #3,d5 and #$fffc,d5 cmp.b #10,(a0) bne .l21 Code:
; or d5,d5 ; beq .l20 move d5,d1 beq.b .l20 addq #3,d5 and #$fffc,d5 cmp.b #10,(a0) bne .l21 Code:
move.l $6c,rasterie+2 move.l #rasteri,$6c Also if you know 68020 instructions timing then divul version is useless for 68020/68030. |
05 May 2021, 16:23 | #53 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
So, divu.l (64 bits) seems faster than divul.l (32 bits)? Maybe you meant the opposite (which would fit the "reset additional register" in the following comment)? As for the overhead effects, you could countercheck by adding dummy operations of the same kind in the loop of the apparently faster instruction. The MC68020UM manual doesn't even mention timings for divul, so maybe the reason is that it performs exactly like divu. |
|
05 May 2021, 16:56 | #54 |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,436
|
It may be my 68020 is still a bit weak (I don't know all the instructions by heart yet), but in my test I used the following two forms:
Code:
divu.l d0,d1 divul.l d0,d1:d2 |
05 May 2021, 17:10 | #55 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
* divu.l <ea>,dq (32/32 -> 32q) * divu.l <ea>dr:dq (64/32 -> 32q, 32r) Indeed it makes sense that divu.l d0,d1 might be faster than divul.l d0,d1 (both 32 bits, but the latter also returns the remainder), but I'd be surprised if the 64 bit divu.l were faster - but now I see that you didn't test that one |
|
05 May 2021, 20:05 | #56 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
I ended up making a few tests myself, and what already reported by roondar is totally confirmed: i.e. divu instructions all perform at the same speed and seem not to depend on the input (at least when overflow doesn't happen - I haven't checked that).
I have made the tests on a standard A1200 and on the same A1200 with an Blizzard 1230-IV turned on - i.e., on a 14 MHz 68020 and on a 50 MHz 68030. The test were based on this simple loop... Code:
move.w #499,d0 .l move.l #$7fffffff,d1 move.l #$12345678,d2 moveq.l #$ffffffff,d3 <divu instruction> dbf d0,.l These are the results (times expressed in CIA ticks): Code:
instruction 68020 68030 ------------------+-------+------- divu.l d1,d3 | 9ca | 2c6 divu.l d1,d2:d3 | 9ca | 2c6 divul.l d1,d2:d3 | 9ca | 2c6 EDIT: what follows is a brain fart! "divu instructions all perform at the same speed" kept lingering in my head and made me come to the conclusion stated below, by making me forget that the pi code uses divu.w (not divu.l!), which is faster. This means that the bvs "optimization" in the pi code is useless (Don Adan, is this that you were referring to, maybe?). Such code can then be reduced to: Code:
.l4 move.w -(a3),d0 lsr.l #1,d5 mulu.w d1,d0 add.l d0,d5 move.l d5,d6 divul.l d4,d7:d6 move.w d7,(a3) sub.l d6,d5 sub.l d7,d5 subq.l #2,d4 bcc.b .l4 Last edited by saimo; 08 May 2021 at 13:32. |
05 May 2021, 21:30 | #57 | ||
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,719
|
Quote:
Quote:
This is what get when you are an x86 coder who doesn't know that most 68k instructions affect the flags. |
||
06 May 2021, 00:25 | #58 |
old bearded fool
Join Date: Jan 2010
Location: Bangkok
Age: 56
Posts: 779
|
After reading Toni's post I got a bit curious about the microcode "state machine" for DIVU.W for 68020 which is the most used instruction of the slow ones in litwr's code.
I used TimeIt (from aminet) and wrote some code for the test. test.c Code:
/* stub for testing */ #include <stdio.h> #include <stdlib.h> unsigned long divuw(__reg("d0") unsigned long a, __reg("d1") unsigned long b, \ __reg("d2") unsigned long c); int main(int argc, char *argv[]) { unsigned long dividend, divisor, counter, result; if(argc < 4) { printf("Usage: %s <dividend> <divisor> <counter>\n", argv[0]); exit(1); } dividend = atol(argv[1]); divisor = atol(argv[2]); if(divisor == 0) { printf("Error: divisor is zero or not an integer.\n"); exit(2); } counter = atol(argv[3]); printf("Running division: %lu / %lu\n", dividend, divisor); result = divuw(dividend, divisor, counter); printf("Done with %lu divisions, result: 0x%08lx\n", counter, result); return 0; } Code:
section "CODE",code public _divuw cnop 0,4 _divuw move.l d3,-(sp) move.l d0,d3 ; save dividend .loop move.l d3,d0 ; restore dividend divu.w d1,d0 subq.l #1,d2 ; loop until counter reaches zero bne .loop move.l (sp)+,d3 ; result from last division in d0 rts Code:
Usage: test_div <dividend> <divisor> <counter> Here are the results... There was no check for division by one, which should be able to return rather fast, same goes for "dividend == divisor". Code:
> timeit test_div 1 1 4000000 Running division: 1 / 1 Done with 4000000 divisions, result: 0x00000001 Elapsed: 15.98s Code:
> timeit test_div 128 8 4000000 Running division: 128 / 8 Done with 4000000 divisions, result: 0x00000010 Elapsed: 15.96s Code:
> timeit test_div 0 1 4000000 Running with: 0 / 1 Done with 4000000 divisions, result: 0x00000000 Elapsed: 15.86s Code:
> timeit test_div 7 3 4000000 Running division: 7 / 3 Done with 4000000 divisions, result: 0x00010002 Elapsed: 15.95s Code:
> timeit test_div 64737 33223 4000000 Running division: 64737 / 33223 Done with 4000000 divisions, result: 0x7b1a0001 Elapsed: 15.98s Code:
> timeit test_div 0 1 4000000 Running with: 0 / 1 Done with 4000000 divisions, result: 0x00000000 Elapsed: 3.70s Conclusion: The microcode "state machine" is really dumb/simple for DIVU.W in the sense that there are no optimizations based on input, so either the hardware engineers were saving logic gates or had other restrictions due to the CPU design in general. On the bright side, this leaves room for improvement from a programmer perspective, avoid DIVxx whenever possible, there are plenty of useful instructions to choose from in the amazing 68k ISA. EDIT: Added a test where DIVU.W is commented out in 'div.s'. Code:
> timeit test_div 1 1 4000000 Running division: 1 / 1 Done with 4000000 divisions, result: 0x00000001 Elapsed: 3.00s Last edited by modrobert; 06 May 2021 at 11:41. |
06 May 2021, 00:52 | #59 |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,436
|
On the Amiga, such expensive instructions can still be useful if you're using the chipset as well and aren't running on a fast CPU with Fast RAM. A simple trick would be to run one or more (big) blit(s) in the background, do your DIVxx stuff in the foreground. Essentially the DIVxx stuff becomes 'free' because you're still doing useful work regardless. And the Blitter would only barely be slowed down by the CPU because it's mostly calculating so it runs at near-full-speed too
I'm pretty sure some games/demos use tricks like that - certainly on A500. |
06 May 2021, 07:56 | #60 | |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,307
|
Quote:
Well, in your case, the division is just a very minor ingredient in the overall running time and other code parts dominate, thus I'm not sure you would be able to see much of a difference from this test code. It needs a tighter loop around the div. That said, I did not see an argument dependency on my 68060 except that it is much faster in the overflow case, but that's the only fast case I could discover. |
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
68020 Bit Field Instructions | mcgeezer | Coders. Asm / Hardware | 9 | 27 October 2023 23:21 |
68060 64-bit integer math | BSzili | Coders. Asm / Hardware | 7 | 25 January 2021 21:18 |
Discovery: Math | Audio Snow | request.Old Rare Games | 30 | 20 August 2018 12:17 |
Math apps | mtb | support.Apps | 1 | 08 September 2002 18:59 |
|
|