![]() |
Quote:
Code:
pi-amiga Code:
pi-amiga Code:
pi-amiga Code:
pi-amiga1200 Code:
pi-amiga1200 Code:
pi-amiga1200 The time results vary a bit between runs, most likely due to multitasking and perhaps accuracy of the frame counter, on the first "100" test I get results between 0.09 and 0.12 running the same program. I attached a screenshot, quality kind of sucks, it's hard to get good ones from CRT screens (looks better in reality). |
Quote:
Quote:
Quote:
Quote:
Quote:
Code:
divu.w d4,d6 Code:
divu.w d4,d6 |
Quote:
Quote:
Quote:
|
Quote:
That said, after an instruction that writes to memory, the CPU can enjoy many "free" cycles for the cached non-memory instructions; for example, I measured that after a write to CHIP RAM, the 68030 on my Blizzard 1230-IV has 26 free cycles which can be used for all the instructions that fit in those cycles; during a write to FAST RAM (which the CPU has exclusive access to and is no-wait-state) the CPU has 4 free cycles. (Exception: if I remember correctly - it's been a while - rotate and, maybe, also shift instructions, for some strange reason, actually can't benefit from the free cycles and cause the CPU to stall until the write finishes.) This is the reason why I proposed the longword write trick. Quote:
|
Quote:
pi-amiga-8 - an old version (68000) pi-amiga-8mo - an old version with MULUopt=1 (68000) pi-amiga-9 - BVS optimization (68000) pi-amiga-9mo - BVS optimization with MULUopt=1 (68000) pi-amiga1200-8 - an old version (68020) pi-amiga1200-8mo - an old version with MULUopt=1 (68020) pi-amiga1200-9 - BVS optimization (68020) pi-amiga1200-9mo - BVS optimization with MULUopt=1 (68020) You have already run pi-amiga-9mo and pi-amiga1200-9mo - if you rerun them the results must be the same. So results for pi-amiga-8, pi-amiga-8mo, pi-amiga-9, pi-amiga1200-8, pi-amiga1200-8mo, pi-amiga1200-9 are only required to get information which optimization actually works on the 68020. This time only 3000 digit results are required. Quote:
Quote:
|
While waiting for new electrolytic capacitors to ship for my original A1200 PSU (it needs recapping), I have connected a PC ATX 250W power supply temporarily which gives me more options.
litwr, Do you want me to run the new tests with the stock A1200 68020 same as before, or using ACA-1232 68030 @ 33MHz (with 128mb fastram)? EDIT: Never mind, the ACA-1232 seems to be broken, so can't do 68030 for now. Amiga 1200 stock 68020 @ 14MHz with 4mb fastram Code:
pi-amiga-8 Code:
pi-amiga-8mo Code:
pi-amiga-9 Code:
pi-amiga-9mo Code:
pi-amiga1200-8 Code:
pi-amiga1200-8mo Code:
pi-amiga1200-9 Code:
pi-amiga1200-9mo EDIT 2: I played around with 'pi-amiga-9.asm' and modified it (edit/compile) to not print the digits via "Write(a6)" to stdout (under PR0000 label), just to see how much it would gain, and the result was 32.80 seconds, so that saved roughly 2 seconds. BTW: Noticed you have "jmp Write(a6)" on line 307 in 'pi-amiga-9.asm', shouldn't that be "jsr Write(a6)"? |
Quote:
BTW could you detach fast RAM and run PI-AMIGA-9MO for 3000 digits? Indeed it would also be nice to get results from your 68030 hardware sometime in the future. Quote:
The JMP instruction is ok - how could it work if it was wrong? :) |
Quote:
Code:
pi-amiga-9mo Code:
pi-amiga1200-9mo Without fastram... Code:
pi-amiga-9mo Quote:
I noticed you don't select any type of RAM "clr.l d1" when allocating "AllocMem(a6)", according to documentation it should pick fastram first, but not sure if that can be overridden by compiler settings? |
litwr,
Check the attached screenshot regarding number of CPU cycles for DIVU.L. Source: https://www.nxp.com/docs/en/data-sheet/MC68020UM.pdf |
Quote:
Code:
divu d4,d6 Code sequences at .longdiv are different but the branch to .longdiv is taken almost never, it is about 1 branch taken of 10,000,000 cases. I can only suggest one plausible explanation. Maybe your results are caused by programs run order. You ran PI-AMIGA-9 first. I can assume that this processing makes your hardware a bit hotter and slower. IMHO if you run PI-AMIGA1200-9MO first then it will be faster. If it is not right then I am completely baffled. Thank you for your remark about the AllocMem invocation. But I can't understand what compiler setting can affect memory allocation function call... |
Quote:
|
Quote:
|
Regarding the timings of instructions, It should be pointed out that the timings in the Motorola manuals for the DIV and MUL instructions represent the maximum number of cycles they can take*, not the minimum. The actual timing varies depending on amongst other things the given input, though Motorola has not included information about how much this variation is**.
If I had to guess, I'd say that the 64 bit DIV instructions will probably perform worse than the 32 bit ones. *) From the manual linked above, page 8-11: "This CC time is a maximum since the times given for the MULU.L and DIVS.L are maximums.". **) See the introduction to chapter 8: "This section describes the instruction execution and operations (table searches, etc.) of the MC68020/EC020 in terms of external clock cycles. It provides accurate execution and operation timing guidelines but not exact timings for every possible circumstance. This approach is used since exact execution time for an instruction or operation is highly dependent on memory speeds and other variables." |
Quote:
https://modrobert.ddns.net/pics/mc68020_clocks.png Time wise you can do roughly 20 ("average speed" ~4 clocks) instructions for the cost of one DIVU.L! |
Best case through worst case in those diagrams only refer to time with instruction overlap/no overlap but in cache/no overlap not in cache*. They do not refer to the differences in timing for DIV/MUL due to different values or operand size being passed (in the case of the 64 bit operations).
*) The last one is especially interesting because it assumes memory access is penalized by just 1 cycle per access, which is way better than what the A1200 without Fast RAM actually manages. Edit: on a side note, it's very logical that DIV/MUL have different execution speeds based on differing inputs as the amount of work needed to be done for both does vary based on input. |
Here are the disassembled files attached for 'pi-amiga-9mo' and 'pi-amiga1200-9mo' binaries from 'pi-amiga-cmp.zip'. As you can see the 'pi-amiga1200-9mo' includes "divul.l d4,d7:d6", but this part of the code is not called much according to litwr. Lots of DIVU.W instructions in both which is slow as well.
Code:
lab_fa: |
Quote:
Not really, they have the same amount of work to do. Your average division algorithm creates the remainder as by-product anyhow. The typical division implementation is a 2nbits/nbits division. IOWs, the underlying algorithm is probably much the same for all division versions, just that some of the outputs are thrown away and/or some of the inputs are assumed to be zero. |
Quote:
245 times divu.w 1 time divu.l 387 times divu.w 1 time divu.l 1647 times divu.w 2 times divu.l etc up to f.e 10000 times if someone called this routine more than 10000 calls then old routine after 10000 times can be called within bvs check. |
Quote:
Quote:
|
Quote:
|
All times are GMT +2. The time now is 10:19. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.