English Amiga Board - View Single Post - Optimizing the 68020+ 32-bit math

saimo · 13 May 2021, 23:55

Quote:

Originally Posted by litwr

So we have a mystery again ...

Again, simply don't use emulators as reference.

Quote:

Thank you very much. However I still have doubts about usefulness of such the 68010+ feature. The interrupt initiating sequence is rather slow on the 68k, so possible loss of 4 cycles means a very little overhead in actual interrupt processing. On the other hand, the 68030+ and even some 68000/10/20 systems have an MMU which allows us to do the same trick in more natural and common way. Maybe I missed something?

Yes.
Interrupts are critical (especially on devices with limited power, like the Amiga or those that use the M68k CPUs as embedded controllers), so the faster they are handled, the better. Now, the VBR helps speeding up the execution of interrupts on Amiga because of its RAM architecture, but that might not apply to other systems.
But even without the speed factor, being able to relocate the vectors table could be an advantage: for example, more freedom for hardware/OS designers, possibility of exploiting those 1024 bytes in the 0 page (which can be accessed more quickly thanks to its 16 bit addresses), more robustness (less risk of destroying the vectors due to null pointers).
The MMU is meant for other purposes, and it's best to use it for those, instead of complicating its usage even more. Not to mention that MMU slows things down a bit and requires RAM (OK, transparent translation registers might help, but then again those are better used for other purposes).
Moreover, only the non-EC/LC 68030+ CPUs have an MMU.

Quote:

IMHO it is possible that your two last results have a typo for 100 digits timing. The 0.06 value for the pi-amiga1200 looks not correct, it can't be more than the corresponding value from pi-amiga.

No typo: those are the actual figures. I ran the tests multiple times.

Quote:

It is also interesting that your results show a big advantage of the 68030 over the 68020.

No surprise at all there: if you look at the figures, you can easily see that the 68030 is about 3.5 faster than the 68020, and that's perfectly logical since that's the frequency ratio, and the CPU cache and the instructions timings are (basically) the same.

Quote:

The table #2 shows ER that is equal 422 for the 68030 and 438 for the 68020. These numbers were almost equal before you published results from your Blizzard card.

I can't follow this. What's ER?

Quote:

Of course, we have to provide conditions for the best timings: maximize the shell window and clear the screen.

Nope, the solution is disabling the printout: if you want to check the speed of the algorithm, eliminate all the other factors (especially considering the limited precision of the timing mechanism).

Quote:

Of course we can do something in the code outside the main loop but it doesn't affect timings. So I am rather reluctant to do anything there. IMHO your changes maybe save several cycles but this doesn't affect the program performance. I am also not sure if your changes make code a bit larger. Anyway thank you.

At a very quick glance (no time for a thorough analysis now), the differences are the following (I have left out the similar/identical/equivalent parts).

EDIT: I gave the code a less swift look and added also 68020 cache-case timings; b = bytes, c = cycles.

Your code:

Code:

   move d5,kv      ;6b 6c
   ...
   move kv(pc),d4  ;4b 3c
   add.l d4,d4     ;2b 2c
   move.l a2,a3    ;2b 2c
   adda.l d4,a3    ;2b 2c
   ...
   sub.w #14,kv    ;6b 10c

total: 22b 25c

My code:

Code:

   movea.w #28,a1       ;4b 4c
   ...
   move.w  d2,d4        ;2b 2c
   lea.l   (a2,d2.w),a3 ;4b 6c
   ...
   sub.w   a1,d2        ;2b 2c

total: 12b 14c

The timings assume 0-wait-state memory accesses, so the actual timings of the instructions that use kv are much slower (like, two or three times slower).
Note: even just 11 cycles per external loop do make a difference of some thousands of cycles when thousands of digits are calculated; the difference might escape the vertical blanking based time measuring, but I just can't see why not making the code faster - that's one of the purposes why you started this thread, isn't it?

Additional note: like I said in the original post, whether my code does the right thing is still to be verified.

By the way, I have an alternative version of the inner loop which is 2 cycles faster on 68020 (and probably even faster on 68040 and 68060, but not on 68030) when the mulu optimization is not used. I'm trying to figure out a way to adapt the mulu optimization version to take advantage of the optimization, but it looks like it might be impossible.