Optimizing the 68020+ 32-bit math - Page 11

litwr · 22 May 2021, 19:18

Quote:

Originally Posted by Don_Adan

Then you used buggy program, i calculated number of instructions manually, 28 instructions, 56 bytes. You can tell me which instruction can not be counted, i signed all 28 instructions from your post.

Really D3 can be odd?
Whow, surprise for me.
But maybe you can learn something new about 68000 coding. How works

addx.w D3,D3

What is wrong with ?
btst #14,(A0)
Think about this or check 68000 asembler book.

I've shown you the listing, the program is ok. So it is your calculations which are buggy. If nobody notes this then it seems we have something very wrong. If nobody comments this I leave this thread. Thanks a lot to all people who helped.
What crazy things are you talking about?! How to connect addx.w D3,D3 with you code?
About BTST just read the manual and study yourself a bit.

Quote:

If a memory location is the destination, a byte is read from that location, and the bit operation performed using the bit number, modulo 8, with zero referring to the least significant bit.

You can note that any value from 0 to 0xffff is allowed here. It really is very sad that you write so much non-sense.

Quote:

Originally Posted by roondar

Of course, no problem.

Thank you very much. I wish Don_Adan could explain his thoughts so clearly. However my code doesn't use absolute addressing.

Quote:

Originally Posted by Thorham

I wasn't talking about more digits, I was talking about memory constraints. Two different things. Not having memory constraints enables more optimizations such as using a table for converting to decimal digits for example.

As I explained afore these things are connected by the nature of pi-spigot algo. If you want more than 64KB memory you need other data types. This is not a trivial matter, it is a kind of math magic around the pi number.
EDIT. I dare to ask you how to test 8-bit systems without 64 KB limit?

robinsonb5 · 22 May 2021, 19:38

Quote:

Originally Posted by litwr

I've shown you the listing, the program is ok. So it is your calculations which are buggy. If nobody notes this then it seems we have something very wrong.

It's very simple:

Code:

F00:0160       .longdiv
F00:0161         if __VASM&28              ;68020/30?
F00:0162                divul d4,d7:d3
F00:0163         else
F00:0164                swap d3
               S01:000000CE:  48 43

In your listing the opcode comes after the code which generated it - but at the end you have:

Code:

F00:0213                subq #2,d4    ;i <- i - 1
               S01:00000102:  55 44
F00:0214                bcc .l2       ;the main loop

So Don_Adan is counting the final bcc (which occupies bytes 104 / 5) but you're not, so you're measuring 2 bytes fewer than he is.

meynaf · 22 May 2021, 19:44

Quote:

Originally Posted by litwr

About BTST just read the manual and study yourself a bit.

Perhaps you need to know that btst in memory is a byte operation so bit number ranges from 0 to 7, hence btst #14,(a0) is incorrect.

Thorham · 22 May 2021, 20:19

Quote:

Originally Posted by litwr

As I explained afore these things are connected by the nature of pi-spigot algo. If you want more than 64KB memory you need other data types. This is not a trivial matter, it is a kind of math magic around the pi number.

I'm not talking about the spigot algorithm's table size, but about the total 64 kb limit:

Quote:

3) it uses less than 64 KB RAM for the code and data

This places a limitation on potential optimizations on bigger systems. For example, it makes it impossible to use a base 10000 conversion table (if that would be beneficial) and still get to ~9000 digits.

Don_Adan · 22 May 2021, 22:55

I found one bug, in my version D7 is not handled correctly for odd values. I must rethink this routine again.

Don_Adan · 23 May 2021, 01:44

Perhaps fixed now, but code is longer.

Code:

         clr.l -(SP)   ; cv
         moveq #0,D7

.l0      clr.l d5       ;d <- 0
         move.l d6,d4     ;i <- kv, i <- i*2
         adda.l d4,a3
         subq.l #1,d4     ;b <- 2*i-1
         move.w #10000,d1
         bra.b .l4

.l2      sub.l d3,d5
         sub.l d7,d5
         lsr.l #1,d5
.l4
         move -(a3),d0      ; r[i]
         mulu.w d1,d0       ;r[i]*10000
         add.l d0,d5       ;d += r[i]*10000
         move.l d5,d3
         lsr.l #1,D3
         divu.w d4,d3
         move.w d3,d7
         clr.w d3
         swap d3
         addx.w  D3,D3

         move.w D4,D0 ; D4
         sub.w D3,D0 ; check if D3 is greater or equal D4
         sls D0 ; if yes then $FF, if not then 0
         extb.l D0 ; -1 or 0
         add.l D7,D7 
         sub.l D0,D7
         and.w D4,D0 ; D4 or 0
         sub.w D0,D3 ; fixed D3

         exg D3,D7
         move.w D7,(A3)     ;r[i] <- d%b

         subq.w #2,d4    ;i <- i - 1
         bcc.b .l2       ;the main loop
         divu.w d1,d5      ;removed with MULU optimization
 
         add.w (SP),D5 ; cv
         move.l D5,(SP) ; cv
         ext.l D5   ; necessary only for litwr version of PR0000 routine
         bsr PR0000

         sub.w #28,d6   ;kv
         bne.b .l0
         addq.l #4,SP ; restore stack

Bruce Abbott · 23 May 2021, 04:07

Quote:

Originally Posted by Thorham

Removing this limit is crucial if you want to use this Pi spigot as a benchmark. Artificially limiting the more powerful systems just makes them look less powerful than they are. A good benchmark doesn't play favorites.

This isn't a 'good' benchmark anyway, if you want something that gauges real-world performance. It's only real use is to show how an algorithm can be implemented on various retro computers. To this end I think it is 'fairer' to specify limits that allow the less powerful systems to be competitive.

One of my goals in life (if I can find the time) is to do some assembly language programming on all of the retro computers in my collection. litwr's 'pipack' has working examples for many of them in one handy archive, so it is useful to me even if the benchmark results aren't very relevant.

Quote:

Originally Posted by Thorham

This places a limitation on potential optimizations on bigger systems. For example, it makes it impossible to use a base 10000 conversion table (if that would be beneficial) and still get to ~9000 digits.

You should see the new FPGA based 68k CPU I am designing for the Amiga. One of the extra instructions is called 'picalc' - which blasts all 9000 digits into RAM in a single clock cycle. it will blow the other machines out of the water!

Seriously though, who cares what optimizations can be done on larger systems? It's not like computing 9000 digits of pi has any practical application.

I would rather see the original algorithm reproduced 'accurately', ie. in a form that closely follows it in an obvious way, rather than 'cheating' with lookup tables etc. Otherwise it opens up the possibility of ridiculous developments like what happened with America's Cup racing boats - which had strict hull dimension rules except that nowhere did the rules state it had to be a single hull (!), opening the way for catamarans and hydrofoils that blew conventional designs out of the water.

modrobert · 23 May 2021, 09:33

Quote:

Originally Posted by Bruce Abbott

Seriously though, who cares what optimizations can be done on larger systems? It's not like computing 9000 digits of pi has any practical application.

How about a number system with the base of a circle circumference where pi is a single digit integer?

Thorham · 23 May 2021, 12:40

Quote:

Originally Posted by Bruce Abbott

This isn't a 'good' benchmark anyway, if you want something that gauges real-world performance.

It's a benchmark, and as such should be fair and not artificially limit systems. Whether or not it's a good benchmark doesn't matter.

Quote:

Originally Posted by Bruce Abbott

Seriously though, who cares what optimizations can be done on larger systems?

This thread is eleven pages long.

litwr · 23 May 2021, 19:43

Quote:

Originally Posted by robinsonb5

It's very simple:
In your listing the opcode comes after the code which generated it - but at the end you have:

Code:

F00:0213                subq #2,d4    ;i <- i - 1
               S01:00000102:  55 44
F00:0214                bcc .l2       ;the main loop

So Don_Adan is counting the final bcc (which occupies bytes 104 / 5) but you're not, so you're measuring 2 bytes fewer than he is.

Dear Sir! Please look at the math I provided for Don_Adan 0x104-0xCE = 0x36 = 54 bytes. Can you note 0x102 there?! I used 0x104 as the final label.

litwr · 23 May 2021, 19:47

Quote:

Originally Posted by meynaf

Perhaps you need to know that btst in memory is a byte operation so bit number ranges from 0 to 7, hence btst #14,(a0) is incorrect.

Cher Monsieur!
I just point the manual snippet about BTST for you afore. Please read it now.

Quote:

If a memory location is the destination, a byte is read from that location, and the bit operation performed using the bit number, modulo 8, with zero referring to the least significant bit.

It is perfectly right to use any number in range 0..0xffff. Why write this non-sense?

litwr · 23 May 2021, 19:56

Quote:

Originally Posted by Thorham

I'm not talking about the spigot algorithm's table size, but about the total 64 kb limit:

This places a limitation on potential optimizations on bigger systems. For example, it makes it impossible to use a base 10000 conversion table (if that would be beneficial) and still get to ~9000 digits.

I already explained this. Please, don't ignore 8-bit and some 16-bit systems. This limit was initially imposed because those systems just can't address more than 64 kb.
However if we want 10000 digits for 32-bit and some 16-bit systems this makes the algo slower for those systems because they have to use bigger tables and variables even for 1000 digits. And let me repeat again this excludes a lot of systems from the benchmarking. So it gives nothing good. You know every sport has rules.

litwr · 23 May 2021, 20:13

Quote:

Originally Posted by Don_Adan

Perhaps fixed now, but code is longer.

Code:

         add.l d0,d5       ;d += r[i]*10000
         move.l d5,d3
         lsr.l #1,D3
        divu.w d4,d3
        move.w d3,d7

It seems it's become an obsession for you.

Thank you for help but your late posts are strange. It is better for you to stop now. The code is good enough now.
Your code snippet is wrong again. Just try to use D4=1 this can overflow DIVU D4,D3.

Please if you want to continue try to run your code at first.

roondar · 23 May 2021, 20:29

Quote:

Originally Posted by litwr

Dear Sir! Please look at the math I provided for Don_Adan 0x104-0xCE = 0x36 = 54 bytes. Can you note 0x102 there?! I used 0x104 as the final label.

Edit: please note, the information below is not correct. I made an error while verifying the code length. I'll leave it anyway because it's always good to keep mind that we can all make mistakes

Original post follows...

I've had enough with everyone in this thread bickering over this. Litwr is correct: the code is 54 bytes long. And because I no longer want to see any silly discussion about it, you can find attached a screenshot of the output that VASM gives for the following code when assembled:

Code:

.longdiv
        move d3,d7
        divu d4,d7
        swap d7
        move d7,d3
        swap d3
        divu d4,d3
        move d3,d7
        exg.l d3,d7
        clr d7
        swap d7
        move d7,(a3)     ;r[i] <- d%b
        bra.s .enddiv
.l2     sub.l d3,d5
        sub.l d7,d5
        lsr.l d5
.l4
        move -(a3),d0      ; r[i]
        mulu d1,d0       ;r[i]*10000
        add.l d0,d5       ;d += r[i]*10000
        move.l d5,d3
        divu d4,d3
        bvs.s .longdiv
        move d3,d7
        clr d3
        swap d3
        move d3,(a3)     ;r[i] <- d%b
.enddiv
        subq #2,d4    ;i <- i - 1
        bcc .l2       ;the main loop
.endcode
    printv    .endcode-.longdiv

For those not aware, printv prints a value. In this case the length of the code.

litwr · 23 May 2021, 20:39

Quote:

Originally Posted by roondar

I've had enough with everyone in this thread bickering over this.

Thank you very much. IMHO we've just reached the goals of this thread... Indeed I would like to solve the mystery discovered by modrobert but it is another goal. Maybe I need to start a new topic for this.

Quote:

Originally Posted by modrobert

How about a number system with the base of a circle circumference where pi is a single digit integer?

A lot of algos exist for the pi-number computation. IMHO the pi-spigot is the easiest. It is not the fastest but it is the shortest.

robinsonb5 · 23 May 2021, 20:43

Quote:

Originally Posted by roondar

I've had enough with everyone in this thread bickering over this. Litwr is correct: the code is 54 bytes long. And because I no longer want to see any silly discussion about it, you can find attached a screenshot of the output that VASM gives for the following code when assembled:

That's all well and good, but you haven't counted the swap at the start of the listing, the $4843 at address $ce.

Quote:

Originally Posted by litwr

Dear Sir! Please look at the math I provided for Don_Adan 0x104-0xCE = 0x36 = 54 bytes. Can you note 0x102 there?! I used 0x104 as the final label.

Yes, indeed, I see the 0x104 - however, the 0x5544 at 0x102 is *not* the bcc, it's the "subq #4, d4". The bcc *starts* at 0x104, and thus ends at 0x106 - therefore you're not counting the bcc.

roondar · 23 May 2021, 20:59

Quote:

Originally Posted by robinsonb5

That's all well and good, but you haven't counted the swap at the start of the listing, the $4843 at address $ce.

Sorry, I find it really funny that I try to get a 100% answer on this and then make an error myself. Especially considering how I phrased it. So errr, yeah... About that...

You're right, I did accidentally cut off the swap while reformatting the listing. I do appologize, I used the program listing as supplied by litwr (the one with all the line numbers, offsets and such added in) and deleted one line more than I should've. Which means, yeah - it is 56 bytes if the instruction I managed to delete were added back in.

Quote:

Originally Posted by litwr

It is perfectly right to use any number in range 0..0xffff. Why write this non-sense?

In most assemblers, you can certainly use a larger number than 0-7 using BTST in memory, but be aware that the instruction itself only has encoding space for 3 bits when used to test bits in memory and only tests on a single byte. So BTST #14,<<memory>> doesn't check the 14th bit, but the 6th bit.

robinsonb5 · 23 May 2021, 21:13

Quote:

Originally Posted by roondar

Sorry, I find it really funny that I try to get a 100% answer on this and then make an error myself.

Such is life - it's all good

The funniest part is how much effort and energy we've all expended over two bytes!

It has been an interesting thread, though - I love that we can still learn new things about instruction timings decades after the CPUs were released.

Don_Adan · 23 May 2021, 22:06

Quote:

Originally Posted by litwr

It seems it's become an obsession for you.

Thank you for help but your late posts are strange. It is better for you to stop now. The code is good enough now.
Your code snippet is wrong again. Just try to use D4=1 this can overflow DIVU D4,D3.

Please if you want to continue try to run your code at first.

You are very funny. You used buggy program which cant calc size of loop routine correctly. You was too lazy to read/check my reply, where I counted all instructions used in main loop. You know better what is correct for using btst at memory. Now you tell me that my routine will be overflow if D4 will be 1. I know this. This routine works only for 1 bit overflow, not more. Maybe you know how works lsr.l #1,D3? You dont show example D4 and D3 values, when overflow problem occured. Present i dont have access to my Amiga to check this. Loop code is good enough, but can be better. You used your program for CPU benchmark. Same for PR0000, your version is only average.

Thorham · 23 May 2021, 23:47

Quote:

Originally Posted by litwr

I already explained this. Please, don't ignore 8-bit and some 16-bit systems. This limit was initially imposed because those systems just can't address more than 64 kb.
However if we want 10000 digits for 32-bit and some 16-bit systems this makes the algo slower for those systems because they have to use bigger tables and variables even for 1000 digits. And let me repeat again this excludes a lot of systems from the benchmarking. So it gives nothing good. You know every sport has rules.

This is not what I'm talking about. I'm talking about potential speed optimizations. I'm specifically not talking about the number of digits, spigot algorithm table sizes, or changing the algorithm in any way that would make it unpractical/unusable on the small systems.

For example, there's a division by 10000 in the original program. It might be possible to make a division table for this and get some benefit. The artificial limitation prevents this. Another one might be a division + binary to decimal conversion table where the whole thing is done in one go. Has nothing to do with the spigot algorithm, and therefore doesn't affect the smaller systems at all.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68020 Bit Field Instructions	mcgeezer	Coders. Asm / Hardware	9	27 October 2023 23:21
68060 64-bit integer math	BSzili	Coders. Asm / Hardware	7	25 January 2021 21:18
Discovery: Math	Audio Snow	request.Old Rare Games	30	20 August 2018 12:17
Math apps	mtb	support.Apps	1	08 September 2002 18:59

22 May 2021, 22:55	#205
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,959	I found one bug, in my version D7 is not handled correctly for odd values. I must rethink this routine again.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)