Optimizing the 68020+ 32-bit math - Page 12

Don_Adan · 24 May 2021, 02:53

Quote:

Originally Posted by litwr

Cher Monsieur!
I just point the manual snippet about BTST for you afore. Please read it now.

It is perfectly right to use any number in range 0..0xffff. Why write this non-sense?

Really "perfectly right?"

Tell me, if someone wrote:

btst #21,$10000

then which bit from which byte (address) he want to test?

modrobert · 24 May 2021, 06:53

Quote:

Originally Posted by Don_Adan

Really "perfectly right?"

Tell me, if someone wrote:

btst #21,$10000

then which bit from which byte (address) he want to test?

Will the assembler/compiler automatically pick the third byte address when you do this (effectively replacing with $10002)? Or will it just wrap around on the bits for the same $10000 address? Haven't tested (yet).

Seems more readable with 'btst #5,$10002' for the example case.

EDIT:

The compiler (vasm) did this when checking the generated code in RAM:

btst #$15,$10000(pc)

(#$15 = #21)

meynaf · 24 May 2021, 08:07

Quote:

Originally Posted by litwr

Cher Monsieur!
I just point the manual snippet about BTST for you afore. Please read it now.

It is perfectly right to use any number in range 0..0xffff. Why write this non-sense?

No it's not right. Many assemblers will emit a warning if you do so.
F.e. phxass will assemble btst #14,(a0) to 0810 000E - but will emit a warning "Bit manipulation out of range".
So yes, at execution time any number will do, but as that same manual you pointed says, it's modulo 8. Hence it's not useful to write anything above 7, and when done, it's usually the sign of a programming mistake.

Quote:

Originally Posted by roondar

In most assemblers, you can certainly use a larger number than 0-7 using BTST in memory, but be aware that the instruction itself only has encoding space for 3 bits when used to test bits in memory and only tests on a single byte. So BTST #14,<<memory>> doesn't check the 14th bit, but the 6th bit.

It has encoding for full 16-bit word. Yeah, very wasteful.
But indeed btst #14 is same as btst #6 on memory.

Quote:

Originally Posted by Don_Adan

Really "perfectly right?"

Tell me, if someone wrote:

btst #21,$10000

then which bit from which byte (address) he want to test?

Yep, not so easy. Progammer will get btst #5,$10000 -- but it's probably not what he wanted.
It's misleading at best, and this is why i consider it as incorrect.
It is useful that assemblers accept it, only for resourcing purposes (to get identical binary).

Quote:

Originally Posted by modrobert

Will the assembler/compiler automatically pick the third byte address when you do this (effectively replacing with $10002)? Or will it just wrap around on the bits for the same $10000 address? Haven't tested (yet).

Seems more readable with 'btst #5,$10002' for the example case.

No, it will not. That would be incorrect.
Would mean bits 0-7 are first byte, bits 8-15 are second byte, bits 16-23 are third byte, bits 24-31 are fourth byte, i.e. little endian...
If you do :

Code:

 move.l $10000,d0
 btst #21,d0

then it's same as :

Code:

 btst #5,$10001

Yes, it's not $10002. Hence, as i said : misleading.

It could be useful, however, to do some kind of btst.l, doing btst #5,$10001 directly if you specify longword btst. I have a macro which does that.

modrobert · 24 May 2021, 09:18

Quote:

Originally Posted by meynaf

No, it will not. That would be incorrect.
Would mean bits 0-7 are first byte, bits 8-15 are second byte, bits 16-23 are third byte, bits 24-31 are fourth byte, i.e. little endian...
If you do :

Code:

 move.l $10000,d0
 btst #21,d0

then it's same as :

Code:

 btst #5,$10001

Yes, it's not $10002. Hence, as i said : misleading.

It could be useful, however, to do some kind of btst.l, doing btst #5,$10001 directly if you specify longword btst. I have a macro which does that.

Yes, my mistake about least to most significant bit order within byte, Amiga is big endian.

I have a C program to compile on systems initially for general info like this.

Code:

#include <stdio.h>

int main(void)
{
    char		*chp;
    short		*shortp;
    int			*intp;
    long		*longp;
    float		*floatp;
    double		*doublep;
    void		*voidp;
    unsigned int        uint;
    
    union
    {
	long		Long;
	unsigned char	uChar[sizeof(long)];
    }u;
    
    (void)fprintf(stderr, "\nData type sizes\n===============\n");
    (void)fprintf(stderr, "char\tshort\tint\tlong\tfloat\tdouble\n");
    (void)fprintf(stderr, "%3lu\t%3lu\t%3lu\t%3lu\t%3lu\t%lu\n\n",
		  sizeof(char),
		  sizeof(short),
		  sizeof(int),
		  sizeof(long),
		  sizeof(float),
		  sizeof(double));

    (void)fprintf(stderr, "Pointer sizes\n=============\n");

    (void)fprintf(stderr, "char\tshort\tint\tlong\tfloat\tdouble\tvoid\n\n");
    (void)fprintf(stderr, "%3lu\t%3lu\t%3lu\t%3lu\t%3lu\t%3lu\t%3lu\n\n",
		  sizeof(chp),
		  sizeof(shortp),
		  sizeof(intp),
		  sizeof(longp),
		  sizeof(floatp),
		  sizeof(doublep),
		  sizeof(voidp));
    
    (void)fprintf(stderr, "Byte ordering\n=============\n");
    (void)fprintf(stderr, "Integer value 0x01020304 represented as:\n\n");
    (void)fprintf(stderr, "Byte 0\tByte 1\tByte 2\tByte 3\n");
    
    u.Long = 0x01020304;

    (void)fprintf(stderr, "%#04x\t%#04x\t%#04x\t%#04x\n\n",
		  u.uChar[0],
		  u.uChar[1],
		  u.uChar[2],
		  u.uChar[3]);
    
    if (u.uChar[0] == 0x01) {
	(void)fprintf(stderr, "Ordering is left-to-right (big endian)\n\n");
    }
    else if (u.uChar[0] == 0x04) {
	(void)fprintf(stderr, "Ordering is right-to-left (little endian)\n\n");
    } else {
	(void)fprintf(stderr, "Ordering is weird!\n\n");
    }

    uint = 0;

    (void)fprintf(stderr, "Misc\n====\n");
    (void)fprintf(stderr, "Largest value for positive int = %u\n", (uint - 1) / 2);
    (void)fprintf(stderr, "Largest value for unsigned int = %u\n", uint - 1);
    (void)fprintf(stderr, "\n");
    return 0;
}

Which output this on the A1200:

Code:

Data type sizes
===============
char    short   int     long    float   double
  1       2       4       4       4     8

Pointer sizes
=============
char    short   int     long    float   double  void

  4       4       4       4       4       4       4

Byte ordering
=============
Integer value 0x01020304 represented as:

Byte 0  Byte 1  Byte 2  Byte 3
0x01    0x02    0x03    0x04

Ordering is left-to-right (big endian)

Misc
====
Largest value for positive int = 2147483647
Largest value for unsigned int = 4294967295

Bruce Abbott · 24 May 2021, 09:39

Quote:

Originally Posted by modrobert

Seems more readable with 'btst #5,$10002' for the example case.

Magic numbers are less readable in general. What is bit #5? The code above provides no clue. To fix that you equate the bit number to a symbolic name, and then the fun starts.

For example, the InputEvent structure has a UWORD field called ie_Qualifier which contains 16 bits. So let's say you want to test IEQUALIFIERB_NUMERICPAD which is bit 8. Assuming that A0 points to the inputevent structure, you can just write...

Code:

btst #IEQUALIFIERB_NUMERICPAD,ie_Qualifier(A0)

...and the meaning is clear, even though the actual bit tested is bit #0 (of the upper byte in the word).

But to be 'correct' you should do...

Code:

btst #IEQUALIFIERB_NUMERICPAD-8,ie_Qualifier(A0)

...which is less clear.

If you want to test for eg. the left shift key (bit #0 of the word) then you must do...

Code:

btst #IEQUALIFIERB_LSHIFT,ie_Qualifier+1(A0)

...which is also clear. The only problem is remembering to add the '+1' for bits in the lower byte.

In high level languages it's easier because you don't have to know which byte each bit is in. The compiler knows which byte to access, and what bit number needs to be generated (though a lazy compiler could - correctly - assume that the bit # will wrap around and so not bother to adjust it).

modrobert · 24 May 2021, 10:00

Quote:

Originally Posted by Bruce Abbott

Magic numbers are less readable in general. What is bit #5? The code above provides no clue. To fix that you equate the bit number to a symbolic name, and then the fun starts.

For example, the InputEvent structure has a UWORD field called ie_Qualifier which contains 16 bits. So let's say you want to test IEQUALIFIERB_NUMERICPAD which is bit 8. Assuming that A0 points to the inputevent structure, you can just write...

Code:

btst #IEQUALIFIERB_NUMERICPAD,ie_Qualifier(A0)

...and the meaning is clear, even though the actual bit tested is bit #0 (of the upper byte in the word).

But to be 'correct' you should do...

Code:

btst #IEQUALIFIERB_NUMERICPAD-8,ie_Qualifier(A0)

...which is less clear.

If you want to test for eg. the left shift key (bit #0 of the word) then you must do...

Code:

btst #IEQUALIFIERB_LSHIFT,ie_Qualifier+1(A0)

...which is also clear. The only problem is remembering to add the '+1' for bits in the lower byte.

In high level languages it's easier because you don't have to know which byte each bit is in. The compiler knows which byte to access, and what bit number needs to be generated (though a lazy compiler could - correctly - assume that the bit # will wrap around and so not bother to adjust it).

Thanks for the explanation. I remember some gotcha about writing hardware registers on Amiga have to be 16 bit word size, so usually resort to 'and' or 'or' with mask instead of doing 'btst' when dealing with bit larger than #7.

Thomas Richter · 24 May 2021, 10:34

Quote:

Originally Posted by Bruce Abbott

But to be 'correct' you should do...

Code:

btst #IEQUALIFIERB_NUMERICPAD-8,ie_Qualifier(A0)

...which is less clear.

There are macros to solve this type of problem. I have a "btstm" macro for DevPac which is a bit-test on a LONG in memory. It does the offset adjustment and bit-count adjustment for you, of course provided that you test for an immediate bit. It's on Aminet....

modrobert · 24 May 2021, 11:02

Quote:

Originally Posted by Thomas Richter

There are macros to solve this type of problem. I have a "btstm" macro for DevPac which is a bit-test on a LONG in memory. It does the offset adjustment and bit-count adjustment for you, of course provided that you test for an immediate bit. It's on Aminet....

Looks like this one...

http://aminet.net/package/dev/asm/DvPkMacros

Code:

btstm   Macro                   ;test one bit in a longword
        btst #(\1)&$7,(3^((\1)>>3))+\2
        Endm

Nice oneliner, trying to sort out the logic.

EDIT:

OK, so for first argument you strip everything except the first three bits, and then get the remaining two bits which are not set in first argument and add their toggled value to second argument, clever.

Getting this when I plug Don_Adan's example values (btstm 21,$10000):

btst #5,$10001

litwr · 26 May 2021, 19:01

Quote:

Originally Posted by robinsonb5

Yes, indeed, I see the 0x104 - however, the 0x5544 at 0x102 is *not* the bcc, it's the "subq #4, d4". The bcc *starts* at 0x104, and thus ends at 0x106 - therefore you're not counting the bcc.

You are right. Sorry, I should have been more accurate. However, I again wish Don_Adan could be less cryptic. He could have just said 0x106 and finished this.

Quote:

Originally Posted by Don_Adan

You are very funny. You used buggy program which cant calc size of loop routine correctly. You was too lazy to read/check my reply, where I counted all instructions used in main loop. You know better what is correct for using btst at memory. Now you tell me that my routine will be overflow if D4 will be 1. I know this. This routine works only for 1 bit overflow, not more. Maybe you know how works lsr.l #1,D3? You dont show example D4 and D3 values, when overflow problem occured. Present i dont have access to my Amiga to check this. Loop code is good enough, but can be better. You used your program for CPU benchmark. Same for PR0000, your version is only average.

You were correct about the size of the loop. However you didn't clarify your point and this was bad. Thanks to robinsonb5 who helped to find the truth. Anyway what is buggy? The program doesn't compute the size of the loop. So this is you who uses rather funny logic.

It is good that you have understood the BTST instruction. It is a sign of progress.
D3 may be equal to 31415926 when D4 = 1. LSR.L D3 makes D3 = 15707963 but this doesn't help against the overflow. So your code is buggy.
It is sad that you don't have an Amiga nearby but it is difficult to imagine. IMHO today, everybody may have a decent Amiga configuration using an emulator.

And please be less cryptic about details. BTW I understand Polish...

I was in Warsaw many times.

Quote:

Originally Posted by Thorham

This is not what I'm talking about. I'm talking about potential speed optimizations. I'm specifically not talking about the number of digits, spigot algorithm table sizes, or changing the algorithm in any way that would make it unpractical/unusable on the small systems.

For example, there's a division by 10000 in the original program. It might be possible to make a division table for this and get some benefit. The artificial limitation prevents this. Another one might be a division + binary to decimal conversion table where the whole thing is done in one go. Has nothing to do with the spigot algorithm, and therefore doesn't affect the smaller systems at all.

Sorry I have still missed your general point. I can only answer about your examples. What prevents you to optimize the division by 10000?! It doesn't break any rule! However this division is outside the main loop so it gives you nothing but larger code. The same is true for your idea about PR0000 optimization. Could you provide more examples?

Quote:

Originally Posted by meynaf

No it's not right. Many assemblers will emit a warning if you do so.

Of course, it is rather unusual to use numbers larger than 7 there but they are allowed and for some exotic purposes, they may be useful. Someone can use this way to use the operand memory to keep a separate value. Why waste 13 bits?!

litwr · 26 May 2021, 19:02

Quote:

Originally Posted by modrobert

Looks like this one...

http://aminet.net/package/dev/asm/DvPkMacros

Code:

btstm   Macro                   ;test one bit in a longword
        btst #(\1)&$7,(3^((\1)>>3))+\2
        Endm

What a nice macro!

litwr · 26 May 2021, 19:17

Quote:

Originally Posted by Don_Adan

Really "perfectly right?"

BTW your code which replaces SUB #14,D6 with SUB #28,D6 imposes a limit of 9360 digits.

The older one could be used up to 9400 digits. You code has also made the algo less clear.

Don_Adan · 26 May 2021, 20:39

Quote:

Originally Posted by litwr

BTW your code which replaces SUB #14,D6 with SUB #28,D6 imposes a limit of 9360 digits. The older one could be used up to 9400 digits. You code has also made the algo less clear.

My code which replaces sub #14,d6 with sub #28,d6 imposes a limit of 9360 digits?
Oooh, really? It must be magic. Here is this optimisation http://eab.abime.net/showpost.php?p=...&postcount=138

Code:

.l7 
;     lsr d6       ; 2 bytes 
         mulu #7,d6          ;kv = d6
         move.l d6,d3
         lea.l ra(pc),a3

         exg.l a5,a6
         jsr Forbid(a6)
         moveq.l #INTB_VERTB,d0
         lea.l VBlankServer(pc),a1
         jsr AddIntServer(a6)
         exg.l a5,a6
         ;move.w #$4000,$dff096    ;DMA off
 ;        lsr d3      ; 2 bytes 

     lsr.w #2,D3  ; 2 bytes 
        subq #1,d3
         move.l #2000*65537,d0
         move.l a3,a0
.fill    move.l d0,(a0)+
         dbra d3,.fill

.l0      clr.l d5       ;d <- 0
;         clr.l d4    ; 2 bytes less 
         clr.l d7
 ;        move d6,d4     ;i <- kv  ; 2 bytes
 ;        add.l d4,d4     ;i <- i*2  ; 2 bytes

  move.l D6,D4         ; 2 bytes
         adda.l d4,a3
.....

endif
 ;        sub.w #14,d6   ;kv   ;4 bytes 
        sub.w #28,D6         ; 4 bytes
         bne .l0

6 instructions (14 bytes), are replaced with 3 instructions (8 bytes)
Of course you can use any program for count this.

This is less clear? 4 digits, every digit 7 bytes. Then sub 28 is less readable than sub 14 ?

litwr · 26 May 2021, 20:48

Quote:

Originally Posted by Don_Adan

Oooh, really? It must be magic.

You have missed the point. Your code imposes that limit because D6 has to keep a larger value now.

Indeed, it is not important because we have a practical limit of 9280 digits now. 9360 is a much larger number.

Therefore your optimization is still actual.

Don_Adan · 26 May 2021, 21:01

Because Pi routine after some time is overflowed more than 1 bit ( over $1FFFF), then my idea can not be used. Thanks to Phil for tests this. And Saimo version is the best option for internal loop.

Anyway if someone will be need fast (?) 32/16 divide with maximum $1FFFF output then he can used my last attempt. Exactly this is divide by 15 bits maximum because bit 16 is zero, and D7 high word is already cleared.

Code:

 lsr.l #1,D3 
 divu.w d4,d3 
 move.w d3,d7 
 clr.w d3 
 swap d3 
 addx.w D3,D3 
 add.l D7,D7 
 sub.w D4,D3 
 bpl.b OneMore
 add.w D4,D3
 subq.l #1,D7
OneMore
 addq.l #1,D7

Don_Adan · 26 May 2021, 21:06

Quote:

Originally Posted by litwr

You have missed the point. Your code imposes that limit because D6 has to keeps a larger value now.

Indeed, it is not important because we have a practical limit of 9280 digits now. 9360 is a much larger number.

Therefore your optimization is still actual.

How you calculated this? Which program you used? 9360 is maximum value for $10000 buff and this is not changed.

Don_Adan · 26 May 2021, 21:23

Quote:

Originally Posted by litwr

And please be less cryptic about details. BTW I understand Polish...

I was in Warsaw many times.

This is EAB rule then you must be happy with my very poor english or wait 100 years when Google translator will be good enough for translation polish texts.

litwr · 27 May 2021, 11:29

Quote:

Originally Posted by Don_Adan

How you calculated this? Which program you used? 9360 is maximum value for $10000 buff and this is not changed.

Your changes make the value of D6 two times larger, and D6 keeps a word value. 0xffff is enough for 9360 digits. If D6 was two times less it would allow us to use up to 9400 digits.
I can assume that non-English language may be allowed in quotes.

Don_Adan · 27 May 2021, 12:07

Quote:

Originally Posted by litwr

Your changes make the value of D6 two times larger, and D6 keeps a word value. 0xffff is enough for 9360 digits. If D6 was two times less it would allow us to use up to 9400 digits.
I can assume that non-English language may be allowed in quotes.

Really? 9400x7=65800 bytes. Out of 65536 ($10000) bytes. My version has no impact of number of digits. Current version is limited by this code only:
move.l d6,d4
subq.l #1,d4
because
divu.w d4,d3 is used later.
Then d4 can not be larger than $ffff. Then D6 can not be larger than $10000.
sub.w #28,d6 can be replaced with sub.l #28,d6, but this is no problem here up to $10000 value. but because d6 can not be higher than $10000 then no problem for all.

litwr · 27 May 2021, 18:51

Quote:

Originally Posted by Don_Adan

Really? 9400x7=65800 bytes. Out of 65536 ($10000) bytes. My version has no impact of number of digits. Current version is limited by this code only:
move.l d6,d4
subq.l #1,d4
because
divu.w d4,d3 is used later.
Then d4 can not be larger than $ffff. Then D6 can not be larger than $10000.
sub.w #28,d6 can be replaced with sub.l #28,d6, but this is no problem here up to $10000 value. but because d6 can not be higher than $10000 then no problem for all.

Yes, D4 sets the same limit too but you added another one.

If we want 9400 digits we must make D4 and D6 long word now. For the previous version it was enough to make D4 double word.
You wrote Then sub 28 is less readable than sub 14 ? - Exactly! It is because the original algorithm uses 14.

robinsonb5 · 27 May 2021, 20:12

Quote:

Originally Posted by litwr

You wrote Then sub 28 is less readable than sub 14 ? - Exactly! It is because the original algorithm uses 14.

That, my friend, is what comments are for.

Reduced readability is an expected side-effect of optimisation - hence the saying "premature optimisation is the root of all evil".

26 May 2021, 21:01	#234
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,957	Because Pi routine after some time is overflowed more than 1 bit ( over $1FFFF), then my idea can not be used. Thanks to Phil for tests this. And Saimo version is the best option for internal loop. Anyway if someone will be need fast (?) 32/16 divide with maximum $1FFFF output then he can used my last attempt. Exactly this is divide by 15 bits maximum because bit 16 is zero, and D7 high word is already cleared. Code: lsr.l #1,D3 divu.w d4,d3 move.w d3,d7 clr.w d3 swap d3 addx.w D3,D3 add.l D7,D7 sub.w D4,D3 bpl.b OneMore add.w D4,D3 subq.l #1,D7 OneMore addq.l #1,D7 Last edited by Don_Adan; 26 May 2021 at 21:07.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68020 Bit Field Instructions	mcgeezer	Coders. Asm / Hardware	9	27 October 2023 23:21
68060 64-bit integer math	BSzili	Coders. Asm / Hardware	7	25 January 2021 21:18
Discovery: Math	Audio Snow	request.Old Rare Games	30	20 August 2018 12:17
Math apps	mtb	support.Apps	1	08 September 2002 18:59

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)