Optimizing the 68020+ 32-bit math - Page 9

litwr · 20 May 2021, 22:51

Quote:

Originally Posted by alkis

Take note on 2. Use OS for print. I think it was Maynaf that suggested OS's RawDoFmt/Write a gazzilian years ago, but troll said it was not fair. So, use OS but don't use OS if the amiga has an advantage.

IMHO another larger troll have just confounded all things.

Long ago we discussed ways how to make the code shorter but now we seek ways to make the code faster. RawDoFmt can make the code shorter but slower.

roondar · 20 May 2021, 23:03

Quote:

Originally Posted by litwr

EDIT. And 140 cycles for DIVU is the worst case. 78 is the best.

Just for the record, this is not correct for any model of 68K CPU.

DIVU.W on 68000 takes 140 cycles, with a maximum difference of less than 10% between slowest and fastest possible times*. It never takes as little as 78 cycles.
DIVU on 68020/030 never takes 140 cycles. Highest cost is 79 cycles for DIVU.L (DIVU.W takes up to 44 cycles)**.
DIVU on 040 and 060 take fewer cycles still, but I don't have the numbers on hand.

*) See page 8-4 of the 68000 user manual.
**) See page 8-30 of the 68020 user manual. If you have the one which has the cycle counts in chapter 9 instead, then it's on page 9-22.

litwr · 20 May 2021, 23:08

Quote:

Originally Posted by roondar

You could easily argue that the 64KB code/data limitation also gives an artificial advantage to some implementations. In particular, this will benefit 8 bit architectures and probably those that have 64KB segmentation as well. To me it's actually an odd choice, regardless of platform. Optimisation tends to be either best speed or best size. Asking for best speed and best size at the same time usually gives neither.

I'm not going to guess about the intentions here (they may be perfectly legitimate, they may not), but IMHO it's quite clear the stated limitations as is make the nature of the program not very good as a cross-platform benchmarking tool. Meaning, it won't really tell you all that much about real world performance differences because of these kind of specialised limitations.

Sorry but you missed the idea behind the 64KB limit. It is directly opposite to providing advantages for some platforms. Some people tried to make crazy optimizations making separate programs for 100, 256, 1000, ... digits. So this limit just unifies all this crazy diversity. Moreover the 64 KB limit is natural for the pi-spigot algorithm. If you want more than 9400 digits you have to use larger numbers, 16-bit are not enough.
However I agree that this program benchmark results are very specific, only one algo is tested. My project has name Rosetta Pi Spigot and I am sure you know what it means.

It would be also interesting to compare most optimized programs - we don't have many alternatives for such comparisons.

roondar · 20 May 2021, 23:16

Quote:

Originally Posted by litwr

Sorry but you missed the idea behind the 64KB limit. It is directly opposite to providing advantages for some platforms. Some people tried to make crazy optimizations making separate programs for 100, 256, 1000, ... digits. So this limit just unifies all this crazy diversity. Moreover the 64 KB limit is natural for the pi-spigot algorithm. If you want more than 9400 digits you have to use larger numbers, 16-bit are not enough.
However I agree that this program benchmark results are very specific, only one algo is tested. My project has name Rosetta Pi Spigot and I am sure you know what it means.

It would be also interesting to compare most optimized programs - we don't have many alternatives for such comparisons.

Fair enough, just be aware that a 64KB limit does help certain architectures more than others.

litwr · 20 May 2021, 23:22

Quote:

Originally Posted by roondar

Just for the record, this is not correct for any model of 68K CPU.

DIVU.W on 68000 takes 140 cycles, with a maximum difference of less than 10% between slowest and fastest possible times*. It never takes as little as 78 cycles.
DIVU on 68020/030 never takes 140 cycles. Highest cost is 79 cycles for DIVU.L (DIVU.W takes up to 44 cycles)**.
DIVU on 040 and 060 take fewer cycles still, but I don't have the numbers on hand.

*) See page 8-4 of the 68000 user manual.
**) See page 8-30 of the 68020 user manual. If you have the one which has the cycle counts in chapter 9 instead, then it's on page 9-22.

Sorry but you are wrong again. https://www.atari-forum.com/viewtopic.php?t=6484 - the best case is 76+EA cycles. And Don_Adan only told about 68000 DIVU.W timing...

EDIT. More info is here.

litwr · 20 May 2021, 23:28

Quote:

Originally Posted by saimo

Syntax is not an opinion: it's a formal set of rules defined by the designer of the CPU. The fact that some assemblers can be tolerant doesn't change the syntax. lsl.l d5 does not exist in the official syntax and is therefore wrong.

It is wrong. The CPU designer sets only basic rules. You know that GCC usually doesn't use Intel syntax for assembly. Moreover GCC was not able to use this syntax until maybe 2005. GCC uses rather Moto's syntax for the x86.

Don_Adan · 21 May 2021, 00:04

Quote:

Originally Posted by litwr

Thank you. But your version is longer and could be slower for the 68020/30. I am really very impressed by your efforts to make the code better. But you know, the perfection is impossible, every next step to the perfect result is much harder than the previous. So IMHO we have very good code know. Its further improvements will cost much and give almost nothing.

Even the top 68k (even the 68060) can move only words in memory and only by 1 bit.

You are right but it seems that you try to prove things that are very well known for us both. I have never claimed that LSR D5 encoding is a particular case of LSR <ea> encoding. I claimed exactly the same thing as you do: LSR D5 is a convenient shorthand version (an alias ) for LSR #1,D5. You know, the x86 SHR AX,1 and SHR AX,2 have very different encoding and it is good that assemblers don't bother programmers to think about it. Technically it would be more correct to write SHR AX instead of SHR AX,1 because this allows us to use different encodings for the both cases but it breaks the convenience of logic and it is not used therefore.

I can't completely agree. Encoding only provides the base for the whole "building" of the assembly. It is very odd to reduce assembler usability just making it to blindly follow hardware encoding.

Thank you very much. You know there is a very old problem. You can just follow your understanding of the rules and try to satisfy everybody. This usually works worse than some people think. There is another way, someone can try to use better rules. IMHO briefer assembler statements are better for computer nerds.

IMHO we already got almost perfect code. I reported about this in http://eab.abime.net/showpost.php?p=...&postcount=115
However saimo and Don_Adan just tries to make the impossible. They pushed me to make some minor improvements which mean very little. Saimo also started this fruitless LSR D5 discussion.

VASM compiles MOVE.L #10,d4 into MOVEQ #10,D4 - however you offer to replace MOVE.W by MOVEQ and it saves 2 bytes! Thank you very much.

Thank you very much again. IMHO the code has become so polished that it can dazzle somebody by its light.

But its speed and digit number have not changed. However the programs became 6 bytes less and this is good. The changes have just been committed.

IMHO even a 1% speedup is rather impossible, it requires some real magic.

All efforts gave us only 4 saved cycles. 4 more saved cycles were just rediscovered. You know, the main goal is speed, the code size is secondary and much less important.

Of course, these are only results for this particular algorithm. This is mostly the division benchmark.

Could you stop writing nonsenses about LONGER version? This version is shortest and much fastest, because D5 is handled as word, not as longword in your version. And all accesses to D5 register are changed/optimised, not only inside PR0000 routine. Seems you never optimised longest program or routine. Often longer code in one place, give shortest code in many other places. And this is case of this program.

And next thing. Average cycles value for this routine is NOT EQUAL for average cycles for printing Pi routine

Maybe you know that Pi started 31415..., these digits are very fast handled by my routine and very slow by your routine.

And where you find that divu.w best case is 78 cycles for 68000? Its 140 cycles plus EA calculation, from my assembler book 140 plus 4 (EA).

And again you compared 68000 cycles vs 68020 cycles. For 68020 my routine will be fastest too.

Very funny if someone who dont know 68k coding, tell me about 68k coding.
"align 2" aligning to word has no sense for 68k code, because every code on 68k is aligned to 2 bytes. THIS IS NOT x86.
You can try to align to 4 (68020/68030) or 16 (68040/68060) bytes maybe it will be fastest.

Because some assemblers handle lsl D5 as lsl.w #1,D5 then this is not equal that you wrote READABLE code. Some assemblers handled swap.w Dx as swap Dx, but some rejected.

Good 68k code MUST be easy readable. You used move, not move.w and this is only lazy code, i dont like read similar code.

Don_Adan · 21 May 2021, 00:46

Quote:

Originally Posted by litwr

Sorry but you are wrong again. https://www.atari-forum.com/viewtopic.php?t=6484 - the best case is 76+EA cycles. And Don_Adan only told about 68000 DIVU.W timing...

EDIT. More info is here.

Ok, tell me which is BEST case for divu.w? 0/1000 ?
And yes, divu.w D1,D5 (2 bytes) will be 4 cycles fastest on 68000 than divu.w #1000,D5 (4 bytes), but you used second version. Seems your write routine takes about 3-4 secs for 3000 iterations on 68030.

Bruce Abbott · 21 May 2021, 04:47

Quote:

Originally Posted by saimo

Regarding appending ".l" to "moveq": it's redundant, but not technically wrong, because the size attribute of moveq is precisely .l.

Except that it isn't precisely long - it's a signed byte extended to long - despite what the 68000 programmer's manual may say about it.

Don_Adan · 21 May 2021, 04:55

Quote:

Originally Posted by litwr

The main loop starts from .longdiv label and it ends on the bcc .l2 statement. The main loops for 80286 and 68020 have the same size now.

So here is full loop? For me main loop has 56 bytes, not 54 bytes.
Anyway for me this still can be optimised. We have or can have free registers (1 data and 2 address). The best for later optimisations will be know how many times overflow occured and for which cases. Perhaps later longdiv can be removed.

Code:

.l0      clr.l d5       ;d <- 0
         clr.l d7
         move.l d6,d4     ;i <- kv, i <- i*2
         adda.l d4,a3
         subq.l #1,d4     ;b <- 2*i-1
         move.w #10000,d1
         bra.b .l4

.longdiv
         swap d3
         move.w d3,d7
         divu.w d4,d7
         swap d7
         move.w d7,d3
         swap d3
         divu.w d4,d3

         move.w d3,d7
         exg d3,d7
         clr.w d7
         swap d7
         move.w d7,(a3)     ;r[i] <- d%b
         bra.b .enddiv

.l2      sub.l d3,d5
         sub.l d7,d5
         lsr.l #1,d5
.l4
         move -(a3),d0      ; r[i]
         mulu.w d1,d0       ;r[i]*10000
         add.l d0,d5       ;d += r[i]*10000
         move.l d5,d3
         divu.w d4,d3
         bvs.s .longdiv

         move.w d3,d7
         clr.w d3
         swap d3
         move.w d3,(a3)     ;r[i] <- d%b
.enddiv
         subq.w #2,d4    ;i <- i - 1
         bcc.b .l2       ;the main loop
         divu.w d1,d5      ;removed with MULU optimization
 
         sub.w #28,d6   ;kv
         bne.b .l0

modrobert · 21 May 2021, 05:33

Quote:

Originally Posted by litwr

It is very strange.

Code:

CNOP 0,4

and

Code:

ALIGN 2

do the same things.

Sorry, think I misunderstood in my previous reply. Do you mean that the assembled result using 'vasm' is same for these two?

EDIT:

Confirmed, just checked, same output with 'vasm'.

@ thread,

Perhaps we should focus on the goal (thread topic) instead of bickering about details which have no actual impact on the result.

Bruce Abbott · 21 May 2021, 07:53

Quote:

Originally Posted by roondar

Fair enough, just be aware that a 64KB limit does help certain architectures more than others.

I think it is a reasonable limit, especially since some platforms targeted have 8 bit CPUs and less than 64k RAM.

As a benchmark this 'pi-spigot' is pretty silly, but then so are most synthetic benchmarks. So long as the rules are well defined and not too ridiculous I have no problem with them.

This thread has turned out to be more interesting then I thought it would be. We should thank litwr for giving us an opportunity to deepen our understanding of 68k code and hone our programming skills, even if the task itself is a little silly.

meynaf · 21 May 2021, 07:55

Quote:

Originally Posted by saimo

No, problem. Just let me download the file and have a look... Oh, surprise!

The specific manual I linked to is the same manual you linked to, and that I happen have here in paper, straight from Motorola

And this is what it says:

But look at Table 3-2. Data Movement Operation Format.
It says 8->32.

Quote:

Originally Posted by saimo

Now, you're throwing in the mix two things I either didn't touch on or say:
* Technically, moveq is 8->32, not 32 - that's right, but I didn't even remotely touch on that aspect;
* we should write 'moveq.l' and not 'moveq' - nowhere I said that.

My point is that correct syntax of moveq does not take a size. Same as abcd,tas...
Badly worded with "moveq has no size", i admit.

Quote:

Originally Posted by saimo

Let's instead look at what actually happened.
In post #140 you wrote: As an example, most assemblers will accept moveq.l even though it is technically incorrect (moveq has no size).
With post #145 I showed that "moveq has no size" is false, as the official reference manual from Motorola (again, the same you linked to) states that the size of moveq is long; additionally, I showed an example on an instruction that actually has no size (bfextu).

Yet for all practical purposes, moveq has no size. It does not need one. Specifying one can be misleading.

Quote:

Originally Posted by saimo

That's all there is to it, and I'm shocked that such a basic matter started such a reaction

You started this reaction, not me.

Quote:

Originally Posted by saimo

Regarding appending ".l" to "moveq": it's redundant, but not technically wrong, because the size attribute of moveq is precisely .l.

Not really. The ".l" is misleading as is suggests moveq is gonna take a longword argument -- but it takes a byte. This also suggests another size is possible -- not the case.

Quote:

Originally Posted by saimo

But that's a totally different story from lsr.l d5: that is just wrong, because Motorola's syntax - and, even more, instruction encoding - demands that a count be specified when the operand is a register.

So we can write lsr.w (a0) but not lsr.w d0. Inconsistent.
The encoding does not allow it, but the syntax - by just allowing a form without a count - does say that if no count is there then it's 1.
Encoding also does not allow add.b #$12,(a0) but most assemblers will silently convert it to addi. Is that incorrect in your view ?

Quote:

Originally Posted by saimo

Who designed the CPU and defined the instruction set with its syntax is the only authority in such matter, and that's Motorola. Alternative syntaxes can and have been be adopted, but they can't have higher authority.

Motorola didn't really explicit the syntax. Everywhere they show it with the size omitted.

roondar · 21 May 2021, 10:14

Quote:

Originally Posted by litwr

https://www.atari-forum.com/viewtopic.php?t=6484 - the best case is 76+EA cycles.
EDIT. More info is here.

Interesting, I had not seen that before. It looks to be legitimate considering the sources, so seems you're right.

Quote:

Originally Posted by Bruce Abbott

Except that it isn't precisely long - it's a signed byte extended to long - despite what the 68000 programmer's manual may say about it.

I'm not too sure I'd agree with that logic. See, at best that'd imply the size is .b (like add.b #1234,a0 is, which is also sign extended to long). It certainly doesn't imply the instruction has no size.

But in this particular case, I still think the manual got it right because this instruction can never only touch a byte or word, it always only affects all 32 bits. This is different from add or move, because they can affect only bytes or words.

Quote:

Originally Posted by Bruce Abbott

I think it is a reasonable limit, especially since some platforms targeted have 8 bit CPUs and less than 64k RAM.

As a benchmark this 'pi-spigot' is pretty silly, but then so are most synthetic benchmarks. So long as the rules are well defined and not too ridiculous I have no problem with them.

Really, I think it's a neat idea. Limits always breed creativity. As long as we're all aware what these limits do (including that they grant advantages to certain systems/architecture because of them) and that this means this is a bad way to compare systems, there's no problem here

This is all I was trying to point out

meynaf · 21 May 2021, 10:40

Quote:

Originally Posted by roondar

But in this particular case, I still think the manual got it right because this instruction can never only touch a byte or word, it always only affects all 32 bits. This is different from add or move, because they can affect only bytes or words.

That same manual says

swap

is word size.
But

swap

affects full 32-bits of the register.

roondar · 21 May 2021, 10:49

Quote:

Originally Posted by meynaf

That same manual says

swap

is word size.
But

swap

affects full 32-bits of the register.

True, but I was pointing out why I agreed in the particular case of moveq. Swap is a different case and I may or may not agree with the manual on that one. Though I do kind of see what they're trying to say here.

Don_Adan · 21 May 2021, 13:07

For me moveq.l, exg.l, bxxx.l etc only wasted 2 bytes of source code (slowest assembling) , if someone know 68k assembler then know sizes of used by him operations. If someone dont know, then even writing correctly named instruction like "movem.w (SP)+,D0-D2" can cause problems. Many coders dont know how this instruction works. Because litwr is beginer then he can use moveq.l, but because this is github repository then better if he cleaning your code from moveq.l, lsr d5, move etc. Someone can start to learn 68k coding from this source and will be learn bad practices in 68k coding too.

Don_Adan · 21 May 2021, 14:13

Seems that this:

Code:

.l0      clr.l d5       ;d <- 0
         clr.l d7
         move.l d6,d4     ;i <- kv, i <- i*2
         adda.l d4,a3

Can be changed to:

Code:

         moveq #0,D7
.l0      clr.l d5       ;d <- 0
         move.l d6,d4     ;i <- kv, i <- i*2
         adda.l d4,a3

Don_Adan · 21 May 2021, 14:48

Someone can check, if this longdiv will be works?

Code:

.longdiv
 add.w D4,D4
 divu.w D4,D3
 lsr.w #1,D4
 move.w D3,D7
 clr.w D3
 swap D3
 lsr.w #1,D3
 addx.l D7,D7 
 exg D3,D7
 move.w D7,(A3) ;r[i] <- d%b
 bra.b .enddiv

litwr · 21 May 2021, 22:24

Quote:

Originally Posted by a/b

If you look at moveq's opcode you will not find size bits. moveq is always .L and there is no point, in my opinion, to write .L, it's unambiguous. Same with, for example, lea. And since these instructions are so common and frequently used it should be common knowledge what they do and cut the c... size out. And the fact that eg. winuae debugger's craptastic disassembler spits out nonsense like bt instead of bra, lea.l, moveq.l etc, does not change that.
If you look at addq.w #n,ax and addq.l #n,ax, they do exactly the same thing. You could say there's no point in writing the size, but they don't have the same opcode (size is part of the opcode in this case) so it does matter.
And finally...
lsl dx does not have its own opcode, it's an alias for lsl #1,dx at best. lsl <ea> does exists, *but* you should not stop there, you should look at its <ea> table and you'll see that dx is not supported (eg. that specific opcode might be used to encode some other instruction).

Thread moves fast... No, Moto doc *does* "forbid" lsl dx. Again, look at the <ea> table for lsl and you will see: Dn -
You cannot just look at the first part of the information and then ignore the other, relevant, part.
What assemblers accept or don't is another thing, they are typically written to accept all kinds of crap for back/cross/whatever compatibilty.

Sorry but you are not right. You've just confused (like saimo) ML and Assembly. For good assembly language, there is no difference between MOVEQ and MOVEQ.L and between ROL D5 and ROL #1,D5. Assembly syntax is higher level than ML syntax. Moto's manual can't define assembly, it has a completely different purpose. It defines capabilities of their CPU. Of course, assemblers are based on technical data provided by the CPU manufacturer. But to say that a CPU manual forbids to use some assembly syntax is rather a kind of folly. ROL D5 is valid because ROL (A5) is valid. Moto unlike Intel suggested to omit the count for the case when it is always equal to 1. This provides a useful pattern. Some assembler writers just missed this and made more rigid, less flexible syntax. However the best assemblers like VASM or AsmOne do not miss this opportunity.

Quote:

Originally Posted by a/b

Use ALIGN 0,4.
I presume that ALIGN 2 is expanded to ALIGN 2,0, so it does no current address aligment (2nd argument is 0) and then adds 2 to the current address. Eg. it works the same only if the current address is not longword aligned (2, 6, 10, ...).

Sorry you are wrong. Let's check the VASM manual.

Quote:

align <bitcount>
Insert as much zero bytes as required to reach an address where <bitcount> low
order bits are zero. For example align 2 would make an alignment to the next
32-bit boundary.

cnop <offset>,<alignment>
Insert as much zero bytes as required to reach an address which can be divided
by <alignment>. Then add <offset> zero bytes. May fill the padding-bytes with
no-operation instructions for certain cpus.

Moreover how could I publish files without checking their assembly listings at first?! I deeply respect people whom I ask for help.

Quote:

Originally Posted by modrobert

PS: I didn't get any source code this time, so you better change it.

I am sorry, I didn't attach the source because you can use any latest source code. Just set mc68020 and compile, this gives PI-ALIGN. Then comment ALIGN 2 before .l2-label and compile, this gives PI-NA. My github - https://github.com/litwr2/rosetta-pi...e/master/amiga

Quote:

Originally Posted by Don_Adan

And next thing. Average cycles value for this routine is NOT EQUAL for average cycles for printing Pi routine
Maybe you know that Pi started 31415..., these digits are very fast handled by my routine and very slow by your routine.

Sorry I don't understand you.

Quote:

Originally Posted by Don_Adan

Very funny if someone who dont know 68k coding, tell me about 68k coding.
"align 2" aligning to word has no sense for 68k code, because every code on 68k is aligned to 2 bytes. THIS IS NOT x86.
You can try to align to 4 (68020/68030) or 16 (68040/68060) bytes maybe it will be fastest.

You have written non-sense.

Quote:

Originally Posted by Don_Adan

Because some assemblers handle lsl D5 as lsl.w #1,D5 then this is not equal that you wrote READABLE code. Some assemblers handled swap.w Dx as swap Dx, but some rejected.
Good 68k code MUST be easy readable. You used move, not move.w and this is only lazy code, i dont like read similar code.

Everybody who knows the 68k assembly can read MOVE D5,D6 or LSL D5 properly.

Quote:

Originally Posted by roondar

Fair enough, just be aware that a 64KB limit does help certain architectures more than others.

Removing the 64 KB limit makes impossible to use 8-bit systems and even several famous 16-bit systems (like the PDP-11, TI99/4, ...) to test and it would be very bad for a Rosetta project. There is a disclaimer about the PDP-11: some PDP-11 systems can use arrays larger than 64 KB but this requires more complex programming.
Of course, if we wanted to test only 16+ bit systems, removing the 64 KB limit would give some advantages for some systems. Let's think about a calculation of 10000 digits of the pi number. In this case we need to use elements larger than 16 bit in the array and we need more than 16-bit to address an element of the array. This gives advantages for 32-bit systems. So the ARM/80386+/IBM370/68000+/VAX/32016 get some bonuses in comparison with the 8086/80286/PDP11. But the slowest operation is division so those bonuses gives only small advantages in performance. To show this advantages we must remove most of systems used for testing. The price is too high.

21 May 2021, 14:13	#178
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,958	Seems that this: Code: .l0 clr.l d5 ;d <- 0 clr.l d7 move.l d6,d4 ;i <- kv, i <- i2 adda.l d4,a3 Can be changed to: Code: moveq #0,D7 .l0 clr.l d5 ;d <- 0 move.l d6,d4 ;i <- kv, i <- i2 adda.l d4,a3

21 May 2021, 14:48	#179
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,958	Someone can check, if this longdiv will be works? Code: .longdiv add.w D4,D4 divu.w D4,D3 lsr.w #1,D4 move.w D3,D7 clr.w D3 swap D3 lsr.w #1,D3 addx.l D7,D7 exg D3,D7 move.w D7,(A3) ;r[i] <- d%b bra.b .enddiv

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68020 Bit Field Instructions	mcgeezer	Coders. Asm / Hardware	9	27 October 2023 23:21
68060 64-bit integer math	BSzili	Coders. Asm / Hardware	7	25 January 2021 21:18
Discovery: Math	Audio Snow	request.Old Rare Games	30	20 August 2018 12:17
Math apps	mtb	support.Apps	1	08 September 2002 18:59

21 May 2021, 13:07	#177
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,958	For me moveq.l, exg.l, bxxx.l etc only wasted 2 bytes of source code (slowest assembling) , if someone know 68k assembler then know sizes of used by him operations. If someone dont know, then even writing correctly named instruction like "movem.w (SP)+,D0-D2" can cause problems. Many coders dont know how this instruction works. Because litwr is beginer then he can use moveq.l, but because this is github repository then better if he cleaning your code from moveq.l, lsr d5, move etc. Someone can start to learn 68k coding from this source and will be learn bad practices in 68k coding too.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)