Optimizing the 68020+ 32-bit math - Page 8

Don_Adan · 20 May 2021, 11:07

Quote:

Originally Posted by Bruce Abbott

I think litwr wants fastest and smallest, so it's a bit tricky. Is a 5% speedup worthwhile if it adds 4 bytes to the file size?

On litwr's benchmark site the main loop code size is 54 bytes on a 50MHz 68030 vs 57 bytes (~6% larger) on a 25MHz 386, while the 386 would theoretically be ~5% faster if running at the same clock speed. Some speed optimization might make the Amiga code 5% quicker but 5% larger, and therefore virtually identical to the 386 (except that the 030 is 25% faster in real terms because 386's top out at 40MHz).

It's also good to see the Amiga 1200 with Blizzard 1230-IV beating a 36MHz ARM3 and a 33MHz 80486 (though of course these figures don't mean much in the real world).

Fair enough, but ultimately we want to know what our efforts have achieved. Hoping to see a side-by-side comparison between the original code and the final optimized version.

Wow, such easy pickings! Perhaps the total size can still be shrunk quite a lot and get even quicker!

If main loop code started from label .l0 (without Write routine) then except my today 4 bytes size optimisation, i can gained 6 bytes more too. Seems 386 is not good enough to beat 68020 in code density.

alkis · 20 May 2021, 11:57

Quote:

Originally Posted by Bruce Abbott

I think litwr wants fastest and smallest, so it's a bit tricky. Is a 5% speedup worthwhile if it adds 4 bytes to the file size?

You spelled "wants to troll" incorrectly there.

Basic Premise: (from troll's site)
Every program is satisfying four restrictions: 1) it measures time; 2) it uses an OS function to print digits, it prints 4 digits a time synchronously with the calculation of them; 3) it uses less than 64 KB RAM for the code and data; 4) it utilizes all available RAM below 64 KB limit to get the maximum number of calculated digits, so it is forbidden to restrict artificially the maximum number of digits.

Take note on 2. Use OS for print. I think it was Maynaf that suggested OS's RawDoFmt/Write a gazzilian years ago, but troll said it was not fair. So, use OS but don't use OS if the amiga has an advantage.

It's pretty pointless, unless you want to keep feeding the troll though.

My €0.02

roondar · 20 May 2021, 12:12

Quote:

Originally Posted by alkis

You spelled "wants to troll" incorrectly there.

Basic Premise: (from troll's site)
Every program is satisfying four restrictions: 1) it measures time; 2) it uses an OS function to print digits, it prints 4 digits a time synchronously with the calculation of them; 3) it uses less than 64 KB RAM for the code and data; 4) it utilizes all available RAM below 64 KB limit to get the maximum number of calculated digits, so it is forbidden to restrict artificially the maximum number of digits.

Take note on 2. Use OS for print. I think it was Maynaf that suggested OS's RawDoFmt/Write a gazzilian years ago, but troll said it was not fair. So, use OS but don't use OS if the amiga has an advantage.

It's pretty pointless, unless you want to keep feeding the troll though.

My €0.02

You could easily argue that the 64KB code/data limitation also gives an artificial advantage to some implementations. In particular, this will benefit 8 bit architectures and probably those that have 64KB segmentation as well. To me it's actually an odd choice, regardless of platform. Optimisation tends to be either best speed or best size. Asking for best speed and best size at the same time usually gives neither.

I'm not going to guess about the intentions here (they may be perfectly legitimate, they may not), but IMHO it's quite clear the stated limitations as is make the nature of the program not very good as a cross-platform benchmarking tool. Meaning, it won't really tell you all that much about real world performance differences because of these kind of specialised limitations.

Don_Adan · 20 May 2021, 14:49

On Amiga is no problem to write program which use full 64KB for store digits. But it will be unfair for 8bit systems and maybe some other CPU's.

saimo · 20 May 2021, 19:10

I'm sorry if this sounds pedant, but we should not spread wrong notions, especially when there is already some confusion.

Quote:

Originally Posted by meynaf

As an example, most assemblers will accept moveq.l even though it is technically incorrect (moveq has no size).

moveq does have a size, and that's long. Here is the official definition:

Please don't be fooled by the fact that the writing Assembler Syntax: MOVEQ # < data > ,Dn doesn't include ".l", as the same happens also with the other instructions - for example, here's move:

Instructions without a size are explicitly declared unsized - here's an example:

Quote:

So about LSL.L D5 being syntaxically correct or not, it is a matter of how you see it

Syntax is not an opinion: it's a formal set of rules defined by the designer of the CPU. The fact that some assemblers can be tolerant doesn't change the syntax. lsl.l d5 does not exist in the official syntax and is therefore wrong.

Thorham · 20 May 2021, 19:25

Quote:

Originally Posted by alkis

Take note on 2. Use OS for print. I think it was Maynaf that suggested OS's RawDoFmt/Write a gazzilian years ago, but troll said it was not fair. So, use OS but don't use OS if the amiga has an advantage.

Just use VPrintf()

meynaf · 20 May 2021, 19:39

Quote:

Originally Posted by saimo

I'm sorry if this sounds pedant, but we should not spread wrong notions, especially when there is already some confusion.

It is not wrong notions. It is 30+ years of coding.
That some "official" online doc says something does not change what is correct and what is not (btw. Freescale isn't Motorola as we knew it).

Quote:

Originally Posted by saimo

moveq does have a size, and that's long.

No. Most, if not all, disassemblers, will output it without a size and all assemblers accept it without a size -- while you may eventually find some which reject moveq.l (and bset.b, etc).

Quote:

Originally Posted by saimo

Syntax is not an opinion: it's a formal set of rules defined by the designer of the CPU. The fact that some assemblers can be tolerant doesn't change the syntax. lsl.l d5 does not exist in the official syntax and is therefore wrong.

If assemblers rejected all 'wrong' syntax with this definition, not a single source in the world would assemble

saimo · 20 May 2021, 20:00

Quote:

Originally Posted by meynaf

It is not wrong notions. It is 30+ years of coding.
That some "official" online doc says something does not change what is correct and what is not (btw. Freescale isn't Motorola as we knew it).

That online doc is the PDF version of the Motorola's M68000 Family Programmer's Reference Manual. For the record, I have the physical book right here, and I can guarantee that the text matches 100%. FreeScale and NXP simply acquired the assets and rights, but didn't redefine the syntax.

Quote:

No. Most, if not all, disassemblers, will output it without a size and all assemblers accept it without a size -- while you may eventually find some which reject moveq.l (and bset.b, etc).

What assemblers and disassemblers do mean less than 0: Motorola defined the syntax, and Motorola said that moveq has a long size.

Quote:

If assemblers rejected all 'wrong' syntax with this definition, not a single source in the world would assemble

That doesn't change the fact that a wrong syntax is just that: wrong.

meynaf · 20 May 2021, 20:12

Quote:

Originally Posted by saimo

That online doc is the PDF version of the Motorola's M68000 Family Programmer's Reference Manual. For the record, I have the physical book right here, and I can guarantee that the text matches 100%. FreeScale and NXP simply acquired the assets and rights, but didn't redefine the syntax.

Technically, moveq is 8->32, not 32. You don't believe me ? Read this then :
http://www2.ece.ohio-state.edu/~degr.../M68000PRM.pdf

Quote:

Originally Posted by saimo

What assemblers and disassemblers do mean less than 0: Motorola defined the syntax, and Motorola said that moveq has a long size.

Not exactly, no. They just say - and only in the specific manual you linked to - that it's a longword operation. Nowhere they say we should write 'moveq.l' and not 'moveq'.

Quote:

Originally Posted by saimo

That doesn't change the fact that a wrong syntax is just that: wrong.

Other syntaxes exist, not just the one of Motorola :
https://www.nextop.de/NeXTstep_3.3_D...mld/index.html

litwr · 20 May 2021, 20:20

Quote:

Originally Posted by saimo

Don't worry, I'm not taking this conversation personally at all: I just had to point out that, unfortunately, your stubbord attitude is preventing you to see something that is very simple and also shows disrespect to who's trying to help you see.
Now, there is nothing controversial here: the pages of Motorola's official manual define clearly the syntax of the instructions, how the instructions work and how they are encoded. You simply fail to understand those pages. I tried to help you with an almost word-by-word guidance, but given that you choose not to see, I won't add anything else.

You have your interpretation. I have mine.

I have showed you my logic you prefer to stop showing yours. So I continue to insist that official Moto's doc doesn't forbid LSL.L D5.
BTW I have just checked LSL.L D5 with ASMONE - it works perfectly.

litwr · 20 May 2021, 20:24

Quote:

Originally Posted by Don_Adan

3 times divu.w called, 3x53 cycles, average about 150 cycles less per one access to PR0000 routine. Plus much fastest code for cv handling. In total, about 165 cycles fastest per one access.

Thank you but even 200 cycles give us less than 0.5% - it is still undetectable. Moreover your optimization may slow down the 68020.

EDIT. And 140 cycles for DIVU is the worst case. 78 is the best.

a/b · 20 May 2021, 20:28

If you look at moveq's opcode you will not find size bits. moveq is always .L and there is no point, in my opinion, to write .L, it's unambiguous. Same with, for example, lea. And since these instructions are so common and frequently used it should be common knowledge what they do and cut the c... size out. And the fact that eg. winuae debugger's craptastic disassembler spits out nonsense like bt instead of bra, lea.l, moveq.l etc, does not change that.
If you look at addq.w #n,ax and addq.l #n,ax, they do exactly the same thing. You could say there's no point in writing the size, but they don't have the same opcode (size is part of the opcode in this case) so it does matter.
And finally...
lsl dx does not have its own opcode, it's an alias for lsl #1,dx at best. lsl <ea> does exists, *but* you should not stop there, you should look at its <ea> table and you'll see that dx is not supported (eg. that specific opcode might be used to encode some other instruction).

Thread moves fast... No, Moto doc *does* "forbid" lsl dx. Again, look at the <ea> table for lsl and you will see: Dn -
You cannot just look at the first part of the information and then ignore the other, relevant, part.
What assemblers accept or don't is another thing, they are typically written to accept all kinds of crap for back/cross/whatever compatibilty.

litwr · 20 May 2021, 20:32

Quote:

Originally Posted by modrobert

I tested now running in reverse order, exactly the same results, 'pi-na' is 0.24 seconds faster than 'pi-align'.

Try using 'cnop 0,4' to align with next long word address instead.

It is very strange.

Code:

CNOP 0,4

and

Code:

ALIGN 2

do the same things.

We need help from the 68k experts for this issue. The 68k experts! Help us! I am completely baffled here.

a/b · 20 May 2021, 20:41

Use ALIGN 0,4.
I presume that ALIGN 2 is expanded to ALIGN 2,0, so it does no current address aligment (2nd argument is 0) and then adds 2 to the current address. Eg. it works the same only if the current address is not longword aligned (2, 6, 10, ...).

modrobert · 20 May 2021, 21:24

Quote:

Originally Posted by litwr

It is very strange.

Code:

CNOP 0,4

and

Code:

ALIGN 2

do the same things.

No, long word vs word.

PS: I didn't get any source code this time, so you better change it.

saimo · 20 May 2021, 22:26

Quote:

Originally Posted by meynaf

Technically, moveq is 8->32, not 32. You don't believe me ? Read this then :
http://www2.ece.ohio-state.edu/~degr.../M68000PRM.pdf

No, problem. Just let me download the file and have a look... Oh, surprise!

(Click to see in full size.)

Quote:

Not exactly, no. They just say - and only in the specific manual you linked to - that it's a longword operation. Nowhere they say we should write 'moveq.l' and not 'moveq'.

The specific manual I linked to is the same manual you linked to, and that I happen have here in paper, straight from Motorola

And this is what it says:

(Click to see in full size.)

Now, you're throwing in the mix two things I either didn't touch on or say:
* Technically, moveq is 8->32, not 32 - that's right, but I didn't even remotely touch on that aspect;
* we should write 'moveq.l' and not 'moveq' - nowhere I said that.

Let's instead look at what actually happened.
In post #140 you wrote: As an example, most assemblers will accept moveq.l even though it is technically incorrect (moveq has no size).
With post #145 I showed that "moveq has no size" is false, as the official reference manual from Motorola (again, the same you linked to) states that the size of moveq is long; additionally, I showed an example on an instruction that actually has no size (bfextu).
That's all there is to it, and I'm shocked that such a basic matter started such a reaction

Regarding appending ".l" to "moveq": it's redundant, but not technically wrong, because the size attribute of moveq is precisely .l. But that's a totally different story from lsr.l d5: that is just wrong, because Motorola's syntax - and, even more, instruction encoding - demands that a count be specified when the operand is a register.

Quote:

Other syntaxes exist, not just the one of Motorola :
https://www.nextop.de/NeXTstep_3.3_D...mld/index.html

Who designed the CPU and defined the instruction set with its syntax is the only authority in such matter, and that's Motorola. Alternative syntaxes can and have been be adopted, but they can't have higher authority.

litwr · 20 May 2021, 22:27

Quote:

Originally Posted by Don_Adan

Next 2 bytes less.

Code:

         move.l #start+$10000-ra,D7
         divu.w #7*4,D7
         ext.l D7
         lsl.w #2,D7    ;d7=maxn

For my version of PR0000, 2 more bytes gained.

Thank you. But your version is longer and could be slower for the 68020/30. I am really very impressed by your efforts to make the code better. But you know, the perfection is impossible, every next step to the perfect result is much harder than the previous. So IMHO we have very good code know. Its further improvements will cost much and give almost nothing.

Quote:

Originally Posted by Bruce Abbott

Only because the 68000 has this arbitrary limitation. To shift by more than 1 bit (without a barrel shifter or temporary registers to store intermediate results) it would have to perform multiple reads and writes to memory, which would be very slow. Also the 16 bit opcode does not have enough space to specify both shift value and <ea>.

Even the top 68k (even the 68060) can move only words in memory and only by 1 bit.

Quote:

Originally Posted by Bruce Abbott

Yes. However in this case - as in many others - not all <ea> modes are valid. Specifically, modes An and Dn are illegal for shift/rotate, and will cause an exception if you try to execute them (even though some debuggers or disassemblers may think they are valid code).

There are equivalent opcodes using Dn explicitly that are valid, which an assembler could alias to for 'convenience'. This also applies to some other instructions that do have an equivalent <ea> opcode, which is a pain when the assembler silently changes one to the other without asking or providing any way to avoid it (sometimes we need to have the exact opcode we asked for!).

You are right but it seems that you try to prove things that are very well known for us both. I have never claimed that LSR D5 encoding is a particular case of LSR <ea> encoding. I claimed exactly the same thing as you do: LSR D5 is a convenient shorthand version (an alias ) for LSR #1,D5. You know, the x86 SHR AX,1 and SHR AX,2 have very different encoding and it is good that assemblers don't bother programmers to think about it. Technically it would be more correct to write SHR AX instead of SHR AX,1 because this allows us to use different encodings for the both cases but it breaks the convenience of logic and it is not used therefore.

Quote:

Originally Posted by Bruce Abbott

Encoding is relevant though, because it tells you which modes are valid for particular instructions.

I can't completely agree. Encoding only provides the base for the whole "building" of the assembly. It is very odd to reduce assembler usability just making it to blindly follow hardware encoding.

Quote:

Originally Posted by Bruce Abbott

"The great thing about standards is that there are so many of them!"

Yes, some code that is "syntactically correct" in one assembler may not be in another. To avoid confusion and maintain compatibility it is best to stick to a common subset with unambiguous syntax where possible, and specify the syntax used when it isn't. Otherwise people may have trouble understanding and using your code.

Thank you very much. You know there is a very old problem. You can just follow your understanding of the rules and try to satisfy everybody. This usually works worse than some people think. There is another way, someone can try to use better rules. IMHO briefer assembler statements are better for computer nerds.

Quote:

Originally Posted by Bruce Abbott

Time for summary of progress so far?
How much space and time have we now saved (or gained) over litwr's original code, as a proportion of it?

IMHO we already got almost perfect code. I reported about this in http://eab.abime.net/showpost.php?p=...&postcount=115
However saimo and Don_Adan just tries to make the impossible. They pushed me to make some minor improvements which mean very little. Saimo also started this fruitless LSR D5 discussion.

Quote:

Originally Posted by Don_Adan

Critical code perhaps can be shortest/fastest only a few, but other code called only once still can be shortened.
Here is example:
from
move #10,d4
to
moveq #10,D4
Mostly time calculation routine can be optimised for space.

VASM compiles MOVE.L #10,d4 into MOVEQ #10,D4 - however you offer to replace MOVE.W by MOVEQ and it saves 2 bytes! Thank you very much.

Quote:

Originally Posted by Don_Adan

Perhaps this code can be shortened/optimised too. A few shortet a few fastest.

Thank you very much again. IMHO the code has become so polished that it can dazzle somebody by its light.

But its speed and digit number have not changed. However the programs became 6 bytes less and this is good. The changes have just been committed.

Quote:

Originally Posted by Bruce Abbott

I think litwr wants fastest and smallest, so it's a bit tricky. Is a 5% speedup worthwhile if it adds 4 bytes to the file size?

IMHO even a 1% speedup is rather impossible, it requires some real magic.

All efforts gave us only 4 saved cycles. 4 more saved cycles were just rediscovered. You know, the main goal is speed, the code size is secondary and much less important.

Quote:

Originally Posted by Bruce Abbott

It's also good to see the Amiga 1200 with Blizzard 1230-IV beating a 36MHz ARM3 and a 33MHz 80486 (though of course these figures don't mean much in the real world).

Of course, these are only results for this particular algorithm. This is mostly the division benchmark.

roondar · 20 May 2021, 22:32

Quote:

Originally Posted by saimo

That's all there is to it, and I'm shocked that such a basic matter started such a reaction

I can't help but agree. In my opinion, the Motorola manuals are the authority here (unless there are errata, in which case they take precedence).

Quote:

But that's a totally different story from lsr.l d5: that is just wrong, because Motorola's syntax - and, even more, instruction encoding - demands that a count be specified when the operand is a register.

It seems to me if the instruction coding for a specific instruction form does not actually exist then using a syntax that implies that form is being used is just plain wrong. End of story.

Quote:

Who designed the CPU and defined the instruction set with its syntax is the only authority in such matter, and that's Motorola. Alternative syntaxes can and have been be adopted, but they can't have higher authority.

I fully agree

litwr · 20 May 2021, 22:33

Quote:

Originally Posted by Don_Adan

If main loop code started from label .l0 (without Write routine) then except my today 4 bytes size optimisation, i can gained 6 bytes more too. Seems 386 is not good enough to beat 68020 in code density.

The main loop starts from .longdiv label and it ends on the bcc .l2 statement. The main loops for 80286 and 68020 have the same size now.

saimo · 20 May 2021, 22:37

Quote:

Originally Posted by litwr

You have your interpretation. I have mine.

I have showed you my logic you prefer to stop showing yours.

The M68000UM is not a collection of poems. There is no room for interpretation. Your interpretation is wrong.

Quote:

So I continue to insist that official Moto's doc doesn't forbid LSL.L D5.

It does and I already explained you why, almost word by word.
The only result you'll achieve by not accepting that your interpretation is wrong is that you won't learn something new and your reputation will be affected negatively.

Quote:

BTW I have just checked LSL.L D5 with ASMONE - it works perfectly.

It means nothing.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68020 Bit Field Instructions	mcgeezer	Coders. Asm / Hardware	9	27 October 2023 23:21
68060 64-bit integer math	BSzili	Coders. Asm / Hardware	7	25 January 2021 21:18
Discovery: Math	Audio Snow	request.Old Rare Games	30	20 August 2018 12:17
Math apps	mtb	support.Apps	1	08 September 2002 18:59

20 May 2021, 14:49	#144
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,979	On Amiga is no problem to write program which use full 64KB for store digits. But it will be unfair for 8bit systems and maybe some other CPU's.

20 May 2021, 20:28	#152
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	If you look at moveq's opcode you will not find size bits. moveq is always .L and there is no point, in my opinion, to write .L, it's unambiguous. Same with, for example, lea. And since these instructions are so common and frequently used it should be common knowledge what they do and cut the c... size out. And the fact that eg. winuae debugger's craptastic disassembler spits out nonsense like bt instead of bra, lea.l, moveq.l etc, does not change that. If you look at addq.w #n,ax and addq.l #n,ax, they do exactly the same thing. You could say there's no point in writing the size, but they don't have the same opcode (size is part of the opcode in this case) so it does matter. And finally... lsl dx does not have its own opcode, it's an alias for lsl #1,dx at best. lsl <ea> does exists, but you should not stop there, you should look at its <ea> table and you'll see that dx is not supported (eg. that specific opcode might be used to encode some other instruction). Thread moves fast... No, Moto doc does "forbid" lsl dx. Again, look at the <ea> table for lsl and you will see: Dn - You cannot just look at the first part of the information and then ignore the other, relevant, part. What assemblers accept or don't is another thing, they are typically written to accept all kinds of crap for back/cross/whatever compatibilty.

20 May 2021, 20:41	#154
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Use ALIGN 0,4. I presume that ALIGN 2 is expanded to ALIGN 2,0, so it does no current address aligment (2nd argument is 0) and then adds 2 to the current address. Eg. it works the same only if the current address is not longword aligned (2, 6, 10, ...).

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)