the multi-cpu code density contest - Page 3

litwr · 07 February 2017, 09:15

Quote:

Originally Posted by matthey

+ division starts at the most significant end making it faster

Thanks. Only this point has a real importance. But it is true for right shift division only, left shift division generally is faster.

Quote:

Originally Posted by meynaf

But ok, i'll give that 68k code. The 68020 has tools that are very powerful when you know how to use them (prepare to be shocked

) :

Code:

bfset (a0){d0:1}

Fascinating!

However I am afraid that it maybe a bit slow with 68020.

meynaf · 07 February 2017, 09:22

Quote:

Originally Posted by Thorham

Yeah, that's pretty short

I keep forgetting those bitfield intructions for some reason

However, is it faster (greater than 5 byte case is 22 cycles)?

Single-bit access can't be a 5-byte case

idrougge · 07 February 2017, 09:28

Quote:

Originally Posted by AnimaInCorpore

I wouldn't say "obsolete" but "less important".

Just like the 68k. There are probably people out there maintaining 68k/Coldfire code in all kinds of projects, but desktop applications, games and even demos are written in high-level languages today. The percentage of PC coders who know and use x86 assembly is much smaller than the amount of Amiga coders who know 68k assembly.

meynaf · 07 February 2017, 09:41

Quote:

Originally Posted by idrougge

The percentage of PC coders who know and use x86 assembly is much smaller than the amount of Amiga coders who know 68k assembly.

That's very true !
So there is little point seeking asm help on some PC forum.

idrougge · 07 February 2017, 09:42

I never said you should seek help on a PC forum.

meynaf · 07 February 2017, 09:50

Quote:

Originally Posted by idrougge

I never said you should seek help on a PC forum.

Stack Overflow is a PC forum.

Thorham · 07 February 2017, 10:11

Quote:

Originally Posted by litwr

However I am afraid that it maybe a bit slow with 68020.

Just benched it on a 68030 (same cycle times as 68020). It's the same speed as this:

Code:

    move.l  d0,d1
    lsr.l   #5,d1
    bset    d0,(a0,d1.w*4)

Very nice instructions those bitfield instructions.

Quote:

Originally Posted by meynaf

Single-bit access can't be a 5-byte case

Thought it referred to the accessed bytes.

Quote:

Originally Posted by idrougge

Just like the 68k. There are probably people out there maintaining 68k/Coldfire code in all kinds of projects, but desktop applications, games and even demos are written in high-level languages today. The percentage of PC coders who know and use x86 assembly is much smaller than the amount of Amiga coders who know 68k assembly.

Of course, but assembly language isn't obsolete. Knowledge of assembly language is still critically important. We simply can't do without yet. As long as computers have CPUs with instruction sets like we have now, we can't do without assembly language knowledge.

Quote:

Originally Posted by meynaf

Stack Overflow is a PC forum.

It's a programming forum. Not everything on stack overflow is peecee related. Most of it is language related.

idrougge · 07 February 2017, 10:20

Quote:

Originally Posted by Thorham

Of course, but assembly language isn't obsolete. Knowledge of assembly language is still critically important. We simply can't do without yet. As long as computers have CPUs with instruction sets like we have now, we can't do without assembly language knowledge.

Do I strike you as so stupid as to suggest that? Take my statement with a grain of salt.

Quote:

Originally Posted by meynaf

Stack Overflow is a PC forum.

StackOverflow is a lot of forums. I mainly use it as a Mac/iOS programming forum, but Codegolf is… well, check for yourself instead of just being stubborn.

litwr · 07 February 2017, 10:27

Quote:

Originally Posted by matthey

Theoretically, any 68k CPU with a 16 bit data bus could be faster when adding/subtracting a 32 bit number in memory but memory was likely already fast enough (and the 68k slow enough) that it made little if any difference. I would love to hear anywhere you read that it made a difference and how much though.

Just read the manuals.
[68000] ADDI.l #,Dn 16 cycles, MOVE.l #,Dn 12 cycles
The timing should be equal for LE.

Thorham · 07 February 2017, 10:44

Quote:

Originally Posted by idrougge

Do I strike you as so stupid as to suggest that?

You didn't suggest it, you just outright wrote it

Quote:

Originally Posted by idrougge

Assembly language has been obsolete for just as long.

Quote:

Originally Posted by idrougge

StackOverflow is a lot of forums.

No, it's not. It's one forum which is part of StackExchange: http://stackexchange.com/

litwr · 07 February 2017, 10:54

Quote:

Originally Posted by matthey

+ more human readable hex/binaries and text in memory

DEC disappeared claiming octals have better readability. They didn' t support hexadecimals to the end.

What a stupidity!

Motorola made the similar mistake...
BTW StackOverflow is one of the best IT forum. You may ask even about 8-bit Commodore there.

grond · 07 February 2017, 12:52

Quote:

Originally Posted by meynaf

Code:

; a0=source, a1-a4=dest
 move.w #1999,d0
.loop
 movem.l (a0)+,d1-d4
 move.l d1,d5
 swap d5
 move.w d3,d5
 move.l d5,(a2)+
 move.l d1,d5
 swap d3
 move.w d3,d5
 move.l d5,(a1)+
 move.l d2,d5
 swap d5
 move.w d4,d5
 move.l d5,(a4)+
 move.l d2,d5
 swap d4
 move.w d4,d5
 move.l d5,(a3)+
 dbf d0,.loop
 rts

If this code is about code density, it can be done better:

Code:

; a0=source, a1-a4=dest
 move.l a0,a5
 adda.w #2000,a5
.loop
 movem.w (a0)+,d0-d7
 swap d0
 swap d1
 move.w d4,d0
 move.w d5,d1
 move.l d0,(a1)+
 swap d2
 move.l d1,(a2)+
 move.w d6,d2
 swap d3
 move.l d2,(a3)+
 move.w d7,d3
 move.l d3,(a4)+
 cmpa.l a0,a5
 bpl .loop
 rts

I haven't counted cycles but it might also be faster. Of course, optimising for speed is a different topic altogether, especially if the destination is in chipmem. For this reason I don't think it makes much sense to analyse one aspect, i.e. code density, without the others (execution speed on a real life system).

meynaf · 07 February 2017, 14:02

Quote:

Originally Posted by idrougge

StackOverflow is a lot of forums. I mainly use it as a Mac/iOS programming forum, but Codegolf is… well, check for yourself instead of just being stubborn.

Well, let me be more precise then : Stack Overflow is mostly a PC forum.
(And if you think they're really helpful in our case, just go and ask them instead of wasting my time here.)

Quote:

Originally Posted by Thorham

Thought it referred to the accessed bytes.

You don't need to access more than 1 byte if you want to get 1 bit.

Quote:

Originally Posted by litwr

DEC disappeared claiming octals have better readability. They didn' t support hexadecimals to the end.

What a stupidity!

Motorola made the similar mistake...

Endianness has nothing to do with octals. Octals are unreadable, little endian is unreadable - if you really want to compare, it's Intel who made the mistake.

Quote:

Originally Posted by litwr

BTW StackOverflow is one of the best IT forum. You may ask even about 8-bit Commodore there.

You may ask many things in many places. Getting meaningful replies is something else.

Quote:

Originally Posted by grond

I haven't counted cycles but it might also be faster. Of course, optimising for speed is a different topic altogether, especially if the destination is in chipmem. For this reason I don't think it makes much sense to analyse one aspect, i.e. code density, without the others (execution speed on a real life system).

It was about doing it fully 32-bit. If you use 16-bit memory accesses, the loop is just 4 instructions !

grond · 07 February 2017, 14:12

Quote:

Originally Posted by meynaf

It was about doing it fully 32-bit. If you use 16-bit memory accesses, the loop is just 4 instructions !

Well, if you change or augment the rules mid-game... I assumed the destination was in chipmem and the source in fast. Otherwise using movem wouldn't have been the best decision anyway.

phx · 07 February 2017, 14:20

Quote:

Originally Posted by meynaf

Endianness has nothing to do with octals. Octals are unreadable, little endian is unreadable - if you really want to compare, it's Intel who made the mistake.

I agree. No sane person would design a new CPU with little-endian, except for compatibility issues with x86, PCI-bus, etc..

The x86 inherited little-endian from their 8-bit CPUs, where it makes sense to read the least significant bytes first from memory when doing operations on them. But there is no reason for real 32/64 bits CPUs, except for compatibility with former models.

Little-endian alone is a reason for me to stay away from a CPU. I never did much with ARM either, because it is mostly used in LE mode.

meynaf · 07 February 2017, 14:25

Quote:

Originally Posted by grond

Well, if you change or augment the rules mid-game... I assumed the destination was in chipmem and the source in fast. Otherwise using movem wouldn't have been the best decision anyway.

Perhaps i was just a little unclear about it, sorry.
The destination is in chipmem and the source in fast.
It's just that reads and writes must be performed with 32-bit width, or it becomes meaningless (see my 4x move.w explanation in previous posts).

Thorham · 07 February 2017, 14:48

Quote:

Originally Posted by meynaf

You don't need to access more than 1 byte if you want to get 1 bit.

I meant that after the 5th byte you'd get a penalty similar to shifting more than 8 bits. The documentation says 14 cycles for < 5 bytes, and 22 cycles > 5 bytes.

grond · 07 February 2017, 15:01

Quote:

Originally Posted by meynaf

It's just that reads and writes must be performed with 32-bit width, or it becomes meaningless

Why would be doing 16bit reads from fast and 32bit writes to chip be meaningless? I understand you are investigating code density but made the extra condition to use 32bit moves for the writes because they are to chipmem on a 32bit chipmem machine. Your four word-size moves example violates this condition. My code does not and shows better code density and possibly even better speed on some 68k. To proud to admit this?

matthey · 07 February 2017, 16:02

Quote:

Originally Posted by Thorham

You mean 68060?

Yes, and likely any other superscalar 68k CPU.

Quote:

Originally Posted by Thorham

Why exactly? If it's the instruction ordering, then wouldn't that be irrelevant because of the pipelining (slow chipmem writes)?

It is instruction scheduling and forwarding concerns. Your code has many dependencies. The 68060 is in order superscalar so it can not reorder instructions which it executes in pairs (OoO execution benefits from better superscalar scheduling also). A calculation result of the first instruction of a pair can not be sourced for the 2nd instruction because it has not been completed yet. Superscalar dual execution of the code could save a few cycles allowing chip mem writes to start sooner. This is no different than other optimizations.

Quote:

Originally Posted by Thorham

Anyway, I optimize code for 68020s and 68030s because they need it far more than 68060s, and I don't know a whole lot about 68060 optimization (I only know something about instruction ordering). Especially when a plain A1200 is your target, 68060 optimization isn't relevant anymore.

It is possible to produce fairly optimal code for 68020-68060. Code for the 68020-68030 can usually be instruction scheduled for the 68060 with little if any slow down (see below for my attempt).

Quote:

Originally Posted by Thorham

Also, it was about code size

I wasn't criticizing your code. It was the shortest, even for the 68060

.

Quote:

Originally Posted by Thorham

Code:

    move.w  #1999,d0 ; pOEP
.loop
    movem.l (a0)+,d1-d4 ; pOEP only
    swap    d3 ; pOEP only
    eor.w   d1,d3 ; pOEP
    eor.w   d3,d1 ; pOEP (dependency)
    move.l  d1,(a1)+ ; pOEP (only .l forwarded)
    eor.w   d1,d3 ; sOEP only
    swap    d3 ; pOEP only
    move.l  d3,(a2)+ ; pOEP
    swap    d4 ; pOEP only
    eor.w   d2,d4 ; pOEP
    eor.w   d4,d2 ; pOEP (dependency)
    move.l  d2,(a3)+  ; pOEP (only .l forwarded)
    eor.w   d2,d4 ; sOEP
    swap    d4 ; pOEP only
    move.l  d4,(a4)+ ; pOEP
    dbra    d0,.loop ; pOEP only
    rts  ; pOEP only

pOEP = primary integer pipe
sOEP = secondary integer pipe

Optimum would be every other instruction being sOEP although that is rarely possible. There are some instructions which can't be sOEP in the 68060 and don't even allow an sOEP instruction at the same time like MOVEM, SWAP (oversight/mistake as it could and should have been), MUL and DIV. There isn't much room to reschedule your code. This is just the nature of the EOR exchange algorithm which does more calculations.

Quote:

Originally Posted by meynaf

Code:

; a0=source, a1-a4=dest
 move.w #1999,d0 ; pOEP
.loop
 movem.l (a0)+,d1-d4 ; pOEP only
 move.l d1,d5 ; pOEP
 swap d5 ; pOEP only
 move.w d3,d5 ; pOEP
 move.l d5,(a2)+ ; pOEP
 move.l d1,d5 ; sOEP
 swap d3 ; pOEP only
 move.w d3,d5 ; pOEP
 move.l d5,(a1)+ ; pOEP
 move.l d2,d5 ; sOEP
 swap d5 ; pOEP only
 move.w d4,d5 ; pOEP
 move.l d5,(a4)+ ; pOEP
 move.l d2,d5 ; sOEP
 swap d4 ; pOEP only
 move.w d4,d5 ; pOEP
 move.l d5,(a3)+ ; pOEP
 dbf d0,.loop ; pOEP only
 rts ; pOEP only

Meynaf's code is not much better but has more opportunities to reschedule. The 68060 has optimizations for MOVE.L which helps his code. I believe I found 2 instructions which can be removed from his code as follows.

Code:

; a0=source, a1-a4=dest
 move.w #1999,d0 ; pOEP
.loop
 movem.l (a0)+,d1-d4 ; pOEP only
 move.l d1,d5 ; pOEP
 swap d1 ; pOEP only
 move.w d3,d1 ; pOEP
 swap d3 ; pOEP only
 move.l d1,(a2)+ ; pOEP
 move.w d3,d5 ; sOEP
 move.l d5,(a1)+ ; pOEP
 move.l d2,d5 ; sOEP
 swap d5 ; pOEP only
 move.w d4,d5 ; pOEP
 swap d4 ; pOEP only
 move.l d5,(a4)+ ; pOEP
 move.w d4,d2 ; sOEP
 move.l d2,(a3)+ ; pOEP
 dbf d0,.loop ; pOEP only
 rts ; pOEP only

On a superscalar CPU like the 68060 but with SWAP available in the sOEP (I believe the Apollo core would qualify), we would have much better dual execution as the following code shows.

Code:

; a0=source, a1-a4=dest
 move.w #1999,d0 ; pOEP
.loop
 movem.l (a0)+,d1-d4 ; pOEP only
 move.l d1,d5 ; pOEP
 swap d1 ; sOEP
 move.w d3,d1 ; pOEP
 swap d3 ; sOEP
 move.l d1,(a2)+ ; pOEP
 move.w d3,d5 ; sOEP
 move.l d5,(a1)+ ; pOEP
 move.l d2,d5 ; sOEP
 swap d5 ; pOEP
 move.w d4,d5 ; pOEP (dependency)
 swap d4 ; sOEP
 move.l d5,(a4)+ ; pOEP
 move.w d4,d2 ; sOEP
 move.l d2,(a3)+ ; pOEP
 dbf d0,.loop ; pOEP only
 rts ; pOEP only

Such a simple mistake of not supporting SWAP in the sOEP was very costly here. SWAP is quite common, simple and the result can be forwarded so not allowing in the sOEP is right up there with removing multiply with 64 bit result on the 68060 for brain farts.

litwr · 07 February 2017, 16:53

Quote:

Originally Posted by Thorham

Just benched it on a 68030 (same cycle times as 68020). It's the same speed as this:

Code:

    move.l  d0,d1
    lsr.l   #5,d1
    bset    d0,(a0,d1.w*4)

Very nice instructions those bitfield instructions.

Theoretically it is nice but practically... Moto's always has oddities like CLR which looks fine but works like RMW-type slow instruction. I am aware that with 68020 CLR works properly. However this illustrates the common fact that Motorola was always too theoretical and forced users of their CPU to use a bit raw and bulky instructions. The bit instructions were the part of forgotten NS 320xx ISA too. BTW it was the first 32-bit CPU in a chip.
The 386 code:

Code:

mov ebx,eax ;2
shl ebx,5   ;3
bts [esi+4*ebx],eax  ;12

17 ticks and 9 bytes.
BTW. BE is a horror! it is even worse than octals. Decimals are horrific too but IBM realized FP BCD at its latest mainframes... We have imperfect world. C'est la vie.

Somebody confuses the external and internal representations.

The same shame we have with Unicode.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Generated code and CPU Instruction Cache	Mrs Beanbag	Coders. Asm / Hardware	11	23 May 2014 11:05
EAB Christmas Song-writing Contest	mr_a500	project.EAB	64	24 May 2009 02:44
AmigaSYS Wallpaper Contest	Calo Nord	News	10	22 April 2005 09:33
Landover's Amiga Arcade Conversion Contest	Frog	News	1	28 January 2005 23:41
Battlechess Contest (EAB vs A500)	Bloodwych	Nostalgia & memories	67	14 August 2003 14:37

07 February 2017, 09:42	#45
idrougge Registered User Join Date: Sep 2007 Location: Stockholm Posts: 4,332	I never said you should seek help on a PC forum.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)