English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 17 February 2017, 15:12   #121
buggs
Registered User
 
Join Date: May 2016
Location: Rostock/Germany
Posts: 132
Quote:
Originally Posted by meynaf View Post
Ok, as i wish to recruit, i must at least give something to do.
Code:
; a0=source, a1-a4=dest
 move.w #1999,d0
.loop
 movem.l (a0)+,d1-d4
 move.l d1,d5
 swap d5
 move.w d3,d5
 move.l d5,(a2)+
 move.l d1,d5
 swap d3
 move.w d3,d5
 move.l d5,(a1)+
 move.l d2,d5
 swap d5
 move.w d4,d5
 move.l d5,(a4)+
 move.l d2,d5
 swap d4
 move.w d4,d5
 move.l d5,(a3)+
 dbf d0,.loop
 rts
<snip> Or do it for ARM or whatever cpu. <snip>
With VASM out, I figured it might be time to post something for "whatever CPU".
Code:
 move #999,d0
.loop
 load (a0)+,E0          ;a0 b0 c0 d0 (.w)
 load (a0)+,E1          ;a1 b1 c1 d1
 load (a0)+,E2          ;a2 b2 c2 d2
 load (a0)+,E3          ;a3 b3 c3 d3
 transhi E0-E3,E4:E5    ;E4: a0 a1 a2 a3 E5: b0 b1 b2 b3
 translo E0-E3,E6:E7    ;E6: c0 c1 c2 c3 E7: d0 d1 d2 d3
                     ;TRANS has latency, 1 cyc lost in this example
 store E4,(a1)+         ;
 store E5,(a2)+         ;
 store E6,(a3)+         ;
 store E7,(a4)+         ;inner loop assembles to 10 * 32 Bit
 dbf   d0,.loop          ;plus move, dbf = 12 * 32 Bit
Code as shown will process 32 bytes per run in 11 cycles. Obviously, it won't be of much use in the original scenario as data keeps piling up in the write buffers (as long as A1-A4 are in Chip). But it'll perform quite nicely when A1-A4 point to a fast memory location.
buggs is offline  
Old 17 February 2017, 15:29   #122
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Unfortunately for you this thread is about code density. And your example shows no benefit at all in this area.

Last edited by prowler; 01 March 2017 at 22:11. Reason: Cleanup.
meynaf is offline  
Old 18 February 2017, 13:40   #123
DamienD
Banned
 
DamienD's Avatar
 
Join Date: Aug 2005
Location: London / Sydney
Age: 47
Posts: 20,420
Why do all these threads keep going downhill; it's usually the same people arguing???

I really don't have time to be reading through 7 pages of bickering between you guys... I need to prepare for a new job over the weekend so...

Closed for now until another GM has time to review.
DamienD is offline  
Old 01 March 2017, 22:16   #124
prowler
Global Moderator
 
prowler's Avatar
 
Join Date: Aug 2008
Location: Sidcup, England
Posts: 10,300
Quote:
Originally Posted by DamienD View Post
Closed for now until another GM has time to review.
Done, thanks Damien!

Thread reopened. Now, let's try again, shall we, guys?

Last edited by prowler; 03 March 2017 at 22:24. Reason: typo.
prowler is offline  
Old 03 March 2017, 13:21   #125
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Ok, let's try again.

I have a new case to submit. It's a complicated one, of course to do in a minimal amount of code.
This is a real life case, but discussing details would probably lead to endless OT.

Here is pseudo-code as i guess an explanation wouldn't be clear enough :
Code:
flag=0
start:
struct = data           ; some rel(pc) array of structs with sizeof =8
loop:
x = struct[5] >>4       ; -- -- -- -- -- x-
v1=table[x*2]           ; attn: table too far for d8(pc,ix) but ok for d16(pc)
v2=table[x*2 +1]
if flag and v2>=0 goto skip  ; v2>=0 is bit #7 test
cc = (v2<0)                  ; set condition (passed to call func via some reg)
if flag then cc=0
var = (v1&128) + (v2&15)     ; but bits 6,5,4 of var are "don't care"
call func (struct,cc,var)
skip:                   ; value in var is unimportant if we skip
struct = struct +8      ; next item (sizeof =8)
if struct[0]<>0 or struct[1]<>0 or struct[2]<>0 or struct[3]<>0 goto loop
if flag goto error
flag = 1
goto start
error:
Function called here (actually unrelated inline code) will sometimes not return so we don't necessarily end up at error label.

Note : pseudo-code is of course not optimal.
Use as few registers as you can.

Last edited by meynaf; 03 March 2017 at 13:36. Reason: fixed two mistakes in pseudo code
meynaf is offline  
Old 03 May 2017, 10:22   #126
litwr
Registered User
 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
Sorry for delay I was a bit sick.

Quote:
Originally Posted by meynaf View Post
Total nonsense. CLR is very useful instruction (when i counted i found out that 2.8% of instructions were CLR). 68000 bug is just implementation mistake, like Pentium's FDIV bug.
CLR is useful but slow. We all like faster speed, do not we? The same thing I suspect about BSET. It is indeed useful but it may be too slow. So it is good theoretically but practically Moto used ppl as beta-testers of their raw products. FDIV bug required a good scientist to reveal it but CLR braked almost every 68000/68010 program.
I can also mention very expensive PDP/VAX-11 ISA. Moto used it as a pattern. It is interesting that "mighty" VAX-11/730 can be outperformed by 6502 @4MHz! Look at pi-spigot results for a proof. Moto was too close to this madness.

Quote:
Originally Posted by meynaf View Post
17 ticks, depends on cpu implementation.
8 bytes for 68000, 4 bytes for 68020. I think the situation is clear.
And where are exact ticks for 680x0? I suspect it will be more than 17 for 68000/68010/68008.

Quote:
Originally Posted by meynaf View Post
Replace "BE" by "LE" in the above and it might become true.
Endianness is only different in the memory interface. For inside the cpu, everything is exactly the same.
My point is in a fact that the user can't feel any difference between BE or LE but LE is faster with multi-words operations like additions or subtractions. So what is all BE for? Do Ppl like slow codes?! Indeed modern CPU don't use multi-word additions or subtractions. But, I repeat again, Moto forced ppl to use slower codes just for abstract theoretical reasons.

Quote:
Originally Posted by meynaf View Post
The registers are 32 bits so the 16 bit memory is never touched in the latter case yet the timing difference is the same. It looks to me like ADD.L is just slower than MOVE.L and there is no Big Endian penalty for code fetch. It would be easy to buffer (cache) a small amount of the instruction stream code to avoid the BE penalty with longwords (I vaguely recall the 68000 having just such a buffer). Memory data accesses are what you want to look at.
It maybe different point - the entire 68000 architecture is based at BE as on the top priority and they sacrificed some clocks for it everywhere. BE objectevly creates a delay for multi-byte operations and all other reasons are just speculative.

Quote:
Originally Posted by meynaf View Post
Ok, let's try again.
I have a new case to submit. It's a complicated one, of course to do in a minimal amount of code.
It is too big to be a sport event.
litwr is offline  
Old 03 May 2017, 20:32   #127
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by litwr View Post
CLR is useful but slow. We all like faster speed, do not we? The same thing I suspect about BSET. It is indeed useful but it may be too slow. So it is good theoretically but practically Moto used ppl as beta-testers of their raw products.
CLR isn't slow. It's a single instruction doing the job of 2. Is STZ on 65c02 slow ?
BSET isn't slow. And you can't criticize Moto for this one - x86 has it since 386.
Starting with 68040 both are just 1 clock.


Quote:
Originally Posted by litwr View Post
FDIV bug required a good scientist to reveal it but CLR braked almost every 68000/68010 program.
Wrong. CLR did not break even a single program on the ST, and on Amiga it makes a difference only for write-only hardware registers (the mistake here - there is one, yes - isn't in the cpu but in these hw regs that can't be read).
x86 is full of bugs (like ol' opcode F0 0F C7 C8 that simply hanged first pentiums like a 02 on 6502). 80286 was so buggy that it was impossible to get out of protected mode...


Quote:
Originally Posted by litwr View Post
I can also mention very expensive PDP/VAX-11 ISA. Moto used it as a pattern. It is interesting that "mighty" VAX-11/730 can be outperformed by 6502 @4MHz! Look at pi-spigot results for a proof. Moto was too close to this madness.
Unlike VAX which took just about everything and more, Moto did statistical analysis on existing programs and removed what was not useful.


Quote:
Originally Posted by litwr View Post
And where are exact ticks for 680x0? I suspect it will be more than 17 for 68000/68010/68008.
I don't know. I don't care. Number of ticks depend on the implementation, you can't judge an instruction set by number of ticks.


Quote:
Originally Posted by litwr View Post
My point is in a fact that the user can't feel any difference between BE or LE but LE is faster with multi-words operations like additions or subtractions.
LE isn't faster on anything but a 8-bit cpu !
And the user looking in data files will find them very unreadable if in LE. I did it just too many times.
Now you want to check some file contents. Like a WAV. Even though it's LE, being LE will still cause trouble when you compare e.g. "RIFF" (52495656). On 68k it's simple cmpi.l #"RIFF",(a0)+. On x86 "RIFF" may well translate to 56564952 and then you get wrong code.


Quote:
Originally Posted by litwr View Post
So what is all BE for? Do Ppl like slow codes?! Indeed modern CPU don't use multi-word additions or subtractions. But, I repeat again, Moto forced ppl to use slower codes just for abstract theoretical reasons.
BE is for a clean architecture, a concept apparently alien to many people.


Quote:
Originally Posted by litwr View Post
It maybe different point - the entire 68000 architecture is based at BE as on the top priority and they sacrificed some clocks for it everywhere. BE objectevly creates a delay for multi-byte operations and all other reasons are just speculative.
Wake up, it's not 1980 anymore. 6800 was maybe slowed down - not 68000.


Quote:
Originally Posted by litwr View Post
It is too big to be a sport event.
Big asm code shouldn't be a problem with a decent instruction set.
And this is the problem with x86's so-called good code density. It's good only for very small code - as soon as it becomes larger, it starts to suck. HOMM2 on x86 is 1.5MB of code, on 68k it's 0.9MB (in spite the compiler did a really poor job - it could have been half that size).
For your "sport event", it needs to be big enough to put some pressure on the register file. How would a c2p look like on x86 for example ?
meynaf is offline  
Old 05 May 2017, 09:17   #128
litwr
Registered User
 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
Quote:
Originally Posted by meynaf View Post
CLR isn't slow. It's a single instruction doing the job of 2. Is STZ on 65c02 slow ?
Using RMW cycles for CLR is it ok?! It had to be just one W.

Quote:
Originally Posted by meynaf View Post
Wrong. CLR did not break even a single program on the ST, and on Amiga
I was about brake not break .

Quote:
Originally Posted by meynaf View Post
On x86 "RIFF" may well translate to 56564952 and then you get wrong code.
I agree some assemblers for LE may have this problem but it is just representation. There is no problem to write "RIFF" for proper configured x86 assembler. ML level is the level of legendary coders of 50s...

Last edited by litwr; 05 May 2017 at 19:44.
litwr is offline  
Old 05 May 2017, 20:56   #129
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by litwr View Post
Using RMW cycles for CLR is it ok?! It had to be just one W.
Of course it had to be just a write but this is only an implementation problem.
I'm not saying the implementation of 68k is good. I'm just saying the instruction set is good (at least, good enough for asm use).
And anyway this was fixed in 68020 (or 68010 ? I don't remember).


Quote:
Originally Posted by litwr View Post
I was about brake not break .
Damned english
Again it's just old 68000 problem.


Quote:
Originally Posted by litwr View Post
I agree some assemblers for LE may have this problem but it is just representation. There is no problem to write "RIFF" for proper configured x86 assembler. ML level is the level of legendary coders of 50s...
This is just representation, of course it is. That whole thing IS a matter of representation ! On BE it's obvious, on LE you may have bad surprises. This has nothing to do with direct ML write.

Now... perhaps should i recall you that this is a thread about code density.
So when will you write code ?
meynaf is offline  
Old 19 May 2017, 23:17   #130
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Dr. Vince Weaver finally updated his code density web site and documentation using some improvements suggested in this thread.

http://www.deater.net/weave/vmwprod/asm/ll/

The 68k has the best code density for the LZSS decoder and 2nd best for total size. Thanks to Vince, to all who contributed code and to the 68k designers for one of the greatest CPU architecture of all time.
matthey is offline  
Old 24 May 2017, 13:55   #131
Megol
Registered User
 
Megol's Avatar
 
Join Date: May 2014
Location: inside the emulator
Posts: 377
Quote:
Originally Posted by matthey View Post
Dr. Vince Weaver finally updated his code density web site and documentation using some improvements suggested in this thread.

http://www.deater.net/weave/vmwprod/asm/ll/

The 68k has the best code density for the LZSS decoder and 2nd best for total size. Thanks to Vince, to all who contributed code and to the 68k designers for one of the greatest CPU architecture of all time.
Congratulations - see how it's better to try to help instead of just ranting?

Note that the ARM and x86 code can be improved, probably the 68k and others too.
Megol is offline  
Old 24 May 2017, 18:39   #132
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Quote:
Originally Posted by Megol View Post
Congratulations - see how it's better to try to help instead of just ranting?
Yes, of course it is better to cooperate. Some improvement is better than none even when people do things the hard (or flawed) way.

Quote:
Originally Posted by Megol View Post
Note that the ARM and x86 code can be improved, probably the 68k and others too.
I expect other architecture code can be improved too. Even the 68k code could be improved more. I submitted 3 source files with increasing optimizations. NorthWay's code was the most aggressively optimized but did not work (I could have messed up somewhere). I probably could have made it run if I could have debugged it on the Amiga but it will only run on Linux. I wasn't too worried about it when Vince asked and was happy to have the 68k in the ballpark of where it should be. I told him he should work on the others using what he learned from my submissions which are mostly like what a compiler should do. He really should have started with compiler generated code although that is no guarantee of good code quality on the 68k. Maybe we will start to see some 68k compiler improvements with bebbo's gcc changes and vbcc being sponsored.

http://eab.abime.net/showthread.php?t=85474
http://eab.abime.net/showthread.php?t=87205
matthey is offline  
Old 24 May 2017, 21:37   #133
Megol
Registered User
 
Megol's Avatar
 
Join Date: May 2014
Location: inside the emulator
Posts: 377
The GCC patches are promising I must say, haven't looked at VBCC lately but any improvement is nice.
Given how clean the 68k architecture is it's strange IMHO that compilers should have any problem generating good code, a compiler that use the (mostly legacy) quirky instructions of x86 instead of a "RISC" subset having problems sure - but 68k? The only real "problem" is the split D and A register sets and that's not too hard to work with...
--
I have never been good at compression/decompression code however the LZSS decompression code in the logo routine(s) feels odd. Feels is the right word as I haven't really analysed it.
Megol is offline  
Old 28 May 2017, 00:07   #134
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Vince added RISC-V ISAs to the comparison.

http://www.deater.net/weave/vmwprod/asm/ll/

He said there was room for improvement for RISC-V code. So far the results come up a little short of the code density hype and claims although RV64C appears to have pretty good code density for a 64 bit CPU. RV64C is beating arm64 (ARMv8 AArch64). All other RISC-V variants are unimpressive in code density and make me wonder why they even bothered.

Quote:
Originally Posted by Megol View Post
The GCC patches are promising I must say, haven't looked at VBCC lately but any improvement is nice.
Given how clean the 68k architecture is it's strange IMHO that compilers should have any problem generating good code, a compiler that use the (mostly legacy) quirky instructions of x86 instead of a "RISC" subset having problems sure - but 68k? The only real "problem" is the split D and A register sets and that's not too hard to work with...
There are a few things which make the 68k more challenging for compilers to generate good quality code like the A/D register split but they are no more difficult than x86 issues. GCC will even change alignment when a particular CPU architecture is specified for the x86/x86_64. The 68k on the other hand has barely maintained old support passed down and translated from earlier compiler versions. The 68k was abandoned about the same time compilers were starting to get good. Popular processors get all the support and there hasn't been a popular 68k design in decades. The innate advantage of a CPU with good code density is easily sabotaged by poor compiler support.
matthey is offline  
Old 28 May 2017, 01:25   #135
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,840
Quote:
Originally Posted by matthey View Post
The innate advantage of a CPU with good code density is easily sabotaged by poor compiler support.
Just write it in asm then
Thorham is offline  
Old 29 May 2017, 12:54   #136
Megol
Registered User
 
Megol's Avatar
 
Join Date: May 2014
Location: inside the emulator
Posts: 377
Quote:
Originally Posted by Thorham View Post
Just write it in asm then
Of course :P

However even on x86 with all support and "optimization" (quotes for a reason...) the generated code quality is generally lacking especially for size optimized code. Enable vectorization and increase optimization -> useless vectorization of scalar integer code that is extremely bloated and runs slower than the most naive integer code due to setup overheads.

Beginning to think i'm getting older - as I'm starting to long for less complex compilers...
Megol is offline  
Old 28 July 2017, 11:06   #137
Photon
Moderator
 
Photon's Avatar
 
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,655
Intel will win because it still supports special-purpose 8-bit CPU instructions that do more than other 8-bit CPUs. 16-bit is fluffier with the exception of mul/div and 32-bit Risc are the worst. Even with Thumb they can't quite get there. There are 64-bit etc CPUs too, ofc ;-)
Photon is offline  
Old 18 August 2017, 12:23   #138
NorthWay
Registered User
 
Join Date: May 2013
Location: Grimstad / Norway
Posts: 852
Back to that peculiar obsession with the size of the LZSS decompression loop. I sat down and tinkered with how I would do it natively if size was all I cared about and I could arrange data as I like. 34 bytes:
Code:
get_bits
	move.b	(a3)+,d5
get_bit
	roxr.b	#1,d5
	beq.s	get_bits

	bcs.s	string

literal
	move.b	(a3)+,(a2)+
	bra.s	check

string
	move.w	(a3)+,d0	; 4(negative) + 12(negative)
	move.w	d0,d1
	or.w	d2,d0		; $F000
	sub.w	d3,d1		; 2<<12

	lea	(a2,d0.w),a0
copyloop
	move.b	(a0)+,(a2)+
	add.w	d4,d1		; 1<<12
	bcc.s	copyloop

check
	cmp.l	a2,a1	; swap order? might need to adjust a1 by 1
	bcc.s	get_bit
If you are willing to reduce the max copy string length from 18 to 16 bytes then you can save 2 more bytes ("sub.w d3,d1").
NorthWay is offline  
Old 19 August 2017, 23:33   #139
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 54
Posts: 4,491
Quote:
Originally Posted by NorthWay View Post
Back to that peculiar obsession with the size of the LZSS decompression loop.
--- cut ---
If you are willing to reduce the max copy string length from 18 to 16 bytes then you can save 2 more bytes ("sub.w d3,d1").
Hi NorthWay, some remarks.
To be fair D2, D3 and A1 need to be initialized and consume code space (with an escape token A1 can be omitted).

And this code do not works on 68000 machines

[EDIT: there is even a subtle initialization bug..]

Regards,
ross

Last edited by ross; 19 August 2017 at 23:49. Reason: []
ross is offline  
Old 20 August 2017, 02:55   #140
NorthWay
Registered User
 
Join Date: May 2013
Location: Grimstad / Norway
Posts: 852
I know, but I said it was about the loop itself. That was the only thing that was counted for some reason, and so I cut out all init etc.
If I cared about a more realistic total code size then I would arrange it differently. Chances are I would care about speed too.

And if you are willing to drop in-buffer overwrite de-compression then you can separate literals and control/length+distance bits and read out the control bits("get_bits") 16 at a time.

Last edited by NorthWay; 20 August 2017 at 03:03.
NorthWay is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Generated code and CPU Instruction Cache Mrs Beanbag Coders. Asm / Hardware 11 23 May 2014 11:05
EAB Christmas Song-writing Contest mr_a500 project.EAB 64 24 May 2009 02:44
AmigaSYS Wallpaper Contest Calo Nord News 10 22 April 2005 09:33
Landover's Amiga Arcade Conversion Contest Frog News 1 28 January 2005 23:41
Battlechess Contest (EAB vs A500) Bloodwych Nostalgia & memories 67 14 August 2003 14:37

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 19:37.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.13447 seconds with 14 queries