07 February 2017, 17:42 | #61 | ||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
So you're not at risk of these 5 bytes if you don't ask for a bitfield that's larger than 24 bits. Quote:
It was there only as an explanation, to show why 16-bit memory accesses must not be used (because then a much shorter version exists). Quote:
I explained this now, so why do you insist ? For the sake of contradicting me maybe ? And no, your code can not be faster. Word-sized movem completely kills the performance in your code : my movem.l is approx. same as 4x normal move. Your movem.w is 8x normal move but you removed far less than 8 instructions to compensate for this. The performance hit is less important for 68000 but nevertheless big enough. Now your code is not good for 020/030 execution because there are too many instructions between fastmem read and first chipmem write. Quote:
Quote:
Now your version isn't faster on 68060 because chipmem is just too slow ; however it has a performance hit on 020/030 so there it's better this way : Code:
move.w #1999,d0 .loop movem.l (a0)+,d1-d4 move.l d1,d5 swap d1 move.w d3,d1 move.l d1,(a2)+ swap d3 move.w d3,d5 move.l d5,(a1)+ move.l d2,d5 swap d2 move.w d4,d2 move.l d2,(a4)+ swap d4 move.w d4,d5 move.l d5,(a3)+ dbf d0,.loop Code:
move.w #1999,d0 .loop movem.l (a0)+,d1-d4 swap d3 exg.w d1,d3 move.l d1,(a1)+ swap d3 move.l d3,(a2)+ swap d4 exg.w d2,d4 move.l d2,(a3)+ swap d4 move.l d4,(a4)+ dbf d0,.loop rts Quote:
Motorola wasn't theoretical. When they designed the 68000 they profiled real programs. Intel did not, they just blindly extended their 8080 to make it 16-bit and produced the start of an horror story. Raw and bulky instructions are on x86 side or other cpus. 68k is fine. 17 ticks, depends on cpu implementation. 8 bytes for 68000, 4 bytes for 68020. I think the situation is clear. Replace "BE" by "LE" in the above and it might become true. Else it's just ad nauseam nonsense. Who does ? Endianness is only different in the memory interface. For inside the cpu, everything is exactly the same. There i might eventually agree... Last edited by meynaf; 07 February 2017 at 17:45. Reason: oops |
||||||
07 February 2017, 17:43 | #62 | ||
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
Quote:
Quote:
Code:
ADD.L #<d32>,Rn ; 16 cycles (PMD says 14 cycles?) MOVE.L #<d32>,Rn ; 12 cycles ----------------------- diff is 16 - 12 = 4 cycles Code:
ADD.L Rm,Rn ; 8 cycles MOVE.L Rm,Rn ; 4 cycles ----------------------- diff is 8 - 4 = 4 cycles Code:
ADD.L (An),Dn ; 14 cycles MOVE.L (An),Dn ; 12 cycles ----------------------- diff is 6 - 4 = 2 cycles |
||
07 February 2017, 18:02 | #63 | ||||
Registered User
Join Date: Jun 2015
Location: Germany
Posts: 1,918
|
Quote:
Quote:
Quote:
Quote:
|
||||
07 February 2017, 18:32 | #64 | |||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
Quote:
Setup your own example and you will decide. Quote:
Quote:
To benefit from them we would need to schedule the code like this : Code:
move.l (a0)+,d1 move.l d1,d5 move.l (a0)+,d2 swap d1 move.l (a0)+,d3 move.w d3,d1 move.l (a0)+,d4 Quote:
|
|||||
07 February 2017, 18:39 | #65 |
Registered User
Join Date: Jun 2015
Location: Germany
Posts: 1,918
|
OK, you obviously want to play Calvinball. I don't. Regarding the wrong immediate, I noticed immediately after posting it but didn't care to correct it because it is irrelevant (still within signed word-size integer). I was all lol when I saw you clutch this straw.
|
07 February 2017, 18:50 | #66 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Nope. I just wasn't clear enough at startup. Three times i explain it now, you still didn't get it.
Read again. I didn't change any rule that was previously explicity written. Quote:
It wasn't an important remark, just a "by the way". But you obviously take things as they arrange you. |
|
07 February 2017, 19:12 | #67 | |||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,762
|
To matthey:
Thanks for explaining Quote:
Quote:
Bitfield instructions can be faster than doing it by hand. They're decent additions. Quote:
CLR is only needed for memory clears, and sadly it's slow for that For the rest clr isn't needed at all: Code:
clr.b dx -> eor.b dx,dx clr.w dx -> eor.w dx,dx clr.l dx -> eor.l dx,dx clr ax -> sub.l ax,ax clra ax -> sub.l ax,ax There are more such redundancies, tst dx is one. Just do move dx,dx. Can also be disassembled properly. |
|||
07 February 2017, 19:40 | #68 | |
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
Quote:
|
|
07 February 2017, 19:47 | #69 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
Quote:
Therefore, the more instructions you put between the read and the first chipmem write, the worse it becomes. |
||
07 February 2017, 19:52 | #70 | |
Registered User
Join Date: May 2014
Location: inside the emulator
Posts: 377
|
Quote:
|
|
07 February 2017, 20:01 | #71 | ||
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
Quote:
Quote:
|
||
07 February 2017, 20:16 | #72 | ||||||
Registered User
Join Date: May 2014
Location: inside the emulator
Posts: 377
|
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
|
||||||
07 February 2017, 20:20 | #73 |
Registered User
Join Date: May 2014
Location: inside the emulator
Posts: 377
|
Hmm indeed. So now I have to know how the Atari ST graphics works and how Amiga graphics works. It isn't a description of the algorithm.
While I do know both of those (wrote a blitter-based converter doing the same task after a demo was released using the blitter like that) most people wouldn't... |
07 February 2017, 20:28 | #74 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
|
|
07 February 2017, 21:59 | #75 | ||||||
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
Most humans would think that writing 0x0a0b0c0d to address 0 would give:
[0] = 0x0a [1] = 0x0b [2] = 0x0c [3] = 0x0d Most humans more naturally read left to right so 0x0a is first, 0x0b is second 0x0c is third and 0x0d is fourth. This is learned though. Maybe Little Endian would be more natural for Hebrew and Arab writers until they looked at a core dump . Quote:
Quote:
https://en.wikipedia.org/wiki/Endianness Quote:
Quote:
Quote:
He chose LE for RISC-V because of its popularity while observing that "string manipulation" could have advantages with BE. Maybe he would have some examples for you but I wouldn't accuse him of being biased toward BE. Quote:
http://eab.abime.net/showthread.php?t=85525&page=3 |
||||||
08 February 2017, 10:08 | #76 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Little endian is electronicians' shortcut taken in the 70's if not before, i.e. old archaic legacy.
And like many electronicians' shortcuts, its benefits were short-lived. It made sense in e.g. the 6502. If you did something like LDA $aabb,Y it could start adding Y with $bb before having read $aa, so $bb had to be stored first. Had it been able to do 16-bit accesses though, there would have been zero benefit in LE. Nowadays things are of course no longer done this way. A 32-bit add might even be split in several 16-bit adds done in parallel, with top result computed for both carry and not carry, the end result being chosen when said carry is known. |
10 February 2017, 03:10 | #77 |
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
I talked about the amateur (somehow with a PhD in Electrical and Computer Engineering) posting his baby steps and grossly misrepresenting the code density of CPU architectures.
http://www.deater.net/weave/vmwprod/asm/ll/ The following is his 68k code for LZSS decompression. Code:
| offsets into the results returned by the uname syscall .equ U_SYSNAME,0 .equ U_NODENAME,65 .equ U_RELEASE,65*2 .equ U_VERSION,(65*3) .equ U_MACHINE,(65*4) .equ U_DOMAINNAME,65*5 | offset into the results returned by the sysinfo syscall .equ S_TOTALRAM,16 | Sycscalls .equ SYSCALL_EXIT, 1 .equ SYSCALL_READ, 3 .equ SYSCALL_WRITE, 4 .equ SYSCALL_OPEN, 5 .equ SYSCALL_CLOSE, 6 .equ SYSCALL_SYSINFO, 116 .equ SYSCALL_UNAME, 122 | .equ STDIN,0 .equ STDOUT,1 .equ STDERR,2 .globl _start _start: |========================= | PRINT LOGO |========================= | LZSS decompression algorithm implementation | by Stephan Walter 2002, based on LZSS.C by Haruhiko Okumura 1989 | optimized some more by Vince Weaver move.l #out_buffer,%a6 | buffer we are printing to move.l %a6,%a1 move.l #(N-F),%d2 | R move.l #(logo),%a3 | a3 points to logo data move.l #(logo_end),%a4 | a4 points to logo end move.l #text_buf,%a5 | r5 points to text buf decompression_loop: clr.l %d5 | clear the %d5 register move.b %a3@+,%d5 | load a byte, increment pointer or.w #0xff00,%d5 | load top as a hackish 8-bit counter test_flags: cmp.l %a4,%a3 | have we reached the end? bge done_logo | if so, exit lsr #1,%d5 | shift bottom bit into carry flag bcs discrete_char | if set, we jump to discrete char offset_length: clr.l %d4 move.b %a3@+,%d0 | load 16-bits, increment pointer move.b %a3@+,%d4 | do it in 2 steps because our data is little-endian :( lsl.l #8,%d4 move.b %d0,%d4 move.l %d4,%d6 | copy d4 to d6 | no need to mask d6, as we do it | by default in output_loop moveq.l #P_BITS,%d0 lsr.l %d0,%d4 move.l #(THRESHOLD+1),%d0 add.l %d0,%d4 add %d4,%d1 | d1 = (d4 >> P_BITS) + THRESHOLD + 1 | (=match_length) output_loop: # andi #((POSITION_MASK<<8)+0xff),%d6 | mask it andi #0x3ff,%d6 | mask it move.b %a5@(0,%d6),%d4 | load byte from text_buf[] addq #1,%d6 | advance pointer in text_buf store_byte: move.b %d4,%a1@+ | store a byte, increment pointer move.b %d4,%a5@(0,%d2) | store a byte to text_buf[r] add #1,%d2 | r++ andi #(N-1),%d2 | mask r dbf %d1,output_loop | decrement count and loop | if %d1 is zero or above bftst %d5,16:8 | are the top bits 0? bne test_flags | if not, re-load flags jmp decompression_loop discrete_char: move.b %a3@+,%d4 | load a byte, increment pointer clr.l %d1 | we set d1 to zero which on m68k | means do the loop once jmp store_byte | and store it | end of LZSS code done_logo: ... rts #=========================================================================== # section .data #=========================================================================== .data data_begin: ver_string: .ascii " Version \0" compiled_string: .ascii ", Compiled \0" one: .ascii "One \0" processor: .ascii " Processor, \0" ram_comma: .ascii "M RAM, \0" bogo_total: .ascii " Bogomips Total\n\0" default_colors: .ascii "\033[0m\n\n\0" escape: .ascii "\033[\0" C: .ascii "C\0" .ifdef FAKE_PROC cpuinfo: .ascii "proc/cpu.m68k\0" .else cpuinfo: .ascii "/proc/cpuinfo\0" .endif .include "logo.lzss_new" #============================================================================ # section .bss #============================================================================ .bss bss_begin: .lcomm uname_info,(65*6) .lcomm sysinfo_buff,(64) .lcomm ascii_buffer,10 .lcomm text_buf, (N+F-1) #.lcomm text_buf, 4096 .lcomm disk_buffer,4096 | we cheat!!!! .lcomm out_buffer,16384 Code:
_start: movea.l #(lab_1810),a6 ; 0 : 2c7c 0000 1810 movea.l a6,a1 ; 6 : 224e move.l #$3c0,d2 ; 8 : 243c 0000 03c0 movea.l #(lab_db),a3 ; e : 267c 0000 00db movea.l #(lab_1f6),a4 ; 14 : 287c 0000 01f6 movea.l #(lab_3d0),a5 ; 1a : 2a7c 0000 03d0 lzss_begin: decompression_loop: clr.l d5 ; 20 : 4285 move.b (a3)+,d5 ; 22 : 1a1b ori.w #$ff00,d5 ; 24 : 0045 ff00 test_flags: cmpa.l a4,a3 ; 28 : b7cc bge done_logo ; 2a : 6c00 0050 lsr.w #1,d5 ; 2e : e24d bcs discrete_char ; 30 : 6500 0042 offset_length: clr.l d4 ; 34 : 4284 move.b (a3)+,d0 ; 36 : 101b move.b (a3)+,d4 ; 38 : 181b lsl.l #8,d4 ; 3a : e18c move.b d0,d4 ; 3c : 1800 move.l d4,d6 ; 3e : 2c04 moveq #$a,d0 ; 40 : 700a lsr.l d0,d4 ; 42 : e0ac move.l #3,d0 ; 44 : 203c 0000 0003 add.l d0,d4 ; 4a : d880 add.w d4,d1 ; 4c : d244 output_loop: andi.w #$3ff,d6 ; 4e : 0246 03ff move.b (a5,d6.l),d4 ; 52 : 1835 6800 addq.w #1,d6 ; 56 : 5246 store_byte: move.b d4,(a1)+ ; 58 : 12c4 move.b d4,(a5,d2.l) ; 5a : 1b84 2800 addq.w #1,d2 ; 5e : 5242 andi.w #$3ff,d2 ; 60 : 0242 03ff dbra d1,output_loop ; 64 : 51c9 ffe8 bftst d5{$10:8} ; 68 : e8c5 0408 bne test_flags ; 6c : 6600 ffba jmp (decompression_loop,pc) ; 70 : 4efa ffae discrete_char: move.b (a3)+,d4 ; 74 : 181b clr.l d1 ; 76 : 4281 jmp (store_byte,pc) ; 78 : 4efa ffde lzss_end: done_logo: Perhaps this attempt deserves a D- for trying but I have to go with an F for doing education and research a disservice by posting his meaningless results (FUD) like they mean something. He was also missing files needed to assemble and reproduce his results (I am not able to execute the program for testing or obtain the total size which requires Linux supposedly). A basic quick cleanup would give us something like the following. Code:
_start: movea.l #(lab_1810),a6 movea.l a6,a1 move.l #$3c0,d2 movea.l #(lab_db),a3 movea.l #(lab_1f6),a4 movea.l #(lab_3d0),a5 moveq #10,d7 move.w #$3ff,d3 move.l #$ff00,d0 lzss_begin: decompression_loop: move.l d0,d5 move.b (a3)+,d5 test_flags: cmpa.l a4,a3 bge done_logo lsr.w #1,d5 bcs discrete_char clr.l d4 ; necessary? move.w (a3)+,d4 ror.w #8,d4 ; LE->BE move.l d4,d6 lsr.l d7,d4 addq.l #3,d4 add.w d4,d1 output_loop: and.w d3,d6 move.b (a5,d6.l),d4 addq.w #1,d6 store_byte: move.b d4,(a1)+ move.b d4,(a5,d2.l) addq.w #1,d2 and.w d3,d2 dbra d1,output_loop bftst d5{$10:8} bne test_flags bra decompression_loop discrete_char: move.b (a3)+,d4 clr.l d1 bra store_byte lzss_end: done_logo: Last edited by matthey; 11 March 2017 at 22:39. |
10 February 2017, 09:22 | #78 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Oddly enough, the source code for the winner doesn't appear to be available on his page. I'd like to see that with the encoding.
Indeed using inefficient code isn't the proper way to compare code densities but this is the problem with people coming from other architectures. They didn't have the relevant tools before, and code as they always have - i.e. they don't use 68k's good features and merely translate the code. Anyway if he had used a decent assembler, peephole optimization would already have made the size fall by a fair amount... Note : you can put the clr.l d4 outside of the loop and get down to 62. So the 68k should appear at second place. EDIT: Down to 60. You can replace : Code:
move.w (a3)+,d4 ror.w #8,d4 ; LE->BE move.l d4,d6 bfextu d4{0:22},d4 ; same as lsr.l #10,d0 Code:
move.b (a3),d4 move.w (a3)+,d6 ror.w #8,d6 lsr.l #2,d4 Last edited by meynaf; 10 February 2017 at 09:29. Reason: -2 bytes |
10 February 2017, 10:19 | #79 |
Registered User
Join Date: Jun 2015
Location: Germany
Posts: 1,918
|
|
10 February 2017, 10:22 | #80 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
You are wrong again. The bit-field instruction is slower than a read (especially this one, which goes in dcache). And please stop trolling - rules didn't change.
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Generated code and CPU Instruction Cache | Mrs Beanbag | Coders. Asm / Hardware | 11 | 23 May 2014 11:05 |
EAB Christmas Song-writing Contest | mr_a500 | project.EAB | 64 | 24 May 2009 02:44 |
AmigaSYS Wallpaper Contest | Calo Nord | News | 10 | 22 April 2005 09:33 |
Landover's Amiga Arcade Conversion Contest | Frog | News | 1 | 28 January 2005 23:41 |
Battlechess Contest (EAB vs A500) | Bloodwych | Nostalgia & memories | 67 | 14 August 2003 14:37 |
|
|