06 February 2017, 17:10 | #21 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,847
|
|
06 February 2017, 17:16 | #22 |
Natteravn
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,546
|
Ok, I'm not a PPC-expert, but to give you something I converted your first example for the PPC. It needs fewer instructions, but some more bytes because of RISC (4 bytes for each instruction).
Code:
# r3 source # r4-r7 dest (-4) li r12,2000 mtctr r12 loop: lmw r28,0(r3) rlwinm r11,r28,16,0,15 rlwimi r11,r30,0,16,31 stwu r11,4(r5) rlwinm r11,r28,0,0,15 rlwimi r11,r30,16,16,31 stwu r11,4(r4) rlwinm r11,r29,16,0,15 rlwimi r11,r31,0,16,31 stwu r11,4(r7) rlwinm r11,r29,0,0,15 rlwimi r11,r31,16,16,31 stwu r11,4(r6) addi r3,r3,16 bdnz loop blr Last edited by phx; 06 February 2017 at 17:18. Reason: addi was missing |
06 February 2017, 17:21 | #23 | |||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
Quote:
Quote:
Btw. Some recent ARM models apparently have a division instruction. |
|||
06 February 2017, 17:32 | #24 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
You must be PPC expert to be able to write this. PPC asm isn't exactly very easy
|
06 February 2017, 17:33 | #25 |
Registered User
Join Date: Nov 2012
Location: Willich/Germany
Posts: 233
|
|
06 February 2017, 17:35 | #26 | |||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,847
|
Quote:
Quote:
Too easy. Quote:
Code:
moveq #-1,d2 .loop addq.l #1,d2 sub.l d0,d1 bgt .loop |
|||
06 February 2017, 17:37 | #27 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,847
|
|
06 February 2017, 17:48 | #28 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
|
06 February 2017, 18:24 | #29 |
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
@Thorham Thank you for you code. I missed that it is better to start with the high byte.
20 lines of code is easy to translate directly but we have other point. To show ISA advantages. This requires to take the idea behind the code. BTW I am bit disappointed by your joke about division, I wrote about the fast division. @meynaf Code:
MOVS R12,0,2 ;clears R12 and CY RSB R1,R1,0 ;NEG repeat 32 ADCS R0,R0,R0 ADCS R12,R1,R12,lsl 1 SUBCC R12,R12,R1 end repeat ADC R0,R0,R0 Last edited by litwr; 06 February 2017 at 18:30. |
06 February 2017, 18:38 | #30 | ||
Natteravn
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,546
|
Quote:
Quote:
|
||
06 February 2017, 18:52 | #31 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
More seriously, you can always find examples like this, but add fused branch to 68k and it's not only of similar size : there isn't a speed difference anymore. Btw. Your code isn't like 68k div because it can not detect overflow. Btw2. Having an ARM version of my example would be nice, too - but it's probably even more time consuming than for x86 |
|
06 February 2017, 18:58 | #32 | ||
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
Quote:
This code is *not* good for superscalar (meynaf's code is likely faster with superscalar). Quote:
PPC would need 71% more ICache and almost 2x the instruction fetch to keep up with the 68k if these numbers were typical (actually closer to 40% more ICache and almost 2x the instruction fetch). PPC was supposed to out clock CISC but that increases power consumption just like cache misses from too small of ICache. Last edited by matthey; 06 February 2017 at 19:26. |
||
06 February 2017, 19:16 | #33 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
Compile it with GCC if you don't have the time to do handwritten x86 asm But let's try another example if you prefer. We have an array of bits in memory. They are numbered like in the PPC : 0 is top left bit (most significant bit, aka sign bit). This array can consist of many bytes, each one giving 8 flags (numbered 0 to 7). For example, flag number 555 ($22b) is bit 555%8 (=3, giving %00010000) of byte 555/8 (+69). Now we want to set a specific bit (number is in some register, like D0, EAX, whichever you want) and get its old state in some flag (in CCR or EFLAGS). Memory region is pointed to by some other register (like A0 or ESI). Real life use case : check whether a newly explored cell is already known or not, while indicating it's not in the "fog of war" anymore. Is that one short enough for you ? |
|
06 February 2017, 20:09 | #34 |
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
This looks more interesting but where is 680x0 code?
BTW I know very little about PPC. It has a bit odd history IMHO. I can't conceive the idea of big endian byte order. It looks contrived to me. It slows down arithmetic with 6809, 68008, 68000, 68010. So what is this oddity for? Last edited by litwr; 06 February 2017 at 20:49. |
06 February 2017, 21:11 | #35 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
You will be real sorry when you see it, believe me. It'll show how much other cpus are crap in comparison
But shht! It's too early for this. This time, you show it first. Then i'll show 68k version. Ready ? Quote:
That horror which makes data total unreadable. That completely illogical thing. Little endian has absolutely no advantage ! Hopefully internet protocol designers were smart enough to use big endian, however too many hardware makers missed the point... Big endian costs ZERO in hardware implementation. It does not slow down anything either. ARM even provides both options and is supposed to be simple and easy to implement. "ABCD" is same as $41424344. You won't see "DCBA" when reading memory afterwards and wonder what happened. |
|
06 February 2017, 22:05 | #36 | ||
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
Quote:
Quote:
Big Endian + used more for networking standards + division starts at the most significant end making it faster + magnitude of numbers can be determined more quickly + natural order better for text handling + more human readable hex/binaries and text in memory Little Endian + more common + addition/subtraction starts at the least significant end allowing faster carry propagation to be used Theoretically, any 68k CPU with a 16 bit data bus could be faster when adding/subtracting a 32 bit number in memory but memory was likely already fast enough (and the 68k slow enough) that it made little if any difference. I would love to hear anywhere you read that it made a difference and how much though. Last edited by matthey; 06 February 2017 at 22:16. |
||
07 February 2017, 03:48 | #37 | |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,847
|
To PHX:
Thanks for that The idea is transposition (each 2 letters are two bytes in a register): aabb ccdd becomes aacc bbdd Quote:
You mean 68060? Why exactly? If it's the instruction ordering, then wouldn't that be irrelevant because of the pipelining (slow chipmem writes)? Anyway, I optimize code for 68020s and 68030s because they need it far more than 68060s, and I don't know a whole lot about 68060 optimization (I only know something about instruction ordering). Especially when a plain A1200 is your target, 68060 optimization isn't relevant anymore. Also, it was about code size Last edited by Thorham; 07 February 2017 at 04:12. |
|
07 February 2017, 04:10 | #38 | |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,847
|
Quote:
Code:
; ; bit array set ; ; a0 = array ; d0 = bit number ; move.l d0,d1 lsr.l #5,d0 bset d1,(a0,d0.w*4) Code:
; ; bit array set ; ; a0 = array ; d0 = bit number ; move.l d0,d1 lsr.l #5,d0 not.l d1 bset d1,(a0,d0.w*4) |
|
07 February 2017, 08:11 | #39 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
Code:
move.w (a0)+,(a1)+ move.w (a0)+,(a2)+ move.w (a0)+,(a3)+ move.w (a0)+,(a4)+ Number of instructions, yes. Size, no. For real life cases, even less. Quote:
But ok, i'll give that 68k code. The 68020 has tools that are very powerful when you know how to use them (prepare to be shocked ) : Code:
bfset (a0){d0:1} |
||
07 February 2017, 09:06 | #40 | ||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,847
|
Quote:
Quote:
Yeah, that's pretty short I keep forgetting those bitfield intructions for some reason However, is it faster (greater than 5 byte case is 22 cycles)? |
||
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Generated code and CPU Instruction Cache | Mrs Beanbag | Coders. Asm / Hardware | 11 | 23 May 2014 11:05 |
EAB Christmas Song-writing Contest | mr_a500 | project.EAB | 64 | 24 May 2009 02:44 |
AmigaSYS Wallpaper Contest | Calo Nord | News | 10 | 22 April 2005 09:33 |
Landover's Amiga Arcade Conversion Contest | Frog | News | 1 | 28 January 2005 23:41 |
Battlechess Contest (EAB vs A500) | Bloodwych | Nostalgia & memories | 67 | 14 August 2003 14:37 |
|
|