![]() |
![]() |
#1 |
Going nowhere
Join Date: Oct 2001
Location: United Kingdom
Age: 50
Posts: 9,016
|
optimisations for 68000
I'm sure others have done the calculations which is why i'm asking.
Only considering 68000 here as thats the default minimum my code is running on. when using for instance MOVEM.W (a0),d0-d1, is there cases where this isn't faster until you move into a certain amount of registers? For instance, is MOVEM.W (a0),d0-d1 faster or slower or the same as: MOVE.w (a0),d0 & MOVE.W 2(a0),d1 I'm doing a lot of movem.w (a0),d0-d3 for instance, just wanting to gauge if this is optimal for what i'm doing or if its marginally quicker doing each register separately. Also some of my code does MOVEM.L a2-a3,$50(a6) Is this the same speed, slower or faster than moving those registers separately? Also for pointing to map data, which this code has to do quite frequently, i'm using a MULU to do it. Would it be significantly quicker to do a lookup table instead? I'm guessing it would be. Just don't want to fall in the trap of although the code is neater and appears to do more for less resources, that in some cases, the movem.X instruction is only beneficial when we move into more than X amount of registers. Ta in advance |
![]() |
![]() |
#2 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,444
|
I don't have the tables to hand, but it'll come down to bus cycles in the end, I expect. When you do individual moves you are having to fetch instructions as well as transferring the data. For movem you have the register mask to get, but after that it's just down to the microcode.
|
![]() |
![]() |
#3 |
This cat is no more
Join Date: Dec 2004
Location: FRANCE
Age: 52
Posts: 8,369
|
there's a threshold, and I'm pretty sure that 2 values isn't enough for MOVEM to be worth it.
MULU is slower than a lookup table. I'm using a macro to generate it up to 256 (maybe it's too much maybe it's not enough) Code:
MUL_TABLE:MACRO mul\1_table rept 256 dc.w REPTN*\1 endr ENDM MUL_TABLE 27 ; d0 is your value lea mul27_table(pc),a0 add.w d0,d0 move.w (a0,d0.w),d0 | d0*=27 |
![]() |
![]() |
#4 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,205
|
A while ago someone posted a neat online cycle counting thingy that I tend to use these days for quick stuff: https://68kcounter.grahambates.com/ haven't noticed any errors so far (note that invalid instructions count as 0). For your MOVEM.W (a0),d0-d1 example it's a wash, but anything more movem.w wins (answer is different if you're OK with a0 being incremented in which case you need to move 4 registers for movem.w to win).
For MULU vs table lookup it also shows the actual cycles for the MULU instruction. Here you want to pay special attention to the number of memory accesses for each approach. Even if the table is slightly faster in raw cycle numbers the extra memory accesses for both code and data will sometimes make it a worse approach assuming you don't have true fast ram. (I'm sure you know, just pointing it out). |
![]() |
![]() |
#5 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,062
|
Pushing 2 registers with movem is the same speed for (ax) or -(ax), and faster vs. other modes.
Popping 3 registers with movem (takes extra 4 cycles) is the same speed for (ax) or (ax)+, and faster vs. other modes. |
![]() |
![]() |
#6 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,307
|
movem.w and move.w are not equivalent. movem.w includes an extension to 32 bits, move.w does not.
|
![]() |
![]() |
#7 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,039
|
And because 32bit register extension (like mentioned Thomas Richter), better dont use movem.w command. Or You exactly know what You are doing. Because one coder used movem.w, when correct was movem.l one famous Amiga game was never sold.
|
![]() |
![]() |
#8 |
Going nowhere
Join Date: Oct 2001
Location: United Kingdom
Age: 50
Posts: 9,016
|
Thanks gents, as I suspected.
movem.w and move.w are equivalent if the following code that uses the results is only referencing word sized accesses so we don't get caught out ![]() |
![]() |
![]() |
#9 |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,436
|
MULU/MULS vs using a lookup table is actually interesting on the Amiga environment. There is no doubt that on 68000 using a table is faster in terms of CPU cycles (especially if you can keep the lookup table pointer in an address register during multiple lookups and/or use a PC relative table).
However, in terms of memory accesses it's worse to use the table than the instruction. In some cases (notably setting up for future blits while the Blitter is already running), this changes things - IIRC it ends up taking almost the same amount of actual elapsed frame time to just use the MULU instead of the table. |
![]() |
![]() |
#10 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,039
|
If You never forget that after using movem.w highword of Dx and Ax registers will be always trashed with $ffff or $0000, then can be ok. But i dont think this is good for speed optimisations for 68000, because You wasted highwords of Dx/Ax registers. Then almost always You need more registers to use and more commands. For speed much better is using movem.l or move.l together with swap command.
|
![]() |
![]() |
#11 | |
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,720
|
Quote:
99% of the time it is better to improve the high level program structure than do micro-optimizations like this. |
|
![]() |
![]() |
#12 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
|
![]() |
![]() |
#13 |
German Translator
Join Date: Aug 2018
Location: Drübeck / Germany
Age: 49
Posts: 197
|
if you want to know the cycle usage very exactly, run the part of code in comparision in
the WinUAE Debugger. stop the code on the beginning and set a breakpoint, then run the code. the upper line from the output shows the cycles. >fi nop Cycles: 1619 Chip, 3238 CPU. (V=105 H=24 -> V=112 H=54) VPOS: 112 ($070) HPOS: 054 ($036) COP: $0002388c (sometimes it's necessary to turn off all DMA and Interrupt channels, otherwise the results could be wrong.) |
![]() |
![]() |
#14 |
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,720
|
|
![]() |
![]() |
#15 | |
Registered User
Join Date: Sep 2019
Location: Finland
Posts: 371
|
Quote:
![]() ![]() |
|
![]() |
![]() |
#16 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
|
![]() |
![]() |
#17 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,840
|
|
![]() |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
68000 code optimisations | pmc | Coders. Asm / Hardware | 248 | 17 September 2023 13:20 |
RTG on 68000? | Mixon | support.AmigaOS | 18 | 11 September 2022 21:01 |
16x16 CPU tile flip optimisations | mcgeezer | Coders. Asm / Hardware | 51 | 20 February 2021 11:54 |
68000 Emulation | buggs | support.FS-UAE | 0 | 29 May 2016 13:35 |
ISOCD optimisations (maximising memory for CD32 games/compilations) | earok | support.Games | 5 | 07 June 2015 14:37 |
|
|