optimisations for 68000

Galahad/FLT · 08 January 2023, 18:05

I'm sure others have done the calculations which is why i'm asking.

Only considering 68000 here as thats the default minimum my code is running on.

when using for instance MOVEM.W (a0),d0-d1, is there cases where this isn't faster until you move into a certain amount of registers?

For instance, is MOVEM.W (a0),d0-d1 faster or slower or the same as:
MOVE.w (a0),d0 &
MOVE.W 2(a0),d1

I'm doing a lot of movem.w (a0),d0-d3 for instance, just wanting to gauge if this is optimal for what i'm doing or if its marginally quicker doing each register separately.

Also some of my code does MOVEM.L a2-a3,$50(a6)

Is this the same speed, slower or faster than moving those registers separately?

Also for pointing to map data, which this code has to do quite frequently, i'm using a MULU to do it.

Would it be significantly quicker to do a lookup table instead? I'm guessing it would be.

Just don't want to fall in the trap of although the code is neater and appears to do more for less resources, that in some cases, the movem.X instruction is only beneficial when we move into more than X amount of registers.

Ta in advance

Karlos · 08 January 2023, 18:19

I don't have the tables to hand, but it'll come down to bus cycles in the end, I expect. When you do individual moves you are having to fetch instructions as well as transferring the data. For movem you have the register mask to get, but after that it's just down to the microcode.

jotd · 08 January 2023, 18:22

there's a threshold, and I'm pretty sure that 2 values isn't enough for MOVEM to be worth it.

MULU is slower than a lookup table. I'm using a macro to generate it up to 256 (maybe it's too much maybe it's not enough)

Code:

MUL_TABLE:MACRO
mul\1_table
	rept	256
	dc.w	REPTN*\1
	endr
    ENDM

    MUL_TABLE  27

   ; d0 is your value
    lea  mul27_table(pc),a0
    add.w   d0,d0
    move.w (a0,d0.w),d0  | d0*=27

paraj · 08 January 2023, 18:29

A while ago someone posted a neat online cycle counting thingy that I tend to use these days for quick stuff: https://68kcounter.grahambates.com/ haven't noticed any errors so far (note that invalid instructions count as 0). For your MOVEM.W (a0),d0-d1 example it's a wash, but anything more movem.w wins (answer is different if you're OK with a0 being incremented in which case you need to move 4 registers for movem.w to win).

For MULU vs table lookup it also shows the actual cycles for the MULU instruction. Here you want to pay special attention to the number of memory accesses for each approach. Even if the table is slightly faster in raw cycle numbers the extra memory accesses for both code and data will sometimes make it a worse approach assuming you don't have true fast ram. (I'm sure you know, just pointing it out).

a/b · 08 January 2023, 18:35

Pushing 2 registers with movem is the same speed for (ax) or -(ax), and faster vs. other modes.
Popping 3 registers with movem (takes extra 4 cycles) is the same speed for (ax) or (ax)+, and faster vs. other modes.

Thomas Richter · 08 January 2023, 18:40

movem.w and move.w are not equivalent. movem.w includes an extension to 32 bits, move.w does not.

Don_Adan · 08 January 2023, 20:36

And because 32bit register extension (like mentioned Thomas Richter), better dont use movem.w command. Or You exactly know what You are doing. Because one coder used movem.w, when correct was movem.l one famous Amiga game was never sold.

Galahad/FLT · 09 January 2023, 08:50

Thanks gents, as I suspected.

movem.w and move.w are equivalent if the following code that uses the results is only referencing word sized accesses so we don't get caught out

roondar · 09 January 2023, 10:01

MULU/MULS vs using a lookup table is actually interesting on the Amiga environment. There is no doubt that on 68000 using a table is faster in terms of CPU cycles (especially if you can keep the lookup table pointer in an address register during multiple lookups and/or use a PC relative table).

However, in terms of memory accesses it's worse to use the table than the instruction. In some cases (notably setting up for future blits while the Blitter is already running), this changes things - IIRC it ends up taking almost the same amount of actual elapsed frame time to just use the MULU instead of the table.

Don_Adan · 09 January 2023, 12:08

If You never forget that after using movem.w highword of Dx and Ax registers will be always trashed with $ffff or $0000, then can be ok. But i dont think this is good for speed optimisations for 68000, because You wasted highwords of Dx/Ax registers. Then almost always You need more registers to use and more commands. For speed much better is using movem.l or move.l together with swap command.

Bruce Abbott · 09 January 2023, 14:49

Quote:

Originally Posted by Don_Adan

i dont think this is good for speed optimisations for 68000, because You wasted highwords of Dx/Ax registers.

It isn't good anyway, because (according to my tests on a stock A500) there is no speedup.

99% of the time it is better to improve the high level program structure than do micro-optimizations like this.

meynaf · 09 January 2023, 17:51

Quote:

Originally Posted by Bruce Abbott

99% of the time it is better to improve the high level program structure than do micro-optimizations like this.

One doesn't preclude the other.

Rock'n Roll · 09 January 2023, 20:45

if you want to know the cycle usage very exactly, run the part of code in comparision in
the WinUAE Debugger.
stop the code on the beginning and set a breakpoint, then run the code.
the upper line from the output shows the cycles.

>fi nop
Cycles: 1619 Chip, 3238 CPU. (V=105 H=24 -> V=112 H=54)
VPOS: 112 ($070) HPOS: 054 ($036) COP: $0002388c

(sometimes it's necessary to turn off all DMA and Interrupt channels, otherwise
the results could be wrong.)

Bruce Abbott · 10 January 2023, 19:44

Quote:

Originally Posted by meynaf

One doesn't preclude the other.

Except, depending on what your goals are, the time wasted doing micro-optimizations might be better spent elsewhere.

koobo · 10 January 2023, 20:00

Quote:

Originally Posted by Bruce Abbott

Except, depending on what your goals are, the time wasted doing micro-optimizations might be better spent elsewhere.

A perfect opportunity for a shameless plug about optimizations! In case y'all missed the original post a few years ago about doing a mandelbrot on the A500

: http://eab.abime.net/showthread.php?t=103710

meynaf · 11 January 2023, 07:44

Quote:

Originally Posted by Bruce Abbott

Except, depending on what your goals are, the time wasted doing micro-optimizations might be better spent elsewhere.

If it takes too much time, you're doing it wrong.

Thorham · 13 January 2023, 11:45

Quote:

Originally Posted by Bruce Abbott

99% of the time it is better to improve the high level program structure than do micro-optimizations like this.

Micro optimizations are for tight loops (after you picked the right algorithms and data formats, of course).

08 January 2023, 18:05	#1
Galahad/FLT Going nowhere Join Date: Oct 2001 Location: United Kingdom Age: 50 Posts: 9,016	optimisations for 68000 I'm sure others have done the calculations which is why i'm asking. Only considering 68000 here as thats the default minimum my code is running on. when using for instance MOVEM.W (a0),d0-d1, is there cases where this isn't faster until you move into a certain amount of registers? For instance, is MOVEM.W (a0),d0-d1 faster or slower or the same as: MOVE.w (a0),d0 & MOVE.W 2(a0),d1 I'm doing a lot of movem.w (a0),d0-d3 for instance, just wanting to gauge if this is optimal for what i'm doing or if its marginally quicker doing each register separately. Also some of my code does MOVEM.L a2-a3,$50(a6) Is this the same speed, slower or faster than moving those registers separately? Also for pointing to map data, which this code has to do quite frequently, i'm using a MULU to do it. Would it be significantly quicker to do a lookup table instead? I'm guessing it would be. Just don't want to fall in the trap of although the code is neater and appears to do more for less resources, that in some cases, the movem.X instruction is only beneficial when we move into more than X amount of registers. Ta in advance

08 January 2023, 18:22	#3
jotd This cat is no more Join Date: Dec 2004 Location: FRANCE Age: 52 Posts: 8,369	there's a threshold, and I'm pretty sure that 2 values isn't enough for MOVEM to be worth it. MULU is slower than a lookup table. I'm using a macro to generate it up to 256 (maybe it's too much maybe it's not enough) Code: MUL_TABLE:MACRO mul\1_table rept 256 dc.w REPTN\1 endr ENDM MUL_TABLE 27 ; d0 is your value lea mul27_table(pc),a0 add.w d0,d0 move.w (a0,d0.w),d0 \| d0=27

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68000 code optimisations	pmc	Coders. Asm / Hardware	248	17 September 2023 13:20
RTG on 68000?	Mixon	support.AmigaOS	18	11 September 2022 21:01
16x16 CPU tile flip optimisations	mcgeezer	Coders. Asm / Hardware	51	20 February 2021 11:54
68000 Emulation	buggs	support.FS-UAE	0	29 May 2016 13:35
ISOCD optimisations (maximising memory for CD32 games/compilations)	earok	support.Games	5	07 June 2015 14:37

08 January 2023, 18:19	#2
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,444	I don't have the tables to hand, but it'll come down to bus cycles in the end, I expect. When you do individual moves you are having to fetch instructions as well as transferring the data. For movem you have the register mask to get, but after that it's just down to the microcode.

08 January 2023, 18:29	#4
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,205	A while ago someone posted a neat online cycle counting thingy that I tend to use these days for quick stuff: https://68kcounter.grahambates.com/ haven't noticed any errors so far (note that invalid instructions count as 0). For your MOVEM.W (a0),d0-d1 example it's a wash, but anything more movem.w wins (answer is different if you're OK with a0 being incremented in which case you need to move 4 registers for movem.w to win). For MULU vs table lookup it also shows the actual cycles for the MULU instruction. Here you want to pay special attention to the number of memory accesses for each approach. Even if the table is slightly faster in raw cycle numbers the extra memory accesses for both code and data will sometimes make it a worse approach assuming you don't have true fast ram. (I'm sure you know, just pointing it out).

08 January 2023, 18:35	#5
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,062	Pushing 2 registers with movem is the same speed for (ax) or -(ax), and faster vs. other modes. Popping 3 registers with movem (takes extra 4 cycles) is the same speed for (ax) or (ax)+, and faster vs. other modes.

08 January 2023, 18:40	#6
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,307	movem.w and move.w are not equivalent. movem.w includes an extension to 32 bits, move.w does not.

08 January 2023, 20:36	#7
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 56 Posts: 2,039	And because 32bit register extension (like mentioned Thomas Richter), better dont use movem.w command. Or You exactly know what You are doing. Because one coder used movem.w, when correct was movem.l one famous Amiga game was never sold.

09 January 2023, 08:50	#8
Galahad/FLT Going nowhere Join Date: Oct 2001 Location: United Kingdom Age: 50 Posts: 9,016	Thanks gents, as I suspected. movem.w and move.w are equivalent if the following code that uses the results is only referencing word sized accesses so we don't get caught out

09 January 2023, 10:01	#9
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,436	MULU/MULS vs using a lookup table is actually interesting on the Amiga environment. There is no doubt that on 68000 using a table is faster in terms of CPU cycles (especially if you can keep the lookup table pointer in an address register during multiple lookups and/or use a PC relative table). However, in terms of memory accesses it's worse to use the table than the instruction. In some cases (notably setting up for future blits while the Blitter is already running), this changes things - IIRC it ends up taking almost the same amount of actual elapsed frame time to just use the MULU instead of the table.

09 January 2023, 12:08	#10
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 56 Posts: 2,039	If You never forget that after using movem.w highword of Dx and Ax registers will be always trashed with $ffff or $0000, then can be ok. But i dont think this is good for speed optimisations for 68000, because You wasted highwords of Dx/Ax registers. Then almost always You need more registers to use and more commands. For speed much better is using movem.l or move.l together with swap command.

09 January 2023, 20:45	#13
Rock'n Roll German Translator Join Date: Aug 2018 Location: Drübeck / Germany Age: 49 Posts: 197	if you want to know the cycle usage very exactly, run the part of code in comparision in the WinUAE Debugger. stop the code on the beginning and set a breakpoint, then run the code. the upper line from the output shows the cycles. >fi nop Cycles: 1619 Chip, 3238 CPU. (V=105 H=24 -> V=112 H=54) VPOS: 112 ($070) HPOS: 054 ($036) COP: $0002388c (sometimes it's necessary to turn off all DMA and Interrupt channels, otherwise the results could be wrong.)

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)