English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 08 January 2023, 18:05   #1
Galahad/FLT
Going nowhere
 
Galahad/FLT's Avatar
 
Join Date: Oct 2001
Location: United Kingdom
Age: 50
Posts: 9,016
optimisations for 68000

I'm sure others have done the calculations which is why i'm asking.

Only considering 68000 here as thats the default minimum my code is running on.

when using for instance MOVEM.W (a0),d0-d1, is there cases where this isn't faster until you move into a certain amount of registers?

For instance, is MOVEM.W (a0),d0-d1 faster or slower or the same as:
MOVE.w (a0),d0 &
MOVE.W 2(a0),d1

I'm doing a lot of movem.w (a0),d0-d3 for instance, just wanting to gauge if this is optimal for what i'm doing or if its marginally quicker doing each register separately.

Also some of my code does MOVEM.L a2-a3,$50(a6)

Is this the same speed, slower or faster than moving those registers separately?

Also for pointing to map data, which this code has to do quite frequently, i'm using a MULU to do it.

Would it be significantly quicker to do a lookup table instead? I'm guessing it would be.

Just don't want to fall in the trap of although the code is neater and appears to do more for less resources, that in some cases, the movem.X instruction is only beneficial when we move into more than X amount of registers.

Ta in advance
Galahad/FLT is offline  
Old 08 January 2023, 18:19   #2
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,444
I don't have the tables to hand, but it'll come down to bus cycles in the end, I expect. When you do individual moves you are having to fetch instructions as well as transferring the data. For movem you have the register mask to get, but after that it's just down to the microcode.
Karlos is online now  
Old 08 January 2023, 18:22   #3
jotd
This cat is no more
 
jotd's Avatar
 
Join Date: Dec 2004
Location: FRANCE
Age: 52
Posts: 8,369
there's a threshold, and I'm pretty sure that 2 values isn't enough for MOVEM to be worth it.

MULU is slower than a lookup table. I'm using a macro to generate it up to 256 (maybe it's too much maybe it's not enough)

Code:
MUL_TABLE:MACRO
mul\1_table
	rept	256
	dc.w	REPTN*\1
	endr
    ENDM

    MUL_TABLE  27

   ; d0 is your value
    lea  mul27_table(pc),a0
    add.w   d0,d0
    move.w (a0,d0.w),d0  | d0*=27
jotd is online now  
Old 08 January 2023, 18:29   #4
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,205
A while ago someone posted a neat online cycle counting thingy that I tend to use these days for quick stuff: https://68kcounter.grahambates.com/ haven't noticed any errors so far (note that invalid instructions count as 0). For your MOVEM.W (a0),d0-d1 example it's a wash, but anything more movem.w wins (answer is different if you're OK with a0 being incremented in which case you need to move 4 registers for movem.w to win).

For MULU vs table lookup it also shows the actual cycles for the MULU instruction. Here you want to pay special attention to the number of memory accesses for each approach. Even if the table is slightly faster in raw cycle numbers the extra memory accesses for both code and data will sometimes make it a worse approach assuming you don't have true fast ram. (I'm sure you know, just pointing it out).
paraj is offline  
Old 08 January 2023, 18:35   #5
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,062
Pushing 2 registers with movem is the same speed for (ax) or -(ax), and faster vs. other modes.
Popping 3 registers with movem (takes extra 4 cycles) is the same speed for (ax) or (ax)+, and faster vs. other modes.
a/b is offline  
Old 08 January 2023, 18:40   #6
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,307
movem.w and move.w are not equivalent. movem.w includes an extension to 32 bits, move.w does not.
Thomas Richter is offline  
Old 08 January 2023, 20:36   #7
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,039
And because 32bit register extension (like mentioned Thomas Richter), better dont use movem.w command. Or You exactly know what You are doing. Because one coder used movem.w, when correct was movem.l one famous Amiga game was never sold.
Don_Adan is offline  
Old 09 January 2023, 08:50   #8
Galahad/FLT
Going nowhere
 
Galahad/FLT's Avatar
 
Join Date: Oct 2001
Location: United Kingdom
Age: 50
Posts: 9,016
Thanks gents, as I suspected.

movem.w and move.w are equivalent if the following code that uses the results is only referencing word sized accesses so we don't get caught out
Galahad/FLT is offline  
Old 09 January 2023, 10:01   #9
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,436
MULU/MULS vs using a lookup table is actually interesting on the Amiga environment. There is no doubt that on 68000 using a table is faster in terms of CPU cycles (especially if you can keep the lookup table pointer in an address register during multiple lookups and/or use a PC relative table).

However, in terms of memory accesses it's worse to use the table than the instruction. In some cases (notably setting up for future blits while the Blitter is already running), this changes things - IIRC it ends up taking almost the same amount of actual elapsed frame time to just use the MULU instead of the table.
roondar is offline  
Old 09 January 2023, 12:08   #10
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,039
If You never forget that after using movem.w highword of Dx and Ax registers will be always trashed with $ffff or $0000, then can be ok. But i dont think this is good for speed optimisations for 68000, because You wasted highwords of Dx/Ax registers. Then almost always You need more registers to use and more commands. For speed much better is using movem.l or move.l together with swap command.
Don_Adan is offline  
Old 09 January 2023, 14:49   #11
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,720
Quote:
Originally Posted by Don_Adan View Post
i dont think this is good for speed optimisations for 68000, because You wasted highwords of Dx/Ax registers.
It isn't good anyway, because (according to my tests on a stock A500) there is no speedup.

99% of the time it is better to improve the high level program structure than do micro-optimizations like this.
Bruce Abbott is offline  
Old 09 January 2023, 17:51   #12
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by Bruce Abbott View Post
99% of the time it is better to improve the high level program structure than do micro-optimizations like this.
One doesn't preclude the other.
meynaf is offline  
Old 09 January 2023, 20:45   #13
Rock'n Roll
German Translator
 
Rock'n Roll's Avatar
 
Join Date: Aug 2018
Location: Drübeck / Germany
Age: 49
Posts: 197
if you want to know the cycle usage very exactly, run the part of code in comparision in
the WinUAE Debugger.
stop the code on the beginning and set a breakpoint, then run the code.
the upper line from the output shows the cycles.

>fi nop
Cycles: 1619 Chip, 3238 CPU. (V=105 H=24 -> V=112 H=54)
VPOS: 112 ($070) HPOS: 054 ($036) COP: $0002388c

(sometimes it's necessary to turn off all DMA and Interrupt channels, otherwise
the results could be wrong.)
Rock'n Roll is offline  
Old 10 January 2023, 19:44   #14
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,720
Quote:
Originally Posted by meynaf View Post
One doesn't preclude the other.
Except, depending on what your goals are, the time wasted doing micro-optimizations might be better spent elsewhere.
Bruce Abbott is offline  
Old 10 January 2023, 20:00   #15
koobo
Registered User
 
koobo's Avatar
 
Join Date: Sep 2019
Location: Finland
Posts: 371
Quote:
Originally Posted by Bruce Abbott View Post
Except, depending on what your goals are, the time wasted doing micro-optimizations might be better spent elsewhere.
A perfect opportunity for a shameless plug about optimizations! In case y'all missed the original post a few years ago about doing a mandelbrot on the A500 : http://eab.abime.net/showthread.php?t=103710
koobo is offline  
Old 11 January 2023, 07:44   #16
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by Bruce Abbott View Post
Except, depending on what your goals are, the time wasted doing micro-optimizations might be better spent elsewhere.
If it takes too much time, you're doing it wrong.
meynaf is offline  
Old 13 January 2023, 11:45   #17
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,840
Quote:
Originally Posted by Bruce Abbott View Post
99% of the time it is better to improve the high level program structure than do micro-optimizations like this.
Micro optimizations are for tight loops (after you picked the right algorithms and data formats, of course).
Thorham is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
68000 code optimisations pmc Coders. Asm / Hardware 248 17 September 2023 13:20
RTG on 68000? Mixon support.AmigaOS 18 11 September 2022 21:01
16x16 CPU tile flip optimisations mcgeezer Coders. Asm / Hardware 51 20 February 2021 11:54
68000 Emulation buggs support.FS-UAE 0 29 May 2016 13:35
ISOCD optimisations (maximising memory for CD32 games/compilations) earok support.Games 5 07 June 2015 14:37

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 12:15.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.11908 seconds with 12 queries