68000 code optimisations - Page 11

meynaf · 05 June 2021, 18:42

Perhaps it should, yes. Alas this movem addressing mode isn't allowed.

roondar · 05 June 2021, 18:46

Well, there's two minor issues there. The first is that movem.l a0-a2,(a4)+ doesn't actually exist, so you'd need to use movem.l a0-a2,(a4) and then update the value of A4 by hand. The second is that you're indeed correct - movem is only faster if you use a certain number of registers. I'm not sure of the top of my head how many, but it's either three or four IIRC (and that's not counting the cost of the updating A4 in this case).

Edit: didn't see meynaf's post when I started writing this, sorry for the double info.

jotd · 06 June 2021, 00:08

if you have a series of moves to perform you can add the total offset of a4 then use movem.l xxx,-(a4).

Slightly harder to maintain though. The gain doesn't seem too significant.

Photon · 02 August 2021, 12:47

An optimization of mine was brought to my attention yesterday

It's nothing advanced but maybe it fits. It goes under basic ALU operations really, which we could make a list of.

not = neg;sub #1

For example, if a number is negative and should be used for a loop count (e.g. dbf), not.w d0 negates it and subtracts 1 in a single instruction.

Thomas Richter · 02 August 2021, 13:54

A typical use case is strlen:

Code:

 move.l a0,d0
.loop:
 tst.b (a0)+
 bne.s .loop
 sub.l a0,d0
 not.l d0

Photon · 02 August 2021, 14:02

Quote:

Originally Posted by Thomas Richter

A typical use case is strlen:

Yep, and languages can keep count during string operations - this avoids running this counting loop even once

(strlen simply loads the count attached to the string and returns.)

It would be actually be interesting with similar cases where a chunk of code can be completely omitted by planning ahead!

jotd · 25 May 2022, 16:39

Question:

I have this '020 code

Code:

     moveq.l     #127,d5
     moveq.l     #126,d3
     move.l     (a0,d5.l*4),d0
     sub.l (a0,d3.l*4),d0

as d5 and d3 are clobbered just afterwards so we don't really need the values there, I figured that I could write

Code:

     move.l     (127*4,a0),d0
     sub.l (126*4,a0),d0

But the moveq is very quick and now I have 16 bit offsets instead of registers (but the *4 operation is done at compile time)

Is my optimisation useful?

Or I could use another register:

Code:

   lea 127*4(a0),a1
   move.l  (A1),d0
   sub.l    -(A1),d0  # pre-decrementing to get offset 126*4

ross · 25 May 2022, 18:11

Yes, this is faster:

Code:

     move.l  (127*4,a0),d0
     sub.l   (126*4,a0),d0

Also on bare 68k (EDIT: of course even if it were only (ax,dx.l),d0

).

jotd · 25 May 2022, 18:50

thanks that's what I thought

jotd · 08 June 2022, 21:46

I was asked to optimize a 68020 code for work (yes, I know, that's great)

The original code shifts D1

0 by D3 bits on the right.

Code:

GO_ON:
    ASR.L #1,D1
    ROXR.L #1,D0
    SUBQ.L #1,D3
    BGE GO_ON

If we have:

D1 = $12345678
D0 = $9ABCDEF0

D3 = 24 (easier to understand what it does)

In the end we get:

D1 = $00000012
D0 = $3456789A

Of course, one trivial optimization is to replace SUBQ+BGE by DBF, but it only speeds up a bit. My idea was to get rid of the loop, with the help of extra registers that I could spare

Code:

       addq.l  #1,d3   ; loop counter is one off
        lsr.l   d3,d0
        moveq.l #0,d5
        bset    d3,d5
        subq.l  #1,d5   ; generate 1111s mask
        move.l  d1,d2
        and.l   d5,d2
        asr.l   d3,d1
        sub.l   #32,d3
        neg.l   d3              ; shift = 32-shift
        lsl.l   d3,d2   
        or.l    d2,d0

The code is faster for D3 > 2. It's 5 times faster when D3=20, so it's already great.

Anyone can propose further improvements on that one?

ross · 08 June 2022, 22:17

Quote:

Originally Posted by jotd

Anyone can propose further improvements on that one?

Micro optimization..

Code:

        addq.w  #1,d3   ; loop counter is one off
        lsr.l   d3,d0
        moveq   #0,d5
        bset    d3,d5
        subq.l  #1,d5   ; generate 1111s mask
        move.l  d1,d2
        and.l   d5,d2
        asr.l   d3,d1
        moveq   #32,d5
        sub.w   d3,d5   ; shift = 32-shift
        lsl.l   d5,d2   
        or.l    d2,d0

But probably only on 68000

EDIT: I had wasted a register...

phx · 08 June 2022, 22:18

This is a standard 64-bit shift-right operation, which you find in any m68k C-compiler's clib.
For example:

Code:

        tst.w   d3
        beq     .2
        moveq   #32,d2
        sub.l   d3,d2
        bgt.b   .1
        move.l  d0,d1
        neg.l   d2
        add.l   d0,d0
        subx.l  d0,d0
        asr.l   d2,d1
        bra.b   .2
.1:     move.l  d0,d4
        lsr.l   d3,d1
        lsl.l   d2,d4
        asr.l   d3,d0
        or.l    d4,d1
.2:     rts

EDIT: Don't know if this is faster than the jotd/ross version. Probably not. Too lazy to count.

ross · 08 June 2022, 22:31

Quote:

Originally Posted by phx

EDIT: Don't know if this is faster than the jotd/ross version. Probably not. Too lazy to count.

For sure it's more generic (support shift >32), I have to try it ...

EDIT: ah, the input are reversed, and in jotd's one the counter is +1
It is best to use this, with proper register input

Don_Adan · 08 June 2022, 23:03

Quote:

Originally Posted by phx

This is a standard 64-bit shift-right operation, which you find in any m68k C-compiler's clib.
For example:

Code:

        tst.w   d3
        beq     .2
        moveq   #32,d2
        sub.l   d3,d2
        bgt.b   .1
        move.l  d0,d1
        neg.l   d2
        add.l   d0,d0
        subx.l  d0,d0
        asr.l   d2,d1
        bra.b   .2
.1:     move.l  d0,d4
        lsr.l   d3,d1
        lsl.l   d2,d4
        asr.l   d3,d0
        or.l    d4,d1
.2:     rts

EDIT: Don't know if this is faster than the jotd/ross version. Probably not. Too lazy to count.

If this is for C compilers then change "bra.b .2" to "rts".

a/b · 08 June 2022, 23:25

Perhaps this works? Didn't do much testing (looked fine with d3=24, 16, 8, 4, 0).

Code:

;	addq.w	#1,d3		; include this if needed (e.g. d3=23 for 24 shifts)
	moveq	#0,d2
	bset	d3,d2
	subq.l	#1,d2		; mask
	eor.l	d0,d1
	and.l	d2,d1
	or.l	d1,d0
	ror.l	d3,d0

jotd · 09 June 2022, 00:21

a/b what about the asr part? There's only one shift in that version. How can it give a correct result for d1?

ross micro optim looks good

Code:

        moveq   #32,d5
        sub.w   d3,d5   ; shift = 32-shift

why is that better than sub.l #32,d3 only on 68000? moveq+sub register isn't faster in all cases? plus you're eliminating the neg instruction. That looks marginally faster to me.

My target is a 68020 CPU

a/b · 09 June 2022, 00:44

Ah, you need both d0 and d1. OK, I thought you only needed d0.

How about this (020+, as you mentioned)?

Code:

;	addq.l	#1,d3		; include this if needed (e.g. d3=23 for 24 shifts)
	moveq	#32,d2
	sub.l	d3,d2
	bfins	d1,d0{d2:d3}
	rol.l	d2,d0
	lsr.l	d3,d1

ross · 09 June 2022, 00:47

Maybe this is the fastest if limited shift and 'right' d3 is used:

Code:

    moveq   #32,d2
    sub.l   d3,d2
    move.l  d1,d4
    lsr.l   d3,d0
    lsl.l   d2,d4
    asr.l   d3,d1
    or.l    d4,d0

It is a specialized version from the generic one (using the specifications of your registers).

About the 68020+: sometimes the speed is the same even if you use immediate values, but I think in fact in that case it is faster anyway (and use less memory).

phx · 09 June 2022, 00:47

Quote:

Originally Posted by ross

For sure it's more generic (support shift >32)

Indeed, I overlooked that. Then they are not comparable.

Quote:

ah, the input are reversed,

Missed that too. Usually the lower register is the MSW in 64-bit register pairs. I like big-endian.

Quote:

Originally Posted by Don_Adan

If this is for C compilers then change "bra.b .2" to "rts".

I extracted it from vclib, exchanged d2 and d3 and removed the prolog and epilog, which includes movem. The bra.b was for the movem.

Quote:

Originally Posted by jotd

why is that better than sub.l #32,d3 only on 68000? moveq+sub register isn't faster in all cases?

I would always prefer moveq+sub over sub.l-immediate as well. It also saves two bytes.

When a/b's solution works it would be brilliant. But I don't think it does. Did a quick check with d0:d1=$12345678:abcdef0 shifted by 7 and the result was $f02468ac:00000008.

EDIT: Wow... ross and me posted in the same minute again. How likely is that?

jotd · 09 June 2022, 00:49

I wanted to look into bitfield instructions but thought they didn't cover registers as sources

That looks & reads great, but maybe it's too good to be true.

Great to see so many answers for my question. Thanks all.

02 August 2021, 13:54	#205
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,215	A typical use case is strlen: Code: move.l a0,d0 .loop: tst.b (a0)+ bne.s .loop sub.l a0,d0 not.l d0

25 May 2022, 16:39	#207
jotd This cat is no more Join Date: Dec 2004 Location: FRANCE Age: 52 Posts: 8,162	Question: I have this '020 code Code: moveq.l #127,d5 moveq.l #126,d3 move.l (a0,d5.l4),d0 sub.l (a0,d3.l4),d0 as d5 and d3 are clobbered just afterwards so we don't really need the values there, I figured that I could write Code: move.l (1274,a0),d0 sub.l (1264,a0),d0 But the moveq is very quick and now I have 16 bit offsets instead of registers (but the 4 operation is done at compile time) Is my optimisation useful? Or I could use another register: Code: lea 1274(a0),a1 move.l (A1),d0 sub.l -(A1),d0 # pre-decrementing to get offset 1264 Last edited by jotd; 25 May 2022 at 18:52.*

25 May 2022, 18:11	#208
ross Defendit numerus Join Date: Mar 2017 Location: Crossing the Rubicon Age: 53 Posts: 4,468	Yes, this is faster: Code: move.l (1274,a0),d0 sub.l (1264,a0),d0 Also on bare 68k (EDIT: of course even if it were only (ax,dx.l),d0 ). Last edited by ross; 25 May 2022 at 18:16.

08 June 2022, 21:46	#210
jotd This cat is no more Join Date: Dec 2004 Location: FRANCE Age: 52 Posts: 8,162	I was asked to optimize a 68020 code for work (yes, I know, that's great) The original code shifts D10 by D3 bits on the right. Code: GO_ON: ASR.L #1,D1 ROXR.L #1,D0 SUBQ.L #1,D3 BGE GO_ON If we have: D1 = $12345678 D0 = $9ABCDEF0 D3 = 24 (easier to understand what it does) In the end we get: D1 = $00000012 D0 = $3456789A Of course, one trivial optimization is to replace SUBQ+BGE by DBF, but it only speeds up a bit. My idea was to get rid of the loop, with the help of extra registers that I could spare Code: addq.l #1,d3 ; loop counter is one off lsr.l d3,d0 moveq.l #0,d5 bset d3,d5 subq.l #1,d5 ; generate 1111s mask move.l d1,d2 and.l d5,d2 asr.l d3,d1 sub.l #32,d3 neg.l d3 ; shift = 32-shift lsl.l d3,d2 or.l d2,d0 The code is faster for D3 > 2. It's 5 times faster when D3=20, so it's already great. Anyone can propose further improvements on that one?

08 June 2022, 22:18	#212
phx Natteravn Join Date: Nov 2009 Location: Herford / Germany Posts: 2,496	This is a standard 64-bit shift-right operation, which you find in any m68k C-compiler's clib. For example: Code: tst.w d3 beq .2 moveq #32,d2 sub.l d3,d2 bgt.b .1 move.l d0,d1 neg.l d2 add.l d0,d0 subx.l d0,d0 asr.l d2,d1 bra.b .2 .1: move.l d0,d4 lsr.l d3,d1 lsl.l d2,d4 asr.l d3,d0 or.l d4,d1 .2: rts EDIT: Don't know if this is faster than the jotd/ross version. Probably not. Too lazy to count.

05 June 2021, 18:42	#201
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	Perhaps it should, yes. Alas this movem addressing mode isn't allowed.

05 June 2021, 18:46	#202
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,409	Well, there's two minor issues there. The first is that movem.l a0-a2,(a4)+ doesn't actually exist, so you'd need to use movem.l a0-a2,(a4) and then update the value of A4 by hand. The second is that you're indeed correct - movem is only faster if you use a certain number of registers. I'm not sure of the top of my head how many, but it's either three or four IIRC (and that's not counting the cost of the updating A4 in this case). Edit: didn't see meynaf's post when I started writing this, sorry for the double info.

06 June 2021, 00:08	#203
jotd This cat is no more Join Date: Dec 2004 Location: FRANCE Age: 52 Posts: 8,162	if you have a series of moves to perform you can add the total offset of a4 then use movem.l xxx,-(a4). Slightly harder to maintain though. The gain doesn't seem too significant.

02 August 2021, 12:47	#204
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,602	An optimization of mine was brought to my attention yesterday It's nothing advanced but maybe it fits. It goes under basic ALU operations really, which we could make a list of. not = neg;sub #1 For example, if a number is negative and should be used for a loop count (e.g. dbf), not.w d0 negates it and subtracts 1 in a single instruction.

25 May 2022, 18:50	#209
jotd This cat is no more Join Date: Dec 2004 Location: FRANCE Age: 52 Posts: 8,162	thanks that's what I thought

08 June 2022, 23:25	#215
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Perhaps this works? Didn't do much testing (looked fine with d3=24, 16, 8, 4, 0). Code: ; addq.w #1,d3 ; include this if needed (e.g. d3=23 for 24 shifts) moveq #0,d2 bset d3,d2 subq.l #1,d2 ; mask eor.l d0,d1 and.l d2,d1 or.l d1,d0 ror.l d3,d0 Last edited by a/b; 09 June 2022 at 00:12. Reason: more shorterer+fasterer (if it works ><)

09 June 2022, 00:21	#216
jotd This cat is no more Join Date: Dec 2004 Location: FRANCE Age: 52 Posts: 8,162	a/b what about the asr part? There's only one shift in that version. How can it give a correct result for d1? ross micro optim looks good Code: moveq #32,d5 sub.w d3,d5 ; shift = 32-shift why is that better than sub.l #32,d3 only on 68000? moveq+sub register isn't faster in all cases? plus you're eliminating the neg instruction. That looks marginally faster to me. My target is a 68020 CPU

09 June 2022, 00:44	#217
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Ah, you need both d0 and d1. OK, I thought you only needed d0. How about this (020+, as you mentioned)? Code: ; addq.l #1,d3 ; include this if needed (e.g. d3=23 for 24 shifts) moveq #32,d2 sub.l d3,d2 bfins d1,d0{d2:d3} rol.l d2,d0 lsr.l d3,d1

09 June 2022, 00:47	#218
ross Defendit numerus Join Date: Mar 2017 Location: Crossing the Rubicon Age: 53 Posts: 4,468	Maybe this is the fastest if limited shift and 'right' d3 is used: Code: moveq #32,d2 sub.l d3,d2 move.l d1,d4 lsr.l d3,d0 lsl.l d2,d4 asr.l d3,d1 or.l d4,d0 It is a specialized version from the generic one (using the specifications of your registers). About the 68020+: sometimes the speed is the same even if you use immediate values, but I think in fact in that case it is faster anyway (and use less memory).

09 June 2022, 00:49	#220
jotd This cat is no more Join Date: Dec 2004 Location: FRANCE Age: 52 Posts: 8,162	I wanted to look into bitfield instructions but thought they didn't cover registers as sources That looks & reads great, but maybe it's too good to be true. Great to see so many answers for my question. Thanks all.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68000 boot code	billt	Coders. General	15	05 May 2012 20:13
Wasted Dreams on 68000	sanjyuubi	support.Games	5	27 May 2011 17:11
680x0 to 68000	Counia	Hardware mods	1	01 March 2011 10:18
quitting on 68000?	Hungry Horace	project.WHDLoad	60	19 December 2006 20:17
3D code and/or internet code for Blitz Basic 2.1	EdzUp	Retrogaming General Discussion	0	10 February 2002 11:40