English Amiga Board

English Amiga Board (https://eab.abime.net/index.php)
-   Coders. Asm / Hardware (https://eab.abime.net/forumdisplay.php?f=112)
-   -   68000 code optimisations (https://eab.abime.net/showthread.php?t=57587)

meynaf 05 June 2021 18:42

Perhaps it should, yes. Alas this movem addressing mode isn't allowed.

roondar 05 June 2021 18:46

Well, there's two minor issues there. The first is that movem.l a0-a2,(a4)+ doesn't actually exist, so you'd need to use movem.l a0-a2,(a4) and then update the value of A4 by hand. The second is that you're indeed correct - movem is only faster if you use a certain number of registers. I'm not sure of the top of my head how many, but it's either three or four IIRC (and that's not counting the cost of the updating A4 in this case).

Edit: didn't see meynaf's post when I started writing this, sorry for the double info.

jotd 06 June 2021 00:08

if you have a series of moves to perform you can add the total offset of a4 then use movem.l xxx,-(a4).

Slightly harder to maintain though. The gain doesn't seem too significant.

Photon 02 August 2021 12:47

An optimization of mine was brought to my attention yesterday :) It's nothing advanced but maybe it fits. It goes under basic ALU operations really, which we could make a list of.

not = neg;sub #1

For example, if a number is negative and should be used for a loop count (e.g. dbf), not.w d0 negates it and subtracts 1 in a single instruction.

Thomas Richter 02 August 2021 13:54

A typical use case is strlen:

Code:

move.l a0,d0
.loop:
 tst.b (a0)+
 bne.s .loop
 sub.l a0,d0
 not.l d0


Photon 02 August 2021 14:02

Quote:

Originally Posted by Thomas Richter (Post 1499099)
A typical use case is strlen:

Yep, and languages can keep count during string operations - this avoids running this counting loop even once :) (strlen simply loads the count attached to the string and returns.)

It would be actually be interesting with similar cases where a chunk of code can be completely omitted by planning ahead! :great

jotd 25 May 2022 16:39

Question:

I have this '020 code

Code:

    moveq.l    #127,d5
    moveq.l    #126,d3
    move.l    (a0,d5.l*4),d0
    sub.l (a0,d3.l*4),d0

as d5 and d3 are clobbered just afterwards so we don't really need the values there, I figured that I could write


Code:

    move.l    (127*4,a0),d0
    sub.l (126*4,a0),d0

But the moveq is very quick and now I have 16 bit offsets instead of registers (but the *4 operation is done at compile time)

Is my optimisation useful?

Or I could use another register:

Code:

  lea 127*4(a0),a1
  move.l  (A1),d0
  sub.l    -(A1),d0  # pre-decrementing to get offset 126*4


ross 25 May 2022 18:11

Yes, this is faster:

Code:

    move.l  (127*4,a0),d0
    sub.l  (126*4,a0),d0

Also on bare 68k (EDIT: of course even if it were only (ax,dx.l),d0 ;)).

jotd 25 May 2022 18:50

thanks that's what I thought

jotd 08 June 2022 21:46

I was asked to optimize a 68020 code for work (yes, I know, that's great)

The original code shifts D1:D0 by D3 bits on the right.

Code:


GO_ON:
    ASR.L #1,D1
    ROXR.L #1,D0
    SUBQ.L #1,D3
    BGE GO_ON


If we have:

D1 = $12345678
D0 = $9ABCDEF0

D3 = 24 (easier to understand what it does)

In the end we get:

D1 = $00000012
D0 = $3456789A

Of course, one trivial optimization is to replace SUBQ+BGE by DBF, but it only speeds up a bit. My idea was to get rid of the loop, with the help of extra registers that I could spare

Code:

      addq.l  #1,d3  ; loop counter is one off
        lsr.l  d3,d0
        moveq.l #0,d5
        bset    d3,d5
        subq.l  #1,d5  ; generate 1111s mask
        move.l  d1,d2
        and.l  d5,d2
        asr.l  d3,d1
        sub.l  #32,d3
        neg.l  d3              ; shift = 32-shift
        lsl.l  d3,d2 
        or.l    d2,d0

The code is faster for D3 > 2. It's 5 times faster when D3=20, so it's already great.

Anyone can propose further improvements on that one?

ross 08 June 2022 22:17

Quote:

Originally Posted by jotd (Post 1549249)
Anyone can propose further improvements on that one?

Micro optimization..
Code:

        addq.w  #1,d3  ; loop counter is one off
        lsr.l  d3,d0
        moveq  #0,d5
        bset    d3,d5
        subq.l  #1,d5  ; generate 1111s mask
        move.l  d1,d2
        and.l  d5,d2
        asr.l  d3,d1
        moveq  #32,d5
        sub.w  d3,d5  ; shift = 32-shift
        lsl.l  d5,d2 
        or.l    d2,d0

But probably only on 68000 :)

EDIT: I had wasted a register...

phx 08 June 2022 22:18

This is a standard 64-bit shift-right operation, which you find in any m68k C-compiler's clib.
For example:
Code:

        tst.w  d3
        beq    .2
        moveq  #32,d2
        sub.l  d3,d2
        bgt.b  .1
        move.l  d0,d1
        neg.l  d2
        add.l  d0,d0
        subx.l  d0,d0
        asr.l  d2,d1
        bra.b  .2
.1:    move.l  d0,d4
        lsr.l  d3,d1
        lsl.l  d2,d4
        asr.l  d3,d0
        or.l    d4,d1
.2:    rts

EDIT: Don't know if this is faster than the jotd/ross version. Probably not. Too lazy to count. ;)

ross 08 June 2022 22:31

Quote:

Originally Posted by phx (Post 1549253)
EDIT: Don't know if this is faster than the jotd/ross version. Probably not. Too lazy to count. ;)

For sure it's more generic (support shift >32), I have to try it ... ;)

EDIT: ah, the input are reversed, and in jotd's one the counter is +1
It is best to use this, with proper register input :D

Don_Adan 08 June 2022 23:03

Quote:

Originally Posted by phx (Post 1549253)
This is a standard 64-bit shift-right operation, which you find in any m68k C-compiler's clib.
For example:
Code:

        tst.w  d3
        beq    .2
        moveq  #32,d2
        sub.l  d3,d2
        bgt.b  .1
        move.l  d0,d1
        neg.l  d2
        add.l  d0,d0
        subx.l  d0,d0
        asr.l  d2,d1
        bra.b  .2
.1:    move.l  d0,d4
        lsr.l  d3,d1
        lsl.l  d2,d4
        asr.l  d3,d0
        or.l    d4,d1
.2:    rts

EDIT: Don't know if this is faster than the jotd/ross version. Probably not. Too lazy to count. ;)

If this is for C compilers then change "bra.b .2" to "rts".

a/b 08 June 2022 23:25

Perhaps this works? Didn't do much testing (looked fine with d3=24, 16, 8, 4, 0).
Code:

;        addq.w        #1,d3                ; include this if needed (e.g. d3=23 for 24 shifts)
        moveq        #0,d2
        bset        d3,d2
        subq.l        #1,d2                ; mask
        eor.l        d0,d1
        and.l        d2,d1
        or.l        d1,d0
        ror.l        d3,d0


jotd 09 June 2022 00:21

a/b what about the asr part? There's only one shift in that version. How can it give a correct result for d1?

ross micro optim looks good

Code:

        moveq  #32,d5
        sub.w  d3,d5  ; shift = 32-shift

why is that better than sub.l #32,d3 only on 68000? moveq+sub register isn't faster in all cases? plus you're eliminating the neg instruction. That looks marginally faster to me.

My target is a 68020 CPU

a/b 09 June 2022 00:44

Ah, you need both d0 and d1. OK, I thought you only needed d0.

How about this (020+, as you mentioned)?
Code:

;        addq.l        #1,d3                ; include this if needed (e.g. d3=23 for 24 shifts)
        moveq        #32,d2
        sub.l        d3,d2
        bfins        d1,d0{d2:d3}
        rol.l        d2,d0
        lsr.l        d3,d1


ross 09 June 2022 00:47

Maybe this is the fastest if limited shift and 'right' d3 is used:

Code:

    moveq  #32,d2
    sub.l  d3,d2
    move.l  d1,d4
    lsr.l  d3,d0
    lsl.l  d2,d4
    asr.l  d3,d1
    or.l    d4,d0

It is a specialized version from the generic one (using the specifications of your registers).

About the 68020+: sometimes the speed is the same even if you use immediate values, but I think in fact in that case it is faster anyway (and use less memory).

phx 09 June 2022 00:47

Quote:

Originally Posted by ross (Post 1549256)
For sure it's more generic (support shift >32)

Indeed, I overlooked that. Then they are not comparable.

Quote:

ah, the input are reversed,
Missed that too. Usually the lower register is the MSW in 64-bit register pairs. I like big-endian. :)

Quote:

Originally Posted by Don_Adan (Post 1549262)
If this is for C compilers then change "bra.b .2" to "rts".

I extracted it from vclib, exchanged d2 and d3 and removed the prolog and epilog, which includes movem. The bra.b was for the movem.

Quote:

Originally Posted by jotd (Post 1549272)
why is that better than sub.l #32,d3 only on 68000? moveq+sub register isn't faster in all cases?

I would always prefer moveq+sub over sub.l-immediate as well. It also saves two bytes.

When a/b's solution works it would be brilliant. But I don't think it does. Did a quick check with d0:d1=$12345678:abcdef0 shifted by 7 and the result was $f02468ac:00000008.

EDIT: Wow... ross and me posted in the same minute again. How likely is that? :)

jotd 09 June 2022 00:49

I wanted to look into bitfield instructions but thought they didn't cover registers as sources

That looks & reads great, but maybe it's too good to be true.

Great to see so many answers for my question. Thanks all.


All times are GMT +2. The time now is 21:04.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.

Page generated in 0.07705 seconds with 11 queries