68000 code optimisations - Page 8

paraj · 23 April 2017, 23:35

Ye olde amycoders website has a tutorial on avoiding multiplications. Looking at the "using squares" part, I'm having trouble finding an immediate use. All the other tips are more or less "drop-in" replacements for their slower counterparts. While I can certainly think of some uses if one is willing to sacrifice precision, it seems more of an advanced building block compared to the other tricks. Am I missing something?

I'm talking ~7MHz 68000 here, while I know 020+ has (aX,dY*4) and probably other stuff that helps don't they multiply fast enough for this trick not to matter?

Below is my attempt doing this in code. As usual there's probably errors, inefficiencies and wrong cycle counts, but in general it seems that the target of 38 (or even 50) seems long out of reach.

EDIT: Hmm.. testing shows that folding the shift into the table doesn't lose precision. It's probably easy to see why, but it'll have to wait and it's still not fast

Code:

; d0 *= d1, assumes a5=square table, d6 used as scratch
; Instruction to beat: muls.w d1, d0 38-70(1/0)
        move.w  d0, d6          ;  4(1/0)
        add.w   d1, d0          ;  4(1/0)
	add.w	d0, d0          ;  4(1/0)
	add.w	d0, d0          ;  4(1/0)
	move.l	(a5,d0.w), d0   ; 18(4/0) d0 = (A+B)^2
        sub.w   d1, d6          ;  4(1/0)
	add.w	d6, d6          ;  4(1/0)
	add.w	d6, d6          ;  4(1/0)
	sub.l	(a5,d6.w), d0   ; 20(3/0) d0 = (A+B)^2 - (A-B)^2
	; see edit asr.l	#2, d0	        ; 12(1/0) d0 = ((A+B)^2 - (A-B)^2))/4
                                ;--------
                                ; 66(14/0)

    ; ...
    dc.l 9>>2
    dc.l 4>>2
    dc.l 1>>2
squares: ; <- a5 points here
    dc.l 0>>2
    dc.l 1>>2
    dc.l 4>>2
    ; ...

Galahad/FLT · 24 April 2017, 00:47

Quote:

Originally Posted by paraj

Ye olde amycoders website has a tutorial on avoiding multiplications. Looking at the "using squares" part, I'm having trouble finding an immediate use. All the other tips are more or less "drop-in" replacements for their slower counterparts. While I can certainly think of some uses if one is willing to sacrifice precision, it seems more of an advanced building block compared to the other tricks. Am I missing something?

I'm talking ~7MHz 68000 here, while I know 020+ has (aX,dY*4) and probably other stuff that helps don't they multiply fast enough for this trick not to matter?

Below is my attempt doing this in code. As usual there's probably errors, inefficiencies and wrong cycle counts, but in general it seems that the target of 38 (or even 50) seems long out of reach.

EDIT: Hmm.. testing shows that folding the shift into the table doesn't lose precision. It's probably easy to see why, but it'll have to wait and it's still not fast

Code:

; d0 *= d1, assumes a5=square table, d6 used as scratch
; Instruction to beat: muls.w d1, d0 38-70(1/0)
        move.w  d0, d6          ;  4(1/0)
        add.w   d1, d0          ;  4(1/0)
	add.w	d0, d0          ;  4(1/0)
	add.w	d0, d0          ;  4(1/0)
	move.l	(a5,d0.w), d0   ; 18(4/0) d0 = (A+B)^2
        sub.w   d1, d6          ;  4(1/0)
	add.w	d6, d6          ;  4(1/0)
	add.w	d6, d6          ;  4(1/0)
	sub.l	(a5,d6.w), d0   ; 20(3/0) d0 = (A+B)^2 - (A-B)^2
	; see edit asr.l	#2, d0	        ; 12(1/0) d0 = ((A+B)^2 - (A-B)^2))/4
                                ;--------
                                ; 66(14/0)

    ; ...
    dc.l 9>>2
    dc.l 4>>2
    dc.l 1>>2
squares: ; <- a5 points here
    dc.l 0>>2
    dc.l 1>>2
    dc.l 4>>2
    ; ...

Surely where you have the 2x add.w d0,d0, it would be quicker to do LSL.W #2,d0 and the same again for the 2x add.w d6,d6 being LSL.W #2,d6 ?

paraj · 24 April 2017, 01:11

Quote:

Originally Posted by Galahad/FLT

Surely where you have the 2x add.w d0,d0, it would be quicker to do LSL.W #2,d0 and the same again for the 2x add.w d6,d6 being LSL.W #2,d6 ?

2 x add.w is 8(2/0), lsl is 6+2*2=10(1/0) (according to the instruction references I'm currently looking at). So lsl is probably better when DMA is contended (and it's shorter), but it might be close. I mostly did it because it's mentioned in the very first post of this thread and I didn't want to be scolded

a/b · 24 April 2017, 07:51

muls is 56 on average
2x18 to fetch LW data + 4x4 to align access = 52, so you are about even without all the other instructions

If you are dealing with limited range, say approx. 256x256 *and* you include >>2 in your table, you could use words instead.
2x14 fetch + 2x4 align = 36

Something like:

Code:

	add.w	d0,d0
	add.w	d1,d1
	move.w	d0,d6
	add.w	d1,d0
	move.w	(a5,d0.w),d0
	sub.w	d1,d6
	sub.w	(a5,d6.w),d0

That's 5x4 + 2x14 = 48 so it could work. In general case (16-bit input data) you would need a rather large table so doing a muls instead is a good compromise.

paraj · 24 April 2017, 09:43

Quote:

Originally Posted by a/b

muls is 56 on average
2x18 to fetch LW data + 4x4 to align access = 52, so you are about even without all the other instructions

If you are dealing with limited range, say approx. 256x256 *and* you include >>2 in your table, you could use words instead.
2x14 fetch + 2x4 align = 36

Something like:

Code:

	add.w	d0,d0
	add.w	d1,d1
	move.w	d0,d6
	add.w	d1,d0
	move.w	(a5,d0.w),d0
	sub.w	d1,d6
	sub.w	(a5,d6.w),d0

That's 5x4 + 2x14 = 48 so it could work. In general case (16-bit input data) you would need a rather large table so doing a muls instead is a good compromise.

Yeah, that could work, but if the input is limited to $100 won't the worst case for MULS.W be (38+8*2) 54 cycles (46 average), in which case we're back to muls being hard to beat?

I was thinking of using the optimization in a bog-standard vector/matrix multiplication routine, so some of the cost could be amortized, but I still have a hard time gaining anything (register pressure is also increased and the code has to be changed to accommodate the optimization).

I.e. when doing X' = MUL(MXX, VX) + MUL(MXY, VY) + MUL(MXZ, VZ), M?? can be premultiplied by 2/4 (the pointer to the square table could also be added). And as VX/VY/VZ are used 3 times, we can save 2*2 scalings for those.

Something like this, but again the gain seems marginal at best:

Code:

        ; assume \1.l is premultipled by 4 (amortized cost=2*4/3)
        ;        \2 is premultipled by 4 and added with a5 (amortized cost=3*9*4/npoints)
        ;        \3 dest
        ;        a6 scratch
        ;        \1 datareg
        ;        \2 any ea
        ;        \3 should be datareg

        move.l  \2, a6          ;  4(1/0) + ea [e.g. 12 for (d16,An), 0 for registers]
        move.l  (a6,\1.w), \3   ; 18(4/0)
        suba.w  \1, a6          ;  8(1/0)
        sub.l   (a6), \3        ; 14(3/0)
                                ; -------
                                ; 44(9/0) + ea + (8/3 + 108/npoints ~ 3)

ross · 24 April 2017, 11:53

Hi paraj, long time ago I came across the same question...
My results? Use mulu/muls, but whit care.

In pure 68k mul(x) is predictable: the algorithm requires 38+2n+address_calculation clocks where n is defined as:
MULU: n = the number of ones in the source
MULS: n = concatenate the source with a zero as the LSB; n is the resultant number of 10 plus number of 01 patterns in the 17-bit source.

So: destination must have max bit resolution (16), source can be minor.

Suppose a 16x16 multiply but in source 8 bit precision suffice.
With MULx dx=xxxxxxxx00000000,dy=xxxxxxxxxxxxxxxx min cycle is 38 and max 54!
[worst case MULU dx=#$FF00->1111111100000000->38+8*2, worst case MULS dx=#$5500->0101010100000000->38+4*2+4*2]

Cheers,
ross

paraj · 24 April 2017, 12:43

Quote:

Originally Posted by ross

Hi paraj, long time ago I came across the same question...
My results? Use mulu/muls, but whit care.

In pure 68k mul(x) is predictable: the algorithm requires 38+2n+address_calculation clocks where n is defined as:
MULU: n = the number of ones in the source
MULS: n = concatenate the source with a zero as the LSB; n is the resultant number of 10 plus number of 01 patterns in the 17-bit source.

So: destination must have max bit resolution (16), source can be minor.

Suppose a 16x16 multiply but in source 8 bit precision suffice.
With MULx dx=xxxxxxxx00000000,dy=xxxxxxxxxxxxxxxx min cycle is 38 and max 54!
[worst case MULU dx=#$FF00->1111111100000000->38+8*2, worst case MULS dx=#$5500->0101010100000000->38+4*2+4*2]

Cheers,
ross

A yeah, that's a good point. For non-scaling transformations the matrix elements will usually have a magnitude of <= 1 ($100 for 8.8 fixpoint), while the vectors also probably have a known range. So it should be easy to decide the best order.

Too bad there isn't an easy way to speed up multiplication like there is with division

at least it's not too slow...

Kalms · 25 April 2017, 08:32

Not sure if this helps, but you can change the logic of the table lookup approach to 'setup code, move.l from table, neg.w table offset, sub.l from table'. If you are doing matrix operations you can re-use some of the setup code across many multiplies. If you are transforming vertices you can also pre-shift your input data. If precision is not very important to you then you can do the transform in logarithmic space and thereby get down to 1 table lookup per multiply.

paraj · 26 April 2017, 20:16

Quote:

Originally Posted by Kalms

Not sure if this helps, but you can change the logic of the table lookup approach to 'setup code, move.l from table, neg.w table offset, sub.l from table'. If you are doing matrix operations you can re-use some of the setup code across many multiplies. If you are transforming vertices you can also pre-shift your input data. If precision is not very important to you then you can do the transform in logarithmic space and thereby get down to 1 table lookup per multiply.

Hi Kalms,

I was already considering re-using the setup code for the matrix across many operations, but couldn't get the cycle down due to too many expensive uses of long words, but your input made me rethink some ideas, and now I think a decent saving could be achieved.

My transform loop has the following structure:

Code:

; a0 = dest (3 x cnt x 8.8 fixpoint words)
; a1 = src   ( --- "" ---)
; a2 = mat  ( 3 x 4 x 8.8 fixpoint words)
; d7 = cnt-1
.loop:
        move.l  a2, a3          ; reload matrix pointer

        move.w  (a0)+, d0       ;  8(2/0) d0 = X
        move.w  (a0)+, d1       ;  8(2/0) d1 = Y
        move.w  (a0)+, d2       ;  8(2/0) d2 = Z
        ; Process X
        ; ...
        ; Process Y
        ; ...
        ; Process Z (can trash d0-d2 here)
        ; ...
        dbf d7, .loop

Focusing on the "Process X" part my code currently looks as follows (should be fairly standard I think):

Code:

        move.w  d0, d4          ;   4(1/0)      d4 = X
        muls.w  (a3)+, d4       ;  42(2/0)+     d4 = X*XX (16.16 fixpoint)
        move.w  d1, d5          ;   4(1/0)      d5 = Y
        muls.w  (a3)+, d5       ;  42(2/0)+     d5 = Y*XY
        add.l   d5, d4          ;   8(1/0)      d4 = X*XX + Y*XY
        move.w  d2, d5          ;   4(1/0)      d5 = Z
        muls.w  (a3)+, d5       ;  42(2/0)+     d5 = Z*XZ
        add.l   d5, d4          ;   8(1/0)      d4 = X*XX + Y*XY + Z*XZ
        asr.l   #8, d4          ;  24(1/0)      convert back to 8.8
        move.w  (a3)+, d5       ;   8(2/0)      d5 = XW (translation)
        addx.w  d5, d4          ;   4(1/0)      d4 = X*XX + Y*XY + Z*XZ + XW
        move.w  d4, (a0)+       ;   8(1/1)      store transformed coordinate
                                ; --------
                                ; 198+

Now pre-multiplying the input vectors by 2 (cheap even if they're not static) and changing the matrix format from:

Code:

XX.w XY.w XZ.w XW.w
YX.w YY.w YZ.w YW.w
ZX.w ZY.w ZZ.w ZW.w

to:

Code:

&squares[XX].l &squares[XY].l &squares[XZ].l XW.w
&squares[YX].l ...

where squares contains (n*n)>>10

Processing X could be done like this:

Code:

        move.l  (a3)+, a4       ;  12(3/0)      a4 = &squares[XX]
        move.w  (a4,d0.w), d4   ;  14(3/0)      d4 = (XX+X)^2
        neg.w   d0              ;   4(1/0)
        sub.w   (a4,d0.w), d4   ;  14(3/0)      d4 = (XX+X)^2 - (XX-X)^2 = X*XX
        neg.w   d0              ;   4(1/0)      restore d0 for processing Y
        move.l  (a3)+, a4       ;  12(3/0)      a4 = &squares[XY]
        add.w   (a4,d1.w), d4   ;  14(3/0)      d4 = X*XX + (XY+Y)^2
        neg.w   d1              ;   4(1/0)
        sub.w   (a4,d1.w), d4   ;  14(3/0)      d4 = X*XX + (XY+Y)^2 - (XY-Y)^2 = X*XX+Y*XY
        neg.w   d1              ;   4(1/0)      restore d1 for processing Y
        move.l  (a3)+, a4       ;  12(3/0)      a4 = &squares[XZ]
        add.w   (a4,d2.w), d4   ;  14(3/0)      d4 = X*XX + Y*XY + (XZ+Z)^2
        neg.w   d2              ;   4(1/0)
        sub.w   (a4,d2.w), d4   ;  14(3/0)      d4 = X*XX + Y*XY + (XZ+Z)^2 - (XZ-Z)^2 = X*XX+Y*XY+Z*XZ
        neg.w   d2              ;   4(1/0)      restore d2 for processing Y
        add.w   (a3)+, d4       ;  14(3/0)      d4 = X*XX + Y*XY + Z*XZ + XW
        move.w  d4, (a0)+       ;   8(1/1)
                                ;---------
                                ; 166

And it's starting to look OK in terms of cycles won vs. loss of precision!

Doing the transforms in logarithmic space is definitely also an interesting option, I think I'll have to look into later. Now I at least have some baselines to compare cycle count precision against.

EDIT: while redoing the cycle timing i also realized that a lot of the neg.w's could be omitted by inverting the add/sub sequence for Y..

Kalms · 28 April 2017, 02:00

Nice! You can get further still if you think in terms of data parallel computation and try all different permutations of the nested loops and input data format changes that you can think of, with the aim of minimizing unnecessary logic beside the 6 table lookups themselves. Consider if self modifying code could become useful as well.

paraj · 01 May 2017, 11:00

Quote:

Originally Posted by Kalms

Nice! You can get further still if you think in terms of data parallel computation and try all different permutations of the nested loops and input data format changes that you can think of, with the aim of minimizing unnecessary logic beside the 6 table lookups themselves. Consider if self modifying code could become useful as well.

Once again, massive thanks for the hints!

Too many years of self modifying code being verbotten has crippled my optimization skills I think

I have to admit I don't quite understand what you're getting at with 'data parallel computations' though. Do you mean like using both the upper and lower word simultaneously?

For ayone curious with the above tricks and the square table in low memory (16-bit addressable) the following loop is possible:

Code:

                movem.w (a1)+, a2-a4    ; 24(7/0)

                ; X
.offXXp:        move.w  $1234(a2), d0   ; 12(3/0)
.offXXm:        sub.w   $1234(a2), d0   ; 12(3/0)
.offXYp:        add.w   $1234(a3), d0   ; 12(3/0)
.offXYm:        sub.w   $1234(a3), d0   ; 12(3/0)
.offXZp:        add.w   $1234(a4), d0   ; 12(3/0)
.offXZm:        sub.w   $1234(a4), d0   ; 12(3/0)
.offXW:         add.w   #$1234, d0      ;  8(2/0)
                move.w  d0, (a0)+       ;  8(1/1)
                                        ; -------
                                        ; 88
                ; Y & Z handled similarly

with the points prescaled by 2 and the offset at .offXXp being &squares[XX].w and .offXXm &squares[-XX].w and so on.

EDIT: My simple 1-plane transform/project/putpixel test went from being able to handle 140 points to 237 in one frame just by applying the above optimization - pretty decent gain!

And of course the square table should be used for projection as well - &squares[$10000/z].w seems like a nice format for the reciprocal table.

meynaf · 11 May 2017, 12:05

As this thread is for generic 68000 optimizations, here is another.
(Actually this one is valid for just about any cpu

)

You can replace :

Code:

 not.w d0
 and.w d0,d1
 not.w d0

By :

Code:

 or.w d0,d1
 eor.w d0,d1

(Makes "bit clear" / "and not" instruction proposals a lot less sexy doesn't it ?)

paraj · 11 May 2017, 20:31

Quote:

Originally Posted by meynaf

As this thread is for generic 68000 optimizations,

Sorry if the earlier MUL discussion ended up being too specific, that's the opposite of my intend, but I guess it's hard to be generic when tables are involved.

And nice little bitwise identity I hadn't seen before!

mikro · 05 February 2018, 01:10

My humble contribution.

Sometimes I need to isolate the carry:

Code:

subx.b	d1,d1	;
neg.b	d1	;
add.b   d1,d0   ; add carry do d0's lower byte

Code:

subx.w	d1,d1	;
negx.b	d1	;
sub.w   d1,d0   ; add carry to d0's upper byte

Nice feature is that you can use use long size if needed, making it adda/suba friendly.

Gorf · 16 February 2018, 14:11

Multiplication by a constant can be done very efficiently. So if you know by what factor you are going to multiply, you could use one of these tricks:

multiply d0 by 29:

Code:

move.l  d0, d1
lsl.l   #4, d0
sub.l   d1, d0
add.l   d0, d0
sub.l   d1, d0

multiply d0 by 625:

Code:

move.l  d0, d1
lsl.l   #2, d0
add.l   d1, d0
lsl.l   #3, d0
sub.l   d1, d0
lsl.l   #4, d0
add.l   d1, d0

multiply d0 by 156:

Code:

move.l  d0, d1
lsl.l   #2, d1
add.l   d1, d0
lsl.l   #5, d0
sub.l   d1, d0

dissident · 21 November 2018, 10:20

Quote:

Originally Posted by Gorf

Multiplication by a constant can be done very efficiently. So if you know by what factor you are going to multiply, you could use one of these tricks:

multiply d0 by 29:

Code:

move.l  d0, d1
lsl.l   #4, d0
sub.l   d1, d0
add.l   d0, d0
sub.l   d1, d0

multiply d0 by 625:

Code:

move.l  d0, d1
lsl.l   #2, d0
add.l   d1, d0
lsl.l   #3, d0
sub.l   d1, d0
lsl.l   #4, d0
add.l   d1, d0

multiply d0 by 156:

Code:

move.l  d0, d1
lsl.l   #2, d1
add.l   d1, d0
lsl.l   #5, d0
sub.l   d1, d0

I've attached my collection of these multiply tricks based on the MC68000 execution times (including these three cases above) as macros, so they can be easily used in every sourcecode.

NorthWay · 02 December 2018, 06:27

Quote:

Originally Posted by pmc

Code:

                     lea                 vals(pc),An
                    movem.w             (An),Dn-Dn

gets you the ext.l's for free

I didn't know this before I read this thread again recently.

Which reminded me of something; once upon a time there was a thing called the GNU(gcc?) super-optimizer that supposedly knew about all sorts of of tricks to get things done in as few instructions as possible (and I think IBM was very proud of the Power instruction set as it had been proven to be possible to do 'all' of the instruction optimization equivalents in a single opcode).
Does anyone know if that is still a thing, or if it has been folded into the mainline product? I'm guessing there would have to be a description of what it did for each and every architecture - possibly some interesting stuff there.

PeterJ · 27 April 2019, 00:02

sorry to wake up this old thread,
i use it from time to time, getting to old to remember, so i have to see these tips every time i start coding again.

Quote:

Originally Posted by frank_b

Code:

       move.b  d0,-(sp)
       move.w (sp)+,d0  ; magic fast shift :)

i do get some garbage in the lower byte when using this.

one can also use this trick to scroll back(i don't know if its as fast.

Code:

	Moveq	#0,D0
	move.w	#$2200,-(SP)
	Move.b  (Sp)+,D0

End result in d0 is #$22

ross · 27 April 2019, 00:39

Quote:

Originally Posted by PeterJ

sorry to wake up this old thread,
i use it from time to time, getting to old to remember, so i have to see these tips every time i start coding again.

And if you want 'consistent'

lsl #8

you can:

Code:

    lea spare(pc),a0
    clr.w   (a0)
    ....

    move.b  d0,(a0)    ;8
    move.w  (a0),d0    ;8

    ....
spare   ds.w    1

16 cycles instead of 22 in bare 68k

PeterJ · 27 April 2019, 00:57

Many Thanks trying to optimize an old Atari ST game.
it was a shi..y arcade conversion(outrun) for both the ST and the Amiga.

just want to see how it would run if it had got some more love

and it uses a lot of lsr #8 and Lsl #8 so the about is just perfect

23 April 2017, 23:35	#141
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,098	Ye olde amycoders website has a tutorial on avoiding multiplications. Looking at the "using squares" part, I'm having trouble finding an immediate use. All the other tips are more or less "drop-in" replacements for their slower counterparts. While I can certainly think of some uses if one is willing to sacrifice precision, it seems more of an advanced building block compared to the other tricks. Am I missing something? I'm talking ~7MHz 68000 here, while I know 020+ has (aX,dY4) and probably other stuff that helps don't they multiply fast enough for this trick not to matter? Below is my attempt doing this in code. As usual there's probably errors, inefficiencies and wrong cycle counts, but in general it seems that the target of 38 (or even 50) seems long out of reach. EDIT: Hmm.. testing shows that folding the shift into the table doesn't lose precision. It's probably easy to see why, but it'll have to wait and it's still not fast Code: ; d0 = d1, assumes a5=square table, d6 used as scratch ; Instruction to beat: muls.w d1, d0 38-70(1/0) move.w d0, d6 ; 4(1/0) add.w d1, d0 ; 4(1/0) add.w d0, d0 ; 4(1/0) add.w d0, d0 ; 4(1/0) move.l (a5,d0.w), d0 ; 18(4/0) d0 = (A+B)^2 sub.w d1, d6 ; 4(1/0) add.w d6, d6 ; 4(1/0) add.w d6, d6 ; 4(1/0) sub.l (a5,d6.w), d0 ; 20(3/0) d0 = (A+B)^2 - (A-B)^2 ; see edit asr.l #2, d0 ; 12(1/0) d0 = ((A+B)^2 - (A-B)^2))/4 ;-------- ; 66(14/0) ; ... dc.l 9>>2 dc.l 4>>2 dc.l 1>>2 squares: ; <- a5 points here dc.l 0>>2 dc.l 1>>2 dc.l 4>>2 ; ... Last edited by paraj; 24 April 2017 at 00:00.

24 April 2017, 07:51	#144
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,038	muls is 56 on average 2x18 to fetch LW data + 4x4 to align access = 52, so you are about even without all the other instructions If you are dealing with limited range, say approx. 256x256 and you include >>2 in your table, you could use words instead. 2x14 fetch + 2x4 align = 36 Something like: Code: add.w d0,d0 add.w d1,d1 move.w d0,d6 add.w d1,d0 move.w (a5,d0.w),d0 sub.w d1,d6 sub.w (a5,d6.w),d0 That's 5x4 + 2x14 = 48 so it could work. In general case (16-bit input data) you would need a rather large table so doing a muls instead is a good compromise.

28 April 2017, 02:00	#150
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	Nice! You can get further still if you think in terms of data parallel computation and try all different permutations of the nested loops and input data format changes that you can think of, with the aim of minimizing unnecessary logic beside the 6 table lookups themselves. Consider if self modifying code could become useful as well. Last edited by Kalms; 28 April 2017 at 02:12.

11 May 2017, 12:05	#152
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,322	As this thread is for generic 68000 optimizations, here is another. (Actually this one is valid for just about any cpu ) You can replace : Code: not.w d0 and.w d0,d1 not.w d0 By : Code: or.w d0,d1 eor.w d0,d1 (Makes "bit clear" / "and not" instruction proposals a lot less sexy doesn't it ?)

05 February 2018, 01:10	#154
mikro Registered User Join Date: Oct 2007 Location: Bratislava / Slovakia Posts: 34	My humble contribution. Sometimes I need to isolate the carry: Code: subx.b d1,d1 ; neg.b d1 ; add.b d1,d0 ; add carry do d0's lower byte Code: subx.w d1,d1 ; negx.b d1 ; sub.w d1,d0 ; add carry to d0's upper byte Nice feature is that you can use use long size if needed, making it adda/suba friendly. Last edited by mikro; 16 February 2018 at 11:12.

24 April 2017, 11:53	#146
ross Defendit numerus Join Date: Mar 2017 Location: Crossing the Rubicon Age: 53 Posts: 4,468	Hi paraj, long time ago I came across the same question... My results? Use mulu/muls, but whit care. In pure 68k mul(x) is predictable: the algorithm requires 38+2n+address_calculation clocks where n is defined as: MULU: n = the number of ones in the source MULS: n = concatenate the source with a zero as the LSB; n is the resultant number of 10 plus number of 01 patterns in the 17-bit source. So: destination must have max bit resolution (16), source can be minor. Suppose a 16x16 multiply but in source 8 bit precision suffice. With MULx dx=xxxxxxxx00000000,dy=xxxxxxxxxxxxxxxx min cycle is 38 and max 54! [worst case MULU dx=#$FF00->1111111100000000->38+82, worst case MULS dx=#$5500->0101010100000000->38+42+4*2] Cheers, ross

25 April 2017, 08:32	#148
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	Not sure if this helps, but you can change the logic of the table lookup approach to 'setup code, move.l from table, neg.w table offset, sub.l from table'. If you are doing matrix operations you can re-use some of the setup code across many multiplies. If you are transforming vertices you can also pre-shift your input data. If precision is not very important to you then you can do the transform in logarithmic space and thereby get down to 1 table lookup per multiply.

16 February 2018, 14:11	#155
Gorf Registered User Join Date: May 2017 Location: Munich/Bavaria Posts: 2,294	Multiplication by a constant can be done very efficiently. So if you know by what factor you are going to multiply, you could use one of these tricks: multiply d0 by 29: Code: move.l d0, d1 lsl.l #4, d0 sub.l d1, d0 add.l d0, d0 sub.l d1, d0 multiply d0 by 625: Code: move.l d0, d1 lsl.l #2, d0 add.l d1, d0 lsl.l #3, d0 sub.l d1, d0 lsl.l #4, d0 add.l d1, d0 multiply d0 by 156: Code: move.l d0, d1 lsl.l #2, d1 add.l d1, d0 lsl.l #5, d0 sub.l d1, d0

27 April 2019, 00:57	#160
PeterJ Registered User Join Date: Feb 2015 Location: Copehagen Posts: 36	Many Thanks trying to optimize an old Atari ST game. it was a shi..y arcade conversion(outrun) for both the ST and the Amiga. just want to see how it would run if it had got some more love and it uses a lot of lsr #8 and Lsl #8 so the about is just perfect Last edited by PeterJ; 27 April 2019 at 01:04.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68000 boot code	billt	Coders. General	15	05 May 2012 20:13
Wasted Dreams on 68000	sanjyuubi	support.Games	5	27 May 2011 17:11
680x0 to 68000	Counia	Hardware mods	1	01 March 2011 10:18
quitting on 68000?	Hungry Horace	project.WHDLoad	60	19 December 2006 20:17
3D code and/or internet code for Blitz Basic 2.1	EdzUp	Retrogaming General Discussion	0	10 February 2002 11:40