English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 24 April 2017, 00:35   #141
paraj
Registered User

 
Join Date: Feb 2017
Location: Denmark
Posts: 78
Ye olde amycoders website has a tutorial on avoiding multiplications. Looking at the "using squares" part, I'm having trouble finding an immediate use. All the other tips are more or less "drop-in" replacements for their slower counterparts. While I can certainly think of some uses if one is willing to sacrifice precision, it seems more of an advanced building block compared to the other tricks. Am I missing something?

I'm talking ~7MHz 68000 here, while I know 020+ has (aX,dY*4) and probably other stuff that helps don't they multiply fast enough for this trick not to matter?

Below is my attempt doing this in code. As usual there's probably errors, inefficiencies and wrong cycle counts, but in general it seems that the target of 38 (or even 50) seems long out of reach.

EDIT: Hmm.. testing shows that folding the shift into the table doesn't lose precision. It's probably easy to see why, but it'll have to wait and it's still not fast

Code:
; d0 *= d1, assumes a5=square table, d6 used as scratch
; Instruction to beat: muls.w d1, d0 38-70(1/0)
        move.w  d0, d6          ;  4(1/0)
        add.w   d1, d0          ;  4(1/0)
	add.w	d0, d0          ;  4(1/0)
	add.w	d0, d0          ;  4(1/0)
	move.l	(a5,d0.w), d0   ; 18(4/0) d0 = (A+B)^2
        sub.w   d1, d6          ;  4(1/0)
	add.w	d6, d6          ;  4(1/0)
	add.w	d6, d6          ;  4(1/0)
	sub.l	(a5,d6.w), d0   ; 20(3/0) d0 = (A+B)^2 - (A-B)^2
	; see edit asr.l	#2, d0	        ; 12(1/0) d0 = ((A+B)^2 - (A-B)^2))/4
                                ;--------
                                ; 66(14/0)

    ; ...
    dc.l 9>>2
    dc.l 4>>2
    dc.l 1>>2
squares: ; <- a5 points here
    dc.l 0>>2
    dc.l 1>>2
    dc.l 4>>2
    ; ...

Last edited by paraj; 24 April 2017 at 01:00.
paraj is offline  
Old 24 April 2017, 01:47   #142
Galahad/FLT
Going nowhere

Galahad/FLT's Avatar
 
Join Date: Oct 2001
Location: United Kingdom
Age: 45
Posts: 7,070
Quote:
Originally Posted by paraj View Post
Ye olde amycoders website has a tutorial on avoiding multiplications. Looking at the "using squares" part, I'm having trouble finding an immediate use. All the other tips are more or less "drop-in" replacements for their slower counterparts. While I can certainly think of some uses if one is willing to sacrifice precision, it seems more of an advanced building block compared to the other tricks. Am I missing something?

I'm talking ~7MHz 68000 here, while I know 020+ has (aX,dY*4) and probably other stuff that helps don't they multiply fast enough for this trick not to matter?

Below is my attempt doing this in code. As usual there's probably errors, inefficiencies and wrong cycle counts, but in general it seems that the target of 38 (or even 50) seems long out of reach.

EDIT: Hmm.. testing shows that folding the shift into the table doesn't lose precision. It's probably easy to see why, but it'll have to wait and it's still not fast

Code:
; d0 *= d1, assumes a5=square table, d6 used as scratch
; Instruction to beat: muls.w d1, d0 38-70(1/0)
        move.w  d0, d6          ;  4(1/0)
        add.w   d1, d0          ;  4(1/0)
	add.w	d0, d0          ;  4(1/0)
	add.w	d0, d0          ;  4(1/0)
	move.l	(a5,d0.w), d0   ; 18(4/0) d0 = (A+B)^2
        sub.w   d1, d6          ;  4(1/0)
	add.w	d6, d6          ;  4(1/0)
	add.w	d6, d6          ;  4(1/0)
	sub.l	(a5,d6.w), d0   ; 20(3/0) d0 = (A+B)^2 - (A-B)^2
	; see edit asr.l	#2, d0	        ; 12(1/0) d0 = ((A+B)^2 - (A-B)^2))/4
                                ;--------
                                ; 66(14/0)

    ; ...
    dc.l 9>>2
    dc.l 4>>2
    dc.l 1>>2
squares: ; <- a5 points here
    dc.l 0>>2
    dc.l 1>>2
    dc.l 4>>2
    ; ...
Surely where you have the 2x add.w d0,d0, it would be quicker to do LSL.W #2,d0 and the same again for the 2x add.w d6,d6 being LSL.W #2,d6 ?
Galahad/FLT is offline  
Old 24 April 2017, 02:11   #143
paraj
Registered User

 
Join Date: Feb 2017
Location: Denmark
Posts: 78
Quote:
Originally Posted by Galahad/FLT View Post
Surely where you have the 2x add.w d0,d0, it would be quicker to do LSL.W #2,d0 and the same again for the 2x add.w d6,d6 being LSL.W #2,d6 ?
2 x add.w is 8(2/0), lsl is 6+2*2=10(1/0) (according to the instruction references I'm currently looking at). So lsl is probably better when DMA is contended (and it's shorter), but it might be close. I mostly did it because it's mentioned in the very first post of this thread and I didn't want to be scolded
paraj is offline  
Old 24 April 2017, 08:51   #144
a/b
Registered User

 
Join Date: Jun 2016
Location: europe
Posts: 77
muls is 56 on average
2x18 to fetch LW data + 4x4 to align access = 52, so you are about even without all the other instructions

If you are dealing with limited range, say approx. 256x256 *and* you include >>2 in your table, you could use words instead.
2x14 fetch + 2x4 align = 36

Something like:
Code:
	add.w	d0,d0
	add.w	d1,d1
	move.w	d0,d6
	add.w	d1,d0
	move.w	(a5,d0.w),d0
	sub.w	d1,d6
	sub.w	(a5,d6.w),d0
That's 5x4 + 2x14 = 48 so it could work. In general case (16-bit input data) you would need a rather large table so doing a muls instead is a good compromise.
a/b is offline  
Old 24 April 2017, 10:43   #145
paraj
Registered User

 
Join Date: Feb 2017
Location: Denmark
Posts: 78
Quote:
Originally Posted by a/b View Post
muls is 56 on average
2x18 to fetch LW data + 4x4 to align access = 52, so you are about even without all the other instructions

If you are dealing with limited range, say approx. 256x256 *and* you include >>2 in your table, you could use words instead.
2x14 fetch + 2x4 align = 36

Something like:
Code:
	add.w	d0,d0
	add.w	d1,d1
	move.w	d0,d6
	add.w	d1,d0
	move.w	(a5,d0.w),d0
	sub.w	d1,d6
	sub.w	(a5,d6.w),d0
That's 5x4 + 2x14 = 48 so it could work. In general case (16-bit input data) you would need a rather large table so doing a muls instead is a good compromise.
Yeah, that could work, but if the input is limited to $100 won't the worst case for MULS.W be (38+8*2) 54 cycles (46 average), in which case we're back to muls being hard to beat?

I was thinking of using the optimization in a bog-standard vector/matrix multiplication routine, so some of the cost could be amortized, but I still have a hard time gaining anything (register pressure is also increased and the code has to be changed to accommodate the optimization).

I.e. when doing X' = MUL(MXX, VX) + MUL(MXY, VY) + MUL(MXZ, VZ), M?? can be premultiplied by 2/4 (the pointer to the square table could also be added). And as VX/VY/VZ are used 3 times, we can save 2*2 scalings for those.

Something like this, but again the gain seems marginal at best:

Code:
        ; assume \1.l is premultipled by 4 (amortized cost=2*4/3)
        ;        \2 is premultipled by 4 and added with a5 (amortized cost=3*9*4/npoints)
        ;        \3 dest
        ;        a6 scratch
        ;        \1 datareg
        ;        \2 any ea
        ;        \3 should be datareg

        move.l  \2, a6          ;  4(1/0) + ea [e.g. 12 for (d16,An), 0 for registers]
        move.l  (a6,\1.w), \3   ; 18(4/0)
        suba.w  \1, a6          ;  8(1/0)
        sub.l   (a6), \3        ; 14(3/0)
                                ; -------
                                ; 44(9/0) + ea + (8/3 + 108/npoints ~ 3)
paraj is offline  
Old 24 April 2017, 12:53   #146
ross
Sum, ergo Cogito

ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 48
Posts: 1,363
Hi paraj, long time ago I came across the same question...
My results? Use mulu/muls, but whit care.

In pure 68k mul(x) is predictable: the algorithm requires 38+2n+address_calculation clocks where n is defined as:
MULU: n = the number of ones in the source
MULS: n = concatenate the source with a zero as the LSB; n is the resultant number of 10 plus number of 01 patterns in the 17-bit source.

So: destination must have max bit resolution (16), source can be minor.

Suppose a 16x16 multiply but in source 8 bit precision suffice.
With MULx dx=xxxxxxxx00000000,dy=xxxxxxxxxxxxxxxx min cycle is 38 and max 54!
[worst case MULU dx=#$FF00->1111111100000000->38+8*2, worst case MULS dx=#$5500->0101010100000000->38+4*2+4*2]

Cheers,
ross
ross is offline  
Old 24 April 2017, 13:43   #147
paraj
Registered User

 
Join Date: Feb 2017
Location: Denmark
Posts: 78
Quote:
Originally Posted by ross View Post
Hi paraj, long time ago I came across the same question...
My results? Use mulu/muls, but whit care.

In pure 68k mul(x) is predictable: the algorithm requires 38+2n+address_calculation clocks where n is defined as:
MULU: n = the number of ones in the source
MULS: n = concatenate the source with a zero as the LSB; n is the resultant number of 10 plus number of 01 patterns in the 17-bit source.

So: destination must have max bit resolution (16), source can be minor.

Suppose a 16x16 multiply but in source 8 bit precision suffice.
With MULx dx=xxxxxxxx00000000,dy=xxxxxxxxxxxxxxxx min cycle is 38 and max 54!
[worst case MULU dx=#$FF00->1111111100000000->38+8*2, worst case MULS dx=#$5500->0101010100000000->38+4*2+4*2]

Cheers,
ross
A yeah, that's a good point. For non-scaling transformations the matrix elements will usually have a magnitude of <= 1 ($100 for 8.8 fixpoint), while the vectors also probably have a known range. So it should be easy to decide the best order.

Too bad there isn't an easy way to speed up multiplication like there is with division at least it's not too slow...
paraj is offline  
Old 25 April 2017, 09:32   #148
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 198
Not sure if this helps, but you can change the logic of the table lookup approach to 'setup code, move.l from table, neg.w table offset, sub.l from table'. If you are doing matrix operations you can re-use some of the setup code across many multiplies. If you are transforming vertices you can also pre-shift your input data. If precision is not very important to you then you can do the transform in logarithmic space and thereby get down to 1 table lookup per multiply.
Kalms is offline  
Old 26 April 2017, 21:16   #149
paraj
Registered User

 
Join Date: Feb 2017
Location: Denmark
Posts: 78
Quote:
Originally Posted by Kalms View Post
Not sure if this helps, but you can change the logic of the table lookup approach to 'setup code, move.l from table, neg.w table offset, sub.l from table'. If you are doing matrix operations you can re-use some of the setup code across many multiplies. If you are transforming vertices you can also pre-shift your input data. If precision is not very important to you then you can do the transform in logarithmic space and thereby get down to 1 table lookup per multiply.
Hi Kalms,

I was already considering re-using the setup code for the matrix across many operations, but couldn't get the cycle down due to too many expensive uses of long words, but your input made me rethink some ideas, and now I think a decent saving could be achieved.

My transform loop has the following structure:
Code:
; a0 = dest (3 x cnt x 8.8 fixpoint words)
; a1 = src   ( --- "" ---)
; a2 = mat  ( 3 x 4 x 8.8 fixpoint words)
; d7 = cnt-1
.loop:
        move.l  a2, a3          ; reload matrix pointer

        move.w  (a0)+, d0       ;  8(2/0) d0 = X
        move.w  (a0)+, d1       ;  8(2/0) d1 = Y
        move.w  (a0)+, d2       ;  8(2/0) d2 = Z
        ; Process X
        ; ...
        ; Process Y
        ; ...
        ; Process Z (can trash d0-d2 here)
        ; ...
        dbf d7, .loop
Focusing on the "Process X" part my code currently looks as follows (should be fairly standard I think):
Code:
        move.w  d0, d4          ;   4(1/0)      d4 = X
        muls.w  (a3)+, d4       ;  42(2/0)+     d4 = X*XX (16.16 fixpoint)
        move.w  d1, d5          ;   4(1/0)      d5 = Y
        muls.w  (a3)+, d5       ;  42(2/0)+     d5 = Y*XY
        add.l   d5, d4          ;   8(1/0)      d4 = X*XX + Y*XY
        move.w  d2, d5          ;   4(1/0)      d5 = Z
        muls.w  (a3)+, d5       ;  42(2/0)+     d5 = Z*XZ
        add.l   d5, d4          ;   8(1/0)      d4 = X*XX + Y*XY + Z*XZ
        asr.l   #8, d4          ;  24(1/0)      convert back to 8.8
        move.w  (a3)+, d5       ;   8(2/0)      d5 = XW (translation)
        addx.w  d5, d4          ;   4(1/0)      d4 = X*XX + Y*XY + Z*XZ + XW
        move.w  d4, (a0)+       ;   8(1/1)      store transformed coordinate
                                ; --------
                                ; 198+
Now pre-multiplying the input vectors by 2 (cheap even if they're not static) and changing the matrix format from:
Code:
XX.w XY.w XZ.w XW.w
YX.w YY.w YZ.w YW.w
ZX.w ZY.w ZZ.w ZW.w
to:
Code:
&squares[XX].l &squares[XY].l &squares[XZ].l XW.w
&squares[YX].l ...
where squares contains (n*n)>>10

Processing X could be done like this:
Code:
        move.l  (a3)+, a4       ;  12(3/0)      a4 = &squares[XX]
        move.w  (a4,d0.w), d4   ;  14(3/0)      d4 = (XX+X)^2
        neg.w   d0              ;   4(1/0)
        sub.w   (a4,d0.w), d4   ;  14(3/0)      d4 = (XX+X)^2 - (XX-X)^2 = X*XX
        neg.w   d0              ;   4(1/0)      restore d0 for processing Y
        move.l  (a3)+, a4       ;  12(3/0)      a4 = &squares[XY]
        add.w   (a4,d1.w), d4   ;  14(3/0)      d4 = X*XX + (XY+Y)^2
        neg.w   d1              ;   4(1/0)
        sub.w   (a4,d1.w), d4   ;  14(3/0)      d4 = X*XX + (XY+Y)^2 - (XY-Y)^2 = X*XX+Y*XY
        neg.w   d1              ;   4(1/0)      restore d1 for processing Y
        move.l  (a3)+, a4       ;  12(3/0)      a4 = &squares[XZ]
        add.w   (a4,d2.w), d4   ;  14(3/0)      d4 = X*XX + Y*XY + (XZ+Z)^2
        neg.w   d2              ;   4(1/0)
        sub.w   (a4,d2.w), d4   ;  14(3/0)      d4 = X*XX + Y*XY + (XZ+Z)^2 - (XZ-Z)^2 = X*XX+Y*XY+Z*XZ
        neg.w   d2              ;   4(1/0)      restore d2 for processing Y
        add.w   (a3)+, d4       ;  14(3/0)      d4 = X*XX + Y*XY + Z*XZ + XW
        move.w  d4, (a0)+       ;   8(1/1)
                                ;---------
                                ; 166
And it's starting to look OK in terms of cycles won vs. loss of precision!

Doing the transforms in logarithmic space is definitely also an interesting option, I think I'll have to look into later. Now I at least have some baselines to compare cycle count precision against.

EDIT: while redoing the cycle timing i also realized that a lot of the neg.w's could be omitted by inverting the add/sub sequence for Y..

Last edited by paraj; 27 April 2017 at 02:50. Reason: Messed up instruction timing, gave it another try
paraj is offline  
Old 28 April 2017, 03:00   #150
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 198
Nice! You can get further still if you think in terms of data parallel computation and try all different permutations of the nested loops and input data format changes that you can think of, with the aim of minimizing unnecessary logic beside the 6 table lookups themselves. Consider if self modifying code could become useful as well.

Last edited by Kalms; 28 April 2017 at 03:12.
Kalms is offline  
Old 01 May 2017, 12:00   #151
paraj
Registered User

 
Join Date: Feb 2017
Location: Denmark
Posts: 78
Quote:
Originally Posted by Kalms View Post
Nice! You can get further still if you think in terms of data parallel computation and try all different permutations of the nested loops and input data format changes that you can think of, with the aim of minimizing unnecessary logic beside the 6 table lookups themselves. Consider if self modifying code could become useful as well.
Once again, massive thanks for the hints! Too many years of self modifying code being verbotten has crippled my optimization skills I think

I have to admit I don't quite understand what you're getting at with 'data parallel computations' though. Do you mean like using both the upper and lower word simultaneously?

For ayone curious with the above tricks and the square table in low memory (16-bit addressable) the following loop is possible:

Code:
                movem.w (a1)+, a2-a4    ; 24(7/0)

                ; X
.offXXp:        move.w  $1234(a2), d0   ; 12(3/0)
.offXXm:        sub.w   $1234(a2), d0   ; 12(3/0)
.offXYp:        add.w   $1234(a3), d0   ; 12(3/0)
.offXYm:        sub.w   $1234(a3), d0   ; 12(3/0)
.offXZp:        add.w   $1234(a4), d0   ; 12(3/0)
.offXZm:        sub.w   $1234(a4), d0   ; 12(3/0)
.offXW:         add.w   #$1234, d0      ;  8(2/0)
                move.w  d0, (a0)+       ;  8(1/1)
                                        ; -------
                                        ; 88
                ; Y & Z handled similarly
with the points prescaled by 2 and the offset at .offXXp being &squares[XX].w and .offXXm &squares[-XX].w and so on.

EDIT: My simple 1-plane transform/project/putpixel test went from being able to handle 140 points to 237 in one frame just by applying the above optimization - pretty decent gain!

And of course the square table should be used for projection as well - &squares[$10000/z].w seems like a nice format for the reciprocal table.

Last edited by paraj; 01 May 2017 at 14:22. Reason: Added some stats
paraj is offline  
Old 11 May 2017, 13:05   #152
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 45
Posts: 3,339
As this thread is for generic 68000 optimizations, here is another.
(Actually this one is valid for just about any cpu )

You can replace :
Code:
 not.w d0
 and.w d0,d1
 not.w d0
By :
Code:
 or.w d0,d1
 eor.w d0,d1
(Makes "bit clear" / "and not" instruction proposals a lot less sexy doesn't it ?)
meynaf is online now  
Old 11 May 2017, 21:31   #153
paraj
Registered User

 
Join Date: Feb 2017
Location: Denmark
Posts: 78
Quote:
Originally Posted by meynaf View Post
As this thread is for generic 68000 optimizations,
Sorry if the earlier MUL discussion ended up being too specific, that's the opposite of my intend, but I guess it's hard to be generic when tables are involved.

And nice little bitwise identity I hadn't seen before!
paraj is offline  
Old 05 February 2018, 02:10   #154
mikro
Registered User
 
Join Date: Oct 2007
Location: Bratislava / Slovakia
Posts: 31
My humble contribution.

Sometimes I need to isolate the carry:
Code:
subx.b	d1,d1	;
neg.b	d1	;
add.b   d1,d0   ; add carry do d0's lower byte
Code:
subx.w	d1,d1	;
negx.b	d1	;
sub.w   d1,d0   ; add carry to d0's upper byte
Nice feature is that you can use use long size if needed, making it adda/suba friendly.

Last edited by mikro; 16 February 2018 at 12:12.
mikro is offline  
Old 16 February 2018, 15:11   #155
Gorf
Registered User

 
Join Date: May 2017
Location: Munich/Bavaria
Posts: 551
Multiplication by a constant can be done very efficiently. So if you know by what factor you are going to multiply, you could use one of these tricks:

multiply d0 by 29:

Code:
move.l  d0, d1
lsl.l   #4, d0
sub.l   d1, d0
add.l   d0, d0
sub.l   d1, d0
multiply d0 by 625:

Code:
move.l  d0, d1
lsl.l   #2, d0
add.l   d1, d0
lsl.l   #3, d0
sub.l   d1, d0
lsl.l   #4, d0
add.l   d1, d0
multiply d0 by 156:

Code:
move.l  d0, d1
lsl.l   #2, d1
add.l   d1, d0
lsl.l   #5, d0
sub.l   d1, d0
Gorf is offline  
Old 21 November 2018, 11:20   #156
dissident
Registered User

 
Join Date: Sep 2015
Location: Germany
Posts: 174
Quote:
Originally Posted by Gorf View Post
Multiplication by a constant can be done very efficiently. So if you know by what factor you are going to multiply, you could use one of these tricks:

multiply d0 by 29:

Code:
move.l  d0, d1
lsl.l   #4, d0
sub.l   d1, d0
add.l   d0, d0
sub.l   d1, d0
multiply d0 by 625:

Code:
move.l  d0, d1
lsl.l   #2, d0
add.l   d1, d0
lsl.l   #3, d0
sub.l   d1, d0
lsl.l   #4, d0
add.l   d1, d0
multiply d0 by 156:

Code:
move.l  d0, d1
lsl.l   #2, d1
add.l   d1, d0
lsl.l   #5, d0
sub.l   d1, d0
I've attached my collection of these multiply tricks based on the MC68000 execution times (including these three cases above) as macros, so they can be easily used in every sourcecode.
Attached Files
File Type: s MULU-Macros.s (25.1 KB, 20 views)
dissident is offline  
Old 02 December 2018, 07:27   #157
NorthWay
Registered User
 
Join Date: May 2013
Location: Grimstad / Norway
Posts: 523
Quote:
Originally Posted by pmc View Post
Code:
                     lea                 vals(pc),An
                    movem.w             (An),Dn-Dn
gets you the ext.l's for free
I didn't know this before I read this thread again recently.


Which reminded me of something; once upon a time there was a thing called the GNU(gcc?) super-optimizer that supposedly knew about all sorts of of tricks to get things done in as few instructions as possible (and I think IBM was very proud of the Power instruction set as it had been proven to be possible to do 'all' of the instruction optimization equivalents in a single opcode).
Does anyone know if that is still a thing, or if it has been folded into the mainline product? I'm guessing there would have to be a description of what it did for each and every architecture - possibly some interesting stuff there.
NorthWay is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
68000 boot code billt Coders. General 15 05 May 2012 21:13
Wasted Dreams on 68000 sanjyuubi support.Games 5 27 May 2011 18:11
680x0 to 68000 Counia Hardware mods 1 01 March 2011 11:18
quitting on 68000? Hungry Horace project.WHDLoad 60 19 December 2006 21:17
3D code and/or internet code for Blitz Basic 2.1 EdzUp Retrogaming General Discussion 0 10 February 2002 12:40

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 15:59.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2018, vBulletin Solutions Inc.
Page generated in 0.10115 seconds with 14 queries