English Amiga Board - View Single Post

paraj · 26 April 2017, 20:16

Quote:

Originally Posted by Kalms

Not sure if this helps, but you can change the logic of the table lookup approach to 'setup code, move.l from table, neg.w table offset, sub.l from table'. If you are doing matrix operations you can re-use some of the setup code across many multiplies. If you are transforming vertices you can also pre-shift your input data. If precision is not very important to you then you can do the transform in logarithmic space and thereby get down to 1 table lookup per multiply.

Hi Kalms,

I was already considering re-using the setup code for the matrix across many operations, but couldn't get the cycle down due to too many expensive uses of long words, but your input made me rethink some ideas, and now I think a decent saving could be achieved.

My transform loop has the following structure:

Code:

; a0 = dest (3 x cnt x 8.8 fixpoint words)
; a1 = src   ( --- "" ---)
; a2 = mat  ( 3 x 4 x 8.8 fixpoint words)
; d7 = cnt-1
.loop:
        move.l  a2, a3          ; reload matrix pointer

        move.w  (a0)+, d0       ;  8(2/0) d0 = X
        move.w  (a0)+, d1       ;  8(2/0) d1 = Y
        move.w  (a0)+, d2       ;  8(2/0) d2 = Z
        ; Process X
        ; ...
        ; Process Y
        ; ...
        ; Process Z (can trash d0-d2 here)
        ; ...
        dbf d7, .loop

Focusing on the "Process X" part my code currently looks as follows (should be fairly standard I think):

Code:

        move.w  d0, d4          ;   4(1/0)      d4 = X
        muls.w  (a3)+, d4       ;  42(2/0)+     d4 = X*XX (16.16 fixpoint)
        move.w  d1, d5          ;   4(1/0)      d5 = Y
        muls.w  (a3)+, d5       ;  42(2/0)+     d5 = Y*XY
        add.l   d5, d4          ;   8(1/0)      d4 = X*XX + Y*XY
        move.w  d2, d5          ;   4(1/0)      d5 = Z
        muls.w  (a3)+, d5       ;  42(2/0)+     d5 = Z*XZ
        add.l   d5, d4          ;   8(1/0)      d4 = X*XX + Y*XY + Z*XZ
        asr.l   #8, d4          ;  24(1/0)      convert back to 8.8
        move.w  (a3)+, d5       ;   8(2/0)      d5 = XW (translation)
        addx.w  d5, d4          ;   4(1/0)      d4 = X*XX + Y*XY + Z*XZ + XW
        move.w  d4, (a0)+       ;   8(1/1)      store transformed coordinate
                                ; --------
                                ; 198+

Now pre-multiplying the input vectors by 2 (cheap even if they're not static) and changing the matrix format from:

Code:

XX.w XY.w XZ.w XW.w
YX.w YY.w YZ.w YW.w
ZX.w ZY.w ZZ.w ZW.w

to:

Code:

&squares[XX].l &squares[XY].l &squares[XZ].l XW.w
&squares[YX].l ...

where squares contains (n*n)>>10

Processing X could be done like this:

Code:

        move.l  (a3)+, a4       ;  12(3/0)      a4 = &squares[XX]
        move.w  (a4,d0.w), d4   ;  14(3/0)      d4 = (XX+X)^2
        neg.w   d0              ;   4(1/0)
        sub.w   (a4,d0.w), d4   ;  14(3/0)      d4 = (XX+X)^2 - (XX-X)^2 = X*XX
        neg.w   d0              ;   4(1/0)      restore d0 for processing Y
        move.l  (a3)+, a4       ;  12(3/0)      a4 = &squares[XY]
        add.w   (a4,d1.w), d4   ;  14(3/0)      d4 = X*XX + (XY+Y)^2
        neg.w   d1              ;   4(1/0)
        sub.w   (a4,d1.w), d4   ;  14(3/0)      d4 = X*XX + (XY+Y)^2 - (XY-Y)^2 = X*XX+Y*XY
        neg.w   d1              ;   4(1/0)      restore d1 for processing Y
        move.l  (a3)+, a4       ;  12(3/0)      a4 = &squares[XZ]
        add.w   (a4,d2.w), d4   ;  14(3/0)      d4 = X*XX + Y*XY + (XZ+Z)^2
        neg.w   d2              ;   4(1/0)
        sub.w   (a4,d2.w), d4   ;  14(3/0)      d4 = X*XX + Y*XY + (XZ+Z)^2 - (XZ-Z)^2 = X*XX+Y*XY+Z*XZ
        neg.w   d2              ;   4(1/0)      restore d2 for processing Y
        add.w   (a3)+, d4       ;  14(3/0)      d4 = X*XX + Y*XY + Z*XZ + XW
        move.w  d4, (a0)+       ;   8(1/1)
                                ;---------
                                ; 166

And it's starting to look OK in terms of cycles won vs. loss of precision!

Doing the transforms in logarithmic space is definitely also an interesting option, I think I'll have to look into later. Now I at least have some baselines to compare cycle count precision against.

EDIT: while redoing the cycle timing i also realized that a lot of the neg.w's could be omitted by inverting the add/sub sequence for Y..