View Single Post
Old 26 April 2017, 20:16   #149
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,172
Quote:
Originally Posted by Kalms View Post
Not sure if this helps, but you can change the logic of the table lookup approach to 'setup code, move.l from table, neg.w table offset, sub.l from table'. If you are doing matrix operations you can re-use some of the setup code across many multiplies. If you are transforming vertices you can also pre-shift your input data. If precision is not very important to you then you can do the transform in logarithmic space and thereby get down to 1 table lookup per multiply.
Hi Kalms,

I was already considering re-using the setup code for the matrix across many operations, but couldn't get the cycle down due to too many expensive uses of long words, but your input made me rethink some ideas, and now I think a decent saving could be achieved.

My transform loop has the following structure:
Code:
; a0 = dest (3 x cnt x 8.8 fixpoint words)
; a1 = src   ( --- "" ---)
; a2 = mat  ( 3 x 4 x 8.8 fixpoint words)
; d7 = cnt-1
.loop:
        move.l  a2, a3          ; reload matrix pointer

        move.w  (a0)+, d0       ;  8(2/0) d0 = X
        move.w  (a0)+, d1       ;  8(2/0) d1 = Y
        move.w  (a0)+, d2       ;  8(2/0) d2 = Z
        ; Process X
        ; ...
        ; Process Y
        ; ...
        ; Process Z (can trash d0-d2 here)
        ; ...
        dbf d7, .loop
Focusing on the "Process X" part my code currently looks as follows (should be fairly standard I think):
Code:
        move.w  d0, d4          ;   4(1/0)      d4 = X
        muls.w  (a3)+, d4       ;  42(2/0)+     d4 = X*XX (16.16 fixpoint)
        move.w  d1, d5          ;   4(1/0)      d5 = Y
        muls.w  (a3)+, d5       ;  42(2/0)+     d5 = Y*XY
        add.l   d5, d4          ;   8(1/0)      d4 = X*XX + Y*XY
        move.w  d2, d5          ;   4(1/0)      d5 = Z
        muls.w  (a3)+, d5       ;  42(2/0)+     d5 = Z*XZ
        add.l   d5, d4          ;   8(1/0)      d4 = X*XX + Y*XY + Z*XZ
        asr.l   #8, d4          ;  24(1/0)      convert back to 8.8
        move.w  (a3)+, d5       ;   8(2/0)      d5 = XW (translation)
        addx.w  d5, d4          ;   4(1/0)      d4 = X*XX + Y*XY + Z*XZ + XW
        move.w  d4, (a0)+       ;   8(1/1)      store transformed coordinate
                                ; --------
                                ; 198+
Now pre-multiplying the input vectors by 2 (cheap even if they're not static) and changing the matrix format from:
Code:
XX.w XY.w XZ.w XW.w
YX.w YY.w YZ.w YW.w
ZX.w ZY.w ZZ.w ZW.w
to:
Code:
&squares[XX].l &squares[XY].l &squares[XZ].l XW.w
&squares[YX].l ...
where squares contains (n*n)>>10

Processing X could be done like this:
Code:
        move.l  (a3)+, a4       ;  12(3/0)      a4 = &squares[XX]
        move.w  (a4,d0.w), d4   ;  14(3/0)      d4 = (XX+X)^2
        neg.w   d0              ;   4(1/0)
        sub.w   (a4,d0.w), d4   ;  14(3/0)      d4 = (XX+X)^2 - (XX-X)^2 = X*XX
        neg.w   d0              ;   4(1/0)      restore d0 for processing Y
        move.l  (a3)+, a4       ;  12(3/0)      a4 = &squares[XY]
        add.w   (a4,d1.w), d4   ;  14(3/0)      d4 = X*XX + (XY+Y)^2
        neg.w   d1              ;   4(1/0)
        sub.w   (a4,d1.w), d4   ;  14(3/0)      d4 = X*XX + (XY+Y)^2 - (XY-Y)^2 = X*XX+Y*XY
        neg.w   d1              ;   4(1/0)      restore d1 for processing Y
        move.l  (a3)+, a4       ;  12(3/0)      a4 = &squares[XZ]
        add.w   (a4,d2.w), d4   ;  14(3/0)      d4 = X*XX + Y*XY + (XZ+Z)^2
        neg.w   d2              ;   4(1/0)
        sub.w   (a4,d2.w), d4   ;  14(3/0)      d4 = X*XX + Y*XY + (XZ+Z)^2 - (XZ-Z)^2 = X*XX+Y*XY+Z*XZ
        neg.w   d2              ;   4(1/0)      restore d2 for processing Y
        add.w   (a3)+, d4       ;  14(3/0)      d4 = X*XX + Y*XY + Z*XZ + XW
        move.w  d4, (a0)+       ;   8(1/1)
                                ;---------
                                ; 166
And it's starting to look OK in terms of cycles won vs. loss of precision!

Doing the transforms in logarithmic space is definitely also an interesting option, I think I'll have to look into later. Now I at least have some baselines to compare cycle count precision against.

EDIT: while redoing the cycle timing i also realized that a lot of the neg.w's could be omitted by inverting the add/sub sequence for Y..

Last edited by paraj; 27 April 2017 at 01:50. Reason: Messed up instruction timing, gave it another try
paraj is offline  
 
Page generated in 0.04543 seconds with 11 queries