Quote:
Originally Posted by Kalms
Nice! You can get further still if you think in terms of data parallel computation and try all different permutations of the nested loops and input data format changes that you can think of, with the aim of minimizing unnecessary logic beside the 6 table lookups themselves. Consider if self modifying code could become useful as well.
|
Once again, massive thanks for the hints!
Too many years of self modifying code being verbotten has crippled my optimization skills I think
I have to admit I don't quite understand what you're getting at with 'data parallel computations' though. Do you mean like using both the upper and lower word simultaneously?
For ayone curious with the above tricks and the square table in low memory (16-bit addressable) the following loop is possible:
Code:
movem.w (a1)+, a2-a4 ; 24(7/0)
; X
.offXXp: move.w $1234(a2), d0 ; 12(3/0)
.offXXm: sub.w $1234(a2), d0 ; 12(3/0)
.offXYp: add.w $1234(a3), d0 ; 12(3/0)
.offXYm: sub.w $1234(a3), d0 ; 12(3/0)
.offXZp: add.w $1234(a4), d0 ; 12(3/0)
.offXZm: sub.w $1234(a4), d0 ; 12(3/0)
.offXW: add.w #$1234, d0 ; 8(2/0)
move.w d0, (a0)+ ; 8(1/1)
; -------
; 88
; Y & Z handled similarly
with the points prescaled by 2 and the offset at .offXXp being &squares[XX].w and .offXXm &squares[-XX].w and so on.
EDIT: My simple 1-plane transform/project/putpixel test went from being able to handle 140 points to 237 in one frame just by applying the above optimization - pretty decent gain!
And of course the square table should be used for projection as well - &squares[$10000/z].w seems like a nice format for the reciprocal table.