Some non-scientific and quick tests.
Pure code seems
slightly faster than this lazy bfextu 8bit LUT implementation:
Code:
_lut8flip:
lea _8lut(pc),a0
move.l d0,d1
bfextu d1{8:8},d2
move.b (a0,d2.w),d0
ror.l #8,d0
bfextu d1{16:8},d2
move.b (a0,d2.w),d0
ror.l #8,d0
bfextu d1{24:8},d2
move.b (a0,d2.w),d0
ror.l #8,d0
bfextu d1{0:8},d2
move.b (a0,d2.w),d0
rts
But the absolute winner is the 16bit LUT approach (even 50% faster).
Simple as:
Code:
_lut16flip:
lea _16lut+65536,a0
move.w (a0,d0.w*2),d0
swap d0
move.w (a0,d0.w*2),d0
rts
The abuse of memory can be contestable, BUT:
suppose you have a lot of big AGA sprites (64x64,4planes) and also a lot of tiles (32x32,4/8planes) for a big total of 1MB of data, all to be flipped.
In this case may be useful
(the waste becomes proportionally less and less significant, and CPU time is precious on 020..)
But surely pure code, like Thorham suggested, is a great deal!
[EDIT, PS]
Why non-scientific?
I do not have a CD32, nor an Amiga for that matter
So it's all based on the emulation of WinUAE which for 020 is not CE perfect (or it is for this simple code? well, it's not that important..).
Also I had no will to write code other than bfextu and anyway the difference in speed between pure code and 8bit LUT does not seem significant enough to justify the exclusive use of LUT