Just an idea: If you render from the middle to the upper and lower end, you can exploit some symmetry. if you put texel i above the center, you'll always put texel h-i-1 below (h texture height). If you store your texture 90 deg. rotated and scramble it like 0,h-1,1,h-2,2,h-3..., you can read and write a word with two pixels instead of a byte with one pixel, thereby double the speed. You'd need two additional blitter passes to separate/unscramble the two bytes in the output (or you could maybe even manage to bake it into the c2p?), but you could still come out faster.
EDIT: Here's a nice example where a byte-swap instruction is missing...
|