For a C2P it basically comes down to how many places you need to move the bits, and how to do that efficiently.
%0000abcd needs to turn into %a0000000, %b0000000, %c0000000, %d0000000 (appropriately shifted for each 8 pixels) for a 1x1. If you look at the good old
https://amycoders.org/sources/c2ptut.html the prescamble part is all about avoiding one (or more) of the parts mentioned in section 10.
I have some old code lying around that does 320x120x4 (1x1) with the blitter using 281 raster lines (with no CPU interleaving and no other DMA active), but it assumes this format: %a3b3c3d3a2b2c2d2a1b1c1d1a0b0c0d0, i.e. 4 different textures for each horizontal position necessary (or other trickery).
Might be worth it to play around with a quickish proof of concept that just assumes the basic texture format, and then work on trying to optimize it. You can just use a C2P written in C (or whatever) for this. Then you can ask the optimization gods in asm/hardware
and you'll have a better idea about where it's possible to optimize.