Well, I'm not actually "bit-banging", I'm just reading/writing bytes from/to the parallel port.
The real problem here is the damn slow CIA access, and the way it works.
My interface starts sending every byte it receives when you write to the port B of the CIA, sensing the STROBE pin (the CIA generates one cycle low pulse on STROBE for each byte written/read to/from the port).
But for reading a byte from the SPI, you first have to write something to it, usually a dummy byte (0xFF).
And that's not the only issue here... Every time you change from read to write or viceversa, you need to write 2 more bytes to the CIA (set port to in/out and change output control line).
So, a basic SPI transfer goes like this:
1) Assuming the CIA port B is in output mode, and the control line is set, you write your data to the port. 1 CIA access
2) Now you change the port B to input mode, and clear the control line (to tell the interface we're going to read from it). 2 CIA accesses
3) If you want to continue working, you should put back the CIA in output mode, and set again the control line. 2 CIA accesses
So you need 5 accesses to the (very) slow CIA, and taking in mind that CIAs are clocked at about 1 Mhz (in fact, even less), the typical access time is around 1 usec, thus you're actually spending 5 usecs on each SPI transfer...
And don't forget that there's no cache or pipelines on a plain 68000, so it's going to take even longer... (cpu has to fetch the opcodes from memory, execute them, move data from/to internal registers to/from memory, etc...).
I'm using some unrolling techniques to speed up some things, trying to minimize jumps and all those things that last forever in a 68000, getting close to a 2x performance boost over an unoptimized code.
All this means that the faster I can go in a bare A500 is around 25 Kbytes/sec (actually 27 Kbytes/sec), and close to 50 Kbytes/sec on a '060 A1200 machine (just 48,5 Kbytes/sec).
So, get your own conclusions...