Only expected interrupts. However, I've figured out why it's slow on my machine: ATC-misses! If I disable page address translation and rely just on the TTR's I get 34.72 / 28.57 fps respectively.
I timed a small test program that reads a byte from a (pseudo-)random offset into an array and varied the size:
Up to 256KB there are no differences, and the no-MMU case stays flat at around ~375ns/loop iteration, while with MMU it grows to 867 for at 32MB array.
EDIT: 256KB of course lines up perfectly with 64-entry ATC and page size of 4K, and forgot something actionable: Of course being more cache friendly is likely a big rework, but I think grouping (height,color) rather than having separate arrays would likely be an easy win. Obviously you can't just switch switch off MMU w/o consequences, so don't change something like that in your own code.