Quote:
Originally Posted by koobo
A bit off topic I suppose, but isn't it more likely that both of these instructions have been prefetched into the 96-byte FIFO buffer, as the previous instructions are short? Not sure tho
|
Perfectly on topic, and it's bitten me before A correctly predicted Bcc instruction is free if taken, but discards the instruction stream (1.4.2.1 of 68060UM):
Quote:
If a hit occurs in the branch cache, indicating a branch taken instruction, the current instruction stream is discarded and a new instruction stream is fetched starting at the location indicated by the branch cache.
|
Just checked by timing loops with 200 iterations and everything in cache, and the original loop takes ~9 cycles (as I expected) and my first version takes ~6. The d7 version also takes 6 (maybe a stall for a2? not sure, and lea doesn't help, but that also increases code size...). However this variation takes it down to 5:
Code:
add.l d0,d2
move.b d7,(a2)
move.b (a1,d4.l),d5
move.l d2,d4
lsr.l d3,d4
add.l d6,a2
move.b (a0,d5.l),d7
subq.l #1,d1
bne.b .loop
But going from 6 to 5 probably isn't going to give a measurable speed up if 9->6 didn't.