25 October 2013, 01:26 | #1 |
Registered User
Join Date: Oct 2013
Location: Hamburg
Posts: 70
|
Loop optimization + cycle counts
Hey y'all,
I am optimizing a loop for a copper-chunky effet. Now, I've read the 68000 optimization thread and picked up some ideas here and there. So far, I have sped up my first version: Code:
; Plot a row of copper colors (doubled, i.e. 2 scanlines). ; ; Uses $YYyyXXxx as the internal texture table offset; ; this is converted to ($YYXX)&$7ffe when plotting ; (texture is 128*128 words=$8000 bytes). : ; d6: texture table offset ; d1: step offset in texture table ; a1: copper row1 (points to: $0rgb $0182 $0rgb $0182 $0rgb...) ; a2: copper row2 ; move.l d6,d0 ; offset = offset0 rept cc_COLS ; cycles move.l d0,d2 ; 8 and.l #$7f00fe00,d2 ; 16 lsr.w #8,d2 ; 22 (6+2*8) move.w d2,d5 ; 4 swap d2 ; 4 or.w d2,d5 ; 4 move.w (a0,d5.w),(a1) ; 14 -- plot move.w (a1),(a2) ; 12 add.l d1,d0 ; 6 -- increment offset lea 4(a1),a1 ; 8? -- copper += 4 lea 4(a2),a2 ; 8? endr ; == 106 Code:
move.l d6,(a3) ; offset = offset0 rept cc_COLS ; cycles movep.w 0(a3),d5 ; 16 and.w #$7ffe,d5 ; 8 move.w (a0,d5.w),(a1) ; 14 -- plot move.w (a1),(a2) ; 12 add.l d1,(a3) ; 16 -- increment offset lea 4(a1),a1 ; 8? -- copper += 4 lea 4(a2),a2 ; 8? endr ; == 82 At the moment, I cannot see an obvious way to optimize further, but I am sure, there is Also, I would be interested if my cycle counts are correct. Going from 106 to 82 cycles in the tight loop should lead to about 77% of the total raster time, but I only got from 255 to 216 (85%). Of course, that may be caused by entirely different effects. The test program does not perform any other tasks and interrupts are disabled. I'm using Blueberry's startup code. (Also, »Superoriginal« by Supergroup does 40x96 with a similar setup... Maybe the setup is all wrong and I should use tables?) So, any hints or suggestions would be greatly appreciated! Cheers, losso PS: For my first post, I have to add: This place rocks! Without it, I wouldn't even have considered getting into Amiga coding again after 20 years. Thanks to Blueberry, britelite, leffmann, Photon, phx, pmc, StingRay, TheDarkCoder, Toni W. and all of you! |
25 October 2013, 02:04 | #2 |
2 contact me: email only!
Join Date: May 2001
Location: Auckland / New Zealand
Posts: 3,187
|
One small thing is to move the $7ffe value into a register. Probably 2 or 4 cycles saved there. Not much but better than nothing!
You are also never using a2 other than to copy one value from a1 into it. It might be quicker (haven't checked cycle counts) to move the value into (a2) for the first loop, 4(a2) for the second, 8(a2) for the third etc rather than into (a2) and doing a lea 4(a2),a2 every loop. Also remember that movep isn't available on all CPUs. |
25 October 2013, 09:21 | #3 |
Registered User
Join Date: Feb 2010
Location: Espoo / Finland
Posts: 821
|
In Superoriginal the effects are actually 80x96, with the pixels spread across two scanlines (even pixels on first scanline, odd pixels on second). So the resolution is basically 40x192.
|
25 October 2013, 09:38 | #4 |
WinUAE developer
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,574
|
I'd use ADDQ instead of LEA. Both have same total cycle count but ADDQ/SUBQ has 4 idle cycles (=same as 1 memory cycle) and 1 opcode fetch. LEA has 1 opcode fetch and 1 extension word fetch. Less bus activity, it can be faster during heavy DMA activity (and it is also 1 word shorter)
(Not sure if MOVEP is exactly emulated, don't remember if it was logic analyzer confirmed or not..) |
25 October 2013, 09:44 | #5 | |
Registered User
Join Date: Feb 2010
Location: Espoo / Finland
Posts: 821
|
Quote:
But anyway, some small changes (some of what Codetapper suggested) Code:
move.w #$7ffe,d7 move.l d6,(a3) ; offset = offset0 off set 0 rept cc_COLS movep.w 0(a3),d5 and.w d7,d5 move.w (a0,d5.w),d4 move.w d4,off(a1) move.w d4,off+nextline(a1) add.l d1,(a3) off set off+4 endr There are also other possible improvements, depending of what kind of effect you're doing. |
|
25 October 2013, 10:15 | #6 |
Registered User
Join Date: Oct 2013
Location: Hamburg
Posts: 70
|
Fantastic, thanks everyone! Now I am down to 190 raster lines with all the improvements above — fast enough to make it in 26x64 resolution in 1 frame, which is slowly in the "good enough" zone for me.
It's for an plain old rotozoomer, by the way (forgot to mention that). The timing should improve another bit as soon as I narrow the display/DMA window down to the exact area displayed, right? Currently I have it set up to a 320x256 1-bitplane screen. Will try that later and post and update… |
25 October 2013, 12:53 | #7 |
Registered User
Join Date: Oct 2013
Location: Hamburg
Posts: 70
|
I'm still curious about the instruction cycle counts. I am not sure, for example, if I interpret the tables on this page correctly.
Let's say I want to count the cycles for: Code:
move.w (a0,d5.w),d4 Code:
Effective Address Calculation Times: d(An,ix) = 10 address calculation cycles for word access 2 bus read cycles 0 bus write cycles Move Byte and Word Instruction Execution Times: d(An,ix) to dN = 14 execution cycles 3 bus read cycles 0 bus write cycles Or ist it 24 cycles *including* 5 bus read cycles? |
25 October 2013, 12:58 | #8 |
WinUAE developer
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,574
|
MOVE tables have effective address calculation included. (There is no "+ add effective address calculation time" at the bottom)
d(An,ix) to dN = total 14 cycles. 3 memory fetches and 2 internal execution (bus is idle) cycles = 3 * 4 + 2 = 14. |
05 November 2013, 11:50 | #9 |
Registered User
Join Date: Oct 2013
Location: Hamburg
Posts: 70
|
I see, thanks for claryfying that!
As for the effect… I think I will have to rethink my overall approach if I want to achieve a larger resolution. It seems wasteful to me that the blitter isn’t doing anything. Maybe I could write the pixels out with (aX)+ moves instead of d(aX) moves and have the blitter copy the pixels into the copperlist? Would save 4 cycles per move in each iteration, if I’m counting right. I will try to resist the urge to dissect existing rotozoomers and see where this approach leads me… |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Sampled loop in cracktro | absence | request.Music | 2 | 30 June 2012 11:33 |
Looking for 68000 binary optimization utility | amigoun | request.Apps | 2 | 23 October 2011 00:36 |
ARM Assembler Optimization | finkel | Coders. General | 10 | 01 December 2010 11:56 |
Post counts | Graham Humphrey | project.EAB | 5 | 06 July 2007 11:09 |
problem cdda don't loop | turrican3 | support.WinUAE | 33 | 04 June 2007 20:18 |
|
|