English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 25 October 2013, 01:26   #1
losso
Registered User
 
losso's Avatar
 
Join Date: Oct 2013
Location: Hamburg
Posts: 70
Loop optimization + cycle counts

Hey y'all,

I am optimizing a loop for a copper-chunky effet. Now, I've read the 68000 optimization thread and picked up some ideas here and there. So far, I have sped up my first version:

Code:
; Plot a row of copper colors (doubled, i.e. 2 scanlines).
;
; Uses $YYyyXXxx as the internal texture table offset;
; this is converted to ($YYXX)&$7ffe when plotting
; (texture is 128*128 words=$8000 bytes).
:
; d6: texture table offset
; d1: step offset in texture table
; a1: copper row1 (points to: $0rgb $0182 $0rgb $0182 $0rgb...)
; a2: copper row2
; 
    move.l   d6,d0          ; offset = offset0
    rept cc_COLS
                            ; cycles
    move.l   d0,d2          ;  8
    and.l    #$7f00fe00,d2  ; 16
    lsr.w    #8,d2          ; 22 (6+2*8)
    move.w   d2,d5          ;  4
    swap     d2             ;  4
    or.w     d2,d5          ;  4
    move.w   (a0,d5.w),(a1) ; 14  -- plot
    move.w   (a1),(a2)      ; 12
    add.l    d1,d0          ;  6  -- increment offset
    lea      4(a1),a1       ;  8? -- copper += 4
    lea      4(a2),a2       ;  8?
    endr                    ; == 106
...by some cycles:

Code:
    move.l   d6,(a3)         ; offset = offset0
    rept cc_COLS
                             ; cycles
    movep.w  0(a3),d5        ; 16
    and.w    #$7ffe,d5       ;  8
    move.w   (a0,d5.w),(a1)  ; 14  -- plot
    move.w   (a1),(a2)       ; 12
    add.l    d1,(a3)         ; 16  -- increment offset
    lea      4(a1),a1        ;  8? -- copper += 4
    lea      4(a2),a2        ;  8?
    endr                     ; == 82
(In my test effect, it's 216 vs 255 raster lines, with 20 columns and 50 rows, WinUAE/A500 config.)

At the moment, I cannot see an obvious way to optimize further, but I am sure, there is

Also, I would be interested if my cycle counts are correct. Going from 106 to 82 cycles in the tight loop should lead to about 77% of the total raster time, but I only got from 255 to 216 (85%). Of course, that may be caused by entirely different effects. The test program does not perform any other tasks and interrupts are disabled. I'm using Blueberry's startup code.

(Also, »Superoriginal« by Supergroup does 40x96 with a similar setup... Maybe the setup is all wrong and I should use tables?)

So, any hints or suggestions would be greatly appreciated!

Cheers,

losso

PS: For my first post, I have to add: This place rocks! Without it, I wouldn't even have considered getting into Amiga coding again after 20 years. Thanks to Blueberry, britelite, leffmann, Photon, phx, pmc, StingRay, TheDarkCoder, Toni W. and all of you!
losso is offline  
Old 25 October 2013, 02:04   #2
Codetapper
2 contact me: email only!
 
Codetapper's Avatar
 
Join Date: May 2001
Location: Auckland / New Zealand
Posts: 3,187
One small thing is to move the $7ffe value into a register. Probably 2 or 4 cycles saved there. Not much but better than nothing!

You are also never using a2 other than to copy one value from a1 into it. It might be quicker (haven't checked cycle counts) to move the value into (a2) for the first loop, 4(a2) for the second, 8(a2) for the third etc rather than into (a2) and doing a lea 4(a2),a2 every loop.

Also remember that movep isn't available on all CPUs.
Codetapper is offline  
Old 25 October 2013, 09:21   #3
britelite
Registered User
 
Join Date: Feb 2010
Location: Espoo / Finland
Posts: 820
Quote:
Originally Posted by losso View Post
(Also, »Superoriginal« by Supergroup does 40x96 with a similar setup... Maybe the setup is all wrong and I should use tables?)
In Superoriginal the effects are actually 80x96, with the pixels spread across two scanlines (even pixels on first scanline, odd pixels on second). So the resolution is basically 40x192.
britelite is offline  
Old 25 October 2013, 09:38   #4
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,545
I'd use ADDQ instead of LEA. Both have same total cycle count but ADDQ/SUBQ has 4 idle cycles (=same as 1 memory cycle) and 1 opcode fetch. LEA has 1 opcode fetch and 1 extension word fetch. Less bus activity, it can be faster during heavy DMA activity (and it is also 1 word shorter)

(Not sure if MOVEP is exactly emulated, don't remember if it was logic analyzer confirmed or not..)
Toni Wilen is offline  
Old 25 October 2013, 09:44   #5
britelite
Registered User
 
Join Date: Feb 2010
Location: Espoo / Finland
Posts: 820
Quote:
Originally Posted by losso View Post
Code:
    move.l   d6,(a3)         ; offset = offset0
    rept     cc_COLS
                             ; cycles
    movep.w  0(a3),d5        ; 16
    and.w    #$7ffe,d5       ;  8
    move.w   (a0,d5.w),(a1)  ; 14  -- plot
    move.w   (a1),(a2)       ; 12
    add.l    d1,(a3)         ; 16  -- increment offset
    lea      4(a1),a1        ;  8? -- copper += 4
    lea      4(a2),a2        ;  8?
    endr                     ; == 82
First of all I think you're cycle counts are off. The tables I have show higher counts.

But anyway, some small changes (some of what Codetapper suggested)
Code:
    move.w   #$7ffe,d7
    move.l   d6,(a3)         ; offset = offset0
off set 0
    rept     cc_COLS
    movep.w  0(a3),d5
    and.w    d7,d5
    move.w   (a0,d5.w),d4
    move.w   d4,off(a1)
    move.w   d4,off+nextline(a1)
    add.l    d1,(a3)
off set off+4
    endr
Where nextline is the offset between scanline 1 and 2.

There are also other possible improvements, depending of what kind of effect you're doing.
britelite is offline  
Old 25 October 2013, 10:15   #6
losso
Registered User
 
losso's Avatar
 
Join Date: Oct 2013
Location: Hamburg
Posts: 70
Fantastic, thanks everyone! Now I am down to 190 raster lines with all the improvements above — fast enough to make it in 26x64 resolution in 1 frame, which is slowly in the "good enough" zone for me.

It's for an plain old rotozoomer, by the way (forgot to mention that).

The timing should improve another bit as soon as I narrow the display/DMA window down to the exact area displayed, right? Currently I have it set up to a 320x256 1-bitplane screen. Will try that later and post and update…
losso is offline  
Old 25 October 2013, 12:53   #7
losso
Registered User
 
losso's Avatar
 
Join Date: Oct 2013
Location: Hamburg
Posts: 70
I'm still curious about the instruction cycle counts. I am not sure, for example, if I interpret the tables on this page correctly.

Let's say I want to count the cycles for:

Code:
move.w (a0,d5.w),d4
Looking at the table at that page, I get:

Code:
Effective Address Calculation Times:
d(An,ix) = 10 address calculation cycles for word access
            2 bus read cycles
            0 bus write cycles

Move Byte and Word Instruction Execution Times:
d(An,ix) to dN = 14 execution cycles
                  3 bus read cycles
                  0 bus write cycles
Does this mean the instruction clocks in at 10+2+14+3 = 29 cycles?
Or ist it 24 cycles *including* 5 bus read cycles?
losso is offline  
Old 25 October 2013, 12:58   #8
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,545
MOVE tables have effective address calculation included. (There is no "+ add effective address calculation time" at the bottom)

d(An,ix) to dN = total 14 cycles. 3 memory fetches and 2 internal execution (bus is idle) cycles = 3 * 4 + 2 = 14.
Toni Wilen is offline  
Old 05 November 2013, 11:50   #9
losso
Registered User
 
losso's Avatar
 
Join Date: Oct 2013
Location: Hamburg
Posts: 70
I see, thanks for claryfying that!

As for the effect… I think I will have to rethink my overall approach if I want to achieve a larger resolution. It seems wasteful to me that the blitter isn’t doing anything. Maybe I could write the pixels out with (aX)+ moves instead of d(aX) moves and have the blitter copy the pixels into the copperlist? Would save 4 cycles per move in each iteration, if I’m counting right.

I will try to resist the urge to dissect existing rotozoomers and see where this approach leads me…
losso is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Sampled loop in cracktro absence request.Music 2 30 June 2012 11:33
Looking for 68000 binary optimization utility amigoun request.Apps 2 23 October 2011 00:36
ARM Assembler Optimization finkel Coders. General 10 01 December 2010 11:56
Post counts Graham Humphrey project.EAB 5 06 July 2007 11:09
problem cdda don't loop turrican3 support.WinUAE 33 04 June 2007 20:18

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 15:41.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.07705 seconds with 13 queries