![]() |
![]() |
#1 |
old chunk of coal
![]() Join Date: Nov 2011
Location: Hungary
Posts: 1,059
|
Build improvements for 68060
I'm trying to speed up the column rendering functions in Build, as these take up most of the rendering time.
For walls vlineasm4 is used to draw 4 textured columns at once to a longword aligned place in the framebuffer. The C version looks something like this: Code:
void vlineasm4(int cnt, char *p) { unsigned char * const pal[4] = {palookupoffse[0], palookupoffse[1], palookupoffse[2], palookupoffse[3]}; unsigned char * const buf[4] = {bufplce[0], bufplce[1], bufplce[2], bufplce[3]}; const int vinc[4] = {vince[0], vince[1], vince[2], vince[3]}; unsigned int vplc[4] = {vplce[0], vplce[1], vplce[2], vplce[3]}; const int logy = glogy, ourbpl = bpl; int *plong = (int *)p; while (cnt--) { *plong = (pal[0][buf[0][vplc[0]>>logy]] << 24); *plong |= (pal[1][buf[1][vplc[1]>>logy]] << 16); *plong |= (pal[2][buf[2][vplc[2]>>logy]] << 8); *plong |= (pal[3][buf[3][vplc[3]>>logy]]); vplc[0] += vinc[0]; vplc[1] += vinc[1]; vplc[2] += vinc[2]; vplc[3] += vinc[3]; plong += (ourbpl/4); } Bmemcpy(&vplce[0], &vplc[0], sizeof(unsigned int) * 4); } This is the asm version I took from the original 2003 Duke3D port: Code:
_vlineasm4: movem.l d2-d7/a2-a6,-(sp) lea _ylookup,a0 move.l (a0,d0.l*4),d0 move.l d0,a0 add.l d1,a0 neg.l d0 lea _bufplce,a1 lea _palookupoffse,a2 lea _vince,a4 move.l (a4),d3 move.l 4(a4),d4 move.l 8(a4),d5 move.l 12(a4),d6 lea _vplce,a3 move.l 12(a3),a6 move.l 8(a3),a5 move.l 4(a3),a4 move.l (a3),a3 move.l mach3_al(pc),d7 move.l a3,d1 lsr.l d7,d1 bra.b .loop cnop 0,16 .loop move.b ([a1],d1.w),d1 add.l d3,a3 move.b ([a2],d1.w),d2 move.l a4,d1 lsl.l #8,d2 lsr.l d7,d1 move.b ([4,a1],d1.w),d1 add.l d4,a4 move.b ([4,a2],d1.w),d2 move.l a5,d1 lsl.l #8,d2 lsr.l d7,d1 move.b ([8,a1],d1.w),d1 add.l d5,a5 move.b ([8,a2],d1.w),d2 move.l a6,d1 lsl.l #8,d2 lsr.l d7,d1 move.b ([12,a1],d1.w),d1 add.l d6,a6 move.b ([12,a2],d1.w),d2 move.l a3,d1 move.l d2,(a0,d0.l) lsr.l d7,d1 add.l _fixchain(pc),d0 bcc.b .loop .end lea _vplce,a2 move.l a3,(a2) move.l a4,4(a2) move.l a5,8(a2) move.l a6,12(a2) movem.l (sp)+,a2-a6/d2-d7 rts This was described as 68030+ in the source file, so I was wondering, is there a way to improve it for the 68060 to avoid stalls and other problems? ![]() |
![]() |
![]() |
#2 |
Registered User
![]() Join Date: Feb 2017
Location: Denmark
Posts: 590
|
Have you measured whether it's actually faster to render 4 columns at a time rather than simply doing 4x1? What you lose on doing more writes could be won by better cache utilization, simpler setup and not having to use indirect addressing modes. "vlineasm1" obviously also looks much more approachable for tuning.
EDIT: What I mean is something like: Looking at what I think is the latest code For vlineasm1 I think this shaves 3 cycles off the current version by reducing change/use register stalls: Code:
move.l d2,d4 lsr.l d3,d4 bra.b .loop cnop 0,16 .loop move.b (a1,d4.l),d5 add.l d0,d2 move.l d2,d4 lsr.l d3,d4 ; d5 stall move.b (a0,d5.l),(a2) ; write add.l d6,a2 subq.l #1,d1 bne.b loop move.b (a0,d5.l),(a2)to move to e.g. d7 and fit in the write from the previous iteration in one of the wasted slots (like the d5 stall). Last edited by paraj; 27 November 2022 at 23:34. |
![]() |
![]() |
#3 |
old chunk of coal
![]() Join Date: Nov 2011
Location: Hungary
Posts: 1,059
|
Thank you, that's a very good question! At one point I did measure the vlineasm1 against the vlineasm4 and the framerate was the same. Since I didn't do any real timings I never questioned whether the 4x1 version is worth pursuing - until now. I'll make some measurements for all 3 versions to see what's going on (vlineasm4, vlineasm1 before and after).
Edit: This is how many milliseconds the game spends in wallscan minus the initial setup in 320x200 at the start of each episode: Code:
vlineasm4 vlineasm1 E1M1 13-14 13-14 E2M1 15-16 14-15 E3M1 11 10 E4M1 14-15 13-14 E6M1 9-10 9-10 How would the d7 version look like? Something like this? Code:
move.b (a0,d5.l),d7 move.b d7,(a2) ; write add.l d6,a2 subq.l #1,d1 bne.b loop Last edited by BSzili; 28 November 2022 at 10:14. Reason: added measurements |
![]() |
![]() |
#4 |
Registered User
![]() Join Date: Feb 2017
Location: Denmark
Posts: 590
|
It's probably completely memory bound then, and not spending many cycles doing anything else. Also noticed I paired the instructions wrong (the first move.b can't pair with the add.l because it's 32-bits long and therefore the add hasn't been fetched yet).
The "d7 idea" (which is probably not worth it given my "new" version didn't improve anything) is to work around the 060's limitation of only being able to do one memory access per cycle. I.e. Code:
move.b (a0,d5.l),(a2) Code:
move.b (a0,d5.l),d7 move.b d7,(a2) So the idea is to have the innerloop write the pixel for the last iteration while preparing the next one, so something like this untested code (note: you also need to preserve d7): Code:
; Calc pixel for first iteration move.l d2,d4 lsr.l d3,d4 move.b (a1,d4.l),d5 add.l d0,d2 move.b (a0,d5.l),d7 subq.l #1,d1 beq.b .last ; Prepare d4 for loop move.l d2,d4 lsr.l d3,d4 bra.b .loop cnop 0,16 .loop move.b (a1,d4.l),d5 add.l d0,d2 move.b d7,(a2) move.l d2,d4 lsr.l d3,d4 move.b (a0,d5.l),d7 add.l d6,a2 ; lea (a2,d6.l),a2 might be better subq.l #1,d1 bne.b .loop .last move.b d7,(a2) .end |
![]() |
![]() |
#5 | |
Registered User
![]() Join Date: Sep 2019
Location: Finland
Posts: 278
|
Quote:
![]() |
|
![]() |
![]() |
#6 | ||
Registered User
![]() Join Date: Feb 2017
Location: Denmark
Posts: 590
|
Quote:
![]() Quote:
Code:
add.l d0,d2 move.b d7,(a2) move.b (a1,d4.l),d5 move.l d2,d4 lsr.l d3,d4 add.l d6,a2 move.b (a0,d5.l),d7 subq.l #1,d1 bne.b .loop Last edited by paraj; 03 December 2022 at 12:28. Reason: Strike out wrong info |
||
![]() |
![]() |
#7 |
old chunk of coal
![]() Join Date: Nov 2011
Location: Hungary
Posts: 1,059
|
Thanks, I tried the latest variant and the speed is pretty much the same. The biggest saving was switching to the single column mode, probably because it does less memory accesses overall.
|
![]() |
![]() |
#8 |
Registered User
![]() Join Date: Feb 2017
Location: Denmark
Posts: 590
|
Yeah, at 320x200 with no overdraw, 4 cycles/pixel would only give ~0.5% (50MHz).
If most of the time is spent in the wallscan function it might be worth it to write the function completely in asm to avoid unnecessary reloads of variables etc. Even if you don't go that route it's probably worth looking into reducing memory accesses/reducing cache pollution. I'd ditch the ylookup table (likely a pessimization on 060) and use local scalar variables for everything (maybe the compiler is smart enough to figure out that e.g. the use of the global vince array can be replaced by a local variable, but I wouldn't count on it). The "ynice" optimization is probably also not worth it for 060. Copying the value of the globals used in the loop (globalshade, globalhoriz, etc.) to local const variables is also worth a try (probably just moves them to the stack, but that should be more cache friendly, and might allow other optimizations). |
![]() |
![]() |
#9 |
old chunk of coal
![]() Join Date: Nov 2011
Location: Hungary
Posts: 1,059
|
For the first test I got rid of ylookup and stuffed everything into local variables:
http://franke.ms/cex/z/8KvrKP This shaved off 1ms of the wall rendering time in every case: Code:
vlineasm4 vlineasm1 wallscan E1M1 13-14 13-14 12-13 E2M1 15-16 14-15 13-14 E3M1 11 10 9-10 E4M1 14-15 13-14 12-13 E6M1 9-10 9-10 9 |
![]() |
![]() |
#10 |
Registered User
![]() Join Date: Feb 2017
Location: Denmark
Posts: 590
|
Nice, that's actually better than I would have hoped for.
At this point it's probably better/easier to turn vlineasm1 into inline assembly. You only need the loop part (d3/d6 should just be loaded from local globalshiftval [and ditch setupvlineasm]/mybpl). There's no reason to do y2-y1-1 and then undo it in the assembly code either (meaning the check if it's 0 can be skipped, that's already checked in the C code). You might also get slight better codegen by being more flexible in your assembly constraints (i.e. not requiring specific registers to be used). |
![]() |
![]() |
#11 |
Registered User
![]() Join Date: Feb 2017
Location: Denmark
Posts: 590
|
And for non-nice x's
Code:
if (bufplce[0] >= tsizx) { if (xnice == 0) bufplce[0] %= tsizx; else bufplce[0] &= tsizx; } Code:
do bufplce[0] -= tsizx; while (bufplce[0] >= tsizx); For inlining with better register usage something like this (which probably has some wrong constraints and likely doesn't work out) is what I mean: Code:
int cnt = y2ve[0]-y1ve[0]; int ptemp,vtemp,dest=x+myframeoffset+mybpl*y1ve[0]; asm volatile ( " moveq #0,%[ptemp]\n\t" "loop:\n\t" " move.l %[v],%[vtemp]\n\t" " lsr.l %[logy],%[vtemp]\n\t" " move.b (%[buf],%[vtemp].l),%[ptemp]\n\t" " add.l %[vinc],%[v]\n\t" " move.b (%[pallookup],%[ptemp].l),(%[dest])\n\t" " subq.l #1,%[cnt]\n\t" " add.l %[mybpl],%[dest]\n\t" " bne.b .loop\n\t" : [cnt] "+d" (cnt), // 0 [v] "+r" (myvplce), // 1 [dest] "+a" (dest), // 2 [ptemp] "=&d" (ptemp), // 3 [vtemp] "=&d" (vtemp) // 4 : [vinc] "r" (myvince), [pallookup] "a" (mypalookupoffse), [buf] "a" (mybufplce+mywaloff), [mybpl] "r" (mybpl), [logy] "d" (logy), "0" (cnt), "1" (myvplce), "2" (dest) ); |
![]() |
![]() |
#12 |
Moderator
![]() Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,397
|
paraj already contributed some good ideas, whether this translates to speed gains depends much more on what's cached than the occasional stalling order. If you can control what's in the cache when you start a loop, you have taken care of the biggest potential time sink. If you have a way of measuring cache misses, it is worth gold.
Likewise obviously reduce memory accesses at all if possible. Here, you look up values hopefully sometimes getting some cache hits, write to increasing offsets with gaps which likely never get accessed again, and read a hopefully cached ![]() |
![]() |
![]() |
#13 |
old chunk of coal
![]() Join Date: Nov 2011
Location: Hungary
Posts: 1,059
|
I'll check these out later. I think the wall textures in Build games have POT dimensions for speed, so they should all be "nice", but sprites have arbitrary dimensions. Those are drawn using using the maskwallscan function, so I'll try the xnice loop optimization there.
|
![]() |
![]() |
#14 |
Registered User
![]() Join Date: Feb 2017
Location: Denmark
Posts: 590
|
Have to correct some mistaken stuff I wrote above. I made some wrong conclusions from my experiments due to lack of understanding. koobo is absolutely right that the instructions are prefetched across correctly predicted loops. For completeness I've added how I now think the cycles should be counted (for the bulk of the loop (i.e. not first/last iteration) and assuming no cache misses) [attached since it's a wall of text].
As a consolation prize here's a stall free (I think) version of the loop. Would be limited by instruction fetch if there were really no cache misses Code:
move.b 0(a1,d4.l),d5 add.l d0,d2 move.l d2,d4 lsr.l d3,d4 move.b d7,(a2) add.l d6,a2 move.b 0(a0,d5.l),d7 subq.l #1,d1 |
![]() |
![]() |
#15 |
Registered User
![]() Join Date: Sep 2019
Location: Finland
Posts: 278
|
It's not straight forward to see how the execution goes by eye with the 060. I recently measured the execution time, did a small change, and measured again. Some speed ups were found which I didn't really understand, but no matter, as long as there was an improvement
![]() |
![]() |
![]() |
#16 |
old chunk of coal
![]() Join Date: Nov 2011
Location: Hungary
Posts: 1,059
|
Thanks, I'll try this version as well. I'll look for a better place to benchmark it, where there are not a lot of floors/celings in the view. In the meantime I checked the NPOT tsizx loop, and I gained about a 1ms looking at a sprite from up close.
Last edited by BSzili; 03 December 2022 at 22:16. |
![]() |
![]() |
#17 |
old chunk of coal
![]() Join Date: Nov 2011
Location: Hungary
Posts: 1,059
|
The latest version of vlineasm1 is measurably faster in 320x400. Could this be applied to mvlineasm1 as well?
It's used for non-translucent sprites and masked walls, and becomes a heavy hitter when you get up close and personal with enemies. I cobbled this together, but I guess the changes will mess with the timings? Code:
move.b 0(a1,d4.l),d5 add.l d0,d2 move.l d2,d4 lsr.l d3,d4 cmp.b #255,d7 beq.b .skip move.b d7,(a2) .skip add.l d6,a2 move.b 0(a0,d5.l),d7 subq.l #1,d1 |
![]() |
![]() |
#18 | |
Registered User
![]() Join Date: Feb 2017
Location: Denmark
Posts: 590
|
Quote:
Maybe something like this is faster: Code:
move.l d2,d3 lsr.l d7,d3 add.l d0,d2 sub.l d6,a2 .loop move.b (a1,d3.l),d4 add.l d6,a2 move.l d2,d3 lsr.l d7,d3 add.l d0,d2 cmp.b #255,d4 beq.b .skip move.b (a0,d4.l),(a2) .skip subq.l #1,d1 bpl.b .loop |
|
![]() |
![]() |
#19 |
old chunk of coal
![]() Join Date: Nov 2011
Location: Hungary
Posts: 1,059
|
Oops. Entry 255 usually remains 255 in the various colormaps, but of course it's faster if transparent pixels are not remapped. Also Exhumed doesn't follow this convention. Thanks, I'll try this out later.
|
![]() |
![]() |
#20 |
old chunk of coal
![]() Join Date: Nov 2011
Location: Hungary
Posts: 1,059
|
The plot thickens. While switching to the *lineasm1 functions exclusively sped up the wallscan/maskwallscan functions, it reduced the overall performance. I didn't pay attention to the FPS counter before, so I'll have to redo all the tests.
Anyway, reducing the memory accesses is a good direction in general, I was able to speed up the sound mixers quite a lot. Maybe some bit twiddling hack for the -128, 127 clamping could speed it up even more. |
![]() |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Voodoo.card 4.32 what improvements? | Bamiga2002 | support.Apps | 4 | 17 July 2014 03:03 |
Improvements on portability | Dreamcast270mhz | request.UAE Wishlist | 11 | 11 February 2010 16:20 |
drawing tablet improvements | pbareges | request.UAE Wishlist | 2 | 10 April 2009 14:06 |
AVIOutput improvements | Toni Wilen | support.WinUAE | 0 | 20 February 2008 12:02 |
OS 3.9 GUI Improvements | redneon | Amiga scene | 11 | 17 February 2005 08:56 |
|
|