English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 27 November 2022, 22:29   #1
BSzili
old chunk of coal

BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 926
Build improvements for 68060

I'm trying to speed up the column rendering functions in Build, as these take up most of the rendering time.
For walls vlineasm4 is used to draw 4 textured columns at once to a longword aligned place in the framebuffer.
The C version looks something like this:
Code:
void vlineasm4(int cnt, char *p)
{
    unsigned char * const pal[4] = {palookupoffse[0], palookupoffse[1], palookupoffse[2], palookupoffse[3]};
    unsigned char * const buf[4] = {bufplce[0], bufplce[1], bufplce[2], bufplce[3]};
    const int vinc[4] = {vince[0], vince[1], vince[2], vince[3]};
    unsigned int vplc[4] = {vplce[0], vplce[1], vplce[2], vplce[3]};
    const int logy = glogy, ourbpl = bpl;
    int *plong = (int *)p;

    while (cnt--)
    {
        *plong =  (pal[0][buf[0][vplc[0]>>logy]] << 24);
        *plong |= (pal[1][buf[1][vplc[1]>>logy]] << 16);
        *plong |= (pal[2][buf[2][vplc[2]>>logy]] << 8);
        *plong |= (pal[3][buf[3][vplc[3]>>logy]]);

        vplc[0] += vinc[0];
        vplc[1] += vinc[1];
        vplc[2] += vinc[2];
        vplc[3] += vinc[3];
        plong += (ourbpl/4);
    }

    Bmemcpy(&vplce[0], &vplc[0], sizeof(unsigned int) * 4);
}
Here bpl is the bytesperline of the framebuffer, p is the starting address in the framebuffer, palookupoffse has the palette remap lookup tables for each column, bufplce has the texture addresses, vplc is the fixed point texel (shifted down by glogy) and vinc is the increment value.
This is the asm version I took from the original 2003 Duke3D port:
Code:
_vlineasm4:
	movem.l d2-d7/a2-a6,-(sp)

	lea     _ylookup,a0
	move.l  (a0,d0.l*4),d0

	move.l  d0,a0
	add.l   d1,a0

	neg.l   d0

	lea     _bufplce,a1
	lea     _palookupoffse,a2

	lea     _vince,a4
	move.l  (a4),d3
	move.l  4(a4),d4
	move.l  8(a4),d5
	move.l  12(a4),d6

	lea     _vplce,a3
	move.l  12(a3),a6
	move.l  8(a3),a5
	move.l  4(a3),a4
	move.l  (a3),a3

	move.l  mach3_al(pc),d7
	move.l  a3,d1
	lsr.l   d7,d1
	bra.b   .loop

	cnop 0,16
.loop
	move.b  ([a1],d1.w),d1
	add.l   d3,a3
	move.b  ([a2],d1.w),d2
	move.l  a4,d1
	lsl.l   #8,d2

	lsr.l   d7,d1
	move.b  ([4,a1],d1.w),d1
	add.l   d4,a4
	move.b  ([4,a2],d1.w),d2
	move.l  a5,d1
	lsl.l   #8,d2

	lsr.l   d7,d1
	move.b  ([8,a1],d1.w),d1
	add.l   d5,a5
	move.b  ([8,a2],d1.w),d2
	move.l  a6,d1
	lsl.l   #8,d2

	lsr.l   d7,d1
	move.b  ([12,a1],d1.w),d1
	add.l   d6,a6
	move.b  ([12,a2],d1.w),d2
	move.l  a3,d1

	move.l  d2,(a0,d0.l)
	lsr.l   d7,d1

	add.l   _fixchain(pc),d0
	bcc.b   .loop
.end
	lea     _vplce,a2
	move.l  a3,(a2)
	move.l  a4,4(a2)
	move.l  a5,8(a2)
	move.l  a6,12(a2)

	movem.l (sp)+,a2-a6/d2-d7
	rts
Here _fixchain is bytesperline, mach3_al is glogy and ylookup[i] = bytesperline*i. There is some clever stuff there, like combining the framebuffer offset and the loop variable into d0.
This was described as 68030+ in the source file, so I was wondering, is there a way to improve it for the 68060 to avoid stalls and other problems?
BSzili is offline  
Old 27 November 2022, 23:45   #2
paraj
Registered User

 
Join Date: Feb 2017
Location: Denmark
Posts: 499
Have you measured whether it's actually faster to render 4 columns at a time rather than simply doing 4x1? What you lose on doing more writes could be won by better cache utilization, simpler setup and not having to use indirect addressing modes. "vlineasm1" obviously also looks much more approachable for tuning.

EDIT: What I mean is something like:

Looking at what I think is the latest code
For vlineasm1 I think this shaves 3 cycles off the current version by reducing change/use register stalls:
Code:
    move.l  d2,d4
    lsr.l   d3,d4
    bra.b   .loop
    cnop 0,16
.loop
    move.b  (a1,d4.l),d5
    add.l   d0,d2

    move.l  d2,d4
    lsr.l   d3,d4

    ; d5 stall

    move.b  (a0,d5.l),(a2)

    ; write

    add.l   d6,a2
    subq.l  #1,d1

    bne.b   loop
I think it should also be possible to do a bit more loop fidling and changing
move.b (a0,d5.l),(a2)
to move to e.g. d7 and fit in the write from the previous iteration in one of the wasted slots (like the d5 stall).

Last edited by paraj; 28 November 2022 at 00:34.
paraj is offline  
Old 28 November 2022, 09:15   #3
BSzili
old chunk of coal

BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 926
Thank you, that's a very good question! At one point I did measure the vlineasm1 against the vlineasm4 and the framerate was the same. Since I didn't do any real timings I never questioned whether the 4x1 version is worth pursuing - until now. I'll make some measurements for all 3 versions to see what's going on (vlineasm4, vlineasm1 before and after).

Edit: This is how many milliseconds the game spends in wallscan minus the initial setup in 320x200 at the start of each episode:
Code:
     vlineasm4 vlineasm1
E1M1     13-14     13-14
E2M1     15-16     14-15
E3M1        11        10
E4M1     14-15     13-14
E6M1      9-10      9-10
The single column version seems to be a little faster in heavy scenes. The new version of vlineasm1 is about the same, so I didn't include it in the table.
How would the d7 version look like? Something like this?
Code:
    move.b  (a0,d5.l),d7
    move.b  d7,(a2)

    ; write

    add.l   d6,a2
    subq.l  #1,d1

    bne.b   loop

Last edited by BSzili; 28 November 2022 at 11:14. Reason: added measurements
BSzili is offline  
Old 28 November 2022, 12:00   #4
paraj
Registered User

 
Join Date: Feb 2017
Location: Denmark
Posts: 499
It's probably completely memory bound then, and not spending many cycles doing anything else. Also noticed I paired the instructions wrong (the first move.b can't pair with the add.l because it's 32-bits long and therefore the add hasn't been fetched yet).

The "d7 idea" (which is probably not worth it given my "new" version didn't improve anything) is to work around the 060's limitation of only being able to do one memory access per cycle.
I.e.
Code:
    move.b  (a0,d5.l),(a2)
Always takes at least two cycles, even with everything cached. Splitting it up like this
Code:
    move.b  (a0,d5.l),d7
    move.b  d7,(a2)
doesn't improve anything on it's own (still takes at least two cycles, and of course the code is larger), but if you can find instructions for the sOEP to pair with the two moves, you get those for "free".

So the idea is to have the innerloop write the pixel for the last iteration while preparing the next one, so something like this untested code (note: you also need to preserve d7):
Code:
    ; Calc pixel for first iteration
    move.l  d2,d4
    lsr.l   d3,d4
    move.b  (a1,d4.l),d5
    add.l   d0,d2
    move.b  (a0,d5.l),d7
    subq.l  #1,d1
    beq.b   .last

    ; Prepare d4 for loop
    move.l  d2,d4
    lsr.l   d3,d4

    bra.b   .loop

    cnop 0,16
.loop
    move.b  (a1,d4.l),d5
    add.l   d0,d2
    move.b  d7,(a2)
    move.l  d2,d4
    lsr.l   d3,d4
    move.b  (a0,d5.l),d7
    add.l   d6,a2           ; lea (a2,d6.l),a2 might be better
    subq.l  #1,d1
    bne.b   .loop
.last
    move.b  d7,(a2)
.end
paraj is offline  
Old 28 November 2022, 12:31   #5
koobo
Registered User

koobo's Avatar
 
Join Date: Sep 2019
Location: Finland
Posts: 208
Quote:
Originally Posted by paraj View Post
Also noticed I paired the instructions wrong (the first move.b can't pair with the add.l because it's 32-bits long and therefore the add hasn't been fetched yet).
A bit off topic I suppose, but isn't it more likely that both of these instructions have been prefetched into the 96-byte FIFO buffer, as the previous instructions are short? Not sure tho
koobo is offline  
Old 28 November 2022, 13:06   #6
paraj
Registered User

 
Join Date: Feb 2017
Location: Denmark
Posts: 499
Quote:
Originally Posted by koobo View Post
A bit off topic I suppose, but isn't it more likely that both of these instructions have been prefetched into the 96-byte FIFO buffer, as the previous instructions are short? Not sure tho
Perfectly on topic, and it's bitten me before A correctly predicted Bcc instruction is free if taken, but discards the instruction stream (1.4.2.1 of 68060UM):
Quote:
If a hit occurs in the branch cache, indicating a branch taken instruction, the current instruction stream is discarded and a new instruction stream is fetched starting at the location indicated by the branch cache.
Just checked by timing loops with 200 iterations and everything in cache, and the original loop takes ~9 cycles (as I expected) and my first version takes ~6. The d7 version also takes 6 (maybe a stall for a2? not sure, and lea doesn't help, but that also increases code size...). However this variation takes it down to 5:
Code:
    add.l   d0,d2
    move.b  d7,(a2)
    move.b  (a1,d4.l),d5
    move.l  d2,d4
    lsr.l   d3,d4
    add.l   d6,a2
    move.b  (a0,d5.l),d7
    subq.l  #1,d1
    bne.b   .loop
But going from 6 to 5 probably isn't going to give a measurable speed up if 9->6 didn't.

Last edited by paraj; 03 December 2022 at 13:28. Reason: Strike out wrong info
paraj is offline  
Old 28 November 2022, 14:46   #7
BSzili
old chunk of coal

BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 926
Thanks, I tried the latest variant and the speed is pretty much the same. The biggest saving was switching to the single column mode, probably because it does less memory accesses overall.
BSzili is offline  
Old 28 November 2022, 15:43   #8
paraj
Registered User

 
Join Date: Feb 2017
Location: Denmark
Posts: 499
Yeah, at 320x200 with no overdraw, 4 cycles/pixel would only give ~0.5% (50MHz).

If most of the time is spent in the wallscan function it might be worth it to write the function completely in asm to avoid unnecessary reloads of variables etc.

Even if you don't go that route it's probably worth looking into reducing memory accesses/reducing cache pollution. I'd ditch the ylookup table (likely a pessimization on 060) and use local scalar variables for everything (maybe the compiler is smart enough to figure out that e.g. the use of the global vince array can be replaced by a local variable, but I wouldn't count on it). The "ynice" optimization is probably also not worth it for 060. Copying the value of the globals used in the loop (globalshade, globalhoriz, etc.) to local const variables is also worth a try (probably just moves them to the stack, but that should be more cache friendly, and might allow other optimizations).
paraj is offline  
Old 28 November 2022, 17:12   #9
BSzili
old chunk of coal

BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 926
For the first test I got rid of ylookup and stuffed everything into local variables:
http://franke.ms/cex/z/8KvrKP
This shaved off 1ms of the wall rendering time in every case:
Code:
     vlineasm4 vlineasm1 wallscan
E1M1     13-14     13-14    12-13
E2M1     15-16     14-15    13-14
E3M1        11        10     9-10
E4M1     14-15     13-14    12-13
E6M1      9-10      9-10        9
BSzili is offline  
Old 28 November 2022, 18:28   #10
paraj
Registered User

 
Join Date: Feb 2017
Location: Denmark
Posts: 499
Nice, that's actually better than I would have hoped for.

At this point it's probably better/easier to turn vlineasm1 into inline assembly. You only need the loop part (d3/d6 should just be loaded from local globalshiftval [and ditch setupvlineasm]/mybpl). There's no reason to do y2-y1-1 and then undo it in the assembly code either (meaning the check if it's 0 can be skipped, that's already checked in the C code). You might also get slight better codegen by being more flexible in your assembly constraints (i.e. not requiring specific registers to be used).
paraj is offline  
Old 28 November 2022, 19:41   #11
paraj
Registered User

 
Join Date: Feb 2017
Location: Denmark
Posts: 499
And for non-nice x's
Code:
  if (bufplce[0] >= tsizx) {
    if (xnice == 0)
      bufplce[0] %= tsizx;
    else
      bufplce[0] &= tsizx;
}
it's probably faster to do:
Code:
do bufplce[0] -= tsizx; while (bufplce[0] >= tsizx);
if bufplce[0] isn't much larger than tsizx. divsl.l being 38 cycles allows quite a few loop iterations before it's slower (for reasons I don't yet know my micro benchmarks give 4 cycles as the minimum for any loop, but that's still 8-9 iterations).

For inlining with better register usage something like this (which probably has some wrong constraints and likely doesn't work out) is what I mean:
Code:
        int cnt = y2ve[0]-y1ve[0];
int ptemp,vtemp,dest=x+myframeoffset+mybpl*y1ve[0];
asm volatile (
"    moveq   #0,%[ptemp]\n\t"
"loop:\n\t"
"	move.l  %[v],%[vtemp]\n\t"
"	lsr.l   %[logy],%[vtemp]\n\t"
"	move.b  (%[buf],%[vtemp].l),%[ptemp]\n\t"
"	add.l   %[vinc],%[v]\n\t"
"	move.b  (%[pallookup],%[ptemp].l),(%[dest])\n\t"
"	subq.l  #1,%[cnt]\n\t"
"	add.l   %[mybpl],%[dest]\n\t"
"	bne.b   .loop\n\t"
: [cnt]     "+d" (cnt),     // 0
  [v]       "+r" (myvplce),       // 1
  [dest]    "+a" (dest),    // 2
  [ptemp]   "=&d" (ptemp),   // 3
  [vtemp]   "=&d" (vtemp)    // 4
: [vinc]    "r" (myvince),
  [pallookup] "a" (mypalookupoffse),
  [buf]     "a" (mybufplce+mywaloff),
  [mybpl]   "r" (mybpl),
  [logy]    "d" (logy),
  "0" (cnt),
  "1" (myvplce),
  "2" (dest)
);
paraj is offline  
Old 28 November 2022, 21:22   #12
Photon
Moderator

Photon's Avatar
 
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,290
paraj already contributed some good ideas, whether this translates to speed gains depends much more on what's cached than the occasional stalling order. If you can control what's in the cache when you start a loop, you have taken care of the biggest potential time sink. If you have a way of measuring cache misses, it is worth gold.

Likewise obviously reduce memory accesses at all if possible.

Here, you look up values hopefully sometimes getting some cache hits, write to increasing offsets with gaps which likely never get accessed again, and read a hopefully cached fixed longword from memory to add to a counter register each loop.
Photon is offline  
Old 28 November 2022, 21:23   #13
BSzili
old chunk of coal

BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 926
I'll check these out later. I think the wall textures in Build games have POT dimensions for speed, so they should all be "nice", but sprites have arbitrary dimensions. Those are drawn using using the maskwallscan function, so I'll try the xnice loop optimization there.
BSzili is offline  
Old 03 December 2022, 16:04   #14
paraj
Registered User

 
Join Date: Feb 2017
Location: Denmark
Posts: 499
Have to correct some mistaken stuff I wrote above. I made some wrong conclusions from my experiments due to lack of understanding. koobo is absolutely right that the instructions are prefetched across correctly predicted loops. For completeness I've added how I now think the cycles should be counted (for the bulk of the loop (i.e. not first/last iteration) and assuming no cache misses) [attached since it's a wall of text].

As a consolation prize here's a stall free (I think) version of the loop. Would be limited by instruction fetch if there were really no cache misses
Code:
        move.b  0(a1,d4.l),d5
        add.l   d0,d2
        move.l  d2,d4
        lsr.l   d3,d4
        move.b  d7,(a2)
        add.l   d6,a2
        move.b  0(a0,d5.l),d7
        subq.l  #1,d1
Attached Files
File Type: txt cyclecount.txt (2.6 KB, 18 views)
paraj is offline  
Old 03 December 2022, 17:53   #15
koobo
Registered User

koobo's Avatar
 
Join Date: Sep 2019
Location: Finland
Posts: 208
It's not straight forward to see how the execution goes by eye with the 060. I recently measured the execution time, did a small change, and measured again. Some speed ups were found which I didn't really understand, but no matter, as long as there was an improvement
koobo is offline  
Old 03 December 2022, 22:57   #16
BSzili
old chunk of coal

BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 926
Thanks, I'll try this version as well. I'll look for a better place to benchmark it, where there are not a lot of floors/celings in the view. In the meantime I checked the NPOT tsizx loop, and I gained about a 1ms looking at a sprite from up close.

Last edited by BSzili; 03 December 2022 at 23:16.
BSzili is offline  
Old 04 December 2022, 13:36   #17
BSzili
old chunk of coal

BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 926
The latest version of vlineasm1 is measurably faster in 320x400. Could this be applied to mvlineasm1 as well?
It's used for non-translucent sprites and masked walls, and becomes a heavy hitter when you get up close and personal with enemies. I cobbled this together, but I guess the changes will mess with the timings?
Code:
        move.b  0(a1,d4.l),d5
        add.l   d0,d2
        move.l  d2,d4
        lsr.l   d3,d4
        cmp.b   #255,d7
        beq.b   .skip
        move.b  d7,(a2)
.skip
        add.l   d6,a2
        move.b  0(a0,d5.l),d7
        subq.l  #1,d1
BSzili is offline  
Old 04 December 2022, 14:33   #18
paraj
Registered User

 
Join Date: Feb 2017
Location: Denmark
Posts: 499
Quote:
Originally Posted by BSzili View Post
The latest version of vlineasm1 is measurably faster in 320x400. Could this be applied to mvlineasm1 as well?
It's used for non-translucent sprites and masked walls, and becomes a heavy hitter when you get up close and personal with enemies. I cobbled this together, but I guess the changes will mess with the timings?
Code:
        move.b  0(a1,d4.l),d5
        add.l   d0,d2
        move.l  d2,d4
        lsr.l   d3,d4
        cmp.b   #255,d7
        beq.b   .skip
        move.b  d7,(a2)
.skip
        add.l   d6,a2
        move.b  0(a0,d5.l),d7
        subq.l  #1,d1
Hmm, I don't think that does the same thing as the original code. It's the value read from the a1-array that's supposed to be compared with 255 (d5 in your code), right?


Maybe something like this is faster:
Code:
    move.l  d2,d3
    lsr.l   d7,d3
    add.l   d0,d2
    sub.l   d6,a2
.loop
    move.b  (a1,d3.l),d4
    add.l   d6,a2
    move.l  d2,d3
    lsr.l   d7,d3
    add.l   d0,d2
    cmp.b   #255,d4
    beq.b   .skip
    move.b  (a0,d4.l),(a2)
.skip
    subq.l  #1,d1
    bpl.b   .loop
paraj is offline  
Old 04 December 2022, 15:02   #19
BSzili
old chunk of coal

BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 926
Oops. Entry 255 usually remains 255 in the various colormaps, but of course it's faster if transparent pixels are not remapped. Also Exhumed doesn't follow this convention. Thanks, I'll try this out later.
BSzili is offline  
Old 11 December 2022, 15:29   #20
BSzili
old chunk of coal

BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 926
The plot thickens. While switching to the *lineasm1 functions exclusively sped up the wallscan/maskwallscan functions, it reduced the overall performance. I didn't pay attention to the FPS counter before, so I'll have to redo all the tests.
Anyway, reducing the memory accesses is a good direction in general, I was able to speed up the sound mixers quite a lot. Maybe some bit twiddling hack for the -128, 127 clamping could speed it up even more.
BSzili is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Voodoo.card 4.32 what improvements? Bamiga2002 support.Apps 4 17 July 2014 04:03
Improvements on portability Dreamcast270mhz request.UAE Wishlist 11 11 February 2010 17:20
drawing tablet improvements pbareges request.UAE Wishlist 2 10 April 2009 15:06
AVIOutput improvements Toni Wilen support.WinUAE 0 20 February 2008 13:02
OS 3.9 GUI Improvements redneon Amiga scene 11 17 February 2005 09:56

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 16:32.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, vBulletin Solutions Inc.
Page generated in 0.10520 seconds with 14 queries