05 August 2022, 19:16 | #21 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,104
|
Quote:
On a side note when testing I was worried my 060 was dying, but "luckily" it just turned out to be a nearby building on fire |
|
06 August 2022, 12:15 | #22 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,104
|
Thanks to bebbo for sharing an improved version via PM, massaging it a bit further and the signed multiply function is now at 60 cycles (including overhead):
Code:
_smul64: fmove.l d0,fp0 move.l d2,a0 fmul.l d1,fp0 fmove.x fp0,-(a7) move.w (a7),d2 addq.l #4,a7 bmi.s .Neg64 sub.w #16382+32,d2 ble.s .L1 move.l (a7)+,d0 move.l d3,a1 move.l d0,d3 lsl.l d2,d3 neg.w d2 add.w #32,d2 move.l (a7)+,d1 lsr.l d2,d0 lsr.l d2,d1 move.l a0,d2 or.l d3,d1 move.l a1,d3 rts .Neg64: sub.w #16382+$8000+32,d2 ble.s .L1neg move.l (a7)+,d0 move.l d3,a1 move.l d0,d3 lsl.l d2,d3 neg.w d2 add.w #32,d2 move.l (a7)+,d1 lsr.l d2,d0 lsr.l d2,d1 or.l d3,d1 move.l a1,d3 neg.l d1 negx.l d0 move.l a0,d2 rts .L1: ; bfextu (a7){0:d2},d1 neg.w d2 move.l (a7),d1 lsr.l d2,d1 moveq #0,d0 addq.l #8,a7 move.l a0,d2 rts .L1neg: ; bfextu (a7){0:d2},d1 neg.w d2 move.l (a7),d1 lsr.l d2,d1 moveq #-1,d0 addq.l #8,a7 neg.l d1 move.l a0,d2 rts Last edited by paraj; 06 August 2022 at 17:54. Reason: Fixed error spotted by a/b, and useless moveq (Don_Adan) |
06 August 2022, 14:18 | #23 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,039
|
Shouldn't move.l d2,a0 in .L1 and .L1Neg be move.l a0,d2?
|
06 August 2022, 14:40 | #24 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,104
|
|
06 August 2022, 16:52 | #25 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
|
Quote:
|
|
06 August 2022, 17:36 | #26 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,104
|
It is indeed useless now that all the bitfield operations have been removed, thanks.
Also for the int64_t/int32_t (sdiv64) case it's faster to just use the FPU to do the float->int conversion (30 cycle improvement vs. bebbo's improved version (at 130)). I.e. for d0:d1/d2 -> d0 Code:
; d0>0 case (for d0 < 0, negate d0:d1 before and d0 after division, checking for d0=0 may or may not be worth it) tst.l d1 bpl.s .L1 addq.l #1,d0 ; add 2**32 (can't overflow if result is <= 2**31) .L1: fmove.l d0,fp0 fmul.s #$4f800000,fp0 ; 2**32 fadd.l d1,fp0 fdiv.l d2,fp0 fintrz.x fp0 fmove.l fp0,d0 |
08 August 2022, 18:38 | #27 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,104
|
Where were we? Oh right, optimizing texture mappers for 060 I've had my eye on getting a one-frame (mostly) fullscreen effect done for some time. Trying to replicate the free-directional tunnel/planes effects in hotstyle takeover (only 20 years late), I came up with this innerloop for use with a 64x64 texture (taking up half of the data cache) and 16 shades for each color (8 seems to give too much banding, and 32 leaves too few colors for the texture). "v" is only 6.6 fixpoint (rest are .16) which is not great. The whole thing is very imprecise though (due to the 32x16 interpolation grid), so that might not matter too much. AFAIK the demo also used that grid size, but I'm not sure why my version seems to look much worse. Already tried tweaking the calculations a bit to hide the worst artifacts, but haven't spend much time on it. In the demo it's mostly noticeable in the planes part, but looks quite OK (also goes by quickly).
Code:
; d6 = ufrac | v ; d1 = zfrac | uint ; d2 = -------- | zint (<=15) ; a2 = dudxfrac | dvdx ; d4 = dzdxfrac | dudxint ; d5 = -------- | dzdxint ; d7 = 63*64 ; a0 = output buffer ; a1 = texture (only upper bits are set in the texels) move.l d6,d0 ; 1 and.l d7,d0 and.w #63,d1 ; 2 move.b d2,d3 or.b d1,d0 ; 3 add.l a2,d6 addx.l d4,d1 ; 4 addx.b d5,d2 ; 5 or.b (a1,d0.l),d3 ; 6 ; - move.b d3,(a0)+ ; 7 ; - (cmp.l) I've attached complete source, can be compiled with vc +aos68k -o boxi.exe -O3 -cpu=68060 -fpu=68060 -DNDEBUG boxi.c boxiasm.s demosys.c -I$NDK -lm060 -lamiga -lauto (should also work with gcc). On my blizzard 1260/50Mhz it runs one-frame (at 320x208) with "Inner" reported as taking 141.4ns (7.1 cycles) per pixel. At 320x256 i get a 40FPS. The way c2p/effect overlapping is done might be the bottleneck here - I suspect there aren't enough cycles between the first part of the C2P writing 4 long words to the store buffer, and the later handling of the final four (but overlapping that is already a bit complicated and moving them slightly later didn't give any improvement). Sorry in advance for the nausea inducing effect |
08 August 2022, 21:15 | #28 |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,474
|
Nice!
(apart from a slight stomach ache ) |
08 August 2022, 22:09 | #29 | |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
Fun stuff!
Quote:
When you switch to "andi.l #63*64,d0", your innerloop becomes larger - 30 bytes per pixel yields a total of 9600 bytes of innerloop, which ought to result in ~1.4kB being ping-ponged in/out of the cache for each iteration through the outer loop. Perhaps that is the reason why you see a performance degradation? Generally speaking, it is usually good to unroll a couple of times, say 8x-32x; this can reduce time spent doing loop control. However, even if the unrolled code fits into the cache, it still needs to be fetched once. Therefore, there is a sweet spot past which excessive unrolling makes the whole-frame time slower because the time saved from eliminating loop control instructions is instead spent doing memory reads for filling the I-cache during the first iteration through the loop. When I look at _boxi_inner it seems that you are not unrolling 320 times there though. Therefore, I think that the perf difference you saw in your measurement loop was not something you observed in the "real" loop. I'm not sure what to do to get performance up regarding the C2P parts. One thing you can try is to eliminate any logic that doesn't affect memory accesses, and check: does that improve performance? because if it does, it is a hint to you that you could theoretically get to that performance if you positioned your memory accesses better. For example, erase the contents of the C2P_STEP macro and check if that causes a performance difference. You could also try removing texture interpolation logic, keeping only memory accesses, to get another measure of theoretical max throughput. |
|
08 August 2022, 22:40 | #30 |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
Regarding the innerloop itself; if you want to speed that up, note that the ADDX operations eat up two pipeline slots. The ADD/ADDX/ADDX scheme costs as much as 5 ADDs, or 3 ADDs + 2 shifts would.
Here is an example of using 1 ADD + 1 ADDX, combined with some creative packing to reduce the amount of shifting/masking. First, the linear flow, without rolling the loop into itself: Code:
; d1 (step) & d0 (counter) vvvvvv-- -------- CCCCcccc cc------ ; d3 (step) & d2 (counter) UUUUUUuu uuuu---- -------- --VVVVVV add.l d1,d0 addx.l d3,d2 move.l d2,d4 rol.l #6,d4 and.l d6,d4 ; d5 = $00000fff move.w d0,d5 lsr.w d7,d5 ; d7 = 16-4 ; d5.w will be in the range $0000 ... $000f or.b (a0,d4.l),d5 move.b d5,(a1)+ Code:
; d1 (step) & d0 (counter) vvvvvv-- -------- CCCCcccc cc------ ; d3 (step) & d2 (counter) UUUUUUuu uuuu---- -------- --VVVVVV or.b (a0,d4.l),d5 move.l d2,d4 move.b d5,(a1)+ rol.l #6,d4 and.l d6,d4 ; d5 = $00000fff move.w d0,d5 lsr.w d7,d5 ; d7 = 16-4 ; d5.w will be in the range $0000 ... $000f add.l d1,d0 addx.l d3,d2 |
09 August 2022, 14:01 | #31 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,104
|
Wow, thanks a lot, that's amazing! Humbling to see such a massive improvement possible. Clearly I still have much to learn. That brings FPS at 320x256 to 46 and 320x240 is 99.6% one frame (3-4 frames missed out of 1000). While refactoring I seem to have introduced an issue (or perhaps it's lack of precision) where C (Z) seems to "bleed" into V in some way that I'll have to debug, but this is clearly a massive step forward! Also great tips for trying to place memory accesses.
Regarding the "andi.l" thing, I tried both rolled and unrolled and various sizes and it seems pretty consistent. I've done some more measurements, and one thing I forgot about was branch target alignment (noticed this when I timed individual instructions and e.g. add was faster than sub!). I must have missed one or more additional details in MC68060UM because it seems like using immediate operands (for otherwise 0.5 cycle instructions) incurs an extra cost. Perhaps some part of the instruction fetch machinery isn't able to keep up with the execution units. Instructions were timed like this: Code:
cnop 0,8 .loop: ; <instruction sequence> subq.l #1,d0 bne.b .loop ; unrolled cnop 0,8 .loop: rept UNROLLX ;50 ; <instruction sequence> endr subq.l #1,d0 bne.w .loop ; sample instruction sequence (all are 3 operations) add.l d1,d2 add.l d1,d3 add.l d1,d4 ; mix1 move.l d1,d2 andi.l #42,d3 addq.l #1,d4 ; mix2 move.l d1,d2 and.l d5,d3 addq.l #1,d4 Code:
Instruction Looped Unrolled addq 2.0 1.5 addi 5.0 4.5 addiw 4.0 3.0 add 2.0 1.5 subq 2.0 1.5 subi 5.0 4.5 sub 2.0 1.5 andi 5.0 4.5 andiw 4.0 3.0 and 2.0 1.5 eori 5.0 4.5 eoriw 4.0 3.0 eor 2.0 1.5 addx 4.0 3.0 mul 7.0 6.0 muli 10.0 9.0 movei 5.0 4.5 moveiw 4.0 3.0 move 2.0 1.5 mix1 3.0 2.5 mix2 2.0 1.5 EDIT: Added nofpu version of test. Last edited by paraj; 09 August 2022 at 14:28. |
09 August 2022, 23:04 | #32 | |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
Quote:
Regarding andi.l, if it wasn't down to code size, and you are not getting lots of cache hits, then yes you are probably getting instruction fetch limited. Instruction fetch works like this: The I-cache can supply one aligned longword of code bytes each cycle. The IFU will fetch longword after longword of code bytes, using the branch prediction logic to determine whether to just walk linearly ahead or to skip somewhere else next time. Branch prediction information is associated with the instruction before the branch itself, so if there is a "the following instruction is predicted to be a branch-taken instruction" then the branch instruction itself will not be fetched by the IFU. The IFU then feeds fetched results through "Early Decode" (IED), which takes a variable number of words as input, and spits out 6-byte packets. These 6-byte packets go into the Instruction Buffer (IB), a 16-entry FIFO. Instructions smaller than 6 bytes still turn into 6-byte packets. Instructions larger than 6 bytes turn into multiple 6-byte packets. The Integer Unit, with its two pipelines, attempt to pull two 6-byte packets from the Instruction Buffer each cycle, and feed them into pOEP / sOEP if possible. Now -- if you have a stream of instructions that pair (example: add.l d0,d1 / add.l d2,d3 / unrolled 16x) then the Integer Unit will process 2 instructions / cycle, and thus the IFU needs to enqueue 2 instructions / cycle to keep the Integer Unit busy. The IFU will succeed at this if the instructions are 2 bytes each (since 2 instructions * 2 bytes/instruction = 4 bytes, and the I-cache can sustain 4 bytes/cycle). However, if these instructions that pair are larger than 2 bytes, you will become instruction fetch limited. The worst example is if you use 32-bit immediates (example: add.l #$12345678,d0 / add.l #$87654321,d1) -- the I-cache will spend 3 cycles fetching this instruction pair, so the IFU will submit on average one instruction pair every 3 cycles, yet the Integer Unit can execute this pair every 3 cycles. What will happen is that the IFU submits 0,1,1,0,1,1,0,1,1 ... instructions per cycle ... and the Integer Unit will snatch an instruction from the FIFO as soon as it is enqueued, run that instruction through the pOEP with the sOEP idle. In your synthetic benchmark, the "x_andi" case is 3x ANDI.L unrolled 50 times. This is 6 bytes per instruction of pairable ops - so that case will be IFU/I-cache limited, and the 1.5 cycle/instruction is effectively a measure of instruction fetch speed from the I-cache. In a more complex scenario, like the texturemapper, then any Integer Unit stalls (processing a slow instruction - like a DIVS/DIVU - or a data cache read miss) will pause the Integer Unit's processing of instructions, but the IFU can continue to do instruction fetch, early decode, and fill the FIFO. This buys the IFU time in the future, thereby hiding subsequent IFU throughput problems ... but, if your texturemapper really is having no data cache misses, and usig all pairable instructions, then you will eventually become IFU throughput bound: the Integer Unit is processing instructions as quickly or quicker than the I-cache can deliver them to the IFU. If we go back to your 7-cycle example, 7 cycles allows the I-cache to deliver 7*4=28 bytes. The original version of your 7-cycle example is 26 bytes in size, but with ANDI.L it becomes 30 bytes. If you really have zero d-cache misses, then the I-cache won't be able to sustain the throughput you need; it needs another 0.5 cycles per iteration extra on average. |
|
10 August 2022, 10:30 | #33 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,104
|
Quote:
Once again thanks a bunch for the detailed information! It was extremely helpful, and there's not much info on the 060 to find on the net (except mostly posts by you and Blueberry). Attached is latest version including 020/nofpu compatible version. It's a bit slow on an unexpanded A1200 (at 0.6 fps). |
|
12 August 2022, 14:54 | #34 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,104
|
Cycles found. 52.6fps at 320x256 (B1260@50MHz), gonna say that's good enough and call it a day. If I were to use it in some way, I'd probably have to spend quite a bit of time tweaking the effect to hide the lack of precision from the huge interpolation grid distance, and find someone with graphic skills to create a decent 64x64 16 color texture.
Final speedup came from interleaving the last 2 chip writes with the inner loop. Did some rough estimation of when there would be room in the store buffer, and that turned out to work. Played around a bit more with the placement of the last 4 chip writes, but didn't see any improvement. EDIT: If you have a different Amiga setup and try it, please post results the "nofpu" version should run on any AGA machine. 040/060LC results might be interesting. Last edited by paraj; 12 August 2022 at 19:22. |
15 August 2022, 21:09 | #35 | |
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,581
|
Quote:
boxi.exe: 6.21 fps. boxi-nofpu: 4.13 fps. I also did some experiments in an attempt to see where the bottleneck was (obviously not the FPU). Allocating bitplanes in FastRAM: 6.8 fps. So writing to ChipRAM is not the bottleneck. No c2p code at all (_boxi = rts): 50.08 fps. Aha! Remove c2p register manipulations, copying to FastRAM: 27.81 fps Same but copying to ChipRAM: 22.36 fps. I haven't figured out exactly what all the code does so I probably left in more than I could have. How would I modify your code to simply do each of the following (not caring that the display would be scrambled)? 1. Render the 'chunky' pixels directly to ChipRAM. 2. Copy the rendered pixels from FastRAM to ChipRAM. Reason I want to do this is to evaluate the relative speed of planar vs chunky graphics, if it had been implemented in the AGA chipset or could be added via some external hardware. |
|
15 August 2022, 21:56 | #36 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,104
|
Quote:
The idea is to calculate 32 chunky pixels and do C2P on those while previous stores to chip ram complete. Since the store buffer can only hold 4 longwords, the C2P part after the main loop is structured to finish 4 LWs ASAP and get them on their way (almost at the end of boxi_scanline). This leaves 4 more LWs which are stored in "pdata(sp)". Except for the final iteration the first two are stored during the setup for the main loop (move.l (a3)+,a2 ... move.l a2,X*ROWBYTES(a5)) and the last two after 8 and 16 pixel respectively - see the rept 8 INNER part. The code is tuned for what's fast on my 060, so not surprising that it's not great on a 030 (though I would have hoped it was faster). So if I understand you correctly, you should get rid of the chip ram stores in the setup/mainloop (a3+ -> a2/3 -> X(a5)) and the C2P part (after the INNER stuff). That should leave you with a loop that just renders pixels to fast ram (pixels(sp)). If you want to render directly to chipram (doing byte accesses) change the "lea pixels(sp),a0" to "move.l a6,a0" just before the mainloop. If you just want to render to fast RAM and then copy, do that after the main loop instead. If you want, I can share earlier versions of the code (that were less optimized/060 specific) if that helps (PM me, and I'll just send you a copy of the git repo). |
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
replace a color with a texture | turrican3 | support.WinUAE | 9 | 18 October 2019 08:10 |
Code optimization. | gazj82 | Coders. Blitz Basic | 26 | 08 July 2018 15:56 |
Sound Quest - Texture | emufan | request.Apps | 2 | 08 April 2016 20:03 |
3D Graphics: possible optimization? | sandruzzo | Coders. General | 3 | 26 February 2016 08:01 |
Speed! - no 3D texture mapping | s2325 | HOL data problems | 1 | 17 October 2010 16:32 |
|
|