20 June 2022, 02:18 | #61 | |||
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
Quote:
1. Generic 3D Starfield (as in Star Raiders) 2. 3D Point Cloud (as in Rez) 3. Additional Pixel detail (quasi texture-like) over flatshaded polygons via perspective interpolated 5x5 vertex grid Quote:
That's the best way to learn and come up with best code Quote:
I have looked up the cycle timings in the text file that @a/b attached few posts above (apologies if the numbers are incorrect) and this is what one Bitplane batch looks to be: Code:
12c move.b ($6000,a1),d3 4c and.b d1,d3 12c lsl.b #3,d2 10c bcc.b .b5 4c or.b d0,d3 .b5 12c move.b d3,($6000,a1) However, the or.b d0,d3 will be executed on average in 50% of cases (each bit has 50% chance of being set as all 64 colors are used across entire screen), so I will count the or.b d0,d3 as 50% - which is 2c, hence 54c-2c = 52c per BitPlane My current version is 18+10+10 + (18/2) = 47c, assuming I got the cycles right. There's few cycles less for that one BP which is addressed as (a1). Code:
18 and.b d3,(-$4000,a1) 10 btst #0,d2 10 beq dp9_2 18 or.b d3,(-$4000,a1) ; BP1 dp9_2: Also, what is the cycle timing of BEQ jumps on 68000 if the branch is not taken ? Is it 10c taken and 12c not taken perhaps ? |
|||
20 June 2022, 03:18 | #62 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,068
|
First pixel is slower (lsl.b #3,d2 12c) than the rest (add.b d2,d2 4c), it's what I was talking about in my previous post about replacing 6x btst #x,d2 (10c).
The basic idea is to push every bit in d2 out to carry flag, which can be done with a quick add.b (except for the first one, to get everything in place by skipping top 2 bits in d2). So it's 1*12+5*4=32 cycles vs. 6x10=60 cycles. And beq is replaced with bcc. Branch taken/not taken cases for bcc/dbcc/... are all in the table: |bcc.b |label | 10/8 (taken/not taken) |
20 June 2022, 07:23 | #63 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 855
|
@VladR
Quote:
Additional optimization: 1. put the addresses relative to the last bitplane in the table; 2. put movea.w #$2000,a2 outside of the plotting loop; 3. read/write bytes with (a1) everywhere; 4. put suba.l a2,a1 at the end of the code of each but the last byte. On a 68000 cycle-wise it's the same (8 cycles more for suba, 8 cycles less for addressing modes), but 1 word less per byte are required - in all, 5 words instead of 10, i.e. 5 less memory accesses. Edit: here's the updated code: Code:
movea.w #$2000,a2 ;bitplanes distance (somewhere outside of the loop) ... asl.w #2,d0 movea.l (a3,d0.w),a1 ;line base address move.w d1,d0 lsr.w #3,d1 ;X offset adda.w d1,a1 ;pixel base address in last bitplane moveq.l #7,d1 and.w d1,d0 sub.w d0,d1 ;bit number moveq.l #0,d0 bset.l d1,d0 ;OR mask move.b d0,d1 not.b d1 ;AND mask move.b (a1),d3 and.b d1,d3 lsl.b #3,d2 bcc.b .b5 or.b d0,d3 .b5 move.b d3,(a1) suba.l a2,a1 move.b (a1),d3 and.b d1,d3 add.b d2,d2 bcc.b .b4 or.b d0,d3 .b4 move.b d3,(a1) suba.l a2,a1 move.b (a1),d3 and.b d1,d3 add.b d2,d2 bcc.b .b3 or.b d0,d3 .b3 move.b d3,(a1) suba.l a2,a1 move.b (a1),d3 and.b d1,d3 add.b d2,d2 bcc.b .b2 or.b d0,d3 .b2 move.b d3,(a1) suba.l a2,a1 move.b (a1),d3 and.b d1,d3 add.b d2,d2 bcc.b .b1 or.b d0,d3 .b1 move.b d3,(a1) suba.l a2,a1 move.b (a1),d3 and.b d1,d3 add.b d2,d2 bcc.b .b0 or.b d0,d3 .b0 move.b d3,(a1) As for cycles, see a/b's answer. EDIT: got some unforeseen extra free time, so I thought I'd calculate the size of the code and the overall number of words read/written from/to RAM (given that the 68000 has no cache, that's to be taken into account as well). ORIGINAL CODE setup size: 10 setup reads: 4 plot size: 7*5+5 = 40 plot reads/writes (best): 6*2 = 12 plot reads/writes (average): 6*3 = 18 plot reads/writes (worst): 6*4 = 24 total (best): 66 total (average): 72 total (worst): 78 ALTERNATIVE CODE setup size (movea.w #$2000,a2 excluded): 13 setup reads: 1 plot size: 8*5+7 = 47 plot reads/writes: 6*2 = 12 total: 73 One thing I forgot to point out is that the alternative code needs only 1 lookup table. Last edited by saimo; 20 June 2022 at 11:43. |
|
20 June 2022, 14:49 | #64 | |
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
Quote:
Awesome, I will go adjust the timings accordingly. Since each branch has a 50% chance of being taken, each branch is effectively 9c on average (whole screen of pixels). |
|
20 June 2022, 15:21 | #65 | |
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
Quote:
Code:
Version 13 - Shifting out the color to the Left [c] : Cycles EHB : 0.54*119,333 = 64,439c (available cycles after DMA given ~54% utilization) --------------------------------------------------------------------- | CPU | MHz | Frame [c] | Colors | DrawPixel [c] | Pixels/Frame | --------------------------------------------------------------------- 6502 1.79 24,186 4 33 732.9 --------------------------------------------------------------------- 68000 7.16 119,333 4 159 750.5 68000 7.16 64,439 64 319 202.0 --------------------------------------------------------------------- No Overdraw version (No AND Masking) 68000 7.16 119,333 4 115 1,037.7 68000 7.16 64,439 64 203 317.4 ErasePixel 68000 7.16 119,333 4 96 1,243.1 68000 7.16 64,439 64 168 383.6 |
|
20 June 2022, 15:33 | #66 | |
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
Quote:
Will the instructions that operate just on registers (e.g. add.b d2,d2) execute in parallel at every other DMA slot in 6 BPL just like they would in 4 BPL ? Or will they still have to wait to be executed till they get the available DMA slot despite not doing any RAM R/W whatsoever? Last edited by VladR; 20 June 2022 at 15:34. Reason: typo |
|
20 June 2022, 17:19 | #67 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,068
|
Speedy 4 cycle instructions are not a problem. You could see it as they are executed while the next opcode is being fetched (or waiting on a free dma slot), so they are not slowed down "in the middle of execution" (once they are fetched, which can take a while, they are good to go).
The problem are instructions with multiple memory accesses, either extended opcode, operands, or execution. You could calculate how many cycles you have left in 1/50sec for the cpu, but once you are dealing with 5+ bitplanes and/or blitter and/or heavy copper lists, that becomes increasingly inaccurate due to instruction times being extended because of memory access conflicts "in the middle of execution". |
20 June 2022, 17:31 | #68 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,228
|
Quote:
They still have to wait because of the need to (pre)fetch the next instructions. I think the easiest way to think about it is like this: Without DMA contention you can just look at raw cycle numbers. With DMA contention you also look at cycle count - 4*number of memory access. The additional cycles (where the CPU isn't accessing memory) can run in parallel and are given "for free" in some sense. E.g. add.w d0,d0 and add.l d0,d0 will run at the same speed if the CPU can only access memory every other time it wants to (like in 6BPL when Agnus is fetching data). |
|
20 June 2022, 21:28 | #69 | ||
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
Quote:
Quote:
Of course, it's not impossible, there's still around ~64,000 cycles left for CPU (assuming no Blitter,Copper, Sprite DMA is happening). It's just that with spikes due to AI, explosions, etc. it might be very hard to not have framedrops, given the unpredictability of the execution. It's a challenge, alright |
||
20 June 2022, 23:16 | #70 |
Lemon. / Core Design
Join Date: Mar 2016
Location: Tier 5
Posts: 1,213
|
I'm wondering if it might be more efficient to plot a pixel with the blitter in 6bpl ?
|
21 June 2022, 00:29 | #71 |
Also known as GarethQ
Join Date: May 2019
Location: Twickenham / U.K.
Posts: 733
|
Well this has to be one of the best thread hijacks I have seen in a while and I don't understand assembly at all!!
|
21 June 2022, 01:14 | #72 | |
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
Quote:
I didn't even notice initially that this is a sticky thread |
|
21 June 2022, 02:04 | #73 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 855
|
One more optimization to the alternative code (can't go into the details now, sorry)...
The setup part that calculates the masks can be rewritten as follows, after putting moveq.l #7,d4 outside of the loop: Code:
and.l d4,d0 moveq.l #-128,d1 lsr.b d0,d1 move.b d1,d0 not.b d1 Last edited by saimo; 21 June 2022 at 02:11. |
21 June 2022, 03:56 | #74 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,050
|
Perhaps this code:
Code:
move.b (a1),d3 and.b d1,d3 add.b d2,d2 bcc.b .b0 or.b d0,d3 .b0 move.b d3,(a1) Code:
and.b (a1),d1 add.b d2,d2 bcc.b .b0 or.b d0,d1 .b0 move.b d1,(a1) Code:
and.b d1,(a1) add.b d2,d2 bcc.b .b0 or.b d0,(a1) .b0 |
21 June 2022, 13:54 | #75 | ||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 855
|
@Don_Adan
Quote:
(One more late night copy&paste-induced mistake - thanks again for opening my eyes!) Quote:
@VladR Adding more information to my previous post... Here's a side-by-side comparison of the old and new lookup-table-less setup code, with timings: Code:
OLD NEW moveq.l #7,d1 ;4 ;moveq.l #7,d4 outside of the loop and.w d1,d0 ;4 and.l d4,d0 ;4 sub.w d0,d1 ;4 moveq.l #0,d0 ;4 moveq.l #-128,d1 ;4 bset.l d1,d0 ;6 lsr.b d0,d1 ;6-20 move.b d0,d1 ;4 move.b d1,d0 ;4 not.b d1 ;4 not.b d1 ;4 ;total: 30 ;total: 22-36, average 29 The alternative code, modified as per all of the above, thus would look like this: Code:
* OUTSIDE OF THE PLOTTING LOOP movea.w #$2000,a2 ;bitplanes distance moveq.l #7,d4 ;X offset mask * PLOT ROUTINE lsl.w #2,d0 movea.l (a3,d0.w),a1 ;line base address move.w d1,d0 lsr.w #3,d1 ;X offset adda.w d1,a1 ;pixel base address in last bitplane and.l d4,d0 moveq.l #-128,d1 lsr.b d0,d1 move.b d1,d0 ;OR mask not.b d1 ;AND mask move.b (a1),d3 and.b d1,d3 lsl.b #3,d2 bcc.b .b5 or.b d0,d3 .b5 move.b d3,(a1) suba.l a2,a1 move.b (a1),d3 and.b d1,d3 add.b d2,d2 bcc.b .b4 or.b d0,d3 .b4 move.b d3,(a1) suba.l a2,a1 move.b (a1),d3 and.b d1,d3 add.b d2,d2 bcc.b .b3 or.b d0,d3 .b3 move.b d3,(a1) suba.l a2,a1 move.b (a1),d3 and.b d1,d3 add.b d2,d2 bcc.b .b2 or.b d0,d3 .b2 move.b d3,(a1) suba.l a2,a1 move.b (a1),d3 and.b d1,d3 add.b d2,d2 bcc.b .b1 or.b d0,d3 .b1 move.b d3,(a1) suba.l a2,a1 and.b (a1),d1 add.b d2,d2 bcc.b .b0 or.b d0,d1 .b0 move.b d1,(a1) * setup code size: 11 * setup code reads: 1 * plot code size: 8*5+6 = 46 * plot code reads/writes: 6*2 = 12 * total: 70 Last edited by saimo; 21 June 2022 at 18:52. |
||
21 June 2022, 16:32 | #76 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,068
|
I'd just go with something similar to what paraj posted a few pages ago... If you can handle 2KB of code (written for asm-one/pro):
Code:
**************************************************************** Depth EQU 6 MKDRAW MACRO (Color) .Start\@ .BPL SET 0 .Offset SET -$4000 REPT Depth IFNE (\1)&(1<<.BPL) IFEQ .Offset bset d0,(a0) ELSE bset d0,(.Offset,a0) ENDIF ELSE IFEQ .Offset bclr d0,(a0) ELSE bclr d0,(.Offset,a0) ENDIF ENDIF .BPL SET .BPL+1 .Offset SET .Offset+$2000 ENDR ; 10 bytes free, either rts/bra, or dbf and bra/rts can fit easily rts DCB.W (32-(*-.Start\@))/2,$4e71 ENDM **************************************************************** ; d0=y, d1=x, d2=color, a3=bm_rows DrawPixel lsl.w #5,d2 ; pre-shift d0/d2 if possible lsl.w #2,d0 ; add.w d0,d0 ; faster if mem access ; add.w d0,d0 ; is not a problem (-2c) move.l (a3,d0.w),a0 move.w d1,d0 lsr.w #3,d1 add.w d1,a0 ; pixel address not.b d0 ; bit jmp (.Draw,pc,d2.w) ALIGN 0,4 .Draw .Col SET 0 REPT 1<<Depth MKDRAW .Col .Col SET .Col+1 ENDR **************************************************************** |
21 June 2022, 19:42 | #77 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,228
|
Quote:
Something that works (assuming non-interleaved bitmap): Setup BLTAFWM/BLTALWM/BLTCDAT=$ffff, BLTAMOD=BLTDMOD=bplsize in bytes-2,BLTBMOD=0,BLTCON0=SRCA!SRCB!DEST!$B8 (Ab+BC). Have a table for each color with 6 words where MSB is set if the pixel should be drawn (e.g. 0 -> 6 times 0, 63 -> $8000, $8000, ...). Put destination in BLTAPT and BLTDPT (doesn't have to be word aligned), pixel mask from above in BLTBPT and (x&15)<<12 in BLTCON1 then write 64*6+1 to BLTSIZE and off you go. Last edited by paraj; 21 June 2022 at 19:49. |
|
21 June 2022, 20:54 | #78 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 855
|
Yeah, as far as CPU-only routines go, that's as fast as it gets (accessory code aside, which is in the same ballpark anyway, 6 instructions, 5*2+1 = 11 words and 6*2 = 12 reads/writes are the bare minimum; I've been thinking of minimizing the writes by restricting them to just when changes are needed, but ANDs and branches cancel the theoretical advantages).
|
22 June 2022, 16:17 | #79 | |
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
Quote:
The fastest version will undergo the full unrolling (e.g. jump table to 64 routines). Besides, I am pretty sure I will have 3 different options for end user, and thus 3 different pixel rasterizer sets in code: 1. 2 BPL 2. 4 BPL 3. 6 BPL Since everything else in the game takes exact same time, in theory, to account for these 3 drastically different scenarios (on the bus), all I would have to do is change the number of pixels (stars) while framerate will stay unchanged. I am toying with the idea of not clearing the framebuffer and just erasing the stars from last frame, since that would be doable (outside of cutscenes) if everything else was done with sprites (with 512 KB RAM, I could pre-render all 3D meshes into sprites at loading time). And this would give me almost full frame (out of 3), as that's how long it takes to clear 6 planes via Blitter... |
|
22 June 2022, 17:09 | #80 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,068
|
Then I would suggest that you also record a bitmap ptr for each star as you draw them (maybe overwrite x/y to save space), so the clearing is then simply: read a bitmap ptr, set whole byte to 0 for each bitplane.
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Help Fund the Amiga 4000 Replica Project! | Acill | Amiga scene | 82 | 02 March 2020 20:04 |
Financial Fund London Amiga or PC | runandbecome | Amiga scene | 8 | 30 September 2016 00:44 |
An idea for continued games development... using Amiga | Galahad/FLT | Amiga scene | 91 | 29 December 2010 11:45 |
Amiga development | freehand | Retrogaming General Discussion | 4 | 18 April 2010 17:53 |
Amizilla Fund closes in on almost $9000 in donations; first one that donates and gets | Pyromania | News | 0 | 11 January 2005 11:00 |
|
|