13 June 2022, 18:27 | #41 | ||
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,291
|
Quote:
Quote:
1st blitter pass goes in reverse (so you can shift left) over the x coordinates and outputs the wanted instruction with the correct data register. If you can ensure the y-coordinate is preshifted/multiplied by the rowsize (if you're doing 3d stuff anyway maybe you could fold it into your projection routine) then just a second pass is needed to combine the two into an offset for the instruction. Otherwise you need more two passes, one for x and one for y shifting them correctly into place. More concretely, say you have a 256x256 screen with a line going from 0,0 to 255,255: Code:
linelist: dc.w 0,0,1,1,.... Code:
8128 0000 or.b d0,$0000(a0) 8328 0000 or.b d1,$0000(a0) 8528 0000 or.b d2,$0000(a0) 8728 0000 or.b d3,$0000(a0) ... Code:
8128 0000 or.b d0,$0000(a0) 8328 0000 or.b d1,$0000(a0) ... 8f28 0000 or.b d7,$0000(a0) 8128 0001 or.b d0,$0001(a0) Code:
8128 0000 or.b d0,$0000(a0) 8328 0030 or.b d1,$0020(a0) ... 8F28 00E0 or.b d7,$00e0(a0) 8128 0101 or.b d0,$0101(a0) Last edited by paraj; 13 June 2022 at 19:17. |
||
13 June 2022, 20:15 | #42 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,091
|
You can still do it in 2 passes by shifting Y 16-* in the opposite direction (and doing a few minor adjustments).
In one of my routines I'm using fixed point 11:5 for both X and Y, and screen width 512 (64=2^6 bytes). So I'd have to shift X by 3+5=8 to the right, and shift Y by 6-5=1 to the left. And since that doesn't work, I did 16-(6-5)=15 to the right for Y. With adjusted Y bltptr and an extra row in bltsize (height+1). EDIT: Forgot to mention that you also have to (manually) patch every 1024th pixel because you're doing multiple blits and the Y bits you shift out of the last row won't carry over to next blit's first row. Last edited by a/b; 13 June 2022 at 20:56. Reason: patching |
14 June 2022, 18:20 | #43 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,291
|
Quote:
Ah cool, I considered that something like that should be possible, but it seemed like something that would take quite a bit of work to get right, though maybe now I'll have to give it a shot |
|
15 June 2022, 00:21 | #44 | |
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
Quote:
That is, indeed, very interesting approach! |
|
15 June 2022, 00:41 | #45 | |
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
Quote:
I still have 2 things on my ToDo list that should shave off some more... Code:
Version 7 [c] : Cycles EHB : 0.54*119,333 = 64,439c (available cycles after DMA given ~54% utilization) --------------------------------------------------------------------- | CPU | MHz | Frame [c] | Colors | DrawPixel [c] | Pixels/Frame | --------------------------------------------------------------------- 6502 1.79 24,186 4 33 732.9 68000 7.16 119,333 4 264 452.0 68000 7.16 64,439 64 466 138.2 --------------------------------------------------------------------- EDIT: Code:
Version 8 - using a 4-cycle btst.b instead of move.l/andi.w [c] : Cycles EHB : 0.54*119,333 = 64,439c (available cycles after DMA given ~54% utilization) --------------------------------------------------------------------- | CPU | MHz | Frame [c] | Colors | DrawPixel [c] | Pixels/Frame | --------------------------------------------------------------------- 6502 1.79 24,186 4 33 732.9 68000 7.16 119,333 4 248 481.1 68000 7.16 64,439 64 426 151.2 --------------------------------------------------------------------- EDIT2: Code:
Version 9 - LUT for YPOS [c] : Cycles EHB : 0.54*119,333 = 64,439c (available cycles after DMA given ~54% utilization) --------------------------------------------------------------------- | CPU | MHz | Frame [c] | Colors | DrawPixel [c] | Pixels/Frame | --------------------------------------------------------------------- 6502 1.79 24,186 4 33 732.9 68000 7.16 119,333 4 212 562.8 68000 7.16 64,439 64 390 165.2 --------------------------------------------------------------------- No Overdraw version (No AND Masking) 68000 7.16 119,333 4 128 932.2 68000 7.16 64,439 64 230 280.1 --------------------------------------------------------------------- Last edited by VladR; 15 June 2022 at 15:20. Reason: Performance Update |
|
15 June 2022, 18:32 | #46 | ||||
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,291
|
Quote:
Code:
move.l #pointlist+4*numpoints-4,bltapt(a6) move.l #codebuffer+4*numpoints-4,bltdpt(a6) move.w #$8128,bltbdat(a6) ; or.b d0,(a0) instruction move.w #7<<9,bltcdat(a6) ; mask for x move.w #9<<12!SRCA!DEST!$E4,bltcon0(a6) ; $E4 D = Bc+AC, shift x & 7 into right place move.w #BLITREVERSE, bltcon1(a6) move.w #numpoints*64+1,bltsize(a6) Quote:
Quote:
Quote:
|
||||
16 June 2022, 14:56 | #47 | ||
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
Quote:
I did keep regretting that decision during last 8 attempts at optimization, as the difference in cycles between 16 and 32-bit adds up pretty quickly everywhere. Quote:
210 is a really good number for 6 BPL I guess I am going to have to work for it a bit harder EDIT: I was just about to do the last item on the ToDo list - BSET/BCLR instead of OR/AND Except, they don't support the 32-bit addressing mode (only 8-bit). Hence I gotta switch to 8/16-bit access. I don't think I can do the full rewrite of LUTs (and everything else) now, that's possible only during weekend. Last edited by VladR; 16 June 2022 at 15:25. |
||
17 June 2022, 17:56 | #48 | ||
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,291
|
Quote:
Quote:
Learning to use the more advanced features really pays dividends in both speed of development, maintainability and performance. In this case I had to use some slightly esoteric features for the "colfunc" macro (to get a proper label), but otherwise it was bread and butter stuff. |
||
17 June 2022, 18:21 | #49 | |
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
Quote:
Code:
Version 10 - accessing Bytes instead of long-words [c] : Cycles EHB : 0.54*119,333 = 64,439c (available cycles after DMA given ~54% utilization) --------------------------------------------------------------------- | CPU | MHz | Frame [c] | Colors | DrawPixel [c] | Pixels/Frame | --------------------------------------------------------------------- 6502 1.79 24,186 4 33 732.9 --------------------------------------------------------------------- 68000 7.16 119,333 4 173 689.8 68000 7.16 64,439 64 337 191.2 --------------------------------------------------------------------- No Overdraw version (No AND Masking) 68000 7.16 119,333 4 121 986.2 68000 7.16 64,439 64 213 302.5 ErasePixel 68000 7.16 119,333 4 128 932.3 68000 7.16 64,439 64 198 325.5 Code:
or.b d7,(-$2000,a1) Code:
bset d7,(-$2000,a1) Is that correct ? If so, then it makes no sense to write a new version using bclr / bset. |
|
17 June 2022, 18:39 | #50 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,291
|
Quote:
bset.b Dn,(ofs,Am) and or.b Dn,(ofs,Am) should both take 16 cycles. Like most 68000 instructions (mul/div/shift being the most common exceptions) they're limited by each memory access taking 4 cycles. In this case 4 (word-sized) memory accesses are needed: 1 for the offset, 2 to do RMW and 1 for prefetch. The advantage in a plain putpixel routine of using bset would come from not having to calculate a bitmask like you need for or.b (either through a LUT or by shifting). When I recommended operating on bytes (or words) it because it seemed like you were doing long word accesses (which double the memory access time). |
|
18 June 2022, 00:07 | #51 | ||
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
Quote:
But I knew that eventually I would get to rewrite it using bytes (which I finally did, though it took some time). Quote:
It was faster to compute the mask when it was 4 bytes in a LUT, but it's 14 cycles to read as a byte, so it makes sense now. It's only 10c faster, but every little bit helps. And it looks like I finally matched the 6502 with pixel throughput (though, admittedly, 3 separate DrawPixel versions (for each color) would raise that number). Code:
Version 11 - OR mask (byte) from LUT [c] : Cycles EHB : 0.54*119,333 = 64,439c (available cycles after DMA given ~54% utilization) --------------------------------------------------------------------- | CPU | MHz | Frame [c] | Colors | DrawPixel [c] | Pixels/Frame | --------------------------------------------------------------------- 6502 1.79 24,186 4 33 732.9 --------------------------------------------------------------------- 68000 7.16 119,333 4 163 732.1 68000 7.16 64,439 64 327 197.1 --------------------------------------------------------------------- No Overdraw version (No AND Masking) 68000 7.16 119,333 4 113 1,056.0 68000 7.16 64,439 64 205 314.3 ErasePixel 68000 7.16 119,333 4 118 1,011.3 68000 7.16 64,439 64 190 339.2 |
||
18 June 2022, 00:15 | #52 | |
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
Quote:
I do have another cycle question. I just started using a different cycle table from https://mrjester.hapisan.com/04_MC68/CycleTimes.htm But those numbers are different from the ones I was inferring from the PDF I got. if I am reading it right, then the following op (ColorByte is a variable, so I presume it's the (addr).l column) takes 20c (I repeat it 6x for each BP), which is way more than I read from the PDF. Code:
btst.b #0,ColorByte |
|
18 June 2022, 00:28 | #53 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,091
|
Yeah, 20c (4 fetch, 4 src operand, 2x4 dst operand, 4 mem read).
|
18 June 2022, 01:54 | #54 |
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
Thanks.
But, then the numbers in my last summary table are off and I need to recompute it all. Probably best to go over every single instruction using the new cycle table. Highly likely there are few other errors... |
18 June 2022, 02:40 | #55 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,091
|
I don't make any claims this is 100% accurate but it shouldn't be far off.
|
18 June 2022, 03:53 | #56 |
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
Damn, this is amazing. Thank you !
I don't have to switch between constantly swapping Adobe Acrobat and just keep this file open in the second window of Notepad++, seeing both the method I am timing and the table at the same time !!! |
19 June 2022, 14:05 | #57 |
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
So, that raised the 6-BPL version from 327c to 415c, but I instantly reverted one of the previous versions (that I also assigned wrong cycle value from that PDF) that used btst against a register. It never made sense to me why would an op working against register be slower than against RAM, but I didn't question the PDF...
Now I'm at 353c: Code:
Version 12 - BTST #x,d2 [c] : Cycles EHB : 0.54*119,333 = 64,439c (available cycles after DMA given ~54% utilization) --------------------------------------------------------------------- | CPU | MHz | Frame [c] | Colors | DrawPixel [c] | Pixels/Frame | --------------------------------------------------------------------- 6502 1.79 24,186 4 33 732.9 --------------------------------------------------------------------- 68000 7.16 119,333 4 165 723.2 68000 7.16 64,439 64 353 182.5 --------------------------------------------------------------------- No Overdraw version (No AND Masking) 68000 7.16 119,333 4 121 986.2 68000 7.16 64,439 64 237 271.9 ErasePixel 68000 7.16 119,333 4 96 1,243.1 68000 7.16 64,439 64 168 383.6 Now I can go examine the Blitter-based approaches... |
19 June 2022, 17:15 | #58 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,091
|
|
19 June 2022, 19:58 | #59 | |
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
Quote:
Either way, I don't think I understand what you mean here. Could you please elaborate ? Once you shift it right, you loose those bits. And any of the 6 bits might be on (and across whole screen they will be, as the input range of color is <0,63>) Either way, this is my current code: Code:
; d0:ypos d1:xpos d2:color a2/a3: LUTs ; Compute Address Offset (xpos,ypos) : (yp*40) + (xp / 8) asl.w #2,d0 move.l (a3,d0),a1 ; a1 = vidPtr [(yp * 40)] asl.w #2,d1 ; d1 = (xpos*4) : ArrayIndex into LUT_XPOS_REL add.w (a2,d1),a1 ; d0 += xpos address Offset move.b 3(a2,d1),d3 ; d3 = MaskAND = $FF - (1 << xpRelMask) ; MaskAND: Clear all bits and.b d3,(-$4000,a1) and.b d3,(-$2000,a1) and.b d3,(a1) and.b d3,($2000,a1) and.b d3,($4000,a1) and.b d3,($6000,a1) ; MaskOR: d3 = (1 << xpRelMask) move.b 2(a2,d1),d3 btst #0,d2 ; 10c beq dp9_2 or.b d3,(-$4000,a1) ; BP1 dp9_2: btst #1,d2 beq dp9_3 or.b d3,(-$2000,a1) ; BP2 dp9_3: btst #2,d2 beq dp9_4 or.b d3,(a1) ; BP3 dp9_4: btst #3,d2 beq dp9_5 or.b d3,($2000,a1) ; BP4 dp9_5: btst #4,d2 beq dp9_6 or.b d3,($4000,a1) ; BP5 dp9_6: btst #5,d2 beq dp9_7 or.b d3,($6000,a1) ; BP6 dp9_7: |
|
19 June 2022, 21:07 | #60 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 901
|
@VladR
I never had to write a generic pixel-plotting routine for planar graphics in my life (at least, I can't remember), so this attracted my attention. I couldn't help but take your code and whip up an alternative version that minimizes the memory accesses (which is crucial, given that you're using 6 bitplanes). Writte on the fly and totally untested, so apologies if it contains bugs! - anyway, even in that case, it's still good enough to illustrate the concepts. Code:
asl.w #2,d0 movea.l (a3,d0.w),a1 ;line base address move.w d1,d0 lsr.w #3,d1 ;X offset adda.w d1,a1 ;pixel base address moveq.l #7,d1 and.w d1,d0 sub.w d0,d1 ;bit number moveq.l #0,d0 bset.l d1,d0 ;OR mask move.b d0,d1 not.b d1 ;AND mask move.b ($6000,a1),d3 and.b d1,d3 lsl.b #3,d2 bcc.b .b5 or.b d0,d3 .b5 move.b d3,($6000,a1) move.b ($4000,a1),d3 and.b d1,d3 add.b d2,d2 bcc.b .b4 or.b d0,d3 .b4 move.b d3,($4000,a1) move.b ($2000,a1),d3 and.b d1,d3 add.b d2,d2 bcc.b .b3 or.b d0,d3 .b3 move.b d3,($2000,a1) move.b (a1),d3 and.b d1,d3 add.b d2,d2 bcc.b .b2 or.b d0,d3 .b2 move.b d3,(a1) move.b (-$2000,a1),d3 and.b d1,d3 add.b d2,d2 bcc.b .b1 or.b d0,d3 .b1 move.b d3,(-$2000,a1) move.b (-$4000,a1),d3 and.b d1,d3 add.b d2,d2 bcc.b .b0 or.b d0,d3 .b0 move.b d3,(-$4000,a1) Last edited by saimo; 20 June 2022 at 07:42. Reason: Fixed some offsets I had forgotten after copying and pasting. |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Help Fund the Amiga 4000 Replica Project! | Acill | Amiga scene | 82 | 02 March 2020 20:04 |
Financial Fund London Amiga or PC | runandbecome | Amiga scene | 8 | 30 September 2016 00:44 |
An idea for continued games development... using Amiga | Galahad/FLT | Amiga scene | 91 | 29 December 2010 11:45 |
Amiga development | freehand | Retrogaming General Discussion | 4 | 18 April 2010 17:53 |
Amizilla Fund closes in on almost $9000 in donations; first one that donates and gets | Pyromania | News | 0 | 11 January 2005 11:00 |
|
|