Amiga Games I'm willing to fund the development of - Page 3

paraj · 13 June 2022, 18:27

Quote:

Originally Posted by VladR

Given 6 bitplanes, the CPU will be at about 54% utilization after all the DMA - right ?
So, 0.54*119,333 = 64,439c available per frame, which results in 126 px (64439/510) rendered per frame.

Back of the envelope calculation I did was that 320x256x6 would take 320*256*6/16 out of 313*223 available slots ~45%, so 54% left for CPU sounds about right.

Quote:

Originally Posted by VladR

I am certainly curious how we can use the Blitter for this scenario.

Had a quick go at it, and for power-of-two bitplane widths (like a/b suggested, otherwise it's not as easy as I first envisioned) it's straight forward once you have the idea. You start with a list of 16-bit x,y coordinates and a code buffer (of same size+2 bytes) in chipmem.

1st blitter pass goes in reverse (so you can shift left) over the x coordinates and outputs the wanted instruction with the correct data register. If you can ensure the y-coordinate is preshifted/multiplied by the rowsize (if you're doing 3d stuff anyway maybe you could fold it into your projection routine) then just a second pass is needed to combine the two into an offset for the instruction. Otherwise you need more two passes, one for x and one for y shifting them correctly into place.

More concretely, say you have a 256x256 screen with a line going from 0,0 to 255,255:

Code:

linelist: dc.w 0,0,1,1,....

After the first pass you'd have:

Code:

8128 0000                or.b d0,$0000(a0)
8328 0000                or.b d1,$0000(a0)
8528 0000                or.b d2,$0000(a0)
8728 0000                or.b d3,$0000(a0)
...

Then updating the x-coordinate:

Code:

8128 0000                or.b d0,$0000(a0)
8328 0000                or.b d1,$0000(a0)
...

8f28 0000                or.b d7,$0000(a0)
8128 0001                or.b d0,$0001(a0)

Finally the y-coordinate:

Code:

8128 0000                or.b d0,$0000(a0)
8328 0030                or.b d1,$0020(a0)
...

8F28 00E0                or.b d7,$00e0(a0)
8128 0101                or.b d0,$0101(a0)

Put in a RTS instruction at the end and you have function that will (with d0-d7 set to $80..$01) plot pixels in a bitplane. Repeat for all the ones where you need bits set. Or generate eor.b instead if you think it'll be quicker to clear the bitplanes that way rather than normal clearing. Generate and.b (and set d0-d7 to the complement) if you need explicit clearing.If you need some combination, you'd only need to run step 1 to change the instruction.

a/b · 13 June 2022, 20:15

You can still do it in 2 passes by shifting Y 16-* in the opposite direction (and doing a few minor adjustments).
In one of my routines I'm using fixed point 11:5 for both X and Y, and screen width 512 (64=2^6 bytes). So I'd have to shift X by 3+5=8 to the right, and shift Y by 6-5=1 to the left. And since that doesn't work, I did 16-(6-5)=15 to the right for Y. With adjusted Y bltptr and an extra row in bltsize (height+1).

EDIT: Forgot to mention that you also have to (manually) patch every 1024th pixel because you're doing multiple blits and the Y bits you shift out of the last row won't carry over to next blit's first row.

paraj · 14 June 2022, 18:20

Quote:

Originally Posted by a/b

You can still do it in 2 passes by shifting Y 16-* in the opposite direction (and doing a few minor adjustments).
In one of my routines I'm using fixed point 11:5 for both X and Y, and screen width 512 (64=2^6 bytes). So I'd have to shift X by 3+5=8 to the right, and shift Y by 6-5=1 to the left. And since that doesn't work, I did 16-(6-5)=15 to the right for Y. With adjusted Y bltptr and an extra row in bltsize (height+1).

EDIT: Forgot to mention that you also have to (manually) patch every 1024th pixel because you're doing multiple blits and the Y bits you shift out of the last row won't carry over to next blit's first row.

Ah cool, I considered that something like that should be possible, but it seemed like something that would take quite a bit of work to get right, though maybe now I'll have to give it a shot

VladR · 15 June 2022, 00:21

Quote:

Originally Posted by paraj

Back of the envelope calculation I did was that 320x256x6 would take 320*256*6/16 out of 313*223 available slots ~45%, so 54% left for CPU sounds about right.

Had a quick go at it, and for power-of-two bitplane widths (like a/b suggested, otherwise it's not as easy as I first envisioned) it's straight forward once you have the idea. You start with a list of 16-bit x,y coordinates and a code buffer (of same size+2 bytes) in chipmem.

1st blitter pass goes in reverse (so you can shift left) over the x coordinates and outputs the wanted instruction with the correct data register. If you can ensure the y-coordinate is preshifted/multiplied by the rowsize (if you're doing 3d stuff anyway maybe you could fold it into your projection routine) then just a second pass is needed to combine the two into an offset for the instruction. Otherwise you need more two passes, one for x and one for y shifting them correctly into place.

More concretely, say you have a 256x256 screen with a line going from 0,0 to 255,255:

Code:

linelist: dc.w 0,0,1,1,....

After the first pass you'd have:

Code:

8128 0000                or.b d0,$0000(a0)
8328 0000                or.b d1,$0000(a0)
8528 0000                or.b d2,$0000(a0)
8728 0000                or.b d3,$0000(a0)
...

Then updating the x-coordinate:

Code:

8128 0000                or.b d0,$0000(a0)
8328 0000                or.b d1,$0000(a0)
...

8f28 0000                or.b d7,$0000(a0)
8128 0001                or.b d0,$0001(a0)

Finally the y-coordinate:

Code:

8128 0000                or.b d0,$0000(a0)
8328 0030                or.b d1,$0020(a0)
...

8F28 00E0                or.b d7,$00e0(a0)
8128 0101                or.b d0,$0101(a0)

Put in a RTS instruction at the end and you have function that will (with d0-d7 set to $80..$01) plot pixels in a bitplane. Repeat for all the ones where you need bits set. Or generate eor.b instead if you think it'll be quicker to clear the bitplanes that way rather than normal clearing. Generate and.b (and set d0-d7 to the complement) if you need explicit clearing.If you need some combination, you'd only need to run step 1 to change the instruction.

Do I get the basic idea right that you prepare all the instruction opcodes first and let the Blitter compute the address offsets and bit masks ?
That is, indeed, very interesting approach!

VladR · 15 June 2022, 00:41

Quote:

Originally Posted by paraj

Also don't know why you'd forgo using LUTs?

...

Even with this version you're not going to be drawing more than a couple of hundred pixels per frame (with a 320x256x6 display active).

I have an update on that. I implemented the LUT (2*320 longs long) and the result is quite underwhelming, honestly. Only 44 cycles were gained

I still have 2 things on my ToDo list that should shave off some more...

Code:

 Version 7
[c] : Cycles
EHB : 0.54*119,333 = 64,439c (available cycles after DMA given ~54% utilization)
---------------------------------------------------------------------
| CPU  |  MHz | Frame [c]  | Colors | DrawPixel [c] |  Pixels/Frame |
---------------------------------------------------------------------
  6502   1.79      24,186       4           33            732.9
 68000   7.16     119,333       4          264            452.0
 68000   7.16      64,439      64          466            138.2
---------------------------------------------------------------------

EDIT:

Code:

 Version 8 - using a 4-cycle btst.b instead of move.l/andi.w
[c] : Cycles
EHB : 0.54*119,333 = 64,439c (available cycles after DMA given ~54% utilization)
---------------------------------------------------------------------
| CPU  |  MHz | Frame [c]  | Colors | DrawPixel [c] |  Pixels/Frame |
---------------------------------------------------------------------
  6502   1.79      24,186       4           33            732.9
 68000   7.16     119,333       4          248            481.1
 68000   7.16      64,439      64          426            151.2
---------------------------------------------------------------------

I suppose, if I could guarantee that there would be no overdraw, then I could get rid of 6x and.l d6,($X000,a1) , which is 140 cycles less for EHB (raising throughput to 225.3 pixels/frame), but it's messing my voxel test data (lots of overdraw for proper 3D perspective) now.

EDIT2:

Code:

 Version 9 - LUT for YPOS
[c] : Cycles
EHB : 0.54*119,333 = 64,439c (available cycles after DMA given ~54% utilization)
---------------------------------------------------------------------
| CPU  |  MHz | Frame [c]  | Colors | DrawPixel [c] |  Pixels/Frame |
---------------------------------------------------------------------
  6502   1.79      24,186       4           33            732.9
 68000   7.16     119,333       4          212            562.8
 68000   7.16      64,439      64          390            165.2
---------------------------------------------------------------------
         No Overdraw version (No AND Masking)
 68000   7.16     119,333       4          128            932.2
 68000   7.16      64,439      64          230            280.1
---------------------------------------------------------------------

There is one more option on my ToDo List - using BCLR/BSET instead of AND/OR - doesn't look like it's an instant savings, though (but needs to be implemented for completeness). After that, I am out of ideas...

paraj · 15 June 2022, 18:32

Quote:

Do I get the basic idea right that you prepare all the instruction opcodes first and let the Blitter compute the address offsets and bit masks ?
That is, indeed, very interesting approach!

1st pass can both prepare instruction and "shift" (select correct dN). For example:

Code:

        move.l  #pointlist+4*numpoints-4,bltapt(a6)
        move.l  #codebuffer+4*numpoints-4,bltdpt(a6)
        move.w  #$8128,bltbdat(a6) ; or.b d0,(a0) instruction
        move.w  #7<<9,bltcdat(a6) ; mask for x
        move.w  #9<<12!SRCA!DEST!$E4,bltcon0(a6) ; $E4 D = Bc+AC, shift x & 7 into right place
        move.w  #BLITREVERSE, bltcon1(a6)
        move.w  #numpoints*64+1,bltsize(a6)

Quote:

I have an update on that. I implemented the LUT (2*320 longs long) and the result is quite underwhelming, honestly. Only 44 cycles were gained

Not saying LUTs are an instant way to get massive speed, just that they should be in your toolbox, and 44 cycles is nothing to scoff at

Quote:

get rid of 6x and.l d6,($X000,a1)

You should be operating on bytes (or words), not longwords for a putpixel routine. A plain 68000 can only access one word at a time.

Quote:

There is one more option on my ToDo List - using BCLR/BSET instead of AND/OR - doesn't look like it's an instant savings, though (but needs to be implemented for completeness). After that, I am out of ideas

In my example code I have one function for each possible color (so 64 functions for 6bpl) and jump to the correct one with a jump table, and if I didn't miscount my "overdraw" version takes 210 cycles w/o any (other) nasty tricks.

VladR · 16 June 2022, 14:56

Quote:

Originally Posted by paraj

You should be operating on bytes (or words), not longwords for a putpixel routine. A plain 68000 can only access one word at a time.

Yeah, but then you have to split top and bottom 16 bits, which I totally didn't feel like doing just yet

I did keep regretting that decision during last 8 attempts at optimization, as the difference in cycles between 16 and 32-bit adds up pretty quickly everywhere.

Quote:

Originally Posted by paraj

In my example code I have one function for each possible color (so 64 functions for 6bpl) and jump to the correct one with a jump table, and if I didn't miscount my "overdraw" version takes 210 cycles w/o any (other) nasty tricks.

This is an exercise in patience

I certainly didn't mind writing 4 versions of DrawPixel on 6502. But 64

?

210 is a really good number for 6 BPL

I guess I am going to have to work for it a bit harder

EDIT: I was just about to do the last item on the ToDo list - BSET/BCLR instead of OR/AND

Except, they don't support the 32-bit addressing mode (only 8-bit). Hence I gotta switch to 8/16-bit access. I don't think I can do the full rewrite of LUTs (and everything else) now, that's possible only during weekend.

paraj · 17 June 2022, 17:56

Quote:

Originally Posted by VladR

Yeah, but then you have to split top and bottom 16 bits, which I totally didn't feel like doing just yet

I did keep regretting that decision during last 8 attempts at optimization, as the difference in cycles between 16 and 32-bit adds up pretty quickly everywhere.

Why? If you're just plotting a pixel you're only accessing a single bit in one byte for each plane?

Quote:

Originally Posted by VladR

This is an exercise in patience

I certainly didn't mind writing 4 versions of DrawPixel on 6502. But 64

?

210 is a really good number for 6 BPL

I guess I am going to have to work for it a bit harder

Patience is not needed, a decent assembler is

(code is in post #36)

Learning to use the more advanced features really pays dividends in both speed of development, maintainability and performance. In this case I had to use some slightly esoteric features for the "colfunc" macro (to get a proper label), but otherwise it was bread and butter stuff.

VladR · 17 June 2022, 18:21

Quote:

Originally Posted by paraj

You should be operating on bytes (or words), not longwords for a putpixel routine. A plain 68000 can only access one word at a time.

Well, I did get to rewrite it using bytes this morning (before I started working), but it's slightly underwhelming, performance-wise.

Code:

 Version 10 - accessing Bytes instead of long-words

[c] : Cycles
EHB : 0.54*119,333 = 64,439c (available cycles after DMA given ~54% utilization)
---------------------------------------------------------------------
| CPU  |  MHz | Frame [c]  | Colors | DrawPixel [c] |  Pixels/Frame |
---------------------------------------------------------------------
  6502   1.79      24,186       4           33            732.9
---------------------------------------------------------------------
 68000   7.16     119,333       4          173            689.8
 68000   7.16      64,439      64          337            191.2
---------------------------------------------------------------------
         No Overdraw version (No AND Masking)
 68000   7.16     119,333       4          121            986.2
 68000   7.16      64,439      64          213            302.5
         ErasePixel
 68000   7.16     119,333       4          128            932.3
 68000   7.16      64,439      64          198            325.5

I just checked the cycle table and if I am reading it right, then

Code:

or.b d7,(-$2000,a1)

Takes exact same 18 cycles like

Code:

bset d7,(-$2000,a1)

Is that correct ? If so, then it makes no sense to write a new version using bclr / bset.

paraj · 17 June 2022, 18:39

Quote:

Originally Posted by VladR

Well, I did get to rewrite it using bytes this morning (before I started working), but it's slightly underwhelming, performance-wise.
[/CODE]I just checked the cycle table and if I am reading it right, then

Code:

or.b d7,(-$2000,a1)

Takes exact same 18 cycles like

Code:

bset d7,(-$2000,a1)

Is that correct ? If so, then it makes no sense to write a new version using bclr / bset.

bset.b Dn,(ofs,Am) and or.b Dn,(ofs,Am) should both take 16 cycles. Like most 68000 instructions (mul/div/shift being the most common exceptions) they're limited by each memory access taking 4 cycles. In this case 4 (word-sized) memory accesses are needed: 1 for the offset, 2 to do RMW and 1 for prefetch.

The advantage in a plain putpixel routine of using bset would come from not having to calculate a bitmask like you need for or.b (either through a LUT or by shifting).

When I recommended operating on bytes (or words) it because it seemed like you were doing long word accesses (which double the memory access time).

VladR · 18 June 2022, 00:07

Quote:

Originally Posted by paraj

When I recommended operating on bytes (or words) it because it seemed like you were doing long word accesses (which double the memory access time).

Yes, that was a deliberate decision on my part to not complicate things too much at the beginning, even though it was instantly obvious there would be some cycles lost due to 32-bit access being much slower.
But I knew that eventually I would get to rewrite it using bytes (which I finally did, though it took some time).

Quote:

Originally Posted by paraj

The advantage in a plain putpixel routine of using bset would come from not having to calculate a bitmask like you need for or.b (either through a LUT or by shifting).

So, using bytes provided an opportunity to merge both OR and AND mask together (inside LUT) with XPOS byte offset from the start of the line. All 3 things fit into 4 bytes now.
It was faster to compute the mask when it was 4 bytes in a LUT, but it's 14 cycles to read as a byte, so it makes sense now.

It's only 10c faster, but every little bit helps. And it looks like I finally matched the 6502 with pixel throughput (though, admittedly, 3 separate DrawPixel versions (for each color) would raise that number).

Code:

 Version 11 - OR mask (byte) from LUT

[c] : Cycles
EHB : 0.54*119,333 = 64,439c (available cycles after DMA given ~54% utilization)
---------------------------------------------------------------------
| CPU  |  MHz | Frame [c]  | Colors | DrawPixel [c] |  Pixels/Frame |
---------------------------------------------------------------------
  6502   1.79      24,186       4           33            732.9
---------------------------------------------------------------------
 68000   7.16     119,333       4          163            732.1
 68000   7.16      64,439      64          327            197.1
---------------------------------------------------------------------
         No Overdraw version (No AND Masking)
 68000   7.16     119,333       4          113          1,056.0
 68000   7.16      64,439      64          205            314.3
         ErasePixel
 68000   7.16     119,333       4          118          1,011.3
 68000   7.16      64,439      64          190            339.2

VladR · 18 June 2022, 00:15

Quote:

Originally Posted by paraj

bset.b Dn,(ofs,Am) and or.b Dn,(ofs,Am) should both take 16 cycles. Like most 68000 instructions (mul/div/shift being the most common exceptions) they're limited by each memory access taking 4 cycles. In this case 4 (word-sized) memory accesses are needed: 1 for the offset, 2 to do RMW and 1 for prefetch.

Thank you!
I do have another cycle question. I just started using a different cycle table from https://mrjester.hapisan.com/04_MC68/CycleTimes.htm

But those numbers are different from the ones I was inferring from the PDF I got.
if I am reading it right, then the following op (ColorByte is a variable, so I presume it's the (addr).l column) takes 20c (I repeat it 6x for each BP), which is way more than I read from the PDF.

Code:

btst.b #0,ColorByte

If that's indeed 20c, I gotta revert to previous way...

a/b · 18 June 2022, 00:28

Yeah, 20c (4 fetch, 4 src operand, 2x4 dst operand, 4 mem read).

VladR · 18 June 2022, 01:54

Quote:

Originally Posted by a/b

Yeah, 20c (4 fetch, 4 src operand, 2x4 dst operand, 4 mem read).

Thanks.

But, then the numbers in my last summary table are off and I need to recompute it all. Probably best to go over every single instruction using the new cycle table. Highly likely there are few other errors...

a/b · 18 June 2022, 02:40

I don't make any claims this is 100% accurate but it shouldn't be far off.

VladR · 18 June 2022, 03:53

Damn, this is amazing. Thank you !
I don't have to switch between constantly swapping Adobe Acrobat and just keep this file open in the second window of Notepad++, seeing both the method I am timing and the table at the same time !!!

VladR · 19 June 2022, 14:05

Quote:

Originally Posted by a/b

Yeah, 20c (4 fetch, 4 src operand, 2x4 dst operand, 4 mem read).

So, that raised the 6-BPL version from 327c to 415c, but I instantly reverted one of the previous versions (that I also assigned wrong cycle value from that PDF) that used btst against a register. It never made sense to me why would an op working against register be slower than against RAM, but I didn't question the PDF...
Now I'm at 353c:

Code:

Version 12 - BTST #x,d2

[c] : Cycles
EHB : 0.54*119,333 = 64,439c (available cycles after DMA given ~54% utilization)
---------------------------------------------------------------------
| CPU  |  MHz | Frame [c]  | Colors | DrawPixel [c] |  Pixels/Frame |
---------------------------------------------------------------------
  6502   1.79      24,186       4           33            732.9
---------------------------------------------------------------------
 68000   7.16     119,333       4          165            723.2
 68000   7.16      64,439      64          353            182.5
---------------------------------------------------------------------
         No Overdraw version (No AND Masking)
 68000   7.16     119,333       4          121            986.2
 68000   7.16      64,439      64          237            271.9
         ErasePixel
 68000   7.16     119,333       4           96          1,243.1
 68000   7.16      64,439      64          168            383.6

At this point, for a CPU-based plotter, I should create separate versions for each color, like on Atari (though there were just 4 versions for 4 colors, not 64).
Now I can go examine the Blitter-based approaches...

a/b · 19 June 2022, 17:15

Quote:

Originally Posted by VladR

...btst against a register...

In case you don't have to preserve the color and if I understand correctly what you are doing, did you consider lsl.b #3,dx/bcc for the top bit and then add.b dx,dx/bcc for the rest (instead of btst#y,dx six times)?

VladR · 19 June 2022, 19:58

Quote:

Originally Posted by a/b

In case you don't have to preserve the color and if I understand correctly what you are doing, did you consider lsl.b #3,dx/bcc for the top bit and then add.b dx,dx/bcc for the rest (instead of btst#y,dx six times)?

Somewhere around version 4 I was doing bitshifting. But then I found something faster. However, I should revisit it again because of the new cycle table, as I don't recall how many cycles was that. But I don't think it was 10 as the btst...
Either way, I don't think I understand what you mean here. Could you please elaborate ? Once you shift it right, you loose those bits. And any of the 6 bits might be on (and across whole screen they will be, as the input range of color is <0,63>)

Either way, this is my current code:

Code:

    ; d0:ypos   d1:xpos   d2:color     a2/a3: LUTs
		;  Compute Address Offset (xpos,ypos) : (yp*40) + (xp / 8)
	asl.w #2,d0
	move.l	(a3,d0),a1	; a1 = vidPtr [(yp * 40)]
	
	asl.w #2,d1		; d1 = (xpos*4) : ArrayIndex into LUT_XPOS_REL
	add.w	(a2,d1),a1	; d0 += xpos address Offset
	move.b	3(a2,d1),d3	; d3 = MaskAND = $FF - (1 << xpRelMask)
		
		; MaskAND:	Clear all bits
	and.b d3,(-$4000,a1)
	and.b d3,(-$2000,a1)
	and.b d3,(a1)
	and.b d3,($2000,a1)
	and.b d3,($4000,a1)
	and.b d3,($6000,a1)
	
		; MaskOR:	d3 = (1 << xpRelMask)
	move.b	2(a2,d1),d3

	btst #0,d2			; 10c 
	beq dp9_2
	or.b d3,(-$4000,a1)		; BP1
		dp9_2:
	btst #1,d2
	beq dp9_3
	or.b d3,(-$2000,a1)		; BP2
		dp9_3:
	btst #2,d2
	beq dp9_4
	or.b d3,(a1)			; BP3
		dp9_4:
	btst #3,d2
	beq dp9_5
	or.b d3,($2000,a1)		; BP4
		dp9_5:
	btst #4,d2
	beq dp9_6
	or.b d3,($4000,a1)		; BP5
		dp9_6:
	btst #5,d2
	beq dp9_7
	or.b d3,($6000,a1)		; BP6
		dp9_7:

saimo · 19 June 2022, 21:07

@VladR

I never had to write a generic pixel-plotting routine for planar graphics in my life (at least, I can't remember), so this attracted my attention. I couldn't help but take your code and whip up an alternative version that minimizes the memory accesses (which is crucial, given that you're using 6 bitplanes).
Writte on the fly and totally untested, so apologies if it contains bugs! - anyway, even in that case, it's still good enough to illustrate the concepts.

Code:

    asl.w   #2,d0
    movea.l (a3,d0.w),a1   ;line base address
    move.w  d1,d0
    lsr.w   #3,d1          ;X offset
    adda.w  d1,a1          ;pixel base address

    moveq.l #7,d1
    and.w   d1,d0
    sub.w   d0,d1          ;bit number
    moveq.l #0,d0
    bset.l  d1,d0          ;OR mask
    move.b  d0,d1
    not.b   d1             ;AND mask

    move.b  ($6000,a1),d3
    and.b   d1,d3
    lsl.b   #3,d2
    bcc.b   .b5
    or.b    d0,d3
.b5 move.b  d3,($6000,a1)

    move.b  ($4000,a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b4
    or.b    d0,d3
.b4 move.b  d3,($4000,a1)

    move.b  ($2000,a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b3
    or.b    d0,d3
.b3 move.b  d3,($2000,a1)

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b2
    or.b    d0,d3
.b2 move.b  d3,(a1)

    move.b  (-$2000,a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b1
    or.b    d0,d3
.b1 move.b  d3,(-$2000,a1)

    move.b  (-$4000,a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b0
    or.b    d0,d3
.b0 move.b  d3,(-$4000,a1)

13 June 2022, 20:15	#42
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,091	You can still do it in 2 passes by shifting Y 16-* in the opposite direction (and doing a few minor adjustments). In one of my routines I'm using fixed point 11:5 for both X and Y, and screen width 512 (64=2^6 bytes). So I'd have to shift X by 3+5=8 to the right, and shift Y by 6-5=1 to the left. And since that doesn't work, I did 16-(6-5)=15 to the right for Y. With adjusted Y bltptr and an extra row in bltsize (height+1). EDIT: Forgot to mention that you also have to (manually) patch every 1024th pixel because you're doing multiple blits and the Y bits you shift out of the last row won't carry over to next blit's first row. Last edited by a/b; 13 June 2022 at 20:56. Reason: patching

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Help Fund the Amiga 4000 Replica Project!	Acill	Amiga scene	82	02 March 2020 20:04
Financial Fund London Amiga or PC	runandbecome	Amiga scene	8	30 September 2016 00:44
An idea for continued games development... using Amiga	Galahad/FLT	Amiga scene	91	29 December 2010 11:45
Amiga development	freehand	Retrogaming General Discussion	4	18 April 2010 17:53
Amizilla Fund closes in on almost $9000 in donations; first one that donates and gets	Pyromania	News	0	11 January 2005 11:00

18 June 2022, 00:28	#53
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,091	Yeah, 20c (4 fetch, 4 src operand, 2x4 dst operand, 4 mem read).

18 June 2022, 03:53	#56
VladR Registered User Join Date: Dec 2019 Location: North Dakota Posts: 741	Damn, this is amazing. Thank you ! I don't have to switch between constantly swapping Adobe Acrobat and just keep this file open in the second window of Notepad++, seeing both the method I am timing and the table at the same time !!!

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)