Amiga Games I'm willing to fund the development of - Page 4

VladR · 20 June 2022, 02:18

Quote:

Originally Posted by saimo

@VladR
I never had to write a generic pixel-plotting routine for planar graphics in my life (at least, I can't remember), so this attracted my attention.

I have several use cases:
1. Generic 3D Starfield (as in Star Raiders)
2. 3D Point Cloud (as in Rez)
3. Additional Pixel detail (quasi texture-like) over flatshaded polygons via perspective interpolated 5x5 vertex grid

Quote:

Originally Posted by saimo

I couldn't help but take your code and whip up an alternative version that minimizes the memory accesses (which is crucial, given that you're using 6 bitplanes).

Thank you! I truly appreciate the brainstorming here in this community

That's the best way to learn and come up with best code

Quote:

Originally Posted by saimo

Writte on the fly and totally untested, so apologies if it contains bugs! - anyway, even in that case, it's still good enough to illustrate the concepts.

Code:

    asl.w   #2,d0
    movea.l (a3,d0.w),a1   ;line base address
    move.w  d1,d0
    lsr.w   #3,d1          ;X offset
    adda.w  d1,a1          ;pixel base address

    moveq.l #7,d1
    and.w   d1,d0
    sub.w   d0,d1          ;bit number
    moveq.l #0,d0
    bset.l  d1,d0          ;OR mask
    move.b  d0,d1
    not.b   d1             ;AND mask

    move.b  ($6000,a1),d3
    and.b   d1,d3
    lsl.b   #3,d2
    bcc.b   .b5
    or.b    d0,d3
.b5 move.b  d3,($6000,a1)

    move.b  ($4000,a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b4
    or.b    d0,d3
.b4 move.b  d3,($4000,a1)

    move.b  ($4000,a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b3
    or.b    d0,d3
.b3 move.b  d3,($2000,a1)

    move.b  ($2000,a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b2
    or.b    d0,d3
.b2 move.b  d3,(a1)

    move.b  (-$2000,a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b1
    or.b    d0,d3
.b1 move.b  d3,(-$2000,a1)

    move.b  (-$4000,a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b0
    or.b    d0,d3
.b0 move.b  d3,(a1)

OK, so it appears that you are doing bit operations in the registers, thus you have to do the memory access twice (even in cases where the bit is 0 and it's not needed).

I have looked up the cycle timings in the text file that @a/b attached few posts above (apologies if the numbers are incorrect) and this is what one Bitplane batch looks to be:

Code:

12c    move.b  ($6000,a1),d3
 4c    and.b   d1,d3
12c    lsl.b   #3,d2
10c   bcc.b   .b5
 4c   or.b    d0,d3
.b5 
12c    move.b  d3,($6000,a1)

12+4+12+10+4+12 = 54c
However, the or.b d0,d3 will be executed on average in 50% of cases (each bit has 50% chance of being set as all 64 colors are used across entire screen), so I will count the or.b d0,d3 as 50% - which is 2c, hence 54c-2c = 52c per BitPlane

My current version is 18+10+10 + (18/2) = 47c, assuming I got the cycles right. There's few cycles less for that one BP which is addressed as (a1).

Code:

         18   and.b d3,(-$4000,a1)
	 10	btst #0,d2
	 10	beq dp9_2
	 18	or.b d3,(-$4000,a1)		; BP1
			dp9_2:

Also, what is the cycle timing of BEQ jumps on 68000 if the branch is not taken ? Is it 10c taken and 12c not taken perhaps ?

a/b · 20 June 2022, 03:18

First pixel is slower (lsl.b #3,d2 12c) than the rest (add.b d2,d2 4c), it's what I was talking about in my previous post about replacing 6x btst #x,d2 (10c).
The basic idea is to push every bit in d2 out to carry flag, which can be done with a quick add.b (except for the first one, to get everything in place by skipping top 2 bits in d2).
So it's 1*12+5*4=32 cycles vs. 6x10=60 cycles. And beq is replaced with bcc.

Branch taken/not taken cases for bcc/dbcc/... are all in the table:
|bcc.b |label | 10/8 (taken/not taken)

saimo · 20 June 2022, 07:23

@VladR

Quote:

Originally Posted by VladR

OK, so it appears that you are doing bit operations in the registers, thus you have to do the memory access twice (even in cases where the bit is 0 and it's not needed

2 accesses are the minimum, i.e. it is not possible to do with less. Keep in mind that instructions like and.b d3,(a1) perform first a read and then a write.

Additional optimization:
1. put the addresses relative to the last bitplane in the table;
2. put movea.w #$2000,a2 outside of the plotting loop;
3. read/write bytes with (a1) everywhere;
4. put suba.l a2,a1 at the end of the code of each but the last byte.
On a 68000 cycle-wise it's the same (8 cycles more for suba, 8 cycles less for addressing modes), but 1 word less per byte are required - in all, 5 words instead of 10, i.e. 5 less memory accesses.

Edit: here's the updated code:

Code:

    movea.w #$2000,a2      ;bitplanes distance (somewhere outside of the loop)
    ...
    asl.w   #2,d0
    movea.l (a3,d0.w),a1   ;line base address
    move.w  d1,d0
    lsr.w   #3,d1          ;X offset
    adda.w  d1,a1          ;pixel base address in last bitplane
    
    moveq.l #7,d1
    and.w   d1,d0
    sub.w   d0,d1          ;bit number
    moveq.l #0,d0
    bset.l  d1,d0          ;OR mask
    move.b  d0,d1
    not.b   d1             ;AND mask

    move.b  (a1),d3
    and.b   d1,d3
    lsl.b   #3,d2
    bcc.b   .b5
    or.b    d0,d3
.b5 move.b  d3,(a1)
    suba.l  a2,a1

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b4
    or.b    d0,d3
.b4 move.b  d3,(a1)
    suba.l  a2,a1

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b3
    or.b    d0,d3
.b3 move.b  d3,(a1)
    suba.l  a2,a1

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b2
    or.b    d0,d3
.b2 move.b  d3,(a1)
    suba.l a2,a1

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b1
    or.b    d0,d3
.b1 move.b  d3,(a1)
    suba.l  a2,a1

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b0
    or.b    d0,d3
.b0 move.b  d3,(a1)

Note: in my previous post I had forgotten to change some offsets after copying and pasting

As for cycles, see a/b's answer.

EDIT: got some unforeseen extra free time, so I thought I'd calculate the size of the code and the overall number of words read/written from/to RAM (given that the 68000 has no cache, that's to be taken into account as well).

ORIGINAL CODE

setup size: 10
setup reads: 4
plot size: 7*5+5 = 40
plot reads/writes (best): 6*2 = 12
plot reads/writes (average): 6*3 = 18
plot reads/writes (worst): 6*4 = 24
total (best): 66
total (average): 72
total (worst): 78

ALTERNATIVE CODE

setup size (movea.w #$2000,a2 excluded): 13
setup reads: 1
plot size: 8*5+7 = 47
plot reads/writes: 6*2 = 12
total: 73

One thing I forgot to point out is that the alternative code needs only 1 lookup table.

VladR · 20 June 2022, 14:49

Quote:

Originally Posted by a/b

First pixel is slower (lsl.b #3,d2 12c) than the rest (add.b d2,d2 4c), it's what I was talking about in my previous post about replacing 6x btst #x,d2 (10c).
The basic idea is to push every bit in d2 out to carry flag, which can be done with a quick add.b (except for the first one, to get everything in place by skipping top 2 bits in d2).
So it's 1*12+5*4=32 cycles vs. 6x10=60 cycles. And beq is replaced with bcc.

I admit I didn't get this before, but now I see why you're shifting in opposite direction - that is indeed quite clever and I haven't considered it

It's Monday, but I should be able to quickly try this approach before I start working...

Quote:

Originally Posted by a/b

Branch taken/not taken cases for bcc/dbcc/... are all in the table:
|bcc.b |label | 10/8 (taken/not taken)

Awesome, I will go adjust the timings accordingly. Since each branch has a 50% chance of being taken, each branch is effectively 9c on average (whole screen of pixels).

VladR · 20 June 2022, 15:21

Quote:

Originally Posted by a/b

First pixel is slower (lsl.b #3,d2 12c) than the rest (add.b d2,d2 4c), it's what I was talking about in my previous post about replacing 6x btst #x,d2 (10c).
The basic idea is to push every bit in d2 out to carry flag, which can be done with a quick add.b (except for the first one, to get everything in place by skipping top 2 bits in d2).
So it's 1*12+5*4=32 cycles vs. 6x10=60 cycles. And beq is replaced with bcc.

Damn, it works

6 BPL dropped from 353c to 319c ! And we're finally faster (just a hair) in a full-frame pixel throughput (at same 4 colors) than 1.79 MHz 6502 ! Took a lot of effort to beat that puny little micro

Code:

Version 13 - Shifting out the color to the Left

[c] : Cycles
EHB : 0.54*119,333 = 64,439c (available cycles after DMA given ~54% utilization)
---------------------------------------------------------------------
| CPU  |  MHz | Frame [c]  | Colors | DrawPixel [c] |  Pixels/Frame |
---------------------------------------------------------------------
  6502   1.79      24,186       4           33            732.9
---------------------------------------------------------------------
 68000   7.16     119,333       4          159            750.5
 68000   7.16      64,439      64          319            202.0
---------------------------------------------------------------------
         No Overdraw version (No AND Masking)
 68000   7.16     119,333       4          115          1,037.7
 68000   7.16      64,439      64          203            317.4
         ErasePixel
 68000   7.16     119,333       4           96          1,243.1
 68000   7.16      64,439      64          168            383.6

VladR · 20 June 2022, 15:33

Quote:

Originally Posted by saimo

EDIT: got some unforeseen extra free time, so I thought I'd calculate the size of the code and the overall number of words read/written from/to RAM (given that the 68000 has no cache, that's to be taken into account as well).

I'll go check out this approach later tonight, but just quick question about the number of RAM reads/writes in 6 BPL (due to heavy DMA load of 6 BPL) :

Will the instructions that operate just on registers (e.g. add.b d2,d2) execute in parallel at every other DMA slot in 6 BPL just like they would in 4 BPL ?

Or will they still have to wait to be executed till they get the available DMA slot despite not doing any RAM R/W whatsoever?

a/b · 20 June 2022, 17:19

Speedy 4 cycle instructions are not a problem. You could see it as they are executed while the next opcode is being fetched (or waiting on a free dma slot), so they are not slowed down "in the middle of execution" (once they are fetched, which can take a while, they are good to go).
The problem are instructions with multiple memory accesses, either extended opcode, operands, or execution. You could calculate how many cycles you have left in 1/50sec for the cpu, but once you are dealing with 5+ bitplanes and/or blitter and/or heavy copper lists, that becomes increasingly inaccurate due to instruction times being extended because of memory access conflicts "in the middle of execution".

paraj · 20 June 2022, 17:31

Quote:

Originally Posted by VladR

I'll go check out this approach later tonight, but just quick question about the number of RAM reads/writes in 6 BPL (due to heavy DMA load of 6 BPL) :

Will the instructions that operate just on registers (e.g. add.b d2,d2) execute in parallel at every other DMA slot in 6 BPL just like they would in 4 BPL ?

Or will they still have to wait to be executed till they get the available DMA slot despite not doing any RAM R/W whatsoever?

They still have to wait because of the need to (pre)fetch the next instructions. I think the easiest way to think about it is like this: Without DMA contention you can just look at raw cycle numbers. With DMA contention you also look at cycle count - 4*number of memory access. The additional cycles (where the CPU isn't accessing memory) can run in parallel and are given "for free" in some sense.

E.g. add.w d0,d0 and add.l d0,d0 will run at the same speed if the CPU can only access memory every other time it wants to (like in 6BPL when Agnus is fetching data).

VladR · 20 June 2022, 21:28

Quote:

Originally Posted by a/b

Speedy 4 cycle instructions are not a problem. You could see it as they are executed while the next opcode is being fetched (or waiting on a free dma slot), so they are not slowed down "in the middle of execution" (once they are fetched, which can take a while, they are good to go).
The problem are instructions with multiple memory accesses, either extended opcode, operands, or execution. You could calculate how many cycles you have left in 1/50sec for the cpu, but once you are dealing with 5+ bitplanes and/or blitter and/or heavy copper lists, that becomes increasingly inaccurate due to instruction times being extended because of memory access conflicts "in the middle of execution".

Quote:

Originally Posted by paraj

They still have to wait because of the need to (pre)fetch the next instructions. I think the easiest way to think about it is like this: Without DMA contention you can just look at raw cycle numbers. With DMA contention you also look at cycle count - 4*number of memory access. The additional cycles (where the CPU isn't accessing memory) can run in parallel and are given "for free" in some sense.

E.g. add.w d0,d0 and add.l d0,d0 will run at the same speed if the CPU can only access memory every other time it wants to (like in 6BPL when Agnus is fetching data).

Thanks. For a second I thought that maybe I could find some compromise and get it to run at 60fps at 6 BPL (by clearing the pixels drawn last frame instead of double buffering - certainly doable with 32 stars) , as I could just do a lot of ops in registers, but since they still need to be fetched from RAM, they still need to wait for a DMA slot.

Of course, it's not impossible, there's still around ~64,000 cycles left for CPU (assuming no Blitter,Copper, Sprite DMA is happening). It's just that with spikes due to AI, explosions, etc. it might be very hard to not have framedrops, given the unpredictability of the execution.

It's a challenge, alright

DanScott · 20 June 2022, 23:16

I'm wondering if it might be more efficient to plot a pixel with the blitter in 6bpl ?

fxgogo · 21 June 2022, 00:29

Well this has to be one of the best thread hijacks I have seen in a while and I don't understand assembly at all!!

VladR · 21 June 2022, 01:14

Quote:

Originally Posted by fxgogo

Well this has to be one of the best thread hijacks I have seen in a while and I don't understand assembly at all!!

Yeah, I was hoping some mod would be bored enough to cut this thread 2 pages back into Coders.Asm / HW section

I didn't even notice initially that this is a sticky thread

saimo · 21 June 2022, 02:04

One more optimization to the alternative code (can't go into the details now, sorry)...
The setup part that calculates the masks can be rewritten as follows, after putting moveq.l #7,d4 outside of the loop:

Code:

   and.l   d4,d0
   moveq.l #-128,d1
   lsr.b   d0,d1
   move.b  d1,d0
   not.b   d1

This is on average 1 cycle faster and has the advantage of being 2 words shorter.

Don_Adan · 21 June 2022, 03:56

Perhaps this code:

Code:

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b0
    or.b    d0,d3
.b0 move.b  d3,(a1)

Can be replaced with:

Code:

    and.b  (a1),d1
    add.b   d2,d2
    bcc.b   .b0
    or.b    d0,d1
.b0 move.b  d1,(a1)

Or maybe even with:

Code:

    and.b  d1,(a1)
    add.b   d2,d2
    bcc.b   .b0
    or.b    d0,(a1)
.b0

I dont coded from long time, but perhaps it can works.

saimo · 21 June 2022, 13:54

@Don_Adan

Quote:

Originally Posted by Don_Adan

Perhaps this code:

Code:

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b0
    or.b    d0,d3
.b0 move.b  d3,(a1)

Can be replaced with:

Code:

    and.b  (a1),d1
    add.b   d2,d2
    bcc.b   .b0
    or.b    d0,d1
.b0 move.b  d1,(a1)

Absolutely right - thank you!
(One more late night copy&paste-induced mistake - thanks again for opening my eyes!)

Quote:

Or maybe even with:

Code:

    and.b  d1,(a1)
    add.b   d2,d2
    bcc.b   .b0
    or.b    d0,(a1)
.b0

I'd avoid this, as it causes up to 4 reads/writes, with read-execute-write operations, while saving only 1 instruction word fetch.

@VladR

Adding more information to my previous post...

Here's a side-by-side comparison of the old and new lookup-table-less setup code, with timings:

Code:

    OLD                        NEW

    moveq.l #7,d1 ;4                            ;moveq.l #7,d4 outside of the loop
    and.w   d1,d0 ;4           and.l   d4,d0    ;4
    sub.w   d0,d1 ;4
    moveq.l #0,d0 ;4           moveq.l #-128,d1 ;4
    bset.l  d1,d0 ;6           lsr.b   d0,d1    ;6-20
    move.b  d0,d1 ;4           move.b  d1,d0    ;4
    not.b   d1    ;4           not.b   d1       ;4
                  ;total: 30                    ;total: 22-36, average 29

Although cycle-wise things don't change much, the advantage here is that there are two instructions less, i.e. two words less to fetch, which is crucial given the busy CHIP RAM bus (which tends to reduce the cycle differences).

The alternative code, modified as per all of the above, thus would look like this:

Code:

* OUTSIDE OF THE PLOTTING LOOP

    movea.w #$2000,a2      ;bitplanes distance
    moveq.l #7,d4          ;X offset mask

* PLOT ROUTINE

    lsl.w   #2,d0
    movea.l (a3,d0.w),a1   ;line base address
    move.w  d1,d0
    lsr.w   #3,d1          ;X offset
    adda.w  d1,a1          ;pixel base address in last bitplane

    and.l   d4,d0
    moveq.l #-128,d1
    lsr.b   d0,d1
    move.b  d1,d0          ;OR mask
    not.b   d1             ;AND mask

    move.b  (a1),d3
    and.b   d1,d3
    lsl.b   #3,d2
    bcc.b   .b5
    or.b    d0,d3
.b5 move.b  d3,(a1)
    suba.l  a2,a1

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b4
    or.b    d0,d3
.b4 move.b  d3,(a1)
    suba.l  a2,a1

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b3
    or.b    d0,d3
.b3 move.b  d3,(a1)
    suba.l  a2,a1

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b2
    or.b    d0,d3
.b2 move.b  d3,(a1)
    suba.l  a2,a1

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b1
    or.b    d0,d3
.b1 move.b  d3,(a1)
    suba.l  a2,a1

    and.b   (a1),d1
    add.b   d2,d2
    bcc.b   .b0
    or.b    d0,d1
.b0 move.b  d1,(a1)

The words count is now this:
* setup code size: 11
* setup code reads: 1
* plot code size: 8*5+6 = 46
* plot code reads/writes: 6*2 = 12
* total: 70

a/b · 21 June 2022, 16:32

I'd just go with something similar to what paraj posted a few pages ago... If you can handle 2KB of code (written for asm-one/pro):

Code:

****************************************************************

Depth		EQU	6

MKDRAW	MACRO	(Color)
.Start\@
.BPL	SET	0
.Offset	SET	-$4000
	REPT	Depth
		IFNE	(\1)&(1<<.BPL)
			IFEQ	.Offset
			bset	d0,(a0)
			ELSE
			bset	d0,(.Offset,a0)
			ENDIF
		ELSE
			IFEQ	.Offset
			bclr	d0,(a0)
			ELSE
			bclr	d0,(.Offset,a0)
			ENDIF
		ENDIF
.BPL		SET	.BPL+1
.Offset		SET	.Offset+$2000
	ENDR
; 10 bytes free, either rts/bra, or dbf and bra/rts can fit easily
	rts
	DCB.W	(32-(*-.Start\@))/2,$4e71
	ENDM

****************************************************************

; d0=y, d1=x, d2=color, a3=bm_rows
DrawPixel
	lsl.w	#5,d2			; pre-shift d0/d2 if possible
	lsl.w	#2,d0
;	add.w	d0,d0			; faster if mem access
;	add.w	d0,d0			;  is not a problem (-2c)
	move.l	(a3,d0.w),a0
	move.w	d1,d0
	lsr.w	#3,d1
	add.w	d1,a0			; pixel address
	not.b	d0			; bit
	jmp	(.Draw,pc,d2.w)

	ALIGN	0,4
.Draw
.Col	SET	0
	REPT	1<<Depth
		MKDRAW	.Col
.Col		SET	.Col+1
	ENDR

****************************************************************

paraj · 21 June 2022, 19:42

Quote:

Originally Posted by DanScott

I'm wondering if it might be more efficient to plot a pixel with the blitter in 6bpl ?

Interesting thought, and maybe it can be slightly faster, but I think if you have to wait for the blitter to finish (like a good boy) it'll kill any potential benefit (though I'd be happy to be proved wrong).
Something that works (assuming non-interleaved bitmap):
Setup BLTAFWM/BLTALWM/BLTCDAT=$ffff, BLTAMOD=BLTDMOD=bplsize in bytes-2,BLTBMOD=0,BLTCON0=SRCA!SRCB!DEST!$B8 (Ab+BC).
Have a table for each color with 6 words where MSB is set if the pixel should be drawn (e.g. 0 -> 6 times 0, 63 -> $8000, $8000, ...).
Put destination in BLTAPT and BLTDPT (doesn't have to be word aligned), pixel mask from above in BLTBPT and (x&15)<<12 in BLTCON1 then write 64*6+1 to BLTSIZE and off you go.

saimo · 21 June 2022, 20:54

Quote:

Originally Posted by a/b

I'd just go with something similar to what paraj posted a few pages ago...

Yeah, as far as CPU-only routines go, that's as fast as it gets (accessory code aside, which is in the same ballpark anyway, 6 instructions, 5*2+1 = 11 words and 6*2 = 12 reads/writes are the bare minimum; I've been thinking of minimizing the writes by restricting them to just when changes are needed, but ANDs and branches cancel the theoretical advantages).

VladR · 22 June 2022, 16:17

Quote:

Originally Posted by a/b

I'd just go with something similar to what paraj posted a few pages ago... If you can handle 2KB of code (written for asm-one/pro):

Yeah, 2 KB out of 512 is nothing compared to 64 KB on Atari 800XL.
The fastest version will undergo the full unrolling (e.g. jump table to 64 routines).

Besides, I am pretty sure I will have 3 different options for end user, and thus 3 different pixel rasterizer sets in code:
1. 2 BPL
2. 4 BPL
3. 6 BPL

Since everything else in the game takes exact same time, in theory, to account for these 3 drastically different scenarios (on the bus), all I would have to do is change the number of pixels (stars) while framerate will stay unchanged.

I am toying with the idea of not clearing the framebuffer and just erasing the stars from last frame, since that would be doable (outside of cutscenes) if everything else was done with sprites (with 512 KB RAM, I could pre-render all 3D meshes into sprites at loading time).
And this would give me almost full frame (out of 3), as that's how long it takes to clear 6 planes via Blitter...

a/b · 22 June 2022, 17:09

Then I would suggest that you also record a bitmap ptr for each star as you draw them (maybe overwrite x/y to save space), so the clearing is then simply: read a bitmap ptr, set whole byte to 0 for each bitplane.

21 June 2022, 02:04	#73
saimo Registered User Join Date: Aug 2010 Location: Italy Posts: 855	One more optimization to the alternative code (can't go into the details now, sorry)... The setup part that calculates the masks can be rewritten as follows, after putting moveq.l #7,d4 outside of the loop: Code: and.l d4,d0 moveq.l #-128,d1 lsr.b d0,d1 move.b d1,d0 not.b d1 This is on average 1 cycle faster and has the advantage of being 2 words shorter. Last edited by saimo; 21 June 2022 at 02:11.

21 June 2022, 03:56	#74
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 56 Posts: 2,050	Perhaps this code: Code: move.b (a1),d3 and.b d1,d3 add.b d2,d2 bcc.b .b0 or.b d0,d3 .b0 move.b d3,(a1) Can be replaced with: Code: and.b (a1),d1 add.b d2,d2 bcc.b .b0 or.b d0,d1 .b0 move.b d1,(a1) Or maybe even with: Code: and.b d1,(a1) add.b d2,d2 bcc.b .b0 or.b d0,(a1) .b0 I dont coded from long time, but perhaps it can works.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Help Fund the Amiga 4000 Replica Project!	Acill	Amiga scene	82	02 March 2020 20:04
Financial Fund London Amiga or PC	runandbecome	Amiga scene	8	30 September 2016 00:44
An idea for continued games development... using Amiga	Galahad/FLT	Amiga scene	91	29 December 2010 11:45
Amiga development	freehand	Retrogaming General Discussion	4	18 April 2010 17:53
Amizilla Fund closes in on almost $9000 in donations; first one that donates and gets	Pyromania	News	0	11 January 2005 11:00

20 June 2022, 03:18	#62
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,068	First pixel is slower (lsl.b #3,d2 12c) than the rest (add.b d2,d2 4c), it's what I was talking about in my previous post about replacing 6x btst #x,d2 (10c). The basic idea is to push every bit in d2 out to carry flag, which can be done with a quick add.b (except for the first one, to get everything in place by skipping top 2 bits in d2). So it's 112+54=32 cycles vs. 6x10=60 cycles. And beq is replaced with bcc. Branch taken/not taken cases for bcc/dbcc/... are all in the table: \|bcc.b \|label \| 10/8 (taken/not taken)

20 June 2022, 17:19	#67
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,068	Speedy 4 cycle instructions are not a problem. You could see it as they are executed while the next opcode is being fetched (or waiting on a free dma slot), so they are not slowed down "in the middle of execution" (once they are fetched, which can take a while, they are good to go). The problem are instructions with multiple memory accesses, either extended opcode, operands, or execution. You could calculate how many cycles you have left in 1/50sec for the cpu, but once you are dealing with 5+ bitplanes and/or blitter and/or heavy copper lists, that becomes increasingly inaccurate due to instruction times being extended because of memory access conflicts "in the middle of execution".

20 June 2022, 23:16	#70
DanScott Lemon. / Core Design Join Date: Mar 2016 Location: Tier 5 Posts: 1,213	I'm wondering if it might be more efficient to plot a pixel with the blitter in 6bpl ?

21 June 2022, 00:29	#71
fxgogo Also known as GarethQ Join Date: May 2019 Location: Twickenham / U.K. Posts: 733	Well this has to be one of the best thread hijacks I have seen in a while and I don't understand assembly at all!!

22 June 2022, 17:09	#80
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,068	Then I would suggest that you also record a bitmap ptr for each star as you draw them (maybe overwrite x/y to save space), so the clearing is then simply: read a bitmap ptr, set whole byte to 0 for each bitplane.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)