English Amiga Board


Go Back   English Amiga Board > Coders > Coders. General

 
 
Thread Tools
Old 20 June 2022, 02:18   #61
VladR
Registered User
 
Join Date: Dec 2019
Location: North Dakota
Posts: 741
Quote:
Originally Posted by saimo View Post
@VladR
I never had to write a generic pixel-plotting routine for planar graphics in my life (at least, I can't remember), so this attracted my attention.
I have several use cases:
1. Generic 3D Starfield (as in Star Raiders)
2. 3D Point Cloud (as in Rez)
3. Additional Pixel detail (quasi texture-like) over flatshaded polygons via perspective interpolated 5x5 vertex grid

Quote:
Originally Posted by saimo View Post
I couldn't help but take your code and whip up an alternative version that minimizes the memory accesses (which is crucial, given that you're using 6 bitplanes).
Thank you! I truly appreciate the brainstorming here in this community
That's the best way to learn and come up with best code

Quote:
Originally Posted by saimo View Post
Writte on the fly and totally untested, so apologies if it contains bugs! - anyway, even in that case, it's still good enough to illustrate the concepts.

Code:
    asl.w   #2,d0
    movea.l (a3,d0.w),a1   ;line base address
    move.w  d1,d0
    lsr.w   #3,d1          ;X offset
    adda.w  d1,a1          ;pixel base address

    moveq.l #7,d1
    and.w   d1,d0
    sub.w   d0,d1          ;bit number
    moveq.l #0,d0
    bset.l  d1,d0          ;OR mask
    move.b  d0,d1
    not.b   d1             ;AND mask

    move.b  ($6000,a1),d3
    and.b   d1,d3
    lsl.b   #3,d2
    bcc.b   .b5
    or.b    d0,d3
.b5 move.b  d3,($6000,a1)

    move.b  ($4000,a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b4
    or.b    d0,d3
.b4 move.b  d3,($4000,a1)

    move.b  ($4000,a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b3
    or.b    d0,d3
.b3 move.b  d3,($2000,a1)

    move.b  ($2000,a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b2
    or.b    d0,d3
.b2 move.b  d3,(a1)

    move.b  (-$2000,a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b1
    or.b    d0,d3
.b1 move.b  d3,(-$2000,a1)

    move.b  (-$4000,a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b0
    or.b    d0,d3
.b0 move.b  d3,(a1)
OK, so it appears that you are doing bit operations in the registers, thus you have to do the memory access twice (even in cases where the bit is 0 and it's not needed).

I have looked up the cycle timings in the text file that @a/b attached few posts above (apologies if the numbers are incorrect) and this is what one Bitplane batch looks to be:
Code:
12c    move.b  ($6000,a1),d3
 4c    and.b   d1,d3
12c    lsl.b   #3,d2
10c   bcc.b   .b5
 4c   or.b    d0,d3
.b5 
12c    move.b  d3,($6000,a1)
12+4+12+10+4+12 = 54c
However, the or.b d0,d3 will be executed on average in 50% of cases (each bit has 50% chance of being set as all 64 colors are used across entire screen), so I will count the or.b d0,d3 as 50% - which is 2c, hence 54c-2c = 52c per BitPlane

My current version is 18+10+10 + (18/2) = 47c, assuming I got the cycles right. There's few cycles less for that one BP which is addressed as (a1).
Code:
         18   and.b d3,(-$4000,a1)
	 10	btst #0,d2
	 10	beq dp9_2
	 18	or.b d3,(-$4000,a1)		; BP1
			dp9_2:

Also, what is the cycle timing of BEQ jumps on 68000 if the branch is not taken ? Is it 10c taken and 12c not taken perhaps ?
VladR is offline  
Old 20 June 2022, 03:18   #62
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
First pixel is slower (lsl.b #3,d2 12c) than the rest (add.b d2,d2 4c), it's what I was talking about in my previous post about replacing 6x btst #x,d2 (10c).
The basic idea is to push every bit in d2 out to carry flag, which can be done with a quick add.b (except for the first one, to get everything in place by skipping top 2 bits in d2).
So it's 1*12+5*4=32 cycles vs. 6x10=60 cycles. And beq is replaced with bcc.

Branch taken/not taken cases for bcc/dbcc/... are all in the table:
|bcc.b |label | 10/8 (taken/not taken)
a/b is offline  
Old 20 June 2022, 07:23   #63
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
@VladR

Quote:
Originally Posted by VladR View Post
OK, so it appears that you are doing bit operations in the registers, thus you have to do the memory access twice (even in cases where the bit is 0 and it's not needed
2 accesses are the minimum, i.e. it is not possible to do with less. Keep in mind that instructions like and.b d3,(a1) perform first a read and then a write.

Additional optimization:
1. put the addresses relative to the last bitplane in the table;
2. put movea.w #$2000,a2 outside of the plotting loop;
3. read/write bytes with (a1) everywhere;
4. put suba.l a2,a1 at the end of the code of each but the last byte.
On a 68000 cycle-wise it's the same (8 cycles more for suba, 8 cycles less for addressing modes), but 1 word less per byte are required - in all, 5 words instead of 10, i.e. 5 less memory accesses.

Edit: here's the updated code:
Code:
    movea.w #$2000,a2      ;bitplanes distance (somewhere outside of the loop)
    ...
    asl.w   #2,d0
    movea.l (a3,d0.w),a1   ;line base address
    move.w  d1,d0
    lsr.w   #3,d1          ;X offset
    adda.w  d1,a1          ;pixel base address in last bitplane
    
    moveq.l #7,d1
    and.w   d1,d0
    sub.w   d0,d1          ;bit number
    moveq.l #0,d0
    bset.l  d1,d0          ;OR mask
    move.b  d0,d1
    not.b   d1             ;AND mask

    move.b  (a1),d3
    and.b   d1,d3
    lsl.b   #3,d2
    bcc.b   .b5
    or.b    d0,d3
.b5 move.b  d3,(a1)
    suba.l  a2,a1

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b4
    or.b    d0,d3
.b4 move.b  d3,(a1)
    suba.l  a2,a1

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b3
    or.b    d0,d3
.b3 move.b  d3,(a1)
    suba.l  a2,a1

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b2
    or.b    d0,d3
.b2 move.b  d3,(a1)
    suba.l a2,a1

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b1
    or.b    d0,d3
.b1 move.b  d3,(a1)
    suba.l  a2,a1

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b0
    or.b    d0,d3
.b0 move.b  d3,(a1)
Note: in my previous post I had forgotten to change some offsets after copying and pasting

As for cycles, see a/b's answer.

EDIT: got some unforeseen extra free time, so I thought I'd calculate the size of the code and the overall number of words read/written from/to RAM (given that the 68000 has no cache, that's to be taken into account as well).

ORIGINAL CODE

setup size: 10
setup reads: 4
plot size: 7*5+5 = 40
plot reads/writes (best): 6*2 = 12
plot reads/writes (average): 6*3 = 18
plot reads/writes (worst): 6*4 = 24
total (best): 66
total (average): 72
total (worst): 78

ALTERNATIVE CODE

setup size (movea.w #$2000,a2 excluded): 13
setup reads: 1
plot size: 8*5+7 = 47
plot reads/writes: 6*2 = 12
total: 73

One thing I forgot to point out is that the alternative code needs only 1 lookup table.

Last edited by saimo; 20 June 2022 at 11:43.
saimo is offline  
Old 20 June 2022, 14:49   #64
VladR
Registered User
 
Join Date: Dec 2019
Location: North Dakota
Posts: 741
Quote:
Originally Posted by a/b View Post
First pixel is slower (lsl.b #3,d2 12c) than the rest (add.b d2,d2 4c), it's what I was talking about in my previous post about replacing 6x btst #x,d2 (10c).
The basic idea is to push every bit in d2 out to carry flag, which can be done with a quick add.b (except for the first one, to get everything in place by skipping top 2 bits in d2).
So it's 1*12+5*4=32 cycles vs. 6x10=60 cycles. And beq is replaced with bcc.
I admit I didn't get this before, but now I see why you're shifting in opposite direction - that is indeed quite clever and I haven't considered it It's Monday, but I should be able to quickly try this approach before I start working...

Quote:
Originally Posted by a/b View Post
Branch taken/not taken cases for bcc/dbcc/... are all in the table:
|bcc.b |label | 10/8 (taken/not taken)
Awesome, I will go adjust the timings accordingly. Since each branch has a 50% chance of being taken, each branch is effectively 9c on average (whole screen of pixels).
VladR is offline  
Old 20 June 2022, 15:21   #65
VladR
Registered User
 
Join Date: Dec 2019
Location: North Dakota
Posts: 741
Quote:
Originally Posted by a/b View Post
First pixel is slower (lsl.b #3,d2 12c) than the rest (add.b d2,d2 4c), it's what I was talking about in my previous post about replacing 6x btst #x,d2 (10c).
The basic idea is to push every bit in d2 out to carry flag, which can be done with a quick add.b (except for the first one, to get everything in place by skipping top 2 bits in d2).
So it's 1*12+5*4=32 cycles vs. 6x10=60 cycles. And beq is replaced with bcc.
Damn, it works 6 BPL dropped from 353c to 319c ! And we're finally faster (just a hair) in a full-frame pixel throughput (at same 4 colors) than 1.79 MHz 6502 ! Took a lot of effort to beat that puny little micro

Code:
Version 13 - Shifting out the color to the Left

[c] : Cycles
EHB : 0.54*119,333 = 64,439c (available cycles after DMA given ~54% utilization)
---------------------------------------------------------------------
| CPU  |  MHz | Frame [c]  | Colors | DrawPixel [c] |  Pixels/Frame |
---------------------------------------------------------------------
  6502   1.79      24,186       4           33            732.9
---------------------------------------------------------------------
 68000   7.16     119,333       4          159            750.5
 68000   7.16      64,439      64          319            202.0
---------------------------------------------------------------------
         No Overdraw version (No AND Masking)
 68000   7.16     119,333       4          115          1,037.7
 68000   7.16      64,439      64          203            317.4
         ErasePixel
 68000   7.16     119,333       4           96          1,243.1
 68000   7.16      64,439      64          168            383.6
VladR is offline  
Old 20 June 2022, 15:33   #66
VladR
Registered User
 
Join Date: Dec 2019
Location: North Dakota
Posts: 741
Quote:
Originally Posted by saimo View Post
EDIT: got some unforeseen extra free time, so I thought I'd calculate the size of the code and the overall number of words read/written from/to RAM (given that the 68000 has no cache, that's to be taken into account as well).
I'll go check out this approach later tonight, but just quick question about the number of RAM reads/writes in 6 BPL (due to heavy DMA load of 6 BPL) :

Will the instructions that operate just on registers (e.g. add.b d2,d2) execute in parallel at every other DMA slot in 6 BPL just like they would in 4 BPL ?

Or will they still have to wait to be executed till they get the available DMA slot despite not doing any RAM R/W whatsoever?

Last edited by VladR; 20 June 2022 at 15:34. Reason: typo
VladR is offline  
Old 20 June 2022, 17:19   #67
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
Speedy 4 cycle instructions are not a problem. You could see it as they are executed while the next opcode is being fetched (or waiting on a free dma slot), so they are not slowed down "in the middle of execution" (once they are fetched, which can take a while, they are good to go).
The problem are instructions with multiple memory accesses, either extended opcode, operands, or execution. You could calculate how many cycles you have left in 1/50sec for the cpu, but once you are dealing with 5+ bitplanes and/or blitter and/or heavy copper lists, that becomes increasingly inaccurate due to instruction times being extended because of memory access conflicts "in the middle of execution".
a/b is offline  
Old 20 June 2022, 17:31   #68
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,098
Quote:
Originally Posted by VladR View Post
I'll go check out this approach later tonight, but just quick question about the number of RAM reads/writes in 6 BPL (due to heavy DMA load of 6 BPL) :

Will the instructions that operate just on registers (e.g. add.b d2,d2) execute in parallel at every other DMA slot in 6 BPL just like they would in 4 BPL ?

Or will they still have to wait to be executed till they get the available DMA slot despite not doing any RAM R/W whatsoever?

They still have to wait because of the need to (pre)fetch the next instructions. I think the easiest way to think about it is like this: Without DMA contention you can just look at raw cycle numbers. With DMA contention you also look at cycle count - 4*number of memory access. The additional cycles (where the CPU isn't accessing memory) can run in parallel and are given "for free" in some sense.


E.g. add.w d0,d0 and add.l d0,d0 will run at the same speed if the CPU can only access memory every other time it wants to (like in 6BPL when Agnus is fetching data).
paraj is offline  
Old 20 June 2022, 21:28   #69
VladR
Registered User
 
Join Date: Dec 2019
Location: North Dakota
Posts: 741
Quote:
Originally Posted by a/b View Post
Speedy 4 cycle instructions are not a problem. You could see it as they are executed while the next opcode is being fetched (or waiting on a free dma slot), so they are not slowed down "in the middle of execution" (once they are fetched, which can take a while, they are good to go).
The problem are instructions with multiple memory accesses, either extended opcode, operands, or execution. You could calculate how many cycles you have left in 1/50sec for the cpu, but once you are dealing with 5+ bitplanes and/or blitter and/or heavy copper lists, that becomes increasingly inaccurate due to instruction times being extended because of memory access conflicts "in the middle of execution".
Quote:
Originally Posted by paraj View Post
They still have to wait because of the need to (pre)fetch the next instructions. I think the easiest way to think about it is like this: Without DMA contention you can just look at raw cycle numbers. With DMA contention you also look at cycle count - 4*number of memory access. The additional cycles (where the CPU isn't accessing memory) can run in parallel and are given "for free" in some sense.


E.g. add.w d0,d0 and add.l d0,d0 will run at the same speed if the CPU can only access memory every other time it wants to (like in 6BPL when Agnus is fetching data).
Thanks. For a second I thought that maybe I could find some compromise and get it to run at 60fps at 6 BPL (by clearing the pixels drawn last frame instead of double buffering - certainly doable with 32 stars) , as I could just do a lot of ops in registers, but since they still need to be fetched from RAM, they still need to wait for a DMA slot.

Of course, it's not impossible, there's still around ~64,000 cycles left for CPU (assuming no Blitter,Copper, Sprite DMA is happening). It's just that with spikes due to AI, explosions, etc. it might be very hard to not have framedrops, given the unpredictability of the execution.

It's a challenge, alright
VladR is offline  
Old 20 June 2022, 23:16   #70
DanScott
Lemon. / Core Design
 
DanScott's Avatar
 
Join Date: Mar 2016
Location: Tier 5
Posts: 1,209
I'm wondering if it might be more efficient to plot a pixel with the blitter in 6bpl ?
DanScott is offline  
Old 21 June 2022, 00:29   #71
fxgogo
Also known as GarethQ
 
fxgogo's Avatar
 
Join Date: May 2019
Location: Twickenham / U.K.
Posts: 715
Well this has to be one of the best thread hijacks I have seen in a while and I don't understand assembly at all!!
fxgogo is offline  
Old 21 June 2022, 01:14   #72
VladR
Registered User
 
Join Date: Dec 2019
Location: North Dakota
Posts: 741
Quote:
Originally Posted by fxgogo View Post
Well this has to be one of the best thread hijacks I have seen in a while and I don't understand assembly at all!!
Yeah, I was hoping some mod would be bored enough to cut this thread 2 pages back into Coders.Asm / HW section
I didn't even notice initially that this is a sticky thread
VladR is offline  
Old 21 June 2022, 02:04   #73
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
One more optimization to the alternative code (can't go into the details now, sorry)...
The setup part that calculates the masks can be rewritten as follows, after putting moveq.l #7,d4 outside of the loop:
Code:
   and.l   d4,d0
   moveq.l #-128,d1
   lsr.b   d0,d1
   move.b  d1,d0
   not.b   d1
This is on average 1 cycle faster and has the advantage of being 2 words shorter.

Last edited by saimo; 21 June 2022 at 02:11.
saimo is offline  
Old 21 June 2022, 03:56   #74
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,957
Perhaps this code:

Code:
    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b0
    or.b    d0,d3
.b0 move.b  d3,(a1)
Can be replaced with:

Code:
    and.b  (a1),d1
    add.b   d2,d2
    bcc.b   .b0
    or.b    d0,d1
.b0 move.b  d1,(a1)
Or maybe even with:

Code:
    and.b  d1,(a1)
    add.b   d2,d2
    bcc.b   .b0
    or.b    d0,(a1)
.b0
I dont coded from long time, but perhaps it can works.
Don_Adan is offline  
Old 21 June 2022, 13:54   #75
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
@Don_Adan

Quote:
Originally Posted by Don_Adan View Post
Perhaps this code:
Code:
    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b0
    or.b    d0,d3
.b0 move.b  d3,(a1)
Can be replaced with:

Code:
    and.b  (a1),d1
    add.b   d2,d2
    bcc.b   .b0
    or.b    d0,d1
.b0 move.b  d1,(a1)
Absolutely right - thank you!
(One more late night copy&paste-induced mistake - thanks again for opening my eyes!)

Quote:
Or maybe even with:

Code:
    and.b  d1,(a1)
    add.b   d2,d2
    bcc.b   .b0
    or.b    d0,(a1)
.b0
I'd avoid this, as it causes up to 4 reads/writes, with read-execute-write operations, while saving only 1 instruction word fetch.


@VladR

Adding more information to my previous post...

Here's a side-by-side comparison of the old and new lookup-table-less setup code, with timings:
Code:
    OLD                        NEW

    moveq.l #7,d1 ;4                            ;moveq.l #7,d4 outside of the loop
    and.w   d1,d0 ;4           and.l   d4,d0    ;4
    sub.w   d0,d1 ;4
    moveq.l #0,d0 ;4           moveq.l #-128,d1 ;4
    bset.l  d1,d0 ;6           lsr.b   d0,d1    ;6-20
    move.b  d0,d1 ;4           move.b  d1,d0    ;4
    not.b   d1    ;4           not.b   d1       ;4
                  ;total: 30                    ;total: 22-36, average 29
Although cycle-wise things don't change much, the advantage here is that there are two instructions less, i.e. two words less to fetch, which is crucial given the busy CHIP RAM bus (which tends to reduce the cycle differences).

The alternative code, modified as per all of the above, thus would look like this:
Code:
* OUTSIDE OF THE PLOTTING LOOP

    movea.w #$2000,a2      ;bitplanes distance
    moveq.l #7,d4          ;X offset mask

* PLOT ROUTINE

    lsl.w   #2,d0
    movea.l (a3,d0.w),a1   ;line base address
    move.w  d1,d0
    lsr.w   #3,d1          ;X offset
    adda.w  d1,a1          ;pixel base address in last bitplane

    and.l   d4,d0
    moveq.l #-128,d1
    lsr.b   d0,d1
    move.b  d1,d0          ;OR mask
    not.b   d1             ;AND mask

    move.b  (a1),d3
    and.b   d1,d3
    lsl.b   #3,d2
    bcc.b   .b5
    or.b    d0,d3
.b5 move.b  d3,(a1)
    suba.l  a2,a1

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b4
    or.b    d0,d3
.b4 move.b  d3,(a1)
    suba.l  a2,a1

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b3
    or.b    d0,d3
.b3 move.b  d3,(a1)
    suba.l  a2,a1

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b2
    or.b    d0,d3
.b2 move.b  d3,(a1)
    suba.l  a2,a1

    move.b  (a1),d3
    and.b   d1,d3
    add.b   d2,d2
    bcc.b   .b1
    or.b    d0,d3
.b1 move.b  d3,(a1)
    suba.l  a2,a1

    and.b   (a1),d1
    add.b   d2,d2
    bcc.b   .b0
    or.b    d0,d1
.b0 move.b  d1,(a1)
The words count is now this:
* setup code size: 11
* setup code reads: 1
* plot code size: 8*5+6 = 46
* plot code reads/writes: 6*2 = 12
* total: 70

Last edited by saimo; 21 June 2022 at 18:52.
saimo is offline  
Old 21 June 2022, 16:32   #76
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
I'd just go with something similar to what paraj posted a few pages ago... If you can handle 2KB of code (written for asm-one/pro):
Code:
****************************************************************

Depth		EQU	6

MKDRAW	MACRO	(Color)
.Start\@
.BPL	SET	0
.Offset	SET	-$4000
	REPT	Depth
		IFNE	(\1)&(1<<.BPL)
			IFEQ	.Offset
			bset	d0,(a0)
			ELSE
			bset	d0,(.Offset,a0)
			ENDIF
		ELSE
			IFEQ	.Offset
			bclr	d0,(a0)
			ELSE
			bclr	d0,(.Offset,a0)
			ENDIF
		ENDIF
.BPL		SET	.BPL+1
.Offset		SET	.Offset+$2000
	ENDR
; 10 bytes free, either rts/bra, or dbf and bra/rts can fit easily
	rts
	DCB.W	(32-(*-.Start\@))/2,$4e71
	ENDM

****************************************************************

; d0=y, d1=x, d2=color, a3=bm_rows
DrawPixel
	lsl.w	#5,d2			; pre-shift d0/d2 if possible
	lsl.w	#2,d0
;	add.w	d0,d0			; faster if mem access
;	add.w	d0,d0			;  is not a problem (-2c)
	move.l	(a3,d0.w),a0
	move.w	d1,d0
	lsr.w	#3,d1
	add.w	d1,a0			; pixel address
	not.b	d0			; bit
	jmp	(.Draw,pc,d2.w)

	ALIGN	0,4
.Draw
.Col	SET	0
	REPT	1<<Depth
		MKDRAW	.Col
.Col		SET	.Col+1
	ENDR

****************************************************************
a/b is offline  
Old 21 June 2022, 19:42   #77
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,098
Quote:
Originally Posted by DanScott View Post
I'm wondering if it might be more efficient to plot a pixel with the blitter in 6bpl ?
Interesting thought, and maybe it can be slightly faster, but I think if you have to wait for the blitter to finish (like a good boy) it'll kill any potential benefit (though I'd be happy to be proved wrong).
Something that works (assuming non-interleaved bitmap):
Setup BLTAFWM/BLTALWM/BLTCDAT=$ffff, BLTAMOD=BLTDMOD=bplsize in bytes-2,BLTBMOD=0,BLTCON0=SRCA!SRCB!DEST!$B8 (Ab+BC).
Have a table for each color with 6 words where MSB is set if the pixel should be drawn (e.g. 0 -> 6 times 0, 63 -> $8000, $8000, ...).
Put destination in BLTAPT and BLTDPT (doesn't have to be word aligned), pixel mask from above in BLTBPT and (x&15)<<12 in BLTCON1 then write 64*6+1 to BLTSIZE and off you go.

Last edited by paraj; 21 June 2022 at 19:49.
paraj is offline  
Old 21 June 2022, 20:54   #78
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by a/b View Post
I'd just go with something similar to what paraj posted a few pages ago...
Yeah, as far as CPU-only routines go, that's as fast as it gets (accessory code aside, which is in the same ballpark anyway, 6 instructions, 5*2+1 = 11 words and 6*2 = 12 reads/writes are the bare minimum; I've been thinking of minimizing the writes by restricting them to just when changes are needed, but ANDs and branches cancel the theoretical advantages).
saimo is offline  
Old 22 June 2022, 16:17   #79
VladR
Registered User
 
Join Date: Dec 2019
Location: North Dakota
Posts: 741
Quote:
Originally Posted by a/b View Post
I'd just go with something similar to what paraj posted a few pages ago... If you can handle 2KB of code (written for asm-one/pro):
Yeah, 2 KB out of 512 is nothing compared to 64 KB on Atari 800XL.
The fastest version will undergo the full unrolling (e.g. jump table to 64 routines).

Besides, I am pretty sure I will have 3 different options for end user, and thus 3 different pixel rasterizer sets in code:
1. 2 BPL
2. 4 BPL
3. 6 BPL

Since everything else in the game takes exact same time, in theory, to account for these 3 drastically different scenarios (on the bus), all I would have to do is change the number of pixels (stars) while framerate will stay unchanged.


I am toying with the idea of not clearing the framebuffer and just erasing the stars from last frame, since that would be doable (outside of cutscenes) if everything else was done with sprites (with 512 KB RAM, I could pre-render all 3D meshes into sprites at loading time).
And this would give me almost full frame (out of 3), as that's how long it takes to clear 6 planes via Blitter...
VladR is offline  
Old 22 June 2022, 17:09   #80
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
Then I would suggest that you also record a bitmap ptr for each star as you draw them (maybe overwrite x/y to save space), so the clearing is then simply: read a bitmap ptr, set whole byte to 0 for each bitplane.
a/b is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Help Fund the Amiga 4000 Replica Project! Acill Amiga scene 82 02 March 2020 20:04
Financial Fund London Amiga or PC runandbecome Amiga scene 8 30 September 2016 00:44
An idea for continued games development... using Amiga Galahad/FLT Amiga scene 91 29 December 2010 11:45
Amiga development freehand Retrogaming General Discussion 4 18 April 2010 17:53
Amizilla Fund closes in on almost $9000 in donations; first one that donates and gets Pyromania News 0 11 January 2005 11:00

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 14:53.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.10313 seconds with 14 queries