English Amiga Board

English Amiga Board (https://eab.abime.net/index.php)
-   Coders. Asm / Hardware (https://eab.abime.net/forumdisplay.php?f=112)
-   -   Chunk True Color 4 pixels (https://eab.abime.net/showthread.php?t=110728)

remz 24 May 2022 03:33

Chunky True Color 4 pixels
 
Hi Amiga coders,

I was thinking of something and wanted to see if this would be technically doable. Please rectify me at any steps, should I made any errors or miscalculations.

The Copper can Move a color registers in 8 clock cycles when 4 or less bitplanes are enabled. (i.e.: If I am not mistaken the Amiga cpu clock speed matches a lowres pixel duration?)
It appears that even when turning off bitplane DMA, the copper wont go faster than 8 pixels per color change: This seems to indicate the copper fetches its two 16-bit instructions words one at a time with 2 clock cycle "internal processing" in between, maybe like so:
Code:

read-work-read-Move-...
 0 1  2 3  4 5  7 8

The 68000 appears to act similarly: Even all dma are turned off, it wont go faster since it usually reads from memory, then does internal work. This is what makes the Amiga 68000 appears to run at full speed even when bitplane dma uses all the odd cycles during the screen display portions.

With that in mind, I thought that by interleaving copper and 68000 both setting color registers, it should be possible to have a 4 pixels wide chunky full color screen running in 0 bitplane. (However the only way I found so far to emit one word to color #0 in only 8 clock cycles is move d0,(a0), which implies preloading the cpu registers. The dma timing would look like this:
Code:

Copper: read-work-read-Move
CPU:    read-Move-read-work
clock:  0 1  2 3  4 5  7 8

But I wanted at least to spawn the discussion if this is something thinkable.

(note: The cpu doesn't appear to align perfectly at every scanline even with interrupt turned off: maybe something precise needs to be taken into account, for example one 'nop' every other scaneline perhaps due to alternating horizontal line length, I am not sure at this point).

Also note that this is assumed to be all running from chip ram. If cpu is running off fast ram, then technically bitplane dma could be still running up to 4 bitplanes without problem.

hooverphonique 24 May 2022 09:56

The following thread discusses color changes using cpu: http://eab.abime.net/showthread.php?t=110394

bloodline 24 May 2022 10:05

Quote:

Originally Posted by remz (Post 1546801)

With that in mind, I thought that by interleaving copper and 68000 both setting color registers, it should be possible to have a 4 pixels wide chunky full color screen running in 0 bitplane.

I'm not sure what you are tying to achieve here.

The advantage of a chunky display is to reduce the number of RAM accesses. So for the CPU to write a pixel on a normal 16 colour Amiga planar display it requires at least 4 separate RAM writes (and possibly some reads as well) not to mention masking and shifting (though 68k Bitwise operations mitigate this somewhat).

With a chunky display the pixel can be written with a single RAM write (or perhaps a read and a write in the case of a 4 bit framebuffer where you might need to mask half of the byte you aren't writing to).

I don't see how changing the colour palette multiple times per scanine using both the CPU and the Copper helps in this situation :confused

ross 24 May 2022 10:32

Quote:

Originally Posted by remz (Post 1546801)
However the only way I found so far to emit one word to color #0 in only 8 clock cycles is move d0,(a0), which implies preloading the cpu registers.

You have already answered what the main problem is :)

Considering a maximum preload of 15 registers (with a7=$dff180) and using 17 changes with copper (the first and last in the line),
you would have a 128 wide pixels chunky 'screen' (15+17)*4, not extended to a full view.

Quote:

(note: The cpu doesn't appear to align perfectly at every scanline even with interrupt turned off
This is another problem that is not trivial to solve, it is possible to do it but with effort.

That said, as an 'academic' problem it's interesting, but there are other ways to make chunky displays that are more usable. ;)

bloodline 24 May 2022 11:01

Quote:

Originally Posted by ross (Post 1546826)
You have already answered what the main problem is :)

Considering a maximum preload of 15 registers (with a7=$dff180) and using 17 changes with copper (the first and last in the line),
you would have a 128 wide pixels chunky 'screen' (15+17)*4, not extended to a full view.


This is another problem that is not trivial to solve, it is possible to do it but with effort.

That said, as an 'academic' problem it's interesting, but there are other ways to make chunky displays that are more usable. ;)

Ugh! So remz is actually trying to write a realtime Chunky to Planar conversion algorithm?

ross 24 May 2022 11:57

Quote:

Originally Posted by bloodline (Post 1546829)
Ugh! So remz is actually trying to write a realtime Chunky to Planar conversion algorithm?

Well, actually the chunky to planar conversion is not there :)

He is trying to directly display a 12-bit true color buffer on the screen.
The buffer itself is not linear (or double), because it contains the even pixels on one side for the copper, and the odd ones on the other for the cpu.

This of course also leads to the problem of rendering to this buffer(s)..

remz 24 May 2022 19:31

Yes I was intrigued about the "technical possibility" more than its real-life usefulness, as you both mentionned the memory layout would be irksome and timings complicated.
However with fast ram and 16 colors (4 bitplanes), possibly changing 80 colors per scanline could be intriguing.
A sort of "UltraDynamic HighColor" mode?

paraj 24 May 2022 21:02

With fast ram available it probably doesn't make sense to involve the copper at all (at least for changing colors).

defor 24 May 2022 23:14

I'm afraid that using CPU to fetch colors to Denise (plus rather uncomfortable color buffer as some colors are set by CPU and others by Copper) is too restrictive.
This is very nice writing about good old classic copper chunky (on OCS using 7bpl bug): https://eab.abime.net/showthread.php?t=107015

remz 25 May 2022 00:04

Yes that video of 57 copper chunky trick was very inspiring.
With code running off Fast ram, is it possible to fully saturate the chip ram bus completely just with the 680x0 cpu? If I read correctly, maximum chip ram bandwidth is 7.15MB/sec?
This would mean being able to set one word every two pixels?
(meaning a potential 160 pixels true color mode?
I tried it in WinUAE but I didn't manage to get smaller than 4 pixel wide.

[edit] Thinking about it, setting a color register has nothing to do with chip ram: It is direct access to Denise, so it doesn't have any dma bandwidth restriction.

Do someone know if the display hardward is fetching color registers at every pixels during a scanline? Maybe there is a limit in there too.

defor 25 May 2022 09:38

There are 8 bus cycles per 16 lo-res pixels. The bus arbitration allows CPU to access every second bus cycle only (*if cycle is available). Hence 4 pixels by CPU, at best. You must check cycle counts for your particular processor (and its operating frequency) if it is able to utilize every available bus cycle. Therefore it is very configuration dependent.
(P.S.: Custom registers access (i.e. custom-chips) happens through the bus = all chip-ram access restrictions apply.)

Toni Wilen 25 May 2022 10:11

It is not possible for CPU to access chip ram (or chipset bus) every cycle. All chipset variants have same interleaved CPU access timing: first cycle is used to transfer address to Agnus/Alice (this cycle is always free for chipset DMA), second cycle is used to transfer data.

Fast CPUs waste lots of cycles doing nothing when accessing chip bus.

EDIT: 7M/s is possible if chip ram bus is 32-bit (A3000 or AGA)

remz 25 May 2022 23:23

Quote:

Originally Posted by defor (Post 1546981)
There are 8 bus cycles per 16 lo-res pixels. The bus arbitration allows CPU to access every second bus cycle only (*if cycle is available). Hence 4 pixels by CPU, at best. You must check cycle counts for your particular processor (and its operating frequency) if it is able to utilize every available bus cycle. Therefore it is very configuration dependent.
(P.S.: Custom registers access (i.e. custom-chips) happens through the bus = all chip-ram access restrictions apply.)

"allows CPU to access every second bus cycle only":
You mean even with all DMA off, the CPU cannot uses all bus cycle?
For example, if I tried to MOVEM 64 bytes to set the whole 32 color palette as fast as possible, the MOVEM itself when done on chip ram would not be 14+4*32 = 142 clock cycles to set 64 bytes? (i.e.: one color per 2 lo-res pixel?)

Toni:
What you are saying is interesting for the Amiga 3000 32-bit chip ram: basically I would be inclined to say the Amiga 3000 could be running as an "almost AGA" speed: with 32-bit chip ram access, and fast ram, would the CPU be able to set sprites and colors potentially 4 times faster than copper?
This could open the door for massive ECS sprites by recycling them during a scanline by the CPU instead of the copper :) Oh I am tempted to try it :)

roondar 25 May 2022 23:46

Quote:

Originally Posted by remz (Post 1547085)
"allows CPU to access every second bus cycle only":
You mean even with all DMA off, the CPU cannot uses all bus cycle?
For example, if I tried to MOVEM 64 bytes to set the whole 32 color palette as fast as possible, the MOVEM itself when done on chip ram would not be 14+4*32 = 142 clock cycles to set 64 bytes? (i.e.: one color per 2 lo-res pixel?)

Assuming a 68000, it would around 142 (maybe 144, I thought it was 16+8*registers cycles for a movem.l to an address). However, half of those cycles are not on the bus, but internal to the CPU. You can see this in WinUAE with cycle accurate timing and using the Visual DMA Debugger feature. Doing so, you'll notice that CPU activity always is interleaved with either idle cycles or other DMA, never back to back.

This half-internal, half-bus split is also why the 68000 on an OCS/ECS system isn't really slowed down by bitplane DMA until you go to 5 bitplanes lowres or 3 bitplanes hires.

Note however that this explanation is slightly simplified. For one, the CPU can access any cycle on the Chip Memory bus that isn't in use by DMA, it just can't access two cycles back to back.
Quote:

Toni:
What you are saying is interesting for the Amiga 3000 32-bit chip ram: basically I would be inclined to say the Amiga 3000 could be running as an "almost AGA" speed: with 32-bit chip ram access, and fast ram, would the CPU be able to set sprites and colors potentially 4 times faster than copper?
This could open the door for massive ECS sprites by recycling them during a scanline by the CPU instead of the copper :) Oh I am tempted to try it :)
Interesting idea, but do note that Fast RAM isn't infinite speed either, so you'll lose some speed compared to the theoretical maximum you point to here because you'd have to read in the data at some point too. That said, the Copper effectively manages to write only 1 word per 4 DMA cycles and this would be able to write 2 long words in the same time. So if you pre-load the CPU registers, this might be quite interesting.

Might be hard to time correctly though and I don't actually know if the A3000 has static Chip RAM access speeds or that they are CPU dependent. On A1200 at least many CPU cards don't get full bandwidth when accessing Chip RAM, this might also be the case on the A3000?

Edit: the above text was replaced, it erroneously referred to speed differences between the 68000/OCS and 32 bit Chip RAM speeds instead of Copper vs. CPU on 32 bit ECS/AGA.

remz 26 May 2022 00:58

Can you however interleave Copper and CPU to saturate the chip ram bandwidth if bitplane dma is turned off?
The problem that I expect with the copper is that it itself runs off chip ram: so any Move operation costs two word-fetches.

Please correct me if I'm wrong, but the Copper writing to a custom register, is it "using the bus"? From what I understand so far, it seems not: That would mean copper can write to any custom registers (even to other chips like Denise and Paula) "for free"?

bloodline 26 May 2022 08:03

Quote:

Originally Posted by remz (Post 1547102)
Can you however interleave Copper and CPU to saturate the chip ram bandwidth if bitplane dma is turned off?
The problem that I expect with the copper is that it itself runs off chip ram: so any Move operation costs two word-fetches.

Please correct me if I'm wrong, but the Copper writing to a custom register, is it "using the bus"? From what I understand so far, it seems not: That would mean copper can write to any custom registers (even to other chips like Denise and Paula) "for free"?

The Cooper uses the chipbus to load its two instruction words. Each of which uses a DMA slot. The copper move instruction uses the first word to load the register address from chipram and the second word to load the register’s new value from chipram.

-Edit- I don’t call that “for free”, as all Copper instructions use the same two cycles.

Cyprian 26 May 2022 10:13

Quote:

Originally Posted by Toni Wilen (Post 1546983)
It is not possible for CPU to access chip ram (or chipset bus) every cycle. All chipset variants have same interleaved CPU access timing: first cycle is used to transfer address to Agnus/Alice (this cycle is always free for chipset DMA), second cycle is used to transfer data.

is the same access scheme valid also for a hardware registers, e.g. color registers?

Or they can be accessed faster (1 CPU clock or 1 chipset bus)?

Toni Wilen 26 May 2022 10:14

Quote:

Originally Posted by remz (Post 1547085)
would the CPU be able to set sprites and colors potentially 4 times faster than copper?

No because custom registers are 16-bit wide.

32-wide chip RAM (A3000 or AGA):
CPU can read or write 32-bit word (if address is 32-bit aligned) every second chipset cycle.
Custom registers (all chipsets):
CPU can read or write 16-bit word every second chipset cycle.

BPLxDAT and SPRxDAT are also only 16-bit wide from CPU point of view. Only DMA can do AGA 32-bit or 2x32-bit transfers.

CPU can use any free chipset cycle but CPU chipset bus access will always take 2 chipset cycles to complete. (Note that this is from chipset point of view, CPU/accelerator board can have write buffer(s) that can improve performance noticeably)

Cyprian 26 May 2022 10:52

Quote:

Originally Posted by Toni Wilen (Post 1547135)
Custom registers (all chipsets):
CPU can read or write 16-bit word every second chipset cycle.

...

Only DMA can do AGA 32-bit or 2x32-bit transfers.

thanks for clarification

defor 26 May 2022 13:21

Quote:

Originally Posted by Toni Wilen (Post 1547135)
CPU can use any free chipset cycle but CPU chipset bus access will always take 2 chipset cycles to complete

Does it mean that CPU can use odd numbered cycles if they're free, but as soon as chip-set needs them CPU is "synced" back to even numbered cycles (because he must wait)?
I though that the bus controller (Agnus?) strictly allows CPU to access even numbered cycles only (if available). The DMA time slot allocation diagram in HRM suggests that :(.


All times are GMT +2. The time now is 23:53.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.

Page generated in 0.04833 seconds with 11 queries