Chunk True Color 4 pixels - Page 2

remz · 26 May 2022, 13:37

Quote:

Originally Posted by bloodline

The Cooper uses the chipbus to load its two instruction words. Each of which uses a DMA slot. The copper move instruction uses the first word to load the register address from chipram and the second word to load the register’s new value from chipram.

-Edit- I don’t call that “for free”, as all Copper instructions use the same two cycles.

Bloodline: Yes indeed thank you for the explanation: Sorry I may have lacked precision: The 'free' portion that I meant was the part where Copper actually sets a register:
A CopMove takes 8 clock cycles, which requires 2 DMA slots (interleaved), a bit like:
fetch-wait-fetch-wait
0 1 2 3 4 5 6 7
But somewhere in there, the custom destination register is getting set: would you know when is this happening exactly?
The 'free' part I was implying is that compared to the 68000, which requires one DMA slot to actually write to a custom register, the Copper appears to do this 'for free', or else a CopMove would require 3 slots?

[Edit]: Pondering about it, maybe I am thinking of the Copper too much like a general CPU: Maybe the way it works is more like:
- First Word - Decode: instruction is a Move To Custom Register At Address xxxx: prepare destination for copy
- Second Word - Value to Copy: Transfer Value directly into Destination:
This operation is apparently special and unique to Copper because it allows direct copy of a Word from chip ram unto a custom Register in just one single dma cycle. CPU cannot do that, nor Blitter (Blitter cannot because custom registers are out of range).
Could that make more sense?

Toni Wilen · 26 May 2022, 14:33

Quote:

Originally Posted by remz

Pondering about it, maybe I am thinking of the Copper too much like a general CPU: Maybe the way it works is more like

Yes. MOVE's first DMA cycle loads register address to internal storage, second DMA cycle writes directly to selected register.

Quote:

Originally Posted by defor

Does it mean that CPU can use odd numbered cycles if they're free, but as soon as chip-set needs them CPU is "synced" back to even numbered cycles (because he must wait)?
I though that the bus controller (Agnus?) strictly allows CPU to access even numbered cycles only (if available). The DMA time slot allocation diagram in HRM suggests that

.

HRM is wrong. CPU can use both odd and even cycles but usually CPU uses even only because DMA uses odd cycles first (exceptions: high resolutions/number of planes or blitter), forcing CPU to use even cycles sooner or later. Many CPU instructions also have cycle count that is divisible by 4 which keeps alignment.

bloodline · 27 May 2022, 20:51

Quote:

Originally Posted by remz

[Edit]: Pondering about it, maybe I am thinking of the Copper too much like a general CPU: Maybe the way it works is more like:
- First Word - Decode: instruction is a Move To Custom Register At Address xxxx: prepare destination for copy
- Second Word - Value to Copy: Transfer Value directly into Destination:
This operation is apparently special and unique to Copper because it allows direct copy of a Word from chip ram unto a custom Register in just one single dma cycle. CPU cannot do that, nor Blitter (Blitter cannot because custom registers are out of range).
Could that make more sense?

It’s just occurred to me that you might not have written a copper list before (I used to struggle with them back in the day).

If the first instruction word is a valid Chip register address then the copper just loads the second instruction word value directly into that valid register address. That’s your 2 cycles, it really is that simple.

If the first instruction word is an odd value (all Chip registers are even), then the second instruction word is used to establish if the copper is going to wait or skip at a certain beam position (the first instruction word is then treated as a bit mask for the position).

-Edit- Also remember that the copper only has bus access on odd cycles, so each move operation takes at least 4 cycles!

remz · 28 May 2022, 03:52

Quote:

Originally Posted by bloodline

It’s just occurred to me that you might not have written a copper list before (I used to struggle with them back in the day).

If the first instruction word is a valid Chip register address then the copper just loads the second instruction word into that valid address. That’s your 2 cycles, it really is that simple.

If the first instruction word is an odd value (all Chip registers are even), then the second instruction word is used to establish if the copper is going to wait or skip at a certain beam position (the first instruction work is then treated as a bit mask for the position).

-Edit- Also remember that the copper only has bus access on odd cycles, so each move operation takes at least 4 cycles!

Thank you Bloodline for the detailed explanation. I did dabble a little bit in a few copperlists. The thing that I wanted to highlight and understand if the fact that a general CPU like the 68000 *cannot* read and write in one single dma cycle, like Copper 'appears' to do.
As a comparison from what I understood so far, to move a 16-word to a 16-bit custom register, CPU running off Chip needs to:
- Read Address from Chip Ram (1 dma cycle)
- Read Value from Chip Ram (1 dma cycle)
- Write Value to Custom Register (1 dma cycle)
3 cycles required, whereas Copper does that in 2 cycles.

Please correct me if I am wrong, but so far I deduced the following facts:
Assuming stock A500 with 512KB Chip Ram, running in NTSC at 7159090 clock/sec. At 59.94Hz, this yields 455 clock cycle per horizontal scanline.
DMA bus runs at half that rate, so 227.5 dma-cycle per scanline, which is reduced to 226 in practice.

Copper uses ever only Odd cycles, and takes 8 clock cycles to perform a 16-bit move. Maximum throughput is thus: 1.7MB/sec

Blitter can use any cycles (even or odd), and is extremely efficient when copying: It can theoretically saturate the bus and copy 3.5MB/sec (using A/D channels).
Blitter is however not as efficient when filling/clearing: still 3.5MB/sec, so it waste half the DMA bus doing nothing.

CPU, like blitter, can also use any dma cycles (odd or even), however is not as fast for copy. I think with movem.w, it can reach slightly less than half the speed of the blitter with approx 1.4MB/sec. *(I think that movem.l would not be significantly faster because the chip bus width is 16-bit?).

One thing that I don't understand is that if Copper only runs on Odd cycles, does it mean that during a 4-bitplane lo-res screen where Display takes 80 odd cycles, Copper cannot do anything?

Also another question I didn't find a clear answer in the HRM:
If Display runs in 1-bitplane mode, does Display DMA only take 20 odd cycles?

Ah if only Commodore could have decided to add even just 64KB of fast ram on all Amigas, it would have made a world of difference. Or even more versatile: how about making the chip ram/fast ram frontier software-programmable? A game could decide for example to assign 256KB of ram, and 256KB for fast ram. Even the CD32 would have been almost twice as fast with that kind of flexibility.

bloodline · 28 May 2022, 10:42

Quote:

Originally Posted by remz

Thank you Bloodline for the detailed explanation. I did dabble a little bit in a few copperlists. The thing that I wanted to highlight and understand if the fact that a general CPU like the 68000 *cannot* read and write in one single dma cycle, like Copper 'appears' to do.
As a comparison from what I understood so far, to move a 16-word to a 16-bit custom register, CPU running off Chip needs to:
- Read Address from Chip Ram (1 dma cycle)
- Read Value from Chip Ram (1 dma cycle)
- Write Value to Custom Register (1 dma cycle)
3 cycles required, whereas Copper does that in 2 cycles.

At step two the CPU is loading a value from memory into a Dx register... at step two the Copper is loading a value from memory into a Custom chip register.

To follow your model, the Custom Chip registers are the Copper's registers.
The copper cannot write to RAM, if it needs to do so, then it needs to do that with the blitter, by setting the blitter registers.

Quote:

Please correct me if I am wrong, but so far I deduced the following facts:
Assuming stock A500 with 512KB Chip Ram, running in NTSC at 7159090 clock/sec. At 59.94Hz, this yields 455 clock cycle per horizontal scanline.
DMA bus runs at half that rate, so 227.5 dma-cycle per scanline, which is reduced to 226 in practice.

Copper uses ever only Odd cycles, and takes 8 clock cycles to perform a 16-bit move. Maximum throughput is thus: 1.7MB/sec

Blitter can use any cycles (even or odd), and is extremely efficient when copying: It can theoretically saturate the bus and copy 3.5MB/sec (using A/D channels).
Blitter is however not as efficient when filling/clearing: still 3.5MB/sec, so it waste half the DMA bus doing nothing.

CPU, like blitter, can also use any dma cycles (odd or even), however is not as fast for copy. I think with movem.w, it can reach slightly less than half the speed of the blitter with approx 1.4MB/sec. *(I think that movem.l would not be significantly faster because the chip bus width is 16-bit?).

One thing that I don't understand is that if Copper only runs on Odd cycles, does it mean that during a 4-bitplane lo-res screen where Display takes 80 odd cycles, Copper cannot do anything?

No, there are 8 DMA slots per 16 lores pixels. These 8 DMA slots constitute a bitplane fetch cycle.
1bit graphics only uses one of those slots per fetch cycle, which leaves 7 slots free per fetch cycle.
2bit graphics uses two of those slots per fetch cycle, which leaves 6 slots free per fetch cycle... etc.
The most slots used in lores mode per fetch cycle is 6 slots (EHB and HAM), leaving 2 free.

In hires, there are 8 DMA slots per 32 pixels.
1bit graphics uses 2 DMA slots per fetch cycle.
So 4bit graphics in hires mode does completely saturate the bus for the duration of the biplane fetch per scanline. Thus very few games use hires mode

Quote:

Also another question I didn't find a clear answer in the HRM:
If Display runs in 1-bitplane mode, does Display DMA only take 20 odd cycles?

The HRM does have a quite good (but with plenty of errors) diagram of the DMA allocation per scanline.

Quote:

Ah if only Commodore could have decided to add even just 64KB of fast ram on all Amigas, it would have made a world of difference. Or even more versatile: how about making the chip ram/fast ram frontier software-programmable? A game could decide for example to assign 256KB of ram, and 256KB for fast ram. Even the CD32 would have been almost twice as fast with that kind of flexibility.

Sure, even a tiny amount of FastRAM would have totally improved the Amiga hardware performance... But having FastRAM would have required a separate
DRAM controller, and obviously more motherboard space, clearly the cost was prohibitive for Commodore's management.

As an aside, once I realised that the A1200 didn't use Alice for Chipram DRAM refreshes, but instead had a Budgie chip, it was clear to me that Commodore had totally lost the plot... why not use Alice for Chipram, and Budgie for Fastram? Then the A1200 could have had 1meg Chipram and 1 meg Fastram.

-Edit- I also think the Amiga graphics system should have been feature frozen at ECS, and a new chunky system used for 8bit graphics onwards. But hindsight is always always very clear, especially for an "armchair engineer" like me

remz · 28 May 2022, 13:35

Quote:

Originally Posted by bloodline

At step two the CPU is loading a value from memory into a Dx register... at step two the Copper is loading a value from memory into a Custom chip register.

To follow your model, the Custom Chip registers are the Copper's registers.
The copper cannot write to RAM, if it needs to do so, then it needs to do that with the blitter, by setting the blitter registers.

Oh that is a really good explanation of how to view it! Thank you!

Quote:

Originally Posted by bloodline

No, there are 8 DMA slots per 16 lores pixels. These 8 DMA slots constitute a bitplane fetch cycle.
1bit graphics only uses one of those slots per fetch cycle, which leaves 7 slots free per fetch cycle.

I never grasped that HRM illustration completely. Also their -2, -1, 1, 2,... numbering below confuses me because it lacks zero.
So my question is when using 4 bitplanes lo-res: bitplane dma uses all odd cycles because 4-2-3-1 are interleaved: Does that mean Copper is completely stopped during the 320 visible pixels portion? Since only Blitter and CPU are able to utilize even cycles?

ross · 28 May 2022, 14:37

Quote:

Originally Posted by remz

Does that mean Copper is completely stopped during the 320 visible pixels portion? Since only Blitter and CPU are able to utilize even cycles?

It is the exact opposite: since even cycles are free, the copper can use them all (at least as long as the active bitplanes are <= 4 and lo-res).
The Blitter and the CPU have lower priority, so if requested by the Copper they can be used.

remz · 28 May 2022, 15:43

Quote:

Originally Posted by ross

It is the exact opposite: since even cycles are free, the copper can use them all (at least as long as the active bitplanes are <= 4 and lo-res).
The Blitter and the CPU have lower priority, so if requested by the Copper they can be used.

I'm confused: Bloodline said: "Also remember that the copper only has bus access on odd cycles, so each move operation takes at least 4 cycles!"

Copper is using odd, or even cycles?

ross · 28 May 2022, 15:57

Quote:

Originally Posted by remz

I'm confused: Bloodline said: "Also remember that the copper only has bus access on odd cycles, so each move operation takes at least 4 cycles!"

Copper is using odd, or even cycles?

If you follow the allocation of DMA cycles in the attached diagram you can notice that copper use even cycles only.
Odd cycles are for many predefined channels.

But it is only a matter of definition, in fact the unreadable cycles during the refresh ones are even HPOS values,
so you can easily invert them, the final effect does not change.

EDIT:
For example, here is how WinUAE indicates them; as you can see the Copper cycles are even:

Code:

 [00   0]  [01   1]  [02   2]  [03   3]  [04   4]  [05   5]  [06   6]  [07   7] 

                               RFS0 038  COP  1FE  RFS1 1FE  COP  08C  RFS2 1FE

EDIT2: to clarify: this does not mean that you can reverse the function that even or odd cycles have, it simply means that once you have defined 'the even ones' then you have also defined which are the odd ones (which are normally used by generic DMA channels: so refresh , discs, audio, sprites, bitplanes up to 4 planes).
The cycles are realigned at the end of the video line (or at the beginning if you prefer, again it's just a matter of definition), in case the total number per line is odd (like in PAL, or alternating lines in NTSC).

And this 'realignment' allows the Copper cycles to always be even.
I hope I have not made you more confused than before

bloodline · 28 May 2022, 18:24

Quote:

Originally Posted by remz

I'm confused: Bloodline said: "Also remember that the copper only has bus access on odd cycles, so each move operation takes at least 4 cycles!"

Copper is using odd, or even cycles?

Definitely go with whatever Ross and Toni say, they have tested real hardware, where I just go by the HRM which is notoriously incorrect in places. Even cycles would make more sense

remz · 03 June 2022, 04:42

You all are extremely helpful

So if I attempt to recap, please inform me if my statements are correct:
Assuming a stock Amiga 500 with 512KB chip ram, running a lowres screen in 6bpp, NTSC or PAL doesn't matter, with interrupt disabled:
Excluding the short DMAs like disk, audio, ram refresh, for the sake of simplicity:
- During VBlank, even and odd cycles are free:
One possible usage to maximize the DMA usage could be having Copper using Even cycles, while CPU can run at full speed on the Odd cycles.
Another possible usage could be using Blitter, which can run at full speed using all cycles, with an option (blithog) to let CPU run once every 3 DMA: Such setting is the 'best pipelining' achievable since CPU spends half its clock cycles on DMA, and the other half on internal instruction execution: it means that CPU borrowing 1 DMA cycle every 3 cycles will slow down the blitter slightly, but essentially yields more effective 'work per clock'.
- During horizontal blank: Copper can be used on Even cycles at full speed to setup sprites and stuff, and CPU and/or blitter can also use the Odd Cycles to do a bit of work
- During display portion: Display DMA takes all Odd cycles (planes 1 to 4), and borrows half the Even cycles for the planes 5 and 6, leaving 40 cycles free. Copper can use all of those for example to change colors or reposition a few sprites. CPU would essentially be completely idle during this part.
One possible way to make CPU parallelize work even during fully saturate chip bus could be doing a few mul or a div instructions.

With some careful timing, the CPU could be doing almost 400 div or 1200 mul per frame essentially "for free" while the DMA is completely used by display & copper.

All at the same time, 4 channel audio could be playing with less than 2% dma performance cost, and maybe even reading of a disk with a 1% dma performance cost (although I have never did a disk reading routine; it is possible that interrupts would take a much larger toll on CPU to handle copying buffers, etc.

Also if I calculate correctly, having sprite DMA active will cost 7% of dma, but just during the visible scanlines which are about 76%~82% of the total time, so real cost of sprites is approximately 5.5% per frame.

bloodline · 08 June 2022, 13:04

Quote:

Originally Posted by remz

You all are extremely helpful

So if I attempt to recap, please inform me if my statements are correct:
Assuming a stock Amiga 500 with 512KB chip ram, running a lowres screen in 6bpp, NTSC or PAL doesn't matter, with interrupt disabled:
Excluding the short DMAs like disk, audio, ram refresh, for the sake of simplicity:
- During VBlank, even and odd cycles are free:
One possible usage to maximize the DMA usage could be having Copper using Even cycles, while CPU can run at full speed on the Odd cycles.

The Copper doing what though? All it can do is load chip registers. Which is useful for setting up the display in the VBlank, as you say without interfering with the CPU. But other than that the Copper doesn't have much use during a VBlank.

Quote:

Another possible usage could be using Blitter, which can run at full speed using all cycles, with an option (blithog) to let CPU run once every 3 DMA: Such setting is the 'best pipelining' achievable since CPU spends half its clock cycles on DMA, and the other half on internal instruction execution: it means that CPU borrowing 1 DMA cycle every 3 cycles will slow down the blitter slightly, but essentially yields more effective 'work per clock'.

So this is a really useful feature of the odd/even memory interleaving. Both the CPU and Blitter can perform operations at the same time, but with interrupts disabled the Blitter can only do a single operation... Maybe you could time the Blitter and use the Copper to set up the blitter at specific scanlines and perform a regular set of copies (not sure how useful that would be).

Quote:

- During horizontal blank: Copper can be used on Even cycles at full speed to setup sprites and stuff, and CPU and/or blitter can also use the Odd Cycles to do a bit of work

With no active disk or audio DMA, I guess the copper could do some colour palette/sprite set up, but you don't have many DMA slots before the sprites need their fetch slots. This is all very shaky.

Quote:

- During display portion: Display DMA takes all Odd cycles (planes 1 to 4), and borrows half the Even cycles for the planes 5 and 6, leaving 40 cycles free. Copper can use all of those for example to change colors or reposition a few sprites. CPU would essentially be completely idle during this part.

I always think of a display fetch as a single 8-cycle operation. 6bitplane Lowres uses 6 of those cycles, leaving 2 cycles free for CPU or Blitter (very useful if you have a double buffered your display), but the Copper won't be much use here, the free cycles are not available to the Copper if I've understood the HRM diagram correctly.

Quote:

One possible way to make CPU parallelize work even during fully saturate chip bus could be doing a few mul or a div instructions.

With some careful timing, the CPU could be doing almost 400 div or 1200 mul per frame essentially "for free" while the DMA is completely used by display & copper.

I always was (and still am) a terrible 68000 coder, I preferred to treat the CPU like a load/store machine so I didn't really have to worry too much about addressing modes... by using Address registers for temporary data register storage, I would try and make as much code as possible operate in the registers... so it can be done, but your code must be extremely deterministic to the point that the returns are so diminished, it's not worth the effort. Also what about interrupts?

Quote:

All at the same time, 4 channel audio could be playing with less than 2% dma performance cost, and maybe even reading of a disk with a 1% dma performance cost (although I have never did a disk reading routine; it is possible that interrupts would take a much larger toll on CPU to handle copying buffers, etc.

Also if I calculate correctly, having sprite DMA active will cost 7% of dma, but just during the visible scanlines which are about 76%~82% of the total time, so real cost of sprites is approximately 5.5% per frame.

Depends how tall/multiplexed your sprites are...

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Color Saturation and Color Tint/Hue	Retro-Nerd	support.WinUAE	22	02 August 2018 10:38
Poland in pixels	s2325	Nostalgia & memories	3	05 May 2014 22:38
Printing in color with WinUAE on color laser	source	support.Apps	7	14 April 2013 00:32
Déjà Vu: A Nightmare Comes True	alkis21	project.Killergorilla's WHD packs	12	02 September 2012 18:49
ISO true color to 256 color algorithm	Lord Riton	Coders. General	19	15 April 2011 17:49

03 June 2022, 04:42	#31
remz Registered User Join Date: May 2022 Location: Canada Posts: 138	You all are extremely helpful So if I attempt to recap, please inform me if my statements are correct: Assuming a stock Amiga 500 with 512KB chip ram, running a lowres screen in 6bpp, NTSC or PAL doesn't matter, with interrupt disabled: Excluding the short DMAs like disk, audio, ram refresh, for the sake of simplicity: - During VBlank, even and odd cycles are free: One possible usage to maximize the DMA usage could be having Copper using Even cycles, while CPU can run at full speed on the Odd cycles. Another possible usage could be using Blitter, which can run at full speed using all cycles, with an option (blithog) to let CPU run once every 3 DMA: Such setting is the 'best pipelining' achievable since CPU spends half its clock cycles on DMA, and the other half on internal instruction execution: it means that CPU borrowing 1 DMA cycle every 3 cycles will slow down the blitter slightly, but essentially yields more effective 'work per clock'. - During horizontal blank: Copper can be used on Even cycles at full speed to setup sprites and stuff, and CPU and/or blitter can also use the Odd Cycles to do a bit of work - During display portion: Display DMA takes all Odd cycles (planes 1 to 4), and borrows half the Even cycles for the planes 5 and 6, leaving 40 cycles free. Copper can use all of those for example to change colors or reposition a few sprites. CPU would essentially be completely idle during this part. One possible way to make CPU parallelize work even during fully saturate chip bus could be doing a few mul or a div instructions. With some careful timing, the CPU could be doing almost 400 div or 1200 mul per frame essentially "for free" while the DMA is completely used by display & copper. All at the same time, 4 channel audio could be playing with less than 2% dma performance cost, and maybe even reading of a disk with a 1% dma performance cost (although I have never did a disk reading routine; it is possible that interrupts would take a much larger toll on CPU to handle copying buffers, etc. Also if I calculate correctly, having sprite DMA active will cost 7% of dma, but just during the visible scanlines which are about 76%~82% of the total time, so real cost of sprites is approximately 5.5% per frame.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)