Vampire discourse, keep it civil (was: Vampire 1200 V2 waiting times) - Page 23

nonarkitten · 30 August 2022, 01:04

Quote:

Originally Posted by Gorf

Exactly ... well it probably could even do some gouraud shading, which would be useful in 3D.

Well in that case we have to drop the DSP56K for the AT&T DSP3210, which was at least in a A3000 prototype ...

https://amitopia.com/updated-dsp-321...a-3000-is-out/

Not at all.
http://amiga.resource.cx/exp/delfina

And looking beyond Amiga.
https://en.wikipedia.org/wiki/Atari_Falcon

The NeXT also had it.
https://en.wikipedia.org/wiki/NeXTcube

The DSP3210 is fine, of course, if you're okay with state of the art in 1992. I know Dave Haynie loved it and I know there's prototypes and some libraries for it, but there are a few* things I disagree with him on and this is one of them.

- It really was never anything more than Dave's dream. The DSP56K at least made it into some real products for the Amiga. No Amiga was ever sold with the DSP3210 in it and no expansion exists with it.
- It needs more RAM throughput and storage for the same amount of data.
- It leans on the 68040 to convert to-and-from integer formats the can actually be used to, e.g., draw actual pixels on screen.
- It's an architectural dead-end, there was nothing after it.
- It wasn't as fast as AT&T claimed, Radius PhotoEngine (quad 66MHz DSP3210) on a Quadra only gave you a 2-4 times increase in Photoshop (over a 33MHz 68040!)
- They were stupidly expensive; Radius PhotoEngine retailed for $1,099 in 1994 when a brand-new Power Mac 6100 was about $1,750 and made EVERYTHING two-to-four times as fast.

They maxed out around 66MHz. Modern DSP56K's clock at 250MHz and are dual core. If this is about just resurrecting the best 1994 had to offer, then fine. If this is about making something amazing, now, then the DSP3210 is laughable and even the DSP56K is reaching it's end-of-life. But if you want SOME compatibility, it's the only thing remotely modern.

* The other was the pointlessly complex AAA chipset when the industry had already moved to simpler chunky bitmap graphics.

nonarkitten · 30 August 2022, 01:11

Like do the math -- a 100MHz 68060 beats a quad 66MHz DSP3210 set up. It's so "meh" it transitions from uninteresting to being the literal embodiment of an anti-pattern.

Gorf · 30 August 2022, 01:54

Quote:

Originally Posted by nonarkitten

Not at all.
http://amiga.resource.cx/exp/delfina

the Delfina would have a terrible bus performance, if it were to write back data to (chip-)RAM

Quote:

And looking beyond Amiga.
...

I know, where the DSP56K was used and never doubted the use in other systems...

Quote:

The DSP3210 is fine, of course, if you're okay with state of the art in 1992.

it's not like the DSP56K was not available since 1990....
Of course we are talking here about vintage implementations in both cases.

Quote:

I know Dave Haynie loved it and I know there's prototypes and some libraries for it, but there are a few* things I disagree with him on and this is one of them.

- It really was never anything more than Dave's dream. The DSP56K at least made it into some real products for the Amiga. No Amiga was ever sold with the DSP3210 in it and no expansion exists with it.

Not true:
This combination was sold as a working and certified ultrasonic unit to many medical doctors and hospitals.
(ATL HDI 1000 Ultrasound machine)

Quote:

- It needs more RAM throughput and storage for the same amount of data.

On the other hand it has a (for the time) large 8K internal Buffer/Ram, which only some models of the DSP56K do have...

Quote:

- It leans on the 68040 to convert to-and-from integer formats the can actually be used to, e.g., draw actual pixels on screen.

Well, here we are back to Blitter-integration ... something that needs specific writes to ChipRAM is likely something that needs to be displayed and therefore should be part of the GFX-pipeline...

Quote:

- It's an architectural dead-end, there was nothing after it.

hmm - ok... aren't we talking about more or less abandoned hardware all the time?
Why else would we discuss how to reimplement them today?
AT&T's history is quite complex - them stopping to develop DSPs is not really a sign of anything regarding the merits of that specific design, is it?

Quote:

- It wasn't as fast as AT&T claimed, Radius PhotoEngine (quad 66MHz DSP3210) on a Quadra only gave you a 2-4 times increase in Photoshop (over a 33MHz 68040!)

Come on ... I am sure you know you can not take one specific and potentially mediocre implementation of a software to determine the speed of a certain hardware.

Quote:

- They were stupidly expensive; Radius PhotoEngine retailed for $1,099 in 1994 when a brand-new Power Mac 6100 was about $1,750 and made EVERYTHING two-to-four times as fast.

Which is now and here completely irrelevant, isn't it?

Quote:

They maxed out around 66MHz.

Again: what should this have to do with a potential modern (FPGA?) reimplementation?

Quote:

Modern DSP56K's clock at 250MHz and are dual core.

So? What is stopping someone from implementing multiple 3210 if it comes to that?

Quote:

If this is about just resurrecting the best 1994 had to offer, then fine.

Well, as the subject of this thread is a reimplementation of the latest 68K processor in conjunction with the Amiga-Chipset and some compatible and sensible upgrades to this standard: yes we are in deed talking about "resurrecting" some 1994 designs. At least this is the common point from which we can look forward, isn't it?

Quote:

If this is about making something amazing, now, then the DSP3210 is laughable and even the DSP56K is reaching it's end-of-life. But if you want SOME compatibility, it's the only thing remotely modern.

Both are of course outdated. The question here is not how some actual old chip performs now, but what would make sense in terms of reimplementing/opimizing it within a FPGA, and also taking into account what is supported by existing libraries...

Quote:

* The other was the pointlessly complex AAA chipset when the industry had already moved to simpler chunky bitmap graphics.

AAA would have had chunky modes.

Gorf · 30 August 2022, 02:23

Quote:

Originally Posted by nonarkitten

Like do the math -- a 100MHz 68060 beats a quad 66MHz DSP3210 set up. It's so "meh" it transitions from uninteresting to being the literal embodiment of an anti-pattern.

And the very same is true for the DSP560001@66
This comparison of old chips to even older chips doesn't tell us very much about how a modern reimplementation of one or the other concept/ISA would behave within something like the Vampire or any other FPGA (and potentially ASIC) implementation.

Promilus · 30 August 2022, 06:53

@Gorf - the point is - and should be always considered - what advantage has DSP3210 implemented in FPGA over "hard processor" in silicon? While you can hook it up directly to common large dma-enabled chip-ram like local memory shared by SAGA and AC68080 that's the only real advantage. It won't get anything like real 56k running at 250MHz. And it both needs larger and more expensive FPGA (which is a first "NO") and yet additional coding effort to both implement it and keep it in line with the rest of the "virtual chipset" (which is a second "NO"). Should Coldfire V4 be more compatible with 68060 there'd be no need to make AC68080 in the first place. The same applies to DSP... although it's not like DSP56k codebase for Amiga users would be large or the impact from introducing it to amiga world (and so is AMMX atm). And to use DSP features (either from hard, external DSP or softcore inside FPGA) you'll have to make extra effort (since it's different architecture than main processor and has it's own set of tools for development). In this aspect I must say that AMMX is straightforward approach. It gives some performance benefits while allowing to use one set of coding tools with updated compiler. Should there be any effort to make heterogenous architecture there are plenty of other choices up there with even greater performance. And since potential code base for either 3210 or 56k ain't that big to make a difference in amiga world we wouldn't lose much from dumping both of those solutions anyway.

As for blitter - blitter and copper are co-processors with very limited programming capabilities. That's because anything more would've been more expensive at that time. One way of making them better is making them faster. Other way of making them better is expanding bandwidth and range of accessed memory. Both are done in SAGA afaik. There's also an option to add fully programmable unit close-by. And Apollo card doesn't really need that since AC68080 is as close as it can be - by design. Since I am a fan of chipset on-board (genuine commodore chipset) I'd rather see a solution which allows both original chipset + cpu coexist and perhaps add 3rd coprocessor working on chipram in between CPU&Chipset cycles. That'd most likely require dropping support for on-board chipram and moving it to e.g. fast SRAM or PSRAM under FPGA control but that might introduce new effects with relatively simple and inexpensive hardware. Just think about e.g. RISC-V softcore moving around few dozens of software sprites while 68k just handles regular stuff within it's dma time slot limits and so does Agnus/Alice.

BTW what is the timetable to get ASIC ?

grond · 30 August 2022, 10:10

Quote:

Originally Posted by nonarkitten

First of all, I'm comparing bitfields on 68080 versus AMMX on 68080.

So you are arguing against Gunnar's design using the strength of Gunnar's design as an argument? And using one usecase for reference where one strength of the design was particularly strong in comparison to another feature of the design? Um, ok.

Quote:

Comparing AMMX on 68080 versus bitfields on a 68020 would be stupid, agreed?

If code execution on lesser processors doesn't matter, there is no argument against AMMX anymore.

Quote:

Bitfield instructions were pipelined from the 040 onward though and only take one cycle from then-on.

This is wrong. Check the Motorola 68040 manual, page 10-15. Bitfield instructions take at least 3 cycles when working on a register and up to 18 cycles when working on memory. For all I know bitfields are also slow on the 060. Hence, you can make a choice between using bitfield instructions which are too slow on any processor preceding the 080 or new instructions that only exist on the 080. Obviously, if the program can run well enough on lower processors, it is advisable to use the backward compatible approach but if it doesn't, well, what's there to lose...

Quote:

This is the code for one pixel. It interleaves the pixel fetch to ensure there's no pipeline stalls, so dx and dy are swapped on odd/even pixels. This is what the compiler outputs:

Code:

        bfextu d0{#12:#4},d2
        tst.b d3
        jeq .nextPixel
        move.w (a0,d3.l*2),(6,a1,d1.l)
.nextPixel

You can see the whole function here: http://franke.ms/cex/z/86T7rx

I don't know much about the problem you are treating with this code but the code doesn't look optimal to me. At least GCC manages to avoid all those useless moves and other instructions we used to see in compiler generated code back in the day, the code is pretty dense. It may even be the best the compiler can generate with the code you are giving it.

Why do you skip zero pixels instead of writing them? Is this to preserve background information? If not, don't clear the buffer and write out zeros avoiding all the branching as this should be faster.

In any case I would treat two pixels at once using a 1024 byte table instead of a 64 byte table and thereby avoid the bitfields altogether by working on byte indeces. Or is this loop run only very few times and needs a new table set up each time it gets called? In this case it is no miracle that using AMMX instructions for the scatter operation isn't much faster. Instead of bitfields I would expect masks and AND-instructions working on a register full of pixel data to be faster.

grond · 30 August 2022, 10:18

Quote:

Originally Posted by nonarkitten

2. The DSP56K is way more powerful than AMMX. AMMX robs the 68080 of performance (i.e., you have to hope that the slot for AMMX is more efficient than the slot for a 68K instruction), the DSP56k only adds to it. So if the DSP56K adds 16MIPS, that's 16MIPS more than the system had before; there's no "stealing" a pipeline slot from the 68K.

But there is stealing memory bandwidth. If a single processor can saturate the memory bandwidth (as the 080 can), there is nothing to gain if a second processors works on the same memory in parallel.

Quote:

And a modern DSP56K runs up to 250MHz with dual cores for a supplemental 500 MIPS -- more than three times that of the Apollo core alone.

But it doesn't run any Amiga code at all. And it would have to be a resource available to exactly one task. Nowhere as flexible as a processor extension.

Quote:

And sure, an AC68080 can brute force better than the Atari Quake demo, but this was a 68030 and 68882 at its base, the DSP56K was doing all the heavy lifting here.

I'm not impressed. I rather have standard code run faster than having to code something to take advantage of a very specific coprocessor.

Thomas Richter · 30 August 2022, 12:06

Quote:

Originally Posted by nonarkitten

First of all, I'm comparing bitfields on 68080 versus AMMX on 68080. Comparing AMMX on 68080 versus bitfields on a 68020 would be stupid, agreed? Bitfield instructions were pipelined from the 040 onward though and only take one cycle from then-on. So it would be fair to compare this against the 040 or 060 as well.

Unfortunately, that is not the case for the 68060. On this microprocessor, bitfield instructions are one of the rare instructions that are microcoded. Even worse, the 68060 implements bitfields by reading the memory byte-wise. Now, if you attempt to emulate the blitter by bitfields on the 68060 and want to read with bitfields from chip-memory, the CPU has to wait for each byte(!) included in the bitfield to wait for a chip RAM cycle becoming available.

Needless to say, this is terribly slow. Not by the microcode, but because the 68060 breaks up the instructions into multiple reads.

I know because I'm just through a couple of optimizations of the latest P96 release where the blitter emulation changed for the 68060 for exactly this reason. You are better off reading the data manually and shifting it in place rather than using the bitfields.

For the 68030, the situation is interestingly just the reverse. Comparing with the rest of the CPU, the bitfield instructions are fast. They surely require multiple cycles, but so do many other instructions, and they operate by a single bus cycle if possible, not by multple cycles.

nonarkitten · 30 August 2022, 18:43

Quote:

Originally Posted by grond

So you are arguing against Gunnar's design using the strength of Gunnar's design as an argument? And using one usecase for reference where one strength of the design was particularly strong in comparison to another feature of the design? Um, ok.

I'm only pointing out the uselessness of AMMX, not whether the Vampire on a whole is terrible or not.

Quote:

Originally Posted by grond

If code execution on lesser processors doesn't matter, there is no argument against AMMX anymore.

Vampire is on-par with an overclocked 68060.

Quote:

Originally Posted by grond

This is wrong. Check the Motorola 68040 manual, page 10-15. <snip>

That's not at all how you read this table and an oversimplification for a pipelined processor. In our use case this is 1 fetch cycle and four execute cycles. In our pipeline this may or may not produce a stall depending on surrounding load/store/ea times, I didn't look that far into it. Same thing with the 060 but now we have two pipelines to consider. GCC knows all this and produces the optimal code accordingly and that's all that snippet was.

Quote:

Originally Posted by grond

I don't know much about the problem you are treating with this code but the code doesn't look optimal to me. At least GCC manages to avoid all those useless moves and other instructions we used to see in compiler generated code back in the day, the code is pretty dense. It may even be the best the compiler can generate with the code you are giving it.

What useless moves? If this doesn't look optimal to you, you're high on something. If you can provide something better, then great. And this is exactly what GCC produces if you took five seconds to checkout the link. This is literally the code Bebbo's GCC makes.

Quote:

Originally Posted by grond

Why do you skip zero pixels instead of writing them?

Those are transparent pixels. We're rendering sprites. Keep up.

Quote:

Originally Posted by grond

Is this to preserve background information? If not, don't clear the buffer and write out zeros avoiding all the branching as this should be faster.

Branching on both the 68060 and 68080 is basically free in this use case.

Quote:

Originally Posted by grond

In any case I would treat two pixels at once using a 1024 byte table instead of a 64 byte table and thereby avoid the bitfields altogether by working on byte indeces.

This makes no sense. I don't get to change how the NEOGEO defined its colour palettes or sprite data. If you actually looked at the link to franke.ms, you can see I'm reading eight sprite pixels at a time (4 bits per pixel) and then pulling those bits out using bitfields. I could replace the BFEXTU with MOVE dx,dy; AND #15,dy; LSR.L #4,dx -- you think these three operations are faster than one BFEXTU? Prove it. Don't speculate. Don't post links. Prove me wrong.

Coalescing two 16-bit writes into a single 32-bit write MIGHT speed things up a tiny bit, but with caching probably not.

Quote:

Originally Posted by grond

Or is this loop run only very few times and needs a new table set up each time it gets called? In this case it is no miracle that using AMMX instructions for the scatter operation isn't much faster. Instead of bitfields I would expect masks and AND-instructions working on a register full of pixel data to be faster.

On the NEOGEO every sprite is a 16x16 tile on the sprite layers and 8x8 on the character layers (aka "fixed"). Each one has it's own unique palette; there are 256, 16-colour palettes (well, 15 colours for most layers except for the bottom-most char plane).

Since on the NEOGEO, everything's a sprite, this code is executed exhaustively for the entire screen. Possibly many times per pixel since there's no "overdraw" testing -- I would guess in the 2-3 times territory.

I also like how you ignored the actual metrics and are still harping on "tables" and "crap code" to try and prove some point. AMMX was 10% faster on sprite rendering in GNGEO. Only 10% over my so-called "crap code". If you think you can write better 68K code, then that only proves my point further that AMMX is basically rubbish in something it was specifically designed for.

nonarkitten · 30 August 2022, 18:48

Quote:

Originally Posted by grond

But there is stealing memory bandwidth. If a single processor can saturate the memory bandwidth (as the 080 can), there is nothing to gain if a second processors works on the same memory in parallel.

DSPs usually have massive amounts of local SRAM to deal with intermediate work. The last DSP56724 has 112,000, 24-bit words (about 336KB on chip).

Quote:

Originally Posted by grond

But it doesn't run any Amiga code at all. And it would have to be a resource available to exactly one task. Nowhere as flexible as a processor extension.

So GPU's suck? Audio codecs suck? These can all be shared, just not as granular as opcodes. And no, DSPs are co-processors, of course they don't run Amiga code.

Quote:

Originally Posted by grond

I'm not impressed. I rather have standard code run faster than having to code something to take advantage of a very specific coprocessor.

So you don't like the Amiga then? Interesting. Why are you here then?

Gorf · 30 August 2022, 19:01

Quote:

Originally Posted by Promilus

@Gorf - the point is - and should be always considered - what advantage has DSP3210 implemented in FPGA over "hard processor" in silicon?

....

Quote:

BTW what is the timetable to get ASIC ?

You gave one answer yourself: once the DSP is part of the FPGA implementation it would also be part of a much faster ASIC.

I really have nothing against the DSP56K or any other DSP per se.
It is just that the DSP3210 was part of the A3000+ and these ultrasounds machines and now even new rebuild A3000+ boards exist and are actually running ...

Yes, the DSP56K is on then Delfina but I could not find any software that makes use of it other than sound effects directly on this ZorroII card ...

That said: I would have nothing against some DSP features directly build into Paula ...

Quote:

As for blitter - blitter and copper are co-processors with very limited programming capabilities. That's because anything more would've been more expensive at that time.

well not any more, when we talk about a FPGA reimplementation ...

Quote:

Just think about e.g. RISC-V softcore moving around few dozens of software sprites while 68k just handles regular stuff within it's dma time slot limits and so does Agnus/Alice.

A RISC-V would be probably overkill for moving things around ... but again some more complex (DSP-like) features within Agnus would be nice.

And If you think about it, per definition the Blitter already is a DSP:
It takes one or more input-streams aka signals and transforms them into one output stream. In this case the operations on this data are rather simple but nevertheless it is already digital signal processing ...

grond · 30 August 2022, 19:31

Quote:

Originally Posted by nonarkitten

Vampire is on-par with an overclocked 68060.

In some things yes, in others it is superior, in some inferior. Who would've thunk?

Quote:

That's not at all how you read this table and an oversimplification for a pipelined processor. In our use case this is 1 fetch cycle and four execute cycles. In our pipeline this may or may not produce a stall depending on surrounding load/store/ea times, I didn't look that far into it.

Ah, you are sure that my interpretation is wrong (because it contradicts your unsupported statement) but "didn't look that far into it". Well, then that is that.

Quote:

Same thing with the 060 but now we have two pipelines to consider.

Yep. And some 60 cycles for a bitfield instruction as Thomas kindly pointed out.

Quote:

GCC knows all this and produces the optimal code accordingly and that's all that snippet was.

Ah, you are of the "the compiler knows better than the assembly coder" type.

Quote:

What useless moves?

Perhaps you just read again what I wrote?

Quote:

If this doesn't look optimal to you, you're high on something.

Ain't you a peach?

Quote:

If you can provide something better, then great. And this is exactly what GCC produces if you took five seconds to checkout the link. This is literally the code Bebbo's GCC makes.

I actually looked at the code for a few minutes which is why I made a complimentary remark about the quality of GCC. It looks like the C code or rather the algorithm is bad. And I hinted at how it could be made much better.

Quote:

Branching on both the 68060 and 68080 is basically free in this use case.

Nonsense. The pixel data is highly unpredictable which is why branch prediction will fail every so often.

Quote:

This makes no sense. I don't get to change how the NEOGEO defined its colour palettes or sprite data. If you actually looked at the link to franke.ms, you can see I'm reading eight sprite pixels at a time (4 bits per pixel)

Yes. And then you treat it four bits a time which is the deficiency I pointed out. Use 8 bits and read two pixels at once from the table. Then write the words.

Quote:

I could replace the BFEXTU with MOVE dx,dy; AND #15,dy; LSR.L #4,dx -- you think these three operations are faster than one BFEXTU?

You could use masks such as f0f0f0f0 and 0f0f0f0f or ff00ff00 and 00ff00ff. With two pipes you can do that fully in parallel. Then your pixel data is organised in bytes which can be handled easily using move.b, shift and swap instructions. And yes, I'm pretty sure these operations are faster on 060.

Quote:

Prove it. Don't speculate. Don't post links. Prove me wrong.

Why would I? I already proved your claim about bitfield execution timings wrong (with some help from Thomas).

Quote:

Coalescing two 16-bit writes into a single 32-bit write MIGHT speed things up a tiny bit, but with caching probably not.

I was referring to the table look-ups: larger table, larger index, less table lookups. Coder 101.

Quote:

I also like how you ignored the actual metrics and are still harping on "tables" and "crap code" to try and prove some point. AMMX was 10% faster on sprite rendering in GNGEO. Only 10% over my so-called "crap code".

Well, there sure are things that AMMX may not be useful for such as taking four bits worth of data out of a data word in an arkane data format and making table lookups. There is nothing in AMMX that could replace the table lookups which is the major part of the execution time in that routine. AMMX is designed to work on specific data formats in a single operation. Your example just proves that there are problems for which SIMD units are not optimal. We have known that at least since the Itanium flop.

Quote:

If you think you can write better 68K code, then that only proves my point further that AMMX is basically rubbish in something it was specifically designed for.

AMMX wasn't designed to make many table lookups easier, it is designed to work on standard data and pixel formats such as 8bit chunky and 16bit chunky, certainly not 4bit-indexed table lookups.

grond · 30 August 2022, 19:36

Quote:

Originally Posted by nonarkitten

So GPU's suck? Audio codecs suck? These can all be shared, just not as granular as opcodes. And no, DSPs are co-processors, of course they don't run Amiga code.

They don't run any Amiga code as there is none that uses DSPs. Somebody may write some code for it but still only one program will be able to use the DSP at a time. So it would have to be either GPU or audio or whatever. Run two programs and you'll get "device or resource busy". We moved beyond single-tasking in 1985.

Quote:

So you don't like the Amiga then? Interesting. Why are you here then?

nonarkitten · 30 August 2022, 19:48

Quote:

Originally Posted by Thomas Richter

Unfortunately, that is not the case for the 68060. On this microprocessor, bitfield instructions are one of the rare instructions that are microcoded. Even worse, the 68060 implements bitfields by reading the memory byte-wise. Now, if you attempt to emulate the blitter by bitfields on the 68060 and want to read with bitfields from chip-memory, the CPU has to wait for each byte(!) included in the bitfield to wait for a chip RAM cycle becoming available.

First, I'm not using bitfields from memory and I was not arguing that AMMX isn't faster than the 68040 or 68060. I'm arguing that AMMX on the 68080 only nets a negligible increase in performance over GCC-generated code. ON THE VAMPIRE.

Now, can you make this code run equally fast on the 68060? Maybe. Probably. But running pure 68K isn't for the benefit of people with 68060's since GNGEO could not, ever, practically run on one -- it's to be able to debug in UAE. UAE doesn't have AMMX. Probably never will. But compiling and debugging ON the Vampire is painful in comparison.

And all that optimization to make it faster on the 68060 would make it a lot slower on the Vampire. So yeah, on the 68040 and 68060 this is perhaps not the fastest code. It's what GCC gave me, it runs well enough to debug and test on UAE and then run on the Vampire to check performance.

AMMX only gets in the way here.

nonarkitten · 30 August 2022, 19:53

Quote:

Originally Posted by Gorf

<snip>

I never proposed using the DSP56K in-FPGA. It's a tiny chip that was widely available and would outperform anything the Cyclone could do, so why would you just waste FPGA space for it.

Gorf · 30 August 2022, 20:18

Quote:

Originally Posted by nonarkitten

I never proposed using the DSP56K in-FPGA. It's a tiny chip that was widely available and would outperform anything the Cyclone could do, so why would you just waste FPGA space for it.

since we were talking about a pure FPGA reimplementation in this thread (Vampire) I was assuming this was a given - and would make sense if this implementation would finally make the jump to ASIC.

Well OK ... If we are talking about real chips now ... the fastest easy available DSP would probably be some TMS320xxxx @ 1.25 GHz

grond · 30 August 2022, 20:25

Quote:

Originally Posted by Gorf

Well OK ... If we are talking about real chips now ... the fastest easy available DSP would probably be some TMS320xxxx @ 1.25 GHz

And we could add a 5 GHz Ryzen processor. We could have an Amiga task allocate the Ryzen as a resource and load Linux or Windows into it. No more need for PCTask...

nonarkitten · 30 August 2022, 20:31

Quote:

Originally Posted by grond

In some things yes, in others it is superior, in some inferior. Who would've thunk?

You sarcasm isn't helpful.

Quote:

Originally Posted by grond

Ah, you are sure that my interpretation is wrong (because it contradicts your unsupported statement) but "didn't look that far into it". Well, then that is that.

I did look into it pretty far. But the initial argument wasn't about 68040's or 68060's which couldn't run GNGEO if they wanted to. Yeah, the bitfield takes five execute cycles on non-Vampire. That kind of sucks, but it doesn't matter.

Quote:

Originally Posted by grond

Yep. And some 60 cycles for a bitfield instruction as Thomas kindly pointed out.

60? LOL. It's five. For register-to-register it's five. I don't even get where you read 60 from, the table shows it maxes out at 18 with a complex EA.

Quote:

Originally Posted by grond

Ah, you are of the "the compiler knows better than the assembly coder" type.

You guys sure like making your straw men arguments. I never said that. I could easily write 68K assembly that out performs this. That wasn't the point and would only further prove my argument that ON THE VAMPIRE, AMMX is basically useless.

Quote:

Originally Posted by grond

Perhaps you just read again what I wrote?

The move is necessary to move the sprite data from sprite memory to screen memory. This is not useless. It's required by the engine.

Quote:

Originally Posted by grond

Ain't you a peach?

You can shove your condescending tone right along with your sarcasm.

Quote:

Originally Posted by grond

I actually looked at the code for a few minutes which is why I made a complimentary remark about the quality of GCC. It looks like the C code or rather the algorithm is bad. And I hinted at how it could be made much better.

And yet it's still 90% the speed of hand-optimized AMMX. As shitty as it is.

Quote:

Originally Posted by grond

Nonsense. The pixel data is highly unpredictable which is why branch prediction will fail every so often.

Both the 68060 and 68080 can execute branch in parallel. The 68060 has a separate third pipeline dedicated to branching. On the 68080, there is specifically a condition where if the branch is skipping the next instruction it gets folded and doesn't eat any time. That's why it's not split into two moves.

Quote:

Originally Posted by grond

Yes. And then you treat it four bits a time which is the deficiency I pointed out. Use 8 bits and read two pixels at once from the table. Then write the words.

The saving here would bring the 68060 in-line with the 68080 at around 2-cycles per pixel. Again, we're talking about hand-written ASM beating C. I'm not arguing that. I'm not defending C. I'm stating that compiled C code is almost as fast as ideal AMMX code.

Quote:

Originally Posted by grond

You could use masks such as f0f0f0f0 and 0f0f0f0f or ff00ff00 and 00ff00ff. With two pipes you can do that fully in parallel. Then your pixel data is organised in bytes which can be handled easily using move.b, shift and swap instructions. And yes, I'm pretty sure these operations are faster on 060.

Not an idiot, already thought of that. You're trading eight inline ANDs for two up-front ANDs. Big whoop.

If I was trying to make GNGEO run on a real, physical 68060, I might care about this level of hyper-optimization. But you people are so far off the reservation at this point.

Quote:

Originally Posted by grond

Why would I? I already proved your claim about bitfield execution timings wrong (with some help from Thomas).

Yes thanks for providing a totally meaningless argument that's completely beside the point. All that's been proven is the BF needs five execute cycles because it's microcoded. The argument that that's faster than the three opcodes to replace it hasn't been proven.

Quote:

Originally Posted by grond

I was referring to the table look-ups: larger table, larger index, less table lookups. Coder 101.

That is literally nonsense. A larger table doesn't make for fewer lookups. The only thing we could do with more registers is load the CLUT entirely into registers once and deference those. AMMX does this, it's a pretty cool operation. It's why it's a whole 10% faster than stock 68K code.

Quote:

Originally Posted by grond

Well, there sure are things that AMMX may not be useful for such as taking four bits worth of data out of a data word in an arkane data format and making table lookups. There is nothing in AMMX that could replace the table lookups which is the major part of the execution time in that routine. AMMX is designed to work on specific data formats in a single operation. Your example just proves that there are problems for which SIMD units are not optimal. We have known that at least since the Itanium flop.

AMMX isn't perfect SIMD. There are a lot of missing opcodes, but one thing Gunnar did do is focus on the Amiga by giving us AMMX opcodes that are specific to CLUT and C2P operations. And in spite of this, only provides a small gain over M68K code.

Quote:

Originally Posted by grond

AMMX wasn't designed to make many table lookups easier, it is designed to work on standard data and pixel formats such as 8bit chunky and 16bit chunky, certainly not 4bit-indexed table lookups.

See, here you're completely wrong, AMMX has a specific set of instructions to make 16-bit CLUT really fast. Like damn, you're a fanboi and don't even know what AMMX can do?

For the record, here's the AMMX inner loop. At the time, GCC didn't understand AMMX (not sure if it does yet), so that's all using DC.W with the original instructions in the comments.

Code:

__asm__ volatile ( "\n"
"\tmove.w  0(%0),d0 \n" 
"\tmove.w  2(%0),d1 \n"
"\tmove.w  4(%0),d2 \n" 
"\tmove.w  6(%0),d3 \n" 

// TRANSi takes 8, 4-bit values from source and uses
// words stored in E8 thru E23 to write the dest
// since this needs 128-bit, this uses a register pair
"\tdc.w 	0xfe00,0x1803 \n" // TRANSi-LO D0, E0:E1
"\tdc.w 	0xfe01,0x1a03 \n" // TRANSi-LO D1, E2:E3
"\tdc.w 	0xfe02,0x1c03 \n" // TRANSi-LO D2, E4:E5
"\tdc.w 	0xfe03,0x1e03 \n" // TRANSi-LO D3, E6:E7

// STOREM3 will conditionally store each word
"\tdc.w 	0xfe11,0x9926 \n"        // STOREM3.W E1,E1,(A1)
"\tdc.w 	0xfe29,0xbb26,0x0008 \n" // STOREM3.W E3,E3,(8,A1)
"\tdc.w 	0xfe29,0xdd26,0x0010 \n" // STOREM3.W E5,E5,(16,A1)
"\tdc.w 	0xfe29,0xff26,0x0018 \n" // STOREM3.W E7,E7,(24,A1)

: "+a"(gfxdata),"+a"(tilepos)
:: "d0","d1","d2","d3"
);

nonarkitten · 30 August 2022, 20:33

Quote:

Originally Posted by Gorf

since we were talking about a pure FPGA reimplementation in this thread (Vampire) I was assuming this was a given - and would make sense if this implementation would finally make the jump to ASIC.

Well OK ... If we are talking about real chips now ... the fastest easy available DSP would probably be some TMS320xxxx @ 1.25 GHz

There will never be a Vampire ASIC.

Yes, there are some impressive DSP's from TI. Absolutely zero legacy with the Amiga though and zero code that would use them.

Gorf · 30 August 2022, 20:33

Quote:

Originally Posted by grond

And we could add a 5 GHz Ryzen processor. We could have an Amiga task allocate the Ryzen as a resource and load Linux or Windows into it. No more need for PCTask...

It's not like the A1060-sidecar was not a full PC ...

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Vampire V4 plus Amiga 1200 and 500 for sale	drusso66	MarketPlace	7	14 November 2021 05:59
For Sale: Amiga 1200 with vampire 1200 v2	supperbin	MarketPlace	8	09 July 2021 15:47
Warp 1260 or Vampire 1200 V2	dude1995	MarketPlace	0	20 May 2021 04:05
Vampire 1200	HanSolo	support.Hardware	55	19 June 2017 10:15
Amiga 1200 Vampire Cards	PaulG	Amiga scene	61	24 February 2017 03:47

30 August 2022, 01:11	#442
nonarkitten Registered User Join Date: Jun 2018 Location: Calgary/Canada Posts: 247	Like do the math -- a 100MHz 68060 beats a quad 66MHz DSP3210 set up. It's so "meh" it transitions from uninteresting to being the literal embodiment of an anti-pattern.

30 August 2022, 06:53	#445
Promilus Registered User Join Date: Sep 2013 Location: Poland Posts: 847	@Gorf - the point is - and should be always considered - what advantage has DSP3210 implemented in FPGA over "hard processor" in silicon? While you can hook it up directly to common large dma-enabled chip-ram like local memory shared by SAGA and AC68080 that's the only real advantage. It won't get anything like real 56k running at 250MHz. And it both needs larger and more expensive FPGA (which is a first "NO") and yet additional coding effort to both implement it and keep it in line with the rest of the "virtual chipset" (which is a second "NO"). Should Coldfire V4 be more compatible with 68060 there'd be no need to make AC68080 in the first place. The same applies to DSP... although it's not like DSP56k codebase for Amiga users would be large or the impact from introducing it to amiga world (and so is AMMX atm). And to use DSP features (either from hard, external DSP or softcore inside FPGA) you'll have to make extra effort (since it's different architecture than main processor and has it's own set of tools for development). In this aspect I must say that AMMX is straightforward approach. It gives some performance benefits while allowing to use one set of coding tools with updated compiler. Should there be any effort to make heterogenous architecture there are plenty of other choices up there with even greater performance. And since potential code base for either 3210 or 56k ain't that big to make a difference in amiga world we wouldn't lose much from dumping both of those solutions anyway. As for blitter - blitter and copper are co-processors with very limited programming capabilities. That's because anything more would've been more expensive at that time. One way of making them better is making them faster. Other way of making them better is expanding bandwidth and range of accessed memory. Both are done in SAGA afaik. There's also an option to add fully programmable unit close-by. And Apollo card doesn't really need that since AC68080 is as close as it can be - by design. Since I am a fan of chipset on-board (genuine commodore chipset) I'd rather see a solution which allows both original chipset + cpu coexist and perhaps add 3rd coprocessor working on chipram in between CPU&Chipset cycles. That'd most likely require dropping support for on-board chipram and moving it to e.g. fast SRAM or PSRAM under FPGA control but that might introduce new effects with relatively simple and inexpensive hardware. Just think about e.g. RISC-V softcore moving around few dozens of software sprites while 68k just handles regular stuff within it's dma time slot limits and so does Agnus/Alice. BTW what is the timetable to get ASIC ?

Currently Active Users Viewing This Thread: 2 (0 members and 2 guests)