Next gen Amiga - Page 28

meynaf · 24 May 2018, 12:54

Quote:

Originally Posted by Gorf

We are talking about next-gen here - so targeting a 7MHz 68000 is not necessary. But the performance problems on low end processors show, that there is room for improvement.

Exactly, something that runs fine on a low end machine will just fly on a fast one !

Quote:

Originally Posted by Gorf

Can you give me some examples of code that take two or more instructions on 68k, but are just one instruction on RISC?

An example of this is arm's predicate+barrelshifter. But they're not as useful as they pretend when alone, and nearly never used together.
So rather than one-liner examples, that can be biased toward some particular architecture, a whole routine would be better (especially one that puts some pressure on the register file).

Why not a code contest ?
Everyone interested designs his own ISA (or chooses an existing one to defend) and then writes some routine (doing something useful).
We could then finally see who's powerful and who's not.

meynaf · 24 May 2018, 13:05

Quote:

Originally Posted by Gorf

I still do not understand why one would hold multi-gigabytes of instrument-samples in RAM - especially in a highly redundant resolution.
you mentioned some latency related reasons, but that makes still no sense to me at all, since we are in the digital realm:
a higher sampling rate means just more data that needs to be processed. More data representing the same period of time. That means higher demand of RAM, bandwidths and processing power. How can this possibly reduce latency?

Do not search for a technical reason why to do this - there is no valid one.
It's always the same old reason why folks do complicated things in place of simple ones.
Something like :
- Hey, i'm managing complex projects handling several gigabytes of data !
In comparison to :
- Huh, i'm doing my work with only a few MB of memory.
You see ? Usual "we've got the biggest balls" stuff.
Unfortunately they do that without knowing and telling them does not help.

Gorf · 24 May 2018, 13:21

I do not want to defend any ISA, but I will start with the m most familiar around here:

Find the maximum and the minimum of two values - both are already in registers d0 and d1 - result needs to be in the same registers.

Code:

sub.l  dl,d0
subx.l d2,d2
and.l  d0,d2
eor.l  d2,d0
add.l  d1,d0
add.l  d2,d1

max is now in d0, min is in d1

Thorham · 24 May 2018, 13:34

Faster on 68020 and 68030:

Code:

   cmp.l    d0,d1
   bgt.s    .l1
   exg      d0,d1
.l1

Gorf · 24 May 2018, 14:05

shorter but not faster ;-)
(my example needs always 12 cycles, yours 6/10 + jmp)

Dunny · 24 May 2018, 14:05

Quote:

Originally Posted by Gorf

While I am not against more than 2 or 4 GB and can see the benefit of a larger address-space for some (rare) applications, your example does not convince me.

I still do not understand why one would hold multi-gigabytes of instrument-samples in RAM - especially in a highly redundant resolution.
you mentioned some latency related reasons, but that makes still no sense to me at all, since we are in the digital realm:
a higher sampling rate means just more data that needs to be processed. More data representing the same period of time. That means higher demand of RAM, bandwidths and processing power. How can this possibly reduce latency?

Imagine you have a track made of around 20 or thirty sampled instruments. Each one has a sample per note (127 of those), one per velocity per note (so each of those 127 notes has 127 samples) and each of those velocities was recorded with 12 or 16 microphones, each having their own spacial properties. That's a pretty extreme example - most instruments have only 30 velocities and are recorded with only three or four mics - but we have to allow for it.

Now for playback of such a track you could render the whole thing to a WAV file (we allow for this) but it takes minutes to do, and making adjustments involves going back to individual samples, so to make things flow a little easier, we keep the lot in memory where necessary. That means that there may be a slight delay when loading in samples that haven't been used yet, but that's fine for editing.

Where it's absolutely not fine is in a live performance. In that situation we cannot tell ahead of time which samples will be needed and we certainly cannot allow any time at all to pull samples off a disk. We need the whole lot in memory.

Then there's effect mixing, which if done on 44.1khz 16bit sound samples quickly gathers aliasing errors so to minimise that we use 192khz 32bit float samples.

It all adds up, I'm afraid, and HDDs (even SSDs) are not yet fast enough.

meynaf · 24 May 2018, 14:18

Quote:

Originally Posted by Gorf

shorter but not faster ;-)
(my example needs always 12 cycles, yours 6/10 + jmp)

You're both wrong.

They have more or less same speed on 020/030.
Linear example is always 12.
Branch example is 10/14 depending on the case (2+8 if taken, 2+6+6 if not taken).

Of course this can be just 2 cycles on a cpu doing instruction fusing.

Thorham · 24 May 2018, 14:19

Quote:

Originally Posted by Gorf

shorter but not faster ;-)
(my example needs always 12 cycles, yours 6/10 + jmp)

The cmp is 2 cycles, the exg is 4 cycles, the bgt.s is 4 cycles when it's not taken and 8 cycles when it is. This adds up to 10 cycles in both cases. Note that this is based on 68030 timings, so it may actually not be faster on 68020.

meynaf · 24 May 2018, 14:22

In my memory exg is 6 and branch is 6 when not taken and 8 if taken.

meynaf · 24 May 2018, 14:23

Quote:

Originally Posted by Dunny

Imagine you have a track made of around 20 or thirty sampled instruments. Each one has a sample per note (127 of those), one per velocity per note (so each of those 127 notes has 127 samples) and each of those velocities was recorded with 12 or 16 microphones, each having their own spacial properties. That's a pretty extreme example - most instruments have only 30 velocities and are recorded with only three or four mics - but we have to allow for it.

Now for playback of such a track you could render the whole thing to a WAV file (we allow for this) but it takes minutes to do, and making adjustments involves going back to individual samples, so to make things flow a little easier, we keep the lot in memory where necessary. That means that there may be a slight delay when loading in samples that haven't been used yet, but that's fine for editing.

Where it's absolutely not fine is in a live performance. In that situation we cannot tell ahead of time which samples will be needed and we certainly cannot allow any time at all to pull samples off a disk. We need the whole lot in memory.

Then there's effect mixing, which if done on 44.1khz 16bit sound samples quickly gathers aliasing errors so to minimise that we use 192khz 32bit float samples.

It all adds up, I'm afraid, and HDDs (even SSDs) are not yet fast enough.

Yet this does not require that big amounts of memory because you could preload just the start of the samples (enough to cover the latency), loading what follows only for samples that are actually used.

Gorf · 24 May 2018, 14:26

Quote:

Originally Posted by Dunny

Where it's absolutely not fine is in a live performance. In that situation we cannot tell ahead of time which samples will be needed and we certainly cannot allow any time at all to pull samples off a disk. We need the whole lot in memory.

you have a limited number of keyboards with a limited number of keys.
The maximum a single person can handle is probably a arrangement similar to big pipe organ in a church with all the panels and registers.

So yes, you do know what limited options of samples are needed.
You are not going to evaluate different microphone settings of your samples in a live performance - these are things you chose upfront.

Quote:

Then there's effect mixing, which if done on 44.1khz 16bit sound samples quickly gathers aliasing errors so to minimise that we use 192khz 32bit float samples.

sure you need to have some higher (virtual) sample-rate while calculating the mix - but upscaling should be part of your algorithm. Doubling the value-range to 32bit makes sense - also doubling the rate to 88.2 might be useful - anything more is useless, since all aliasing errors than still left can not influence your hearing experience.

But again: you only need to do that within your calculation, but there is no need to store the instruments in this "quality" since it is only intermediate redundant information.

Quote:

It all adds up, I'm afraid, and HDDs (even SSDs) are not yet fast enough.

with your approach they aren't of course!
first you blow up your data by a factor of >8 without adding information and than you complain about the transfer speed...

Thorham · 24 May 2018, 14:27

Quote:

Originally Posted by meynaf

In my memory exg is 6 and branch is 6 when not taken and 8 if taken.

Check the manual. Exg really is 4 and non taken byte branches are 4. Just benched it, and my code always executes in 10 cycles.

meynaf · 24 May 2018, 14:41

Quote:

Originally Posted by Thorham

Check the manual. Exg really is 4 and non taken byte branches are 4. Just benched it, and my code always executes in 10 cycles.

Hmm... well... what can i say...

Anyway, instruction timings depend heavily on the implementation, so we'd better favor small code - simply because it's small everywhere.

Gorf · 24 May 2018, 14:54

Quote:

Originally Posted by Thorham

Check the manual. Exg really is 4 and non taken byte branches are 4. Just benched it, and my code always executes in 10 cycles.

hmm - according to the user manual EXG is 4 but branching at least 6 - more if it misses the cache.
https://www.nxp.com/docs/en/referenc.../MC68030UM.pdf
(page 11-48)

meynaf · 24 May 2018, 15:04

One sure thing is that the code is 6 bytes

Gorf · 24 May 2018, 15:09

ok - now we got the 68k case more than covered.
Next ISA please ;-)

Thorham · 24 May 2018, 15:41

Quote:

Originally Posted by Gorf

hmm - according to the user manual EXG is 4 but branching at least 6 - more if it misses the cache.
https://www.nxp.com/docs/en/referenc.../MC68030UM.pdf
(page 11-48)

I bench marked my code on a 50mhz 68030, and it really is 10 cycles. Obviously, the code was executed from the cache completely.

Gorf · 24 May 2018, 16:05

Quote:

Originally Posted by Thorham

I bench marked my code on a 50mhz 68030, and it really is 10 cycles. Obviously, the code was executed from the cache completely.

good to know! thank you for evaluating

meynaf · 24 May 2018, 16:17

Quote:

Originally Posted by Gorf

ok - now we got the 68k case more than covered.
Next ISA please ;-)

What ? You ask risc lovers to write actual asm code ? Ahem...

Oh, by the way. This is a little bit boring example ; with my own instruction set it would be single instruction. But who cares.

Gorf · 24 May 2018, 17:13

Quote:

Originally Posted by meynaf

What ? You ask risc lovers to write actual asm code ? Ahem...

Oh, by the way. This is a little bit boring example ; with my own instruction set it would be single instruction. But who cares.

I do.

you mentioned instruction fusing... and maybe your instruction set would be a good intermediate representation:

a sophisticated decoder/translator in FPGA would find that both code snippets do the same in the end and can be represented by a single (intermediate) instruction.

The FPGA would take every instruction and identify the group. it can do that in parallel with many instructions.(parallelism)

In the second step it compares every instruction with the one that follows - if it belongs to the right group and such a comparison makes sense. Meanwhile the next group of instructions are passing through step one. (pipelining)

in the third step matching couples of instructions are fused - there can be more than one fusing step. (meanwhile an other group of instructions enters step one und former step one instructions go to comparing in step two....)

Now we would end up with a architecture independent and very short intermediate representation of the code.
Traversing a LUT or a tree each intermediate instruction would be translated in either host-cpu code or send to some special simd-unit in FPGA.

there could be more than one of these decoders/translators allowing for some kind of "speculative translation" of branches.

24 May 2018, 13:21	#543
Gorf Registered User Join Date: May 2017 Location: Munich/Bavaria Posts: 2,294	I do not want to defend any ISA, but I will start with the m most familiar around here: Find the maximum and the minimum of two values - both are already in registers d0 and d1 - result needs to be in the same registers. Code: sub.l dl,d0 subx.l d2,d2 and.l d0,d2 eor.l d2,d0 add.l d1,d0 add.l d2,d1 max is now in d0, min is in d1

24 May 2018, 13:34	#544
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,757	Faster on 68020 and 68030: Code: cmp.l d0,d1 bgt.s .l1 exg d0,d1 .l1

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Has anyone got an Amiga 1200 T12 Gen II?	ccorkin	support.Hardware	10	14 April 2017 23:18
What do people think about this as next Gen AMIGA?	Gunnar	Amiga scene	111	05 July 2014 20:59
Classic 1st Gen EA games for the Amiga	illy5603	support.Games	8	03 July 2010 02:59
Next-gen Amiga development	LaundroMat	Coders. General	3	05 October 2002 00:30

24 May 2018, 14:05	#545
Gorf Registered User Join Date: May 2017 Location: Munich/Bavaria Posts: 2,294	shorter but not faster ;-) (my example needs always 12 cycles, yours 6/10 + jmp)

24 May 2018, 14:22	#549
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	In my memory exg is 6 and branch is 6 when not taken and 8 if taken.

24 May 2018, 15:04	#555
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	One sure thing is that the code is 6 bytes

24 May 2018, 15:09	#556
Gorf Registered User Join Date: May 2017 Location: Munich/Bavaria Posts: 2,294	ok - now we got the 68k case more than covered. Next ISA please ;-)

Currently Active Users Viewing This Thread: 2 (0 members and 2 guests)