Enhanced 68k ISA - Page 6

Mrs Beanbag · 02 September 2016, 21:35

Quote:

Originally Posted by NorthWay

The change to Exec is, but the struct it points to is indeed not documented and definitely so intended not to be.

good show, it means they can change it without notice for any purpose they like... and so can we...

but back to those pesky RMWs...

well which instructions fit that bill anyway? Any operation with <ea> as destination, i suppose. Such as "addq #1,<ea>", that could be useful for thead safe stuff, right?

Well, give each core a priority (which might rotate on a per-cycle basis), the highest priority core that encounters a destination <ea> gets to set a busy bit that blocks all other memory reads (on this address?) until it clears it again at the end of the instruction. (How does the 68060 pipeline cope with such potential RAM-based data hazards, anyway? Just stall?)

As for TAS/CAS &c, found a thread about that very issue here on EAB, well i can see the problem when using it on chip RAM, because DMA doesn't respect it, which can lead to incorrect results. However on fast RAM, the situation is different. The Amiga only lives on one side of the expansion socket. Accelerators might have their own DMAs for various things, however, if we are designing an accelerator of our own, we have control over that.

Megol · 02 September 2016, 23:09

Quote:

Originally Posted by meynaf

Quite a lot of trouble. Consider two programs accessing the same list, especially with inline versions of Forbid/Disable.

Snooping for those cases are simple in hardware and handling them are too. Making RMW instructions atomic isn't a problem either, there are a lot of other problems though that is mostly OS related. Don't see any need to add special SMP instructions except for switching on/off the snooping mechanism.

Quote:

Originally Posted by Mrs Beanbag

good show, it means they can change it without notice for any purpose they like... and so can we...

but back to those pesky RMWs...

well which instructions fit that bill anyway? Any operation with <ea> as destination, i suppose. Such as "addq #1,<ea>", that could be useful for thead safe stuff, right?

Well, give each core a priority (which might rotate on a per-cycle basis), the highest priority core that encounters a destination <ea> gets to set a busy bit that blocks all other memory reads (on this address?) until it clears it again at the end of the instruction. (How does the 68060 pipeline cope with such potential RAM-based data hazards, anyway? Just stall?)

As for TAS/CAS &c, found a thread about that very issue here on EAB, well i can see the problem when using it on chip RAM, because DMA doesn't respect it, which can lead to incorrect results. However on fast RAM, the situation is different. The Amiga only lives on one side of the expansion socket. Accelerators might have their own DMAs for various things, however, if we are designing an accelerator of our own, we have control over that.

IMHO cache-line locking is the best solution to making RMW instructions atomic.

When reading from memory the data can be 1) in the local cache 2) in the cache of another processor 3) in main memory. If it is in the local cache the data will be locked (will not change coherency state until the instruction retires), if it is in a remote cache it will be fetched to the local cache and then locked. If it is in main memory it will be fetched and then locked when in the local cache.

In this way other processors can still continue to run programs unless they happen to access a cache-line that is locked.

NorthWay · 03 September 2016, 01:01

Quote:

Originally Posted by Megol

IMHO cache-line locking is the best solution to making RMW instructions atomic.

That was what I was thinking of myself. Any idea how expensive it is in gate cost, execution slowdown from cache ping-ponging and implementation complexity?

And for once you really _need_ to have a per-cpu cache :-)

Mrs Beanbag · 05 September 2016, 21:25

lately i've been musing about the possibility of doing massive hyperthreading/massive ILP instead of having multiple cores, but coming back to ISA stuff... or maybe this should go back in the "other thread"...

but something i've wondered before, if it could be possible to have a "fork" instruction, have some really simple hardware scheduler allowing one to create another thread directly from asm code.

Megol · 06 September 2016, 22:02

Quote:

Originally Posted by NorthWay

That was what I was thinking of myself. Any idea how expensive it is in gate cost, execution slowdown from cache ping-ponging and implementation complexity?

How expensive it is depends on how coherency is done but it is essentially only a check when another processor want access to a certain cacheline if it is local (this is always done anyway) and if it is locked. If it is locked the remote request is stalled until the instruction is finished. In most cases very little extra hardware is required.

There would be no more ping-pong effect compared to other systems with cache coherency, a remote processor that want to access a local cacheline always have to request it before continuing execution. Well, one could do a very complicated design where the remote processor can execute an instruction on the local processor but for real workloads it would be slower.

The extra cost is mostly in the coherency logic itself however that is a cost one have to bear to make a user-friendly system.

(here local processor = the processor that currently owns the cacheline, remote processor = the processor that want to access the same cacheline)

Quote:

And for once you really _need_ to have a per-cpu cache :-)

Not really as one could modify the protocol and have per core locking. But for most cases at least one cache (preferably two: instruction, data) is the best solution anyway.

Megol · 06 September 2016, 22:11

Quote:

Originally Posted by Mrs Beanbag

lately i've been musing about the possibility of doing massive hyperthreading/massive ILP instead of having multiple cores, but coming back to ISA stuff... or maybe this should go back in the "other thread"...

but something i've wondered before, if it could be possible to have a "fork" instruction, have some really simple hardware scheduler allowing one to create another thread directly from asm code.

The transputer had a hardware scheduler (well, microcode anyway) and very cheap creation of new threads. There have been other processors with cheap hardware supported multithreading but I can't remember anything except the transputer ATM, a search should show something.

Some other related design ideas for speculative threading etc. have equivalent of fork instructions to spin of a speculative thread.

NorthWay · 06 September 2016, 22:59

I've seen some of the states for modern Power caches and I think they have more than 20 different possible ones. Shared read-only is one of them (IIRC). That can of course be converted to local-rw plus far-purge on first write. Or if the caches aren't exclusive as local-rw plus far-ro.

But I still don't know how the OS should behave as it considers itself alone.

Megol · 07 September 2016, 21:19

Quote:

Originally Posted by NorthWay

I've seen some of the states for modern Power caches and I think they have more than 20 different possible ones. Shared read-only is one of them (IIRC). That can of course be converted to local-rw plus far-purge on first write. Or if the caches aren't exclusive as local-rw plus far-ro.

Well that's a question of optimization

Systems supporting massive amounts of multiprocessors want to reduce coherency overheads as much as practically possible, a smaller number of processors/cores can use less complicated designs.

A common coherency design uses four states/cacheline, MESI or Modified/Exclusive/Shared/Invalid. In short:

Modified: The cacheline have the current data which have been modified (not matching main memory).
Exclusive: The cacheline have the only copy of the data.
Shared: This cacheline have a copy of the data, other processors may also have it.
Invalid: Well, the line is invalid.

Making RMW instructions atomic can be done either by adding external logic not modifying the states themselves (just changing how state changes happen), by slightly redefining the Modified state (such as a RMW instruction changing the state to modified before actually modifying it and not allowing the state to change until the instruction is retired) or by adding a new state similar to modified that can only change to modified (Locked? Atomic? Something like that).

Quote:

But I still don't know how the OS should behave as it considers itself alone.

Yes. 100% compatibility will never be possible but I've not seen a good argument that an almost-SMP mode would be possible. Almost-SMP as there are some corner cases where true multi-processing would break down however for the most time the system should be similar to a SMP one.

Mrs Beanbag · 10 September 2016, 17:19

Quote:

Originally Posted by Megol

Yes. 100% compatibility will never be possible but I've not seen a good argument that an almost-SMP mode would be possible. Almost-SMP as there are some corner cases where true multi-processing would break down however for the most time the system should be similar to a SMP one.

right. (i presume you meant "would NOT be possible"?)

As for inter-process communication, a thing that didn't get mentioned when it should have been: as Maynaf says, on modern OS different processes are quite isolated from each other in their own address space so can't just write to each other's data structures. Well that is true for different processes but it is not true for different threads, which one can spawn (with, for instance, std::thread in C++11). All the threads of one process exist in the same address space so you can do exactly what Amiga applications to, and write to shared memory. It is sometimes difficult to get right but there is no hardware problem. The programmer doesn't know, or need to know, whether the other threads are running on a different core or not. 68k actually should make this sort of thing easier if atomicity of <ea> destination instructions can be guaranteed.

The only difficulty i can see is with interrupts. When an interrupt happens, the program it belongs to might reasonably assume the interrupt routine cannot be interrupted by anything else, so it could cause trouble if that program can carry on executing even when the interrupt is happening. One would need either, to know which process set up the interrupt and pause that one, or cautiously suspend all processes during servicing of any interrupt.

For instance if i try to stop a Protracker module while the playroutine interrupt is happening on another hardware thread, it might cause trouble. This is not a situation anyone currently needs to worry about.

Megol · 10 September 2016, 19:47

Quote:

Originally Posted by Mrs Beanbag

right. (i presume you meant "would NOT be possible"?)

Yes.

Quote:

As for inter-process communication, a thing that didn't get mentioned when it should have been: as Maynaf says, on modern OS different processes are quite isolated from each other in their own address space so can't just write to each other's data structures. Well that is true for different processes but it is not true for different threads, which one can spawn (with, for instance, std::thread in C++11). All the threads of one process exist in the same address space so you can do exactly what Amiga applications to, and write to shared memory. It is sometimes difficult to get right but there is no hardware problem. The programmer doesn't know, or need to know, whether the other threads are running on a different core or not. 68k actually should make this sort of thing easier if atomicity of <ea> destination instructions can be guaranteed.

The only difficulty i can see is with interrupts. When an interrupt happens, the program it belongs to might reasonably assume the interrupt routine cannot be interrupted by anything else, so it could cause trouble if that program can carry on executing even when the interrupt is happening. One would need either, to know which process set up the interrupt and pause that one, or cautiously suspend all processes during servicing of any interrupt.

For instance if i try to stop a Protracker module while the playroutine interrupt is happening on another hardware thread, it might cause trouble. This is not a situation anyone currently needs to worry about.

Interrupts are one problem but an easily solved one: just make all interrupts/exceptions be handled by one processor halting all others.

That RMW instructions are virtually atomic on a one-processor system is solved by making them actually atomic on multiprocessor systems.

The disable/forbid etc. routines and their in-line macros can be handled by snooping changes of the two relevant addresses and stall other processors. A more optimized way to handle it would be to support "virtual stalling", as long as other processors doesn't disturb the one calling forbid etc. they can be allowed to continue execution.

Programs executing under Amiga OS can assume that they will not be preempted by lower priority ones and can use this as a kind of synchronizing mechanism. The simple way to handle this is making sure all processors are executing programs of the same priority, it isn't optimal though.

There are certainly some other problems that have to be handled however I think multiprocessing is possible and a way to provide Amiga systems with a nice speed boost. Could be wrong of course, I often am.

Mrs Beanbag · 10 September 2016, 20:48

Quote:

Originally Posted by Megol

The disable/forbid etc. routines and their in-line macros can be handled by snooping changes of the two relevant addresses and stall other processors. A more optimized way to handle it would be to support "virtual stalling", as long as other processors doesn't disturb the one calling forbid etc. they can be allowed to continue execution.

Forbid() is often used to prevent other processes from doing stuff temporarily, for instance shrinking an allocated area of memory, one calls forbid, Frees the memory and then re-allocates it with AllocAbs. That is one example. If some other process could reserve the same memory in the meantime that could get nasty. Mind you certain memory tracker programs already break this since they write tags all over blocks of ram at the point of freeing.

Quote:

Programs executing under Amiga OS can assume that they will not be preempted by lower priority ones and can use this as a kind of synchronizing mechanism. The simple way to handle this is making sure all processors are executing programs of the same priority, it isn't optimal though.

that is true, setting a process to a high priority is like a way of doing "thread.join()" i suppose.. the OS could make it "safe" at the expense of stalling all other running processes even though they don't have anything to do with the high priority task, which is not optimal.

Samurai_Crow · 11 September 2016, 10:09

The AROS source has an experimental project called "Silly SMP" that didn't help much on Intel because of the scheduler. The way they were doing it required high priority tasks to be non-blocking like the third-party Executive utility on AmigaOS.

BeamCoder · 09 December 2021, 06:28

Quote:

Originally Posted by meynaf

I wonder if hardware loops couldn't replace the SIMD extensions.
That would be called hardware autovectorization.
Then it would be potentially beneficial to every program, not just ones that make the effort to use those filthy vector extensions.
And the next gen could have better performance without rewriting any program.

Just dreaming...

Might be related to what you're saying?

Virtual Vector Method Data Cache:
https://groups.google.com/g/comp.arc...m/yYCLvOSqCgAJ

Other methods:
https://web.archive.org/web/20211101...flaws-of-simd/

meynaf · 09 December 2021, 09:46

Quote:

Originally Posted by BeamCoder

Might be related to what you're saying?

Virtual Vector Method Data Cache:
https://groups.google.com/g/comp.arc...m/yYCLvOSqCgAJ

Other methods:
https://web.archive.org/web/20211101...flaws-of-simd/

Related yes, but not the exact same thing.

BeamCoder · 09 December 2021, 13:41

Quote:

Originally Posted by meynaf

Related yes, but not the exact same thing.

Does it perhaps alleviate your dislike of having SIMD/Vectors on CPU?

By the way, I kind of like your philosophy of having a simple/humanly programmable ISA. Want to know how is your ISA going?

meynaf · 09 December 2021, 14:49

Quote:

Originally Posted by BeamCoder

Does it perhaps alleviate your dislike of having SIMD/Vectors on CPU?

Well, not really

Quote:

Originally Posted by BeamCoder

By the way, I kind of like your philosophy of having a simple/humanly programmable ISA. Want to know how is your ISA going?

The ISA is more or less finished. I'm more focused on making it work in real life, and actually it even works - though not being able to do HW implementation, i did it in pure SW.

coldacid · 09 December 2021, 14:59

Quote:

Originally Posted by meynaf

The ISA is more or less finished. I'm more focused on making it work in real life, and actually it even works - though not being able to do HW implementation, i did it in pure SW.

Verilog? VHDL? Or a non-sythesizable programming language?

meynaf · 09 December 2021, 15:09

Quote:

Originally Posted by coldacid

Verilog? VHDL? Or a non-sythesizable programming language?

When i write pure SW, it's pure SW. That is, software emulation -- IOW a VM.

Samurai_Crow · 09 December 2021, 17:58

Quote:

Originally Posted by coldacid

Verilog? VHDL? Or a non-sythesizable programming language?

This repo is a place you can start if you want VHDL. It's got a 4-stage pipeline already though it still needs some work.

BeamCoder · 12 December 2021, 21:20

Quote:

Originally Posted by meynaf

Well, not really

The ISA is more or less finished. I'm more focused on making it work in real life, and actually it even works - though not being able to do HW implementation, i did it in pure SW.

Does it also work on PC? If so, any chance we could have the source code or at least use it? I am interested to program on it.

About your VM:

Currently, I am interested on fantasy consoles: programs which simulate classic consoles/computers and provides tools to make games for it. The problem I have with those is they use scripting languages to program instead of assembly. Only a few actually tries to be similar to classic machines, like this one http://www.vircon32.com/index.html. And so, I'm throwing out the idea to extend your VM to a fantasy console and maybe some people would be interested to program in your ISA. Or if you don't want to, maybe I could have the source code of your VM to create a fantasy console as it would be an interesting project for me.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
BOOM (DOOM Enhanced) port to 68k	NovaCoder	News	155	05 May 2023 12:26
ISA Ethernet Cards	jmmijo	support.Hardware	13	03 February 2015 11:04
Any ISA Mach64 Information?	CU_AMiGA	support.Hardware	21	09 September 2007 22:17
Help converting an 8bit ISA slot to 16bit ISA slot	Smiley	support.Hardware	4	25 April 2006 11:20
A2000 ISA slots	Unknown_K	support.Hardware	1	20 March 2005 09:48

05 September 2016, 21:25	#104
Mrs Beanbag Glastonbridge Software Join Date: Jan 2012 Location: Edinburgh/Scotland Posts: 2,243	lately i've been musing about the possibility of doing massive hyperthreading/massive ILP instead of having multiple cores, but coming back to ISA stuff... or maybe this should go back in the "other thread"... but something i've wondered before, if it could be possible to have a "fork" instruction, have some really simple hardware scheduler allowing one to create another thread directly from asm code.

06 September 2016, 22:59	#107
NorthWay Registered User Join Date: May 2013 Location: Grimstad / Norway Posts: 851	I've seen some of the states for modern Power caches and I think they have more than 20 different possible ones. Shared read-only is one of them (IIRC). That can of course be converted to local-rw plus far-purge on first write. Or if the caches aren't exclusive as local-rw plus far-ro. But I still don't know how the OS should behave as it considers itself alone.

11 September 2016, 10:09	#112
Samurai_Crow Total Chaos forever! Join Date: Aug 2007 Location: Waterville, MN, USA Age: 49 Posts: 2,190	The AROS source has an experimental project called "Silly SMP" that didn't help much on Intel because of the scheduler. The way they were doing it required high priority tasks to be non-blocking like the third-party Executive utility on AmigaOS.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)