some fancy ideas for a extended (68k?) CISC-CPU

Gorf · 12 December 2020, 16:02

I hope this is not too off-topic here....
I was thinking of implementing a CISC-CPU (as VM) that is very close to the 68K or the NS320xx with emphasis on code density und came up with some ideas that might provide some nice benefits ... but I am not sure how useful my ideas would be for real assembler coders, so your input is most welcome.

So the instruction set words are mostly 16 bit in length - that size should cover all operations on registers. Memory operations are longer of course due to addresses and immediate values.
the first 8 bit are the instruction key followed by 2x 4-bit register addresses, hence we have 16 general purpose registers.
Very familiar so far...

Now the improvements:

Stacks of registers

This feature is inspired by stack-machines and/or so called register-windowing like in SPARC-CPUs, but differs in implementation.
Imagine every register is in fact a stack of entries - to make it simple we start by only two values. Every write operation onto one of our 16 registers is in fact a push operation, copying the existing value to the second (otherwise hidden) entry before the new value is written.
Special instructions allow to swap these two values and restoring the former value to a register - to a range of registers and/or to all 16 registers.

That feature gives the coder 32 registers, without the need of longer instruction words...
It could be expanded to deeper stacks of eg. 4 entries per register.

"I know what you did last summer!"

The CPU keeps track of the last 16 instructions - imagine a kind of internal log or special instruction cache.
A special instruction allows to tell the cpu to execute any of these instructions again, in any order (so it is not a jump or loop!!)
The special instruction allows to repeat either two out of the last 16 instructions or three of the last 8.
This should allow for very compact code.
Example:

Code:

Instruction-a---
Instruction-b  |
Instruction-c  |
Instruction-d  |-- log entries
Instruction-e  |
Instruction-f  |
Instruction-g  |
Instruction-h---

Repeat -4, -6, -1
which executes e, c and h again

now our instruction log has changed - the special repeat-command is decoded into three independent instructions and stored as such in the log:

Code:

Instruction-d---
Instruction-e  |
Instruction-f  |
Instruction-g  |-- log entries
Instruction-h  |
Instruction-e  |
Instruction-c  |
Instruction-h---

if we now order again
Repeat -4, -6, -1
it executes h, f and h

and so on...

The 8-Bit-Turbo

This is s special mode for our CPU that allows the (relatively) fast emulation of old 8-bit CPUs like the z80 or simply every byte-code you come up with.

after switching to this mode the next 8-bit value in the instruction-cache will be treated as combination of shift (by a given value), additional offset and jump.
So if your byte-code interpreter has a table or library for all its commands, this provides an easy and fast way to jump to the right location - no lookup needed.
Just set your table-entries e.g. 1k apart each - starting at "tablestart".
The CPU will perform a shift by 10 bit and add "tablestart" and jump to that location!
(the shift-value and "tablestart" are stored in special registers beforehand)

Until now we used only the first 8 bits for this, but in a 32-bit machine we already fetched the next three 8-bit values automatically ... these will be stored in different CPU-registers, before the jump allowing them to be used as values for the interpreter code.
At the end of our interpreter code block, we have a special jump instruction, that jumps back to our 8-bit code section and puts the CPU in the special mode again...

Ok...thats it for now.
Maybe I am just reinventing the wheel here? I did not research for patents ...
Anyways - what are your thoughts?

meynaf · 12 December 2020, 18:20

Quote:

Originally Posted by Gorf

I hope this is not too off-topic here....
I was thinking of implementing a CISC-CPU (as VM) that is very close to the 68K or the NS320xx with emphasis on code density und came up with some ideas that might provide some nice benefits ... but I am not sure how useful my ideas would be for real assembler coders, so your input is most welcome.

Oh noes you've gotten me started on the subject

Real assembler coders often run into annoying limitations. Even though 68k is by very far the most friendly cpu there is - especially 68020+ - it still has its issues.

Yes a CISC cpu as a VM might be a great idea and not that difficult to implement if you don't focus too much on actual speed.
Actually i even have one in development.

Quote:

Originally Posted by Gorf

So the instruction set words are mostly 16 bit in length - that size should cover all operations on registers. Memory operations are longer of course due to addresses and immediate values.
the first 8 bit are the instruction key followed by 2x 4-bit register addresses, hence we have 16 general purpose registers.
Very familiar so far...

Very familiar, but be awared that with 16-bit opcodes there isn't enough space for 4-bit register addresses and a full set of memory addressing modes.

Quote:

Originally Posted by Gorf

Now the improvements:

Stacks of registers

This feature is inspired by stack-machines and/or so called register-windowing like in SPARC-CPUs, but differs in implementation.
Imagine every register is in fact a stack of entries - to make it simple we start by only two values. Every write operation onto one of our 16 registers is in fact a push operation, copying the existing value to the second (otherwise hidden) entry before the new value is written.
Special instructions allow to swap these two values and restoring the former value to a register - to a range of registers and/or to all 16 registers.

That feature gives the coder 32 registers, without the need of longer instruction words...
It could be expanded to deeper stacks of eg. 4 entries per register.

I'd like to see this in action, especially in comparison to just using memory.
Do you have some example code in mind or was it just food for thoughts ?

Quote:

Originally Posted by Gorf

"I know what you did last summer!"

The CPU keeps track of the last 16 instructions - imagine a kind of internal log or special instruction cache.

I'd push the concept further and do hardware loops based on this. As long as all instructions in the cache are usable for vectorization, the loop gets auto-vectorized and does several iterations per instruction (if target cpu can do that).

Quote:

Originally Posted by Gorf

The 8-Bit-Turbo

This is s special mode for our CPU that allows the (relatively) fast emulation of old 8-bit CPUs like the z80 or simply every byte-code you come up with.

Using an emulated cpu to emulate another cpu ? Hmm...

Quote:

Originally Posted by Gorf

Ok...thats it for now.
Maybe I am just reinventing the wheel here? I did not research for patents ...
Anyways - what are your thoughts?

There is nothing wrong in reinventing the wheel. You might come up with a better wheel.

Gorf · 12 December 2020, 19:15

Quote:

Originally Posted by meynaf

Oh noes you've gotten me started on the subject

Real assembler coders often run into annoying limitations. Even though 68k is by very far the most friendly cpu there is - especially 68020+ - it still has its issues.

Yes a CISC cpu as a VM might be a great idea and not that difficult to implement if you don't focus too much on actual speed.
Actually i even have one in development.

Speed for the VM is no consideration at the moment - I am trying to think more in hardware terms and implement things, that could actually work in real silicon or a FPGA.
So things that could be done fast/parallel in hardware do have priority over things, that would run fast in software.

Quote:

Very familiar, but be awared that with 16-bit opcodes there isn't enough space for 4-bit register addresses and a full set of memory addressing modes.

As the focus is on code density I try to put as much in just 16bit opcodes as possible - as said, memory addressing modes are longer of course.
I do not want to go full RISC here...
Also there is always the possibility for an escape code to use the next 16 bits additionally as opcode.

Quote:

I'd like to see this in action, especially in comparison to just using memory.
Do you have some example code in mind or was it just food for thoughts ?

The advantage of the register stacks is, that they are much faster compared to memory accesses (or should be in theory on real hardware)
This has probably no benefit at all in a VM ...
But in hardware a swap of one or several registers to its counterparts could be done in one cycle ...
(actually this would be implemented as 64bit (or 128bit) wide registers and we would only change on what part of the register the CPU looks: at the lower or the upper part ...)

Quote:

I'd push the concept further and do hardware loops based on this. As long as all instructions in the cache are usable for vectorization, the loop gets auto-vectorized and does several iterations per instruction (if target cpu can do that).

That is also a very interesting idea - and actually not mutual exclusive ...
My idea was again based on a imaginary hardware implementation:
It would use already decoded op-codes in the cpu, which would make the repeat-call not only smaller but also faster.

How exactly would your auto-vectorization work?

Quote:

Using an emulated cpu to emulate another cpu ? Hmm...

this idea was born from a "what if" scenario ... how e.g. Commodore could have provided some C64 compatibility in a new 32-bit CPU, without limiting it to just one architecture, but being as flexible as possible - and archiving a near 2:1 ratio in per clock execution, while still doing interpretation and not JIT.
But this should also speed up byte-code of a scripting language.

Quote:

There is nothing wrong in reinventing the wheel. You might come up with a better wheel.

meynaf · 12 December 2020, 20:57

Quote:

Originally Posted by Gorf

Speed for the VM is no consideration at the moment - I am trying to think more in hardware terms and implement things, that could actually work in real silicon or a FPGA.
So things that could be done fast/parallel in hardware do have priority over things, that would run fast in software.

Ah, but that's a different story then. You can't do everything with the encoding, strange features are probably not good ideas, etc. There is less freedom in silicon than in software.

In addition, research for speed often limits ease to code, which in turn forces coders to go away from asm, which in turn is bad for speed. Beware of the altar of speed. It's evil

Quote:

Originally Posted by Gorf

As the focus is on code density I try to put as much in just 16bit opcodes as possible - as said, memory addressing modes are longer of course.

Making memory accesses longer would decrease code density i'm afraid.

Quote:

Originally Posted by Gorf

I do not want to go full RISC here...
Also there is always the possibility for an escape code to use the next 16 bits additionally as opcode.

I think the 68k way was quite nice - use 16 registers with the cost of 8.
The D/A split was just pushed too far in how it limits possibilities.

Quote:

Originally Posted by Gorf

The advantage of the register stacks is, that they are much faster compared to memory accesses (or should be in theory on real hardware)
This has probably no benefit at all in a VM ...
But in hardware a swap of one or several registers to its counterparts could be done in one cycle ...
(actually this would be implemented as 64bit (or 128bit) wide registers and we would only change on what part of the register the CPU looks: at the lower or the upper part ...)

Well, it depends. On modern cpus cached memory accesses aren't slower than registers. Besides, swapping would indeed be one more cycle - and furthermore make things harder for a possible out-of-order implementation.
Not to mention cases where more than 16 regs are needed are quite rare.

Quote:

Originally Posted by Gorf

That is also a very interesting idea - and actually not mutual exclusive ...
My idea was again based on a imaginary hardware implementation:
It would use already decoded op-codes in the cpu, which would make the repeat-call not only smaller but also faster.

Caching already decoded instructions isn't new

Quote:

Originally Posted by Gorf

How exactly would your auto-vectorization work?

Ideally, by not having to add any new instruction at all.
There are internally two modes : the normal one, and the vector one.
The cpu simply stores somewhere the last time it saw an incompatible instruction (one that can not be converted to a parallel version). Then, if a loop instruction is seen and it does not go beyond this limit, a new mode is activated that chooses a vector version of all instructions in the loop.
The vector size can be anything. Even no vectorized implementation at all, same code would still work. Vector registers are internal and never seen by the programmer.
I don't know if this is even possible of course, but if it is, it would mean the end of simd extensions being made obsolete by new version, and so on, giving an awful legacy (think of mmx -> sse1 -> sse2 -> sse4.1 -> sse4.2 -> avx -> avx2...). Now you just change the vector size and all existing programs just take benefit of it.

Quote:

Originally Posted by Gorf

this idea was born from a "what if" scenario ... how e.g. Commodore could have provided some C64 compatibility in a new 32-bit CPU, without limiting it to just one architecture, but being as flexible as possible - and archiving a near 2:1 ratio in per clock execution, while still doing interpretation and not JIT.
But this should also speed up byte-code of a scripting language.

Well, 8-bit cpus don't need to be this fast. But scripting languages, yes.
However it's just 3 instructions to read byte, fetch address and branch, and you won't escape the pipeline stall - or a strong OoO could actually fetch next emulated instruction and find out the address while previous one isn't finished, leading to zero benefit in doing that with single instruction or special mode.

Anyway, if i were you i would not focus on the special features ; rather, on the general purpose instruction set. IMO this is too often overlooked.

robinsonb5 · 12 December 2020, 21:33

You guys might find the forum at anycpu.org interesting.

Gorf · 13 December 2020, 02:05

Quote:

Ah, but that's a different story then. You can't do everything with the encoding, strange features are probably not good ideas, etc. There is less freedom in silicon than in software.

my CPU's features may be strange ... they certainly are, but they coming from thinking in hardware terms. These are things I came up with, because they are possible or even cheap and not so much because they are useful ;-)
That's why I am asking here, how much sense they make for experienced coders.

Quote:

Well, it depends. On modern cpus cached memory accesses aren't slower than registers. Besides, swapping would indeed be one more cycle - and furthermore make things harder for a possible out-of-order implementation.
Not to mention cases where more than 16 regs are needed are quite rare.

Memory accesses are still considerably slower: 2-4 times. Modern CPUs have a lot of tricks hiding this fact, with additional store buffers, prefetches, an out-of-order execution...
But then we are no longer talking about something one person alone could design ....

But I take your hint that more than 16 registers are not that relevant for coders as I might have thought.

Quote:

Making memory accesses longer would decrease code density i'm afraid.

Well - since you can not fit a 32bit address in 16bits ... there is no way around it.

Quote:

Well, 8-bit cpus don't need to be this fast. But scripting languages, yes.

Well ... you have no idea how slow my CPU is ...
I guess you think in terms of a modern CPU an GHz, while I am thinking in some small hobby project - proof of concept thingy.
After a software VM the best case scenario would be, I manage to lern enough of FPGA HDL to implement it on such a board ... and then we are talking about maybe 100MHz

Quote:

Anyway, if i were you i would not focus on the special features

sure ... but would they be useful?
So far:
more than 16 registers are not as needed as I thought.
Byte-code-mode depends if it is really faster ...

Maybe the "repeat" feature?

Quote:

Ideally, by not having to add any new instruction at all.
There are internally two modes : the normal one, and the vector one.
The cpu simply stores somewhere the last time it saw an incompatible instruction (one that can not be converted to a parallel version). Then, if a loop instruction is seen and it does not go beyond this limit, a new mode is activated that chooses a vector version of all instructions in the loop.

I like the idea, but that is definitively beyond my capabilities.
But it definitively is food for thoughts!

meynaf · 13 December 2020, 10:23

Quote:

Originally Posted by Gorf

my CPU's features may be strange ... they certainly are, but they coming from thinking in hardware terms. These are things I came up with, because they are possible or even cheap and not so much because they are useful ;-)
That's why I am asking here, how much sense they make for experienced coders.

What coders need is something easy to code on. If speed is their only concern they just code for todays multi-ghz machines.
Easier to code means less instructions for doing the same work, which in turns enhances code density.

Quote:

Originally Posted by Gorf

Memory accesses are still considerably slower: 2-4 times. Modern CPUs have a lot of tricks hiding this fact, with additional store buffers, prefetches, an out-of-order execution...
But then we are no longer talking about something one person alone could design ....

This is less problematic in the 100Mhz range, and not at all with a VM.

Quote:

Originally Posted by Gorf

But I take your hint that more than 16 registers are not that relevant for coders as I might have thought.

I sometimes need more than what the 68k offers but it's typically 2-3 more regs, hardly ever more.
This is easy to fix : use separate register for stack pointer (instead of using A7), add some kind of data pointer for small data model (in which A4 or A5 are typically used), add features for high parts of registers, allow more for address registers, and you're done.

Quote:

Originally Posted by Gorf

Well - since you can not fit a 32bit address in 16bits ... there is no way around it.

You can not fit a 32 bit address in 16 bits but you can use a 16 bit offset relative to some 32 bit address located in some register.

Quote:

Originally Posted by Gorf

Well ... you have no idea how slow my CPU is ...
I guess you think in terms of a modern CPU an GHz, while I am thinking in some small hobby project - proof of concept thingy.
After a software VM the best case scenario would be, I manage to lern enough of FPGA HDL to implement it on such a board ... and than we are talking about maybe 100MHz

If your cpu is inherently slow and can absolutely not compete with what currently exists, why filling it with features targeted at speed ?

Instead of designing the cpu for speedy things that will never be fast, why not putting that into a higher level ? My VM contains features that the emulated cpu can call, and which can then run at native speeds. While it is an emulated machine running on an emulated machine, it still does 2.5Gb/s copymem...

Quote:

Originally Posted by Gorf

sure ... but would they be useful?
So far:
more than 16 registers are not as needed as I thought.
Byte-code-mode depends if it is really faster ...

Maybe the "repeat" feature?

Well, you could concentrate on the code density. Design the full encoding, take a few examples and see how it manages them. If you think in terms of special features, you can also see how they fare in these examples.

Gorf · 13 December 2020, 16:01

Quote:

Originally Posted by meynaf

What coders need is something easy to code on. If speed is their only concern they just code for todays multi-ghz machines.
Easier to code means less instructions for doing the same work, which in turns enhances code density.

That is the goal.
And that is exactly what e.g. the "repeat x,y,z" command does: it is effectively three commands in one.

Same with the "byte-code-mode": it is just one command instead of three for read byte, fetch address and branch.

The "register stacks" would make many lines obsolete, where one saves a register back to RAM just to load it again later. Just one short command instead of at least two lengthy one, which need some memory addresses.
The gain is much more if you want to save back a whole bunch of registers...

Imagine in your program you jump forth and back between two subroutines: wouldn't it be nice to switch some of the registers forth and back with just one command?

Quote:

This is less problematic in the 100Mhz range, and not at all with a VM.

If your argument is: RAM is now as fast as registers, than the question is:
Why would we use a register machine at all?
The Transputer had very fast internal RAM (1-2kB) and they made the same argument ... and got mostly rid off registers in their design shortening the length of the 15 most common op-codes to just one byte (4 bit key + 4 bit value).

If you want to keep registers, because you consider them as convenient, then at least there is no use for real data-registers: all registers should just be shortcuts for locations in RAM, since it is the same speed anyways.
(this would also spare you from loading and saving registers - you are just pointing to locations in RAM at all times...)

But my goal is to follow a more classic approach - as I said: I am just reinventing the wheel

Quote:

This is easy to fix : use separate register for stack pointer (instead of using A7), add some kind of data pointer for small data model (in which A4 or A5 are typically used), add features for high parts of registers, allow more for address registers, and you're done.

That is what I am aiming at which my 16 general purpose registers.
As you pointed out: the space for the op-codes gets very cramped, but I think I have some ideas and tricks to make that possible...

Quote:

You can not fit a 32 bit address in 16 bits but you can use a 16 bit offset relative to some 32 bit address located in some register.

sure - but at one point you must have loaded this 32bit value into your register ... you could work with two 16 bit loads a a shift - but that would make the code not denser ...

and even a 16 immediate value or offset would not fit together with the actual op-code in just 16 bits...

But I have of course planned a series of commands using offsets - these offsets are ether 8 bits + a shift (remember, there is a special shift-value register) - so this does fit nicely in a 16bit-wide op-code. It gives you effectively a 9 bit range in most cases, if your structures are word-aligned ... if you work with other data structures and you can guarantee e.g. 4 byte alignments (BCPL) or 1k alignments you can set your shift-value-register accordingly.

I would skip the 16 bit offset, since it messes up my instruction fetches - the next offset-size is 24 bits, which results in a 32bit wide op-code.

Quote:

If your cpu is inherently slow and can absolutely not compete with what currently exists, why filling it with features targeted at speed ?

these features are targeted for code density - speed is just a welcome byproduct.

Quote:

Instead of designing the cpu for speedy things that will never be fast, why not putting that into a higher level ? My VM contains features that the emulated cpu can call, and which can then run at native speeds. While it is an emulated machine running on an emulated machine, it still does 2.5Gb/s copymem...

A very different approach and goal. Probably a better one than mine and I am looking forward to see this maybe in action - if you consider to release this at all of course..

Here I just wanted to see if some ideas I considered as use- or helpful, would resonate.
Well ... it does not look that way ... so:

Quote:

Well, you could concentrate on the code density.

didn't I?

Quote:

Design the full encoding, take a few examples and see how it manages them. If you think in terms of special features, you can also see how they fare in these examples.

I will think of some examples

meynaf · 13 December 2020, 17:12

Quote:

Originally Posted by Gorf

That is the goal.
And that is exactly what e.g. the "repeat x,y,z" command does: it is effectively three commands in one.

Same with the "byte-code-mode": it is just one command instead of three for read byte, fetch address and branch.

Yes but said commands must be common enough for them to have a significant impact on overall code density. And, of course, be smaller than otherwise equivalent code...

Quote:

Originally Posted by Gorf

The "register stacks" would make many lines obsolete, where one saves a register back to RAM just to load it again later. Just one short command instead of at least two lengthy one, which need some memory addresses.
The gain is much more if you want to save back a whole bunch of registers...

Imagine in your program you jump forth and back between two subroutines: wouldn't it be nice to switch some of the registers forth and back with just one command?

I'm not only jumping back and forth between two subroutines - it's many more.
Register windows failed because of this - at some point you *have* to push them to some memory place.

Quote:

Originally Posted by Gorf

If your argument is: RAM is now as fast as registers, than the question is:
Why would we use a register machine at all?

Because speed isn't the only reason why having registers. As they don't come in vast number, only few bits are used to designate them and they're good for code density.
Besides, RAM is as fast as registers but not on all operations. Regular data access yes, but not when used as pointer (aka double indirect).

Quote:

Originally Posted by Gorf

That is what I am aiming at which my 16 general purpose registers.
As you pointed out: the space for the op-codes gets very cramped, but I think I have some ideas and tricks to make that possible...

Swapping and direct access aren't the same thing.
Consider data in low word of a 32-bit register, and now you're out of registers to hold the loop counter. My VM has an instruction using high word as loop counter so no swap is necessary.
But with a register window you still need to swap the register twice, adding two instructions.

Quote:

Originally Posted by Gorf

sure - but at one point you must have loaded this 32bit value into your register ... you could work with two 16 bit loads a a shift - but that would make the code not denser ...

The 32-bit value has to be loaded only once for many 16-bit accesses relative to it.
And if it is your stack pointer, you may not even need to load it at all.

Quote:

Originally Posted by Gorf

and even a 16 immediate value or offset would not fit together with the actual op-code in just 16 bits...

This is why variable lengths have been invented.

Quote:

Originally Posted by Gorf

these features are targeted for code density - speed is just a welcome byproduct.

Code density is similar to data compression : the most often used must take the least space.

Quote:

Originally Posted by Gorf

A very different approach and goal. Probably a better one than mine and I am looking forward to see this maybe in action - if you what to release this at all of course..

I do have a VM implementing nearly everything but i don't see the point in releasing it. It's not like if it was an impressive demo...

Quote:

Originally Posted by Gorf

didn't I?

Not really. You only mentioned it.
Code density does not come from using special tricks. It comes from overall good encoding method, efficient instruction set, short immediates...

Quote:

Originally Posted by Gorf

I will think of some examples

The more code is written, the more "nice features" start to backfire

Gorf · 13 December 2020, 19:26

Quote:

Originally Posted by meynaf

Yes but said commands must be common enough for them to have a significant impact on overall code density. And, of course, be smaller than otherwise equivalent code...

obviously.

Quote:

I'm not only jumping back and forth between two subroutines - it's many more.
Register windows failed because of this - at some point you *have* to push them to some memory place.

but less often

Quote:

Because speed isn't the only reason why having registers. As they don't come in vast number, only few bits are used to designate them and they're good for code density.
Besides, RAM is as fast as registers but not on all operations. Regular data access yes, but not when used as pointer (aka double indirect).

my point here was, when you argue with the register-like speed of caches, you probably would rethink the design in a more radical way.

The TI TMS9900 had "virtual" register banks in RAM and real register pointing to a bank - that made sense back than and would make sense again with very fast cache.

As I already pointed out: without registers and only stack, you can use even shorter op-codes .. see Transputer or Forth-CPUs like RTX2010, Novix N4000 and GreenArray.
Also see PERL-CPU (Performance Enhanced Registerless)
https://www.researchgate.net/profile...ication_detail

But all that is a different kind of discussion ... here I just wanted to see if coders would like the features I mentioned in my first post...

Quote:

Swapping and direct access aren't the same thing.
Consider data in low word of a 32-bit register, and now you're out of registers to hold the loop counter. My VM has an instruction using high word as loop counter so no swap is necessary.
But with a register window you still need to swap the register twice, adding two instructions.

actually only once, since the first swap is implicit...
(can be turned on and off)

Your idea is very nice, but also limited to 16bit operations .. if you need the full 32 bit width of your register this won't help, but my method would.

Quote:

The 32-bit value has to be loaded only once for many 16-bit accesses relative to it.
And if it is your stack pointer, you may not even need to load it at all.

Also the stack-pointer has to come from somewhere...
But I was just saying, that you need a longer command for loading longer addresses or data ...

Quote:

This is why variable lengths have been invented.

As I pointed out already in my first reply to you.

Quote:

Code density is similar to data compression : the most often used must take the least space.

also obvious ;-)

Quote:

Not really. You only mentioned it.
Code density does not come from using special tricks. It comes from overall good encoding method, efficient instruction set, short immediates...

It sure does.
I have a lot of other ideas as well and of course all the obvious and well known ways of reducing code size are considered or already part of the concept.
But here I was only talking about the 3 features I mentioned in my first post.

Quote:

The more code is written, the more "nice features" start to backfire

How so?

Thomas Richter · 13 December 2020, 19:37

Pointless. Rather pointless. Don't we have the Vampire already for those that want to experiment with alternative 68K interpretations? Who will write software for that? I certainly won't.

Gorf · 13 December 2020, 19:51

Quote:

Originally Posted by Thomas Richter

Pointless. Rather pointless. Don't we have the Vampire already for those that want to experiment with alternative 68K interpretations?

That is of course true for any new hobby cpu-architecture.
I am not aiming at revolutionizing the world ...

Quote:

Who will write software for that? I certainly won't.

probably only me, myself and I ... same as with meynaf's VM for him.

But if you think, this thread does not belong here, a moderator might move it to some other place or delete it - I did not what to cause any fuzz.

meynaf · 13 December 2020, 20:27

Quote:

Originally Posted by Gorf

but less often

That depends of the size of the register window. If you just have a second shadow register, don't expect it to be of any use to replace regular register saving.

Quote:

Originally Posted by Gorf

my point here was, when you argue with the register-like speed of caches, you probably would rethink the design in a more radical way.

Registers have their uses, memory has its uses. It's just about using the right tool for a given job.

Quote:

Originally Posted by Gorf

actually only once, since the first swap is implicit...
(can be turned on and off)

How can the first swap be implicit ?
You have to specify somewhere which registers are gonna be swapped, haven't you ?

Quote:

Originally Posted by Gorf

Your idea is very nice, but also limited to 16bit operations .. if you need the full 32 bit width of your register this won't help, but my method would.

It's again about usage statistics - most loop counters fit in 16 bit. If one does not, well, there are other ways.
My method at least does not require additionnal instructions to perform swapping.

Quote:

Originally Posted by Gorf

Also the stack-pointer has to come from somewhere...

Yep, from the OS loading the program...

Quote:

Originally Posted by Gorf

But I was just saying, that you need a longer command for loading longer addresses or data ...

But my point is that you don't need to do that often so it is perfectly acceptable.

Quote:

Originally Posted by Gorf

It sure does.
I have a lot of other ideas as well and of course all the obvious and well known ways of reducing code size are considered or already part of the concept.

May we see these, then ?

Quote:

Originally Posted by Gorf

But here I was only talking about the 3 features I mentioned in my first post.

And as a coder i don't see the point in any of the 3, sorry.

Quote:

Originally Posted by Gorf

How so?

Doesn't the microprocessor industry have many examples of original, great looking at first sight, failed designs ?

Quote:

Originally Posted by Thomas Richter

Pointless. Rather pointless. Don't we have the Vampire already for those that want to experiment with alternative 68K interpretations?

The vampire does not allow tinkering with the instruction set.

Quote:

Originally Posted by Thomas Richter

Who will write software for that? I certainly won't.

I certainly won't ask you to write any software, so that's fine.

Thomas Richter · 13 December 2020, 20:44

Quote:

Originally Posted by meynaf

The vampire does not allow tinkering with the instruction set.

Huh? It certainly does. It is an FPGA that is reprogrammed from a flash ROM upon startup. Just that the API and the internals are not exposed to you.

Gorf · 13 December 2020, 21:20

As for an example let's take the question for the neighboring thread:
"Quickest way to add two colours"

a/b has proposed a solution for generating a table:
http://eab.abime.net/showpost.php?p=1445979&postcount=7

here is how we could reduce the size of the loop, with my ideas:

Code:

loop	moveq	#15,d0
	and.w	d7,d0
	ror.w	#4,d7
	moveq	#15,d1
	and.w	d7,d1
	ror.w	#4,d7
	moveq	#15,d2
	and.w	d7,d2
	ror.w	#4,d7
	moveq	#15,d3
	and.w	d7,d3
	ror.w	#4,d7
	add.w	d2,d0
	add.w	d3,d1
        .....

becomes:

loop	moveq	#15,d3
	and.w	d7,d3 ; implicit "switch" on d3
	ror.w	#4,d7
	pop	d3,d0 ; moves d3 to d0 and switches back d3 to #15
	rep.2	-3, -2 ; does and.w d7,d3 and ror.w #4,d7
	pop	d3,d1
	rep.2	-3, -2
	pop	d3,d2
        rep.2	-3, -2
	add.w	d2,d0
	add.w	d3,d1

This is a reduction of 30% in the case.

meynaf · 14 December 2020, 09:29

Quote:

Originally Posted by Thomas Richter

Huh? It certainly does. It is an FPGA that is reprogrammed from a flash ROM upon startup. Just that the API and the internals are not exposed to you.

Without actual VHDL sources it's not very useful...

Quote:

Originally Posted by Gorf

As for an example let's take the question for the neighboring thread:
"Quickest way to add two colours"

a/b has proposed a solution for generating a table:
http://eab.abime.net/showpost.php?p=1445979&postcount=7

here is how we could reduce the size of the loop, with my ideas:
(...)

This is a reduction of 30% in the case.

The code makes dangerous assumptions about what instructions do a switch and what instructions do not. There is one with

and

but not with

add

or

ror

? Where's the sense in that ? What if in some other example the switch is actually not wanted ?

At the end : code not readable - maybe also not working.
But a very nice example actually ! It's real life work.

For your reading pleasure (or not

), here's the version for my VM (with renamed instructions to be more 68k-like) :

Code:

 pdep #$0f0f0f0f,d7:d0		; ****GBgb -> 0G0B0g0b
 moverp d0.h,d1			; d1=****0G0B
 add.w d1,d0			; d0=****0G0B + 0g0b (-> ******0B + 0b)
 moverp d0.t,d1			; d1=******0G + 0g

I think that's a little better than 30% reduction.
Ask me if you want explanation on individual instructions.

Gorf · 14 December 2020, 12:12

Quote:

Originally Posted by meynaf

The code makes dangerous assumptions about what instructions do a switch and what instructions do not.

No it does not of course.
As I wrote in my first post: all write operations act as a "push" on a register.
The new value is now the actual on, the old value the other one..

Quote:

There is one with

and

but not with

add

or

ror

?

They all do.
I just pointed this out in an extra comment in the one case we made use of it later.

Quote:

Where's the sense in that ? What if in some other example the switch is actually not wanted ?

You, can disallow this behavior per register, to preserve a value in the background.
This might be useful for registers, containing addresses - but it is not the default behavior.
(Not done in this example)

Quote:

At the end : code not readable - maybe also not working.
But a very nice example actually ! It's real life work.

For your reading pleasure (or not

), here's the version for my VM (with renamed instructions to be more 68k-like) :

Code:

 pdep #$0f0f0f0f,d7:d0		; ****GBgb -> 0G0B0g0b
 moverp d0.h,d1			; d1=****0G0B
 add.w d1,d0			; d0=****0G0B + 0g0b (-> ******0B + 0b)
 moverp d0.t,d1			; d1=******0G + 0g

Guess what:
This code is not readable and maybe also not working

Quote:

I think that's a little better than 30% reduction.
Ask me if you want explanation on individual instructions.

Very much so

meynaf · 14 December 2020, 16:47

Quote:

Originally Posted by Gorf

No it does not of course.
As I wrote in my first post: all write operations act as a "push" on a register.
The new value is now the actual on, the old value the other one..

But reads don't automatically pop, do they ?
Anyway, don't you fear it's gonna overload the register file by too much data to move around ? I mean, simple EXG (= 2 writes) would become 128 bits to change in a single instruction...

Quote:

Originally Posted by Gorf

They all do.
I just pointed this out in an extra comment in the one case we made use of it later.

Ok, the comment was just misleading then.

Quote:

Originally Posted by Gorf

You, can disallow this behavior per register, to preserve a value in the background.
This might be useful for registers, containing addresses - but it is not the default behavior.
(Not done in this example)

So registers will also have settings ? It's not becoming simpler

Quote:

Originally Posted by Gorf

Guess what:
This code is not readable and maybe also not working

It is much shorter than your attempt and does not use "tricks".
So unreadable maybe, but less so than yours and it is working. Remember, i have a vm so i can test any code.

But actually, yours really does not work, at least, according to your specifications in the first post. Your two last reps will not repeat the right instructions. Remember, the log is supposed to have evolved since...
(Or maybe neither pop nor rep are entered in the log ?)

Also, must be some fun if an irq comes right in the middle of this code. Or if the coder wants to debug the code using trace mode.
Do you have solutions for these issues ?

Gorf · 14 December 2020, 17:08

Quote:

Originally Posted by meynaf

But reads don't automatically pop, do they ?
Anyway, don't you fear it's gonna overload the register file by too much data to move around ? I mean, simple EXG (= 2 writes) would become 128 bits to change in a single instruction...

no, just pop op-codes pop.
It would still just be 64 bits to change with EXG. It actually makes EXG internally easier, since the CPU does not need to store the content of one register temporarily somewhere else ...

Quote:

So registers will also have settings ? It's not becoming simpler

For a VM it's getting more complicated and slower.
But in hardware it is just a flag - a bit per register, that decides if a write does use the "other" entry or not.

Quote:

It is much shorter than your attempt and does not use "tricks".

That ist very debatable - you are in fact using a huge trick there, by using a Intel-feature that was only introduced with a couple years ago... and takes a lot of cycles on AMD (until Zen3) and is not available on other architectures...

Quote:

So unreadable maybe, but less so than yours and it is working. Remember, i have a vm so i can test any code.

But actually, yours really does not work, at least, according to your specifications in the first post. Your two last reps will not repeat the right instructions. Remember, the log is supposed to have evolved since...
(Or maybe neither pop nor rep are entered in the log ?)

It does work according to my specifications in the first post!
Since the reps are decoded as two instructions in the log, it is referring again to the right ones.

I could also have used
rep -6, -5
for the second one
and
rep -6, -5
or
rep -9, -8
for the third one.

Quote:

Also, must be some fun if an irq comes right in the middle of this code. Or if the coder wants to debug the code using trace mode.
Do you have solutions for these issues ?

Interrupts have to wait until the operation is finished - as usual.
A simple trace would point to the rep - a more sophisticated trace would show you the actual command doublet or triplet.

meynaf · 14 December 2020, 17:43

Quote:

Originally Posted by Gorf

It would still just be 64 bits to change with EXG.

You consider pushes aren't writes maybe ?

Quote:

Originally Posted by Gorf

It actually makes EXG internally easier, since the CPU does not need to store the content of one register temporarily somewhere else ...

The CPU does not need to store the content of one register anywhere. The data is read, then transmitted to the ALU, where it's just wiring to swap the values.

Quote:

Originally Posted by Gorf

For a VM it's getting more complicated and slower.
But in hardware it is just a flag - a bit per register, that decides if a write does use the "other" entry or not.

A flag that must be saved/restored upon context switches as well. It can't be just an internal, hidden flag.

Quote:

Originally Posted by Gorf

That ist very debatable - you are in fact using a huge trick there, by using a Intel-feature that was only introduced with a couple years ago... and takes a lot of cycles on AMD (until Zen3) and is not available on other architectures...

Who cares ? Intel did it, it's reasonably fast on their cpus, so i have a proof it's perfectly doable in HW. It's not a trick, it's just an instruction - it's not like a feature that would change the whole architecture of the cpu (this is what i regard as a trick).

Quote:

Originally Posted by Gorf

It does work according to my specifications in the first post!
Since the reps are decoded as two instructions in the log, it is referring again to the right ones.

I could also have used
rep -6, -5
for the second one
and
rep -6, -5
or
rep -9, -8
for the third one.

This implies the rep itself never enters the log.
Also implies it's difficult to spot which instructions will be executed.

Quote:

Originally Posted by Gorf

Interrupts have to wait until the operation is finished - as usual.

How can you possibly do that. After completion of some operation you never know if there will be - or not - another rep after. When the interrupt has finished it has polluted the instruction log. And no, you can't fix that by having interrupt code not enter the log - it could just switch to another task. Only way is to save the whole log upon context switches, or even with all exceptions/interrupts...

Quote:

Originally Posted by Gorf

A simple trace would point to the rep - a more sophisticated trace would show you the actual command doublet or triplet.

That's not the problem.
The trace exception would mess up the log by adding its own instructions in it. It's log pollution - same problem as the interrupt above.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68k & PPC CPU Usage monitor for OS3	ancalimon	support.Apps	1	29 June 2020 23:42
68k CPU pause (bubble)	kamelito	Coders. Asm / Hardware	9	27 January 2020 15:09
Bad weather for the 68K socket cpu cards	Solderbro	support.Hardware	0	14 July 2018 10:19
Looking to get max CPU performance in WinUAE 68k OS	GunnzAkimbo	support.WinUAE	1	12 May 2016 11:18
Apollo / Phoenix CISC CPUs m68k compatible	Snake79	News	3	05 March 2015 20:20

12 December 2020, 16:02	#1
Gorf Registered User Join Date: May 2017 Location: Munich/Bavaria Posts: 2,295	some fancy ideas for a extended (68k?) CISC-CPU I hope this is not too off-topic here.... I was thinking of implementing a CISC-CPU (as VM) that is very close to the 68K or the NS320xx with emphasis on code density und came up with some ideas that might provide some nice benefits ... but I am not sure how useful my ideas would be for real assembler coders, so your input is most welcome. So the instruction set words are mostly 16 bit in length - that size should cover all operations on registers. Memory operations are longer of course due to addresses and immediate values. the first 8 bit are the instruction key followed by 2x 4-bit register addresses, hence we have 16 general purpose registers. Very familiar so far... Now the improvements: Stacks of registers This feature is inspired by stack-machines and/or so called register-windowing like in SPARC-CPUs, but differs in implementation. Imagine every register is in fact a stack of entries - to make it simple we start by only two values. Every write operation onto one of our 16 registers is in fact a push operation, copying the existing value to the second (otherwise hidden) entry before the new value is written. Special instructions allow to swap these two values and restoring the former value to a register - to a range of registers and/or to all 16 registers. That feature gives the coder 32 registers, without the need of longer instruction words... It could be expanded to deeper stacks of eg. 4 entries per register. "I know what you did last summer!" The CPU keeps track of the last 16 instructions - imagine a kind of internal log or special instruction cache. A special instruction allows to tell the cpu to execute any of these instructions again, in any order (so it is not a jump or loop!!) The special instruction allows to repeat either two out of the last 16 instructions or three of the last 8. This should allow for very compact code. Example: Code: Instruction-a--- Instruction-b \| Instruction-c \| Instruction-d \|-- log entries Instruction-e \| Instruction-f \| Instruction-g \| Instruction-h--- Repeat -4, -6, -1 which executes e, c and h again now our instruction log has changed - the special repeat-command is decoded into three independent instructions and stored as such in the log: Code: Instruction-d--- Instruction-e \| Instruction-f \| Instruction-g \|-- log entries Instruction-h \| Instruction-e \| Instruction-c \| Instruction-h--- if we now order again Repeat -4, -6, -1 it executes h, f and h and so on... The 8-Bit-Turbo This is s special mode for our CPU that allows the (relatively) fast emulation of old 8-bit CPUs like the z80 or simply every byte-code you come up with. after switching to this mode the next 8-bit value in the instruction-cache will be treated as combination of shift (by a given value), additional offset and jump. So if your byte-code interpreter has a table or library for all its commands, this provides an easy and fast way to jump to the right location - no lookup needed. Just set your table-entries e.g. 1k apart each - starting at "tablestart". The CPU will perform a shift by 10 bit and add "tablestart" and jump to that location! (the shift-value and "tablestart" are stored in special registers beforehand) Until now we used only the first 8 bits for this, but in a 32-bit machine we already fetched the next three 8-bit values automatically ... these will be stored in different CPU-registers, before the jump allowing them to be used as values for the interpreter code. At the end of our interpreter code block, we have a special jump instruction, that jumps back to our 8-bit code section and puts the CPU in the special mode again... Ok...thats it for now. Maybe I am just reinventing the wheel here? I did not research for patents ... Anyways - what are your thoughts? Last edited by Gorf; 12 December 2020 at 20:07.

12 December 2020, 21:33	#5
robinsonb5 Registered User Join Date: Mar 2012 Location: Norfolk, UK Posts: 1,153	You guys might find the forum at anycpu.org interesting.

13 December 2020, 19:37	#11
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,233	Pointless. Rather pointless. Don't we have the Vampire already for those that want to experiment with alternative 68K interpretations? Who will write software for that? I certainly won't.

13 December 2020, 21:20	#15
Gorf Registered User Join Date: May 2017 Location: Munich/Bavaria Posts: 2,295	As for an example let's take the question for the neighboring thread: "Quickest way to add two colours" a/b has proposed a solution for generating a table: http://eab.abime.net/showpost.php?p=1445979&postcount=7 here is how we could reduce the size of the loop, with my ideas: Code: loop moveq #15,d0 and.w d7,d0 ror.w #4,d7 moveq #15,d1 and.w d7,d1 ror.w #4,d7 moveq #15,d2 and.w d7,d2 ror.w #4,d7 moveq #15,d3 and.w d7,d3 ror.w #4,d7 add.w d2,d0 add.w d3,d1 ..... becomes: loop moveq #15,d3 and.w d7,d3 ; implicit "switch" on d3 ror.w #4,d7 pop d3,d0 ; moves d3 to d0 and switches back d3 to #15 rep.2 -3, -2 ; does and.w d7,d3 and ror.w #4,d7 pop d3,d1 rep.2 -3, -2 pop d3,d2 rep.2 -3, -2 add.w d2,d0 add.w d3,d1 This is a reduction of 30% in the case. Last edited by Gorf; 13 December 2020 at 22:25.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)