12 December 2020, 16:02 | #1 |
Registered User
Join Date: May 2017
Location: Munich/Bavaria
Posts: 2,426
|
some fancy ideas for a extended (68k?) CISC-CPU
I hope this is not too off-topic here....
I was thinking of implementing a CISC-CPU (as VM) that is very close to the 68K or the NS320xx with emphasis on code density und came up with some ideas that might provide some nice benefits ... but I am not sure how useful my ideas would be for real assembler coders, so your input is most welcome. So the instruction set words are mostly 16 bit in length - that size should cover all operations on registers. Memory operations are longer of course due to addresses and immediate values. the first 8 bit are the instruction key followed by 2x 4-bit register addresses, hence we have 16 general purpose registers. Very familiar so far... Now the improvements: Stacks of registers This feature is inspired by stack-machines and/or so called register-windowing like in SPARC-CPUs, but differs in implementation. Imagine every register is in fact a stack of entries - to make it simple we start by only two values. Every write operation onto one of our 16 registers is in fact a push operation, copying the existing value to the second (otherwise hidden) entry before the new value is written. Special instructions allow to swap these two values and restoring the former value to a register - to a range of registers and/or to all 16 registers. That feature gives the coder 32 registers, without the need of longer instruction words... It could be expanded to deeper stacks of eg. 4 entries per register. "I know what you did last summer!" The CPU keeps track of the last 16 instructions - imagine a kind of internal log or special instruction cache. A special instruction allows to tell the cpu to execute any of these instructions again, in any order (so it is not a jump or loop!!) The special instruction allows to repeat either two out of the last 16 instructions or three of the last 8. This should allow for very compact code. Example: Code:
Instruction-a--- Instruction-b | Instruction-c | Instruction-d |-- log entries Instruction-e | Instruction-f | Instruction-g | Instruction-h--- which executes e, c and h again now our instruction log has changed - the special repeat-command is decoded into three independent instructions and stored as such in the log: Code:
Instruction-d--- Instruction-e | Instruction-f | Instruction-g |-- log entries Instruction-h | Instruction-e | Instruction-c | Instruction-h--- Repeat -4, -6, -1 it executes h, f and h and so on... The 8-Bit-Turbo This is s special mode for our CPU that allows the (relatively) fast emulation of old 8-bit CPUs like the z80 or simply every byte-code you come up with. after switching to this mode the next 8-bit value in the instruction-cache will be treated as combination of shift (by a given value), additional offset and jump. So if your byte-code interpreter has a table or library for all its commands, this provides an easy and fast way to jump to the right location - no lookup needed. Just set your table-entries e.g. 1k apart each - starting at "tablestart". The CPU will perform a shift by 10 bit and add "tablestart" and jump to that location! (the shift-value and "tablestart" are stored in special registers beforehand) Until now we used only the first 8 bits for this, but in a 32-bit machine we already fetched the next three 8-bit values automatically ... these will be stored in different CPU-registers, before the jump allowing them to be used as values for the interpreter code. At the end of our interpreter code block, we have a special jump instruction, that jumps back to our 8-bit code section and puts the CPU in the special mode again... Ok...thats it for now. Maybe I am just reinventing the wheel here? I did not research for patents ... Anyways - what are your thoughts? Last edited by Gorf; 12 December 2020 at 20:07. |
12 December 2020, 18:20 | #2 | |||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
Real assembler coders often run into annoying limitations. Even though 68k is by very far the most friendly cpu there is - especially 68020+ - it still has its issues. Yes a CISC cpu as a VM might be a great idea and not that difficult to implement if you don't focus too much on actual speed. Actually i even have one in development. Quote:
Quote:
Do you have some example code in mind or was it just food for thoughts ? Quote:
Quote:
There is nothing wrong in reinventing the wheel. You might come up with a better wheel. |
|||||
12 December 2020, 19:15 | #3 | ||||||
Registered User
Join Date: May 2017
Location: Munich/Bavaria
Posts: 2,426
|
Quote:
So things that could be done fast/parallel in hardware do have priority over things, that would run fast in software. Quote:
I do not want to go full RISC here... Also there is always the possibility for an escape code to use the next 16 bits additionally as opcode. Quote:
This has probably no benefit at all in a VM ... But in hardware a swap of one or several registers to its counterparts could be done in one cycle ... (actually this would be implemented as 64bit (or 128bit) wide registers and we would only change on what part of the register the CPU looks: at the lower or the upper part ...) Quote:
My idea was again based on a imaginary hardware implementation: It would use already decoded op-codes in the cpu, which would make the repeat-call not only smaller but also faster. How exactly would your auto-vectorization work? Quote:
this idea was born from a "what if" scenario ... how e.g. Commodore could have provided some C64 compatibility in a new 32-bit CPU, without limiting it to just one architecture, but being as flexible as possible - and archiving a near 2:1 ratio in per clock execution, while still doing interpretation and not JIT. But this should also speed up byte-code of a scripting language. Quote:
Last edited by Gorf; 12 December 2020 at 19:50. |
||||||
12 December 2020, 20:57 | #4 | ||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
In addition, research for speed often limits ease to code, which in turn forces coders to go away from asm, which in turn is bad for speed. Beware of the altar of speed. It's evil Quote:
Quote:
The D/A split was just pushed too far in how it limits possibilities. Quote:
Not to mention cases where more than 16 regs are needed are quite rare. Quote:
Ideally, by not having to add any new instruction at all. There are internally two modes : the normal one, and the vector one. The cpu simply stores somewhere the last time it saw an incompatible instruction (one that can not be converted to a parallel version). Then, if a loop instruction is seen and it does not go beyond this limit, a new mode is activated that chooses a vector version of all instructions in the loop. The vector size can be anything. Even no vectorized implementation at all, same code would still work. Vector registers are internal and never seen by the programmer. I don't know if this is even possible of course, but if it is, it would mean the end of simd extensions being made obsolete by new version, and so on, giving an awful legacy (think of mmx -> sse1 -> sse2 -> sse4.1 -> sse4.2 -> avx -> avx2...). Now you just change the vector size and all existing programs just take benefit of it. Quote:
However it's just 3 instructions to read byte, fetch address and branch, and you won't escape the pipeline stall - or a strong OoO could actually fetch next emulated instruction and find out the address while previous one isn't finished, leading to zero benefit in doing that with single instruction or special mode. Anyway, if i were you i would not focus on the special features ; rather, on the general purpose instruction set. IMO this is too often overlooked. |
||||||
12 December 2020, 21:33 | #5 |
Registered User
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,157
|
You guys might find the forum at anycpu.org interesting.
|
13 December 2020, 02:05 | #6 | ||||||
Registered User
Join Date: May 2017
Location: Munich/Bavaria
Posts: 2,426
|
Quote:
That's why I am asking here, how much sense they make for experienced coders. Quote:
But then we are no longer talking about something one person alone could design .... But I take your hint that more than 16 registers are not that relevant for coders as I might have thought. Quote:
Quote:
I guess you think in terms of a modern CPU an GHz, while I am thinking in some small hobby project - proof of concept thingy. After a software VM the best case scenario would be, I manage to lern enough of FPGA HDL to implement it on such a board ... and then we are talking about maybe 100MHz Quote:
So far: more than 16 registers are not as needed as I thought. Byte-code-mode depends if it is really faster ... Maybe the "repeat" feature? Quote:
But it definitively is food for thoughts! Last edited by Gorf; 13 December 2020 at 16:24. |
||||||
13 December 2020, 10:23 | #7 | |||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
Easier to code means less instructions for doing the same work, which in turns enhances code density. Quote:
Quote:
This is easy to fix : use separate register for stack pointer (instead of using A7), add some kind of data pointer for small data model (in which A4 or A5 are typically used), add features for high parts of registers, allow more for address registers, and you're done. Quote:
Quote:
Instead of designing the cpu for speedy things that will never be fast, why not putting that into a higher level ? My VM contains features that the emulated cpu can call, and which can then run at native speeds. While it is an emulated machine running on an emulated machine, it still does 2.5Gb/s copymem... Well, you could concentrate on the code density. Design the full encoding, take a few examples and see how it manages them. If you think in terms of special features, you can also see how they fare in these examples. |
|||||
13 December 2020, 16:01 | #8 | ||||||||
Registered User
Join Date: May 2017
Location: Munich/Bavaria
Posts: 2,426
|
Quote:
And that is exactly what e.g. the "repeat x,y,z" command does: it is effectively three commands in one. Same with the "byte-code-mode": it is just one command instead of three for read byte, fetch address and branch. The "register stacks" would make many lines obsolete, where one saves a register back to RAM just to load it again later. Just one short command instead of at least two lengthy one, which need some memory addresses. The gain is much more if you want to save back a whole bunch of registers... Imagine in your program you jump forth and back between two subroutines: wouldn't it be nice to switch some of the registers forth and back with just one command? Quote:
Why would we use a register machine at all? The Transputer had very fast internal RAM (1-2kB) and they made the same argument ... and got mostly rid off registers in their design shortening the length of the 15 most common op-codes to just one byte (4 bit key + 4 bit value). If you want to keep registers, because you consider them as convenient, then at least there is no use for real data-registers: all registers should just be shortcuts for locations in RAM, since it is the same speed anyways. (this would also spare you from loading and saving registers - you are just pointing to locations in RAM at all times...) But my goal is to follow a more classic approach - as I said: I am just reinventing the wheel Quote:
As you pointed out: the space for the op-codes gets very cramped, but I think I have some ideas and tricks to make that possible... Quote:
and even a 16 immediate value or offset would not fit together with the actual op-code in just 16 bits... But I have of course planned a series of commands using offsets - these offsets are ether 8 bits + a shift (remember, there is a special shift-value register) - so this does fit nicely in a 16bit-wide op-code. It gives you effectively a 9 bit range in most cases, if your structures are word-aligned ... if you work with other data structures and you can guarantee e.g. 4 byte alignments (BCPL) or 1k alignments you can set your shift-value-register accordingly. I would skip the 16 bit offset, since it messes up my instruction fetches - the next offset-size is 24 bits, which results in a 32bit wide op-code. Quote:
Quote:
Here I just wanted to see if some ideas I considered as use- or helpful, would resonate. Well ... it does not look that way ... so: Quote:
Quote:
Last edited by Gorf; 13 December 2020 at 17:01. |
||||||||
13 December 2020, 17:12 | #9 | ||||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
Quote:
Register windows failed because of this - at some point you *have* to push them to some memory place. Quote:
Besides, RAM is as fast as registers but not on all operations. Regular data access yes, but not when used as pointer (aka double indirect). Quote:
Consider data in low word of a 32-bit register, and now you're out of registers to hold the loop counter. My VM has an instruction using high word as loop counter so no swap is necessary. But with a register window you still need to swap the register twice, adding two instructions. Quote:
And if it is your stack pointer, you may not even need to load it at all. Quote:
Quote:
Quote:
Not really. You only mentioned it. Code density does not come from using special tricks. It comes from overall good encoding method, efficient instruction set, short immediates... The more code is written, the more "nice features" start to backfire |
||||||||
13 December 2020, 19:26 | #10 | |||||||||
Registered User
Join Date: May 2017
Location: Munich/Bavaria
Posts: 2,426
|
Quote:
Quote:
Quote:
The TI TMS9900 had "virtual" register banks in RAM and real register pointing to a bank - that made sense back than and would make sense again with very fast cache. As I already pointed out: without registers and only stack, you can use even shorter op-codes .. see Transputer or Forth-CPUs like RTX2010, Novix N4000 and GreenArray. Also see PERL-CPU (Performance Enhanced Registerless) https://www.researchgate.net/profile...ication_detail But all that is a different kind of discussion ... here I just wanted to see if coders would like the features I mentioned in my first post... Quote:
(can be turned on and off) Your idea is very nice, but also limited to 16bit operations .. if you need the full 32 bit width of your register this won't help, but my method would. Quote:
But I was just saying, that you need a longer command for loading longer addresses or data ... Quote:
Quote:
Quote:
I have a lot of other ideas as well and of course all the obvious and well known ways of reducing code size are considered or already part of the concept. But here I was only talking about the 3 features I mentioned in my first post. Quote:
|
|||||||||
13 December 2020, 19:37 | #11 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,310
|
Pointless. Rather pointless. Don't we have the Vampire already for those that want to experiment with alternative 68K interpretations? Who will write software for that? I certainly won't.
|
13 December 2020, 19:51 | #12 | ||
Registered User
Join Date: May 2017
Location: Munich/Bavaria
Posts: 2,426
|
Quote:
I am not aiming at revolutionizing the world ... Quote:
But if you think, this thread does not belong here, a moderator might move it to some other place or delete it - I did not what to cause any fuzz. |
||
13 December 2020, 20:27 | #13 | |||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
That depends of the size of the register window. If you just have a second shadow register, don't expect it to be of any use to replace regular register saving.
Quote:
Quote:
You have to specify somewhere which registers are gonna be swapped, haven't you ? Quote:
My method at least does not require additionnal instructions to perform swapping. Yep, from the OS loading the program... Quote:
Quote:
Quote:
Doesn't the microprocessor industry have many examples of original, great looking at first sight, failed designs ? Quote:
I certainly won't ask you to write any software, so that's fine. |
|||||||
13 December 2020, 20:44 | #14 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,310
|
|
13 December 2020, 21:20 | #15 |
Registered User
Join Date: May 2017
Location: Munich/Bavaria
Posts: 2,426
|
As for an example let's take the question for the neighboring thread:
"Quickest way to add two colours" a/b has proposed a solution for generating a table: http://eab.abime.net/showpost.php?p=1445979&postcount=7 here is how we could reduce the size of the loop, with my ideas: Code:
loop moveq #15,d0 and.w d7,d0 ror.w #4,d7 moveq #15,d1 and.w d7,d1 ror.w #4,d7 moveq #15,d2 and.w d7,d2 ror.w #4,d7 moveq #15,d3 and.w d7,d3 ror.w #4,d7 add.w d2,d0 add.w d3,d1 ..... becomes: loop moveq #15,d3 and.w d7,d3 ; implicit "switch" on d3 ror.w #4,d7 pop d3,d0 ; moves d3 to d0 and switches back d3 to #15 rep.2 -3, -2 ; does and.w d7,d3 and ror.w #4,d7 pop d3,d1 rep.2 -3, -2 pop d3,d2 rep.2 -3, -2 add.w d2,d0 add.w d3,d1 Last edited by Gorf; 13 December 2020 at 22:25. |
14 December 2020, 09:29 | #16 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
Quote:
andbut not with addor ror? Where's the sense in that ? What if in some other example the switch is actually not wanted ? At the end : code not readable - maybe also not working. But a very nice example actually ! It's real life work. For your reading pleasure (or not ), here's the version for my VM (with renamed instructions to be more 68k-like) : Code:
pdep #$0f0f0f0f,d7:d0 ; ****GBgb -> 0G0B0g0b moverp d0.h,d1 ; d1=****0G0B add.w d1,d0 ; d0=****0G0B + 0g0b (-> ******0B + 0b) moverp d0.t,d1 ; d1=******0G + 0g Ask me if you want explanation on individual instructions. |
||
14 December 2020, 12:12 | #17 | |||||
Registered User
Join Date: May 2017
Location: Munich/Bavaria
Posts: 2,426
|
Quote:
As I wrote in my first post: all write operations act as a "push" on a register. The new value is now the actual on, the old value the other one.. Quote:
I just pointed this out in an extra comment in the one case we made use of it later. Quote:
This might be useful for registers, containing addresses - but it is not the default behavior. (Not done in this example) Quote:
This code is not readable and maybe also not working Quote:
Last edited by Gorf; 14 December 2020 at 12:31. |
|||||
14 December 2020, 16:47 | #18 | |||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
Anyway, don't you fear it's gonna overload the register file by too much data to move around ? I mean, simple EXG (= 2 writes) would become 128 bits to change in a single instruction... Quote:
Quote:
It is much shorter than your attempt and does not use "tricks". So unreadable maybe, but less so than yours and it is working. Remember, i have a vm so i can test any code. But actually, yours really does not work, at least, according to your specifications in the first post. Your two last reps will not repeat the right instructions. Remember, the log is supposed to have evolved since... (Or maybe neither pop nor rep are entered in the log ?) Also, must be some fun if an irq comes right in the middle of this code. Or if the coder wants to debug the code using trace mode. Do you have solutions for these issues ? |
|||
14 December 2020, 17:08 | #19 | |||||
Registered User
Join Date: May 2017
Location: Munich/Bavaria
Posts: 2,426
|
Quote:
It would still just be 64 bits to change with EXG. It actually makes EXG internally easier, since the CPU does not need to store the content of one register temporarily somewhere else ... Quote:
But in hardware it is just a flag - a bit per register, that decides if a write does use the "other" entry or not. Quote:
Quote:
Since the reps are decoded as two instructions in the log, it is referring again to the right ones. I could also have used rep -6, -5 for the second one and rep -6, -5 or rep -9, -8 for the third one. Quote:
A simple trace would point to the rep - a more sophisticated trace would show you the actual command doublet or triplet. Last edited by Gorf; 14 December 2020 at 17:24. |
|||||
14 December 2020, 17:43 | #20 | ||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
You consider pushes aren't writes maybe ?
Quote:
Quote:
Quote:
Quote:
Also implies it's difficult to spot which instructions will be executed. Quote:
Quote:
The trace exception would mess up the log by adding its own instructions in it. It's log pollution - same problem as the interrupt above. |
||||||
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
68k & PPC CPU Usage monitor for OS3 | ancalimon | support.Apps | 1 | 29 June 2020 23:42 |
68k CPU pause (bubble) | kamelito | Coders. Asm / Hardware | 9 | 27 January 2020 15:09 |
Bad weather for the 68K socket cpu cards | Solderbro | support.Hardware | 0 | 14 July 2018 10:19 |
Looking to get max CPU performance in WinUAE 68k OS | GunnzAkimbo | support.WinUAE | 1 | 12 May 2016 11:18 |
Apollo / Phoenix CISC CPUs m68k compatible | Snake79 | News | 3 | 05 March 2015 20:20 |
|
|