Enhanced 68k ISA

matthey · 03 August 2016, 04:47

I was recently asked some questions about the enhanced 68k ISA I worked on a few years ago but my old documentation is no longer available elsewhere. I have added the 68kF_PRMv7.pdf as an attachment. Sorry that pdf is the only recognized EAB attachment that works well.

First a little history. I was contacted by Gunnar and asked to join the Apollo Team and help create an enhance FPGA 68k CPU which I did. I took the initiative, without being asked, to document ISA enhancements to help evaluate 68k ISA ideas for the apollo-core.com FPGA project. Originally I named the document ApolloPRM but decided it would be advantageous to have an ISA usable by other 68k FPGA processors also. I came up with the 68kF (68koolFusion) name with the idea to sound like kool fun and the Fusion part being a fusion of 68k and ColdFire. It turns out the ISA is more like cold Fusion as it fissiled when Gunnar decided he was going to make all the decisions himself while abandoning all previous work and our work group ideas.

I cleaned up the documentation a little recently including some new ideas and going back to the extended precision FPU (which I favor especially for compatibility). I added the ColdFire BYTEREV and BITREV as REVB.L (Reverse Bytes) and BREV.L (Bit Reverse) respectively which was talked about in another thread including mentioning the possibility of fusing with a MOVE.L. I tried to create a REVB.W using ROR.W #8,Dn but the CC would be set differently so it wouldn't have been consistent. One of the last changes was for BScc (Bit Set Condition) including the name change as suggested my meynaf to better match the 68k naming conventions. I came up with an idea to add an operation to the other bits but I don't know the best syntax to specify it. Two bits in the encoding can easily be used to specify an operation on the other bits including no operation, change the bits, clear the bits and set the bits. I believe this would be a simple enough operation but makes the instructions much more powerful. The following are some ideas of possible syntax to set or clr bit #0 according to CC while clearing other bits.

Code:

    bsne #0:clr_other,(a0)
    bsne #0,(a0),clr_other
    bsne_clr #0,(a0)

The functionality is now perfect for setting a bool in C99 which are a byte in the CISC implementations I've seen (and created for vbcc). Of course BScc would allow a bool to only be one bit and is perfect for working with bit packed settings like the AmigaOS commonly uses. I don't know the best syntax but it really doesn't matter as the ISA is dead. Gunnar created his toy and the retro 68k FPGA guys don't want to add enhancements. This is really only of interest for the last of us old school 68k Amiga intellectual geeks. Feel free to discuss any 68k enhancements (ISA, ABI or CPU design) in this thread.

buggs · 03 August 2016, 07:54

This one can be fun. Off the top of my head, two things I'd be happy to see:
1. cmove is one instruction I came to enjoy while away from the Amiga. Conditional moves instead of the usual 68k bcc, move combination are quite convenient, at least in Asm code. Edit: found SELcc

2. One of the favorite toys of you Apollo guys is missing from the PDF: the SIMD stuff.

Otherwise, thank you for the update regarding the ISA. Which brings me to one more thing: How can one reliably detect a CPU with the feature set in question (at different core development levels)? Do we have to probe instruction by instruction or is there already something like "CPUID"?

meynaf · 03 August 2016, 10:48

Quote:

Originally Posted by buggs

1. cmove is one instruction I came to enjoy while away from the Amiga. Conditional moves instead of the usual 68k bcc, move combination are quite convenient, at least in Asm code.

A macro is enough for the conditional move. In hardware the case can be detected and the branch merged with the instruction. Other instructions can as well be merged this way, so you have conditional add, etc.

Adding a conditional move instruction wouldn't help for code density because it would have to be a 32-bit instruction (due to lack of encoding space for it).

Quote:

Originally Posted by buggs

Edit: found SELcc

Nice in theory, but the most common case i've found was the choice between two immediates or two memory addresses (LEA)...

Quote:

Originally Posted by buggs

2. One of the favorite toys of you Apollo guys is missing from the PDF: the SIMD stuff.

It doesn't seem to be fully documented anywhere. Seems they didn't want to look ridiculous

Quote:

Originally Posted by matthey

I was recently asked some questions about the enhanced 68k ISA I worked on a few years ago but my old documentation is no longer available elsewhere. I have added the 68kF_PRMv7.pdf as an attachment. Sorry that pdf is the only recognized EAB attachment that works well.

Some remarks :
1. Your ABS instruction documents a 4-bit field for the register, while saying it works on Dn. To the best of my knowledge you didn't add extra data registers so it ought to be a 3-bit field.
2. You document longword only for ADDQ to address regs. But word size exists and is used in programs - even though it's useless as it leads to exactly the same result.
3. You have a branch hint bit. Do you remember what Gunnar said about what occured with this bit in the Power cpu ? You are a compiler writer so you should know that the compiler doesn't have this kind of info, btw.
4. You used Gunnar's dirty encoding for DBcc.L. This ain't good for assemblers and disassemblers and isn't overall nice. Just set b7=b6=0 from normal DBcc and you're done.
5. Perhaps you could simply mention REMU/REMS as alias for DIVUR/DIVSR, instead of having one entry for each.
6. How could you enable (An)+ and -(An) for JMP/JSR ? It's unsized, so what's the operation ? Same for LEA/PEA.
7. SATS is useless. Yes, really. I can develop if you want. Clearly I wouldn't have put it.
8. You seem to be moving 4-bit condition fields here and there. That's no good. At least BScc and SELcc should have it at the same place.
9. I don't see SELcc as very useful. One could just preload the target register with case 1, then test the condition, then conditionnally load the target with case 2. Not fully against it, but the encoding really isn't nice.

Quote:

Originally Posted by matthey

I cleaned up the documentation a little recently including some new ideas and going back to the extended precision FPU (which I favor especially for compatibility). I added the ColdFire BYTEREV and BITREV as REVB.L (Reverse Bytes) and BREV.L (Bit Reverse) respectively which was talked about in another thread including mentioning the possibility of fusing with a MOVE.L.

BREV and REVB look a little bit ambiguous to me. I'd favor BITREV/BYTEREV not only because of the CF, but also because it's more readable.

Quote:

Originally Posted by matthey

One of the last changes was for BScc (Bit Set Condition) including the name change as suggested my meynaf to better match the 68k naming conventions.

Thanks

Quote:

Originally Posted by matthey

Gunnar created his toy and the retro 68k FPGA guys don't want to add enhancements. This is really only of interest for the last of us old school 68k Amiga intellectual geeks.

If enough of us agree (yeah i can dream) then perhaps some HW guy could come up and do something.

Quote:

Originally Posted by matthey

Feel free to discuss any 68k enhancements (ISA, ABI or CPU design) in this thread.

Let's go for it. I present nearly all additions i'd like to see here.

I start by adding the Coldfire's useful stuff : MVZ, MVS, BITREV, BYTEREV.

Write to PC-relative is now allowed, unless (like in the case of Scc) the encoding has been stolen.

Address regs are supported whereever this has a natural encoding and that encoding isn't stolen by something else (i.e. not for Scc).
This means you can now do MOVEA.B to use the same extend trick as MOVEA.W. Note : unsigned extend (more common for bytes).

Code:

0001<r>0 01< ea >            movea.b ea,An

As we have byte access to An, the reason why MOVEM.B didn't exist is removed. Massive byte-extend in a single instruction is now available.

Code:

00101001 11< ea >            movem.b ea,rlist
00111001 11< ea >            movem.b rlist,ea

Quite often you have to extend a byte when indexing. This is quite painful.
Therefore the byte index is added.
If compatibility didn't have to be maintained, i'd have done this otherwise.

Code:

  110<r> d<r>0ss1 00111000        (an,dn.b)
  110<r> d<r>0ss1 00111010        d16(an,dn.b)
  110<r> d<r>0ss1 00111011        d32(an,dn.b)
  111011 d<r>0ss1 00111010        d16(pc,dn.b)
  111011 d<r>0ss1 00111011        d32(pc,dn.b)

Another addressing mode "added". Normal (d32,pc) is quite large, larger than a 32-bit address. This fixes that problem so people who hate relocs should like it.

Code:

  111101                d32(pc)

Quite often you use a register to point to BSS. It's quite nice but nevertheless eats a register. Having some constant address doesn't need all the addressing flexibility of A4/A5 that are used for this purpose, and therefore a new read-only register could make the trick.
Alas it needs OS patches to be saved and therefore i won't defend this much.
Like SP, it could have a different value for user and supervisor, hence the UBP here.
Thus :

Code:

  111110                d16(bp)
  111111                d8(bp,ix)
01001110 01111010 d<r>1111 11111111    movec bp,rn    (base pointer)
01001110 01111011 d<r>1111 11111111    movec rn,bp
01001110 01111010 d<r>1111 11111110    movec ubp,rn   (user base pointer)
01001110 01111011 d<r>1111 11111110    movec rn,ubp

Bit-fields are extended to support a "reverse" mode.
The new mode makes the bit-field behave like this :
- for the position, we name the bit like for BTST (and it's the last bit of the field we target here) ; the operation is limited to 32 bits
- for the size, we use (32-n)
This gives the following bit-field word extension encoding :

Code:

    b15    D/A
    b14-12 reg
    b11    static/dynamic
    b10    mode reverse (pos)
    b9     D/A (pos)
    b8-6   reg (pos)
    b5     static/dynamic
    b4     mode reverse (len)
    b3     D/A (len)
    b2-0   reg (len)

Like Matt did, we now have LEA to Dn. Ok it was originally my idea

Anyone knowing the x86 well enough knows the LEA to EAX trick. There you can do it as well. Also remember you can do things such as MOVE.x (Dn.l),ea - for the cases you're short of address regs.

Code:

0100<r>1 01< ea >            lea    ea,Dn

A few useful ops here, with a similar encoding.
ABS is like in Matt's doc.
BITCNT is called POPCNT but i don't like this name. Even if it's more or less "standard issue", we count bits, not people.
DUP will duplicate the lowest byte into all others, i.e. $12345678 will become $78787878.

Code:

00000000 11001<r>            dup    Dn
00000010 11001<r>            abs    Dn
00000100 11001<r>            bitcnt Dn

Sometimes you have to test some value then clear it (like "process this code once if the value is set). It involves clearing the value and thus repeating the same EA - a pain for modes such as (An)+.
Hence the TAC (Test And Clear) instruction, testing the byte like TST.B and clearing it.

Code:

00011001 11< ea >            tac    ea

Good enough coders should know about BTST Dn,#i.
Dynamic bit-test a constant is a nice trick, however limited to 8 bits.
This limits will go away.

Code:

01001011 11000<r>            btst    Dn,#i16
01001101 11000<r>            btst    Dn,#i32

I don't like adding an addressing mode for short immediates : it will only be useful for a handful cases.
MOVE.L #16bit,Dn can be handled with LEA adr.w,Dn.
ADD/SUB are useful but relatively rare.
OR/AND/EOR are useless - just operate on the lower part of the reg.
CMP is quite useful but we can add it alone :

Code:

01001000 01100<r>            cmpq.l    #i16,Dn

Normal DBcc has many limitations. It's one of the major reason for data register shortage, is zero-based and sometimes it's an annoyance, and of course can't operate on anything but a word size.
In addition it works on the lower word only and having it sometimes on the high word would spare a register.
I'm not sure full DBcc.L is really useful even though i have an encoding for it. I've put simple DBF.L here, this probably needs discussion.

Code:

01000001 11000<r>            dbfh    Dn,d16    (high word dbf)
01000011 11000<r>           dbf.b    Dn,d16
01000101 11000<r>            dbf1    Dn,d16    (base 1 dbf)
01000111 11000<r>            dbfa    An,d16    (dbf An)
01001111 11000<r>           dbf.l    Dn,d16

Square roots are useful sometimes in demos (for 3D stuff). There is nothing for this, apart maybe floating-point SQR, but FP suffers from rounding errors.
I prefer having an integer SQR, and here it is. May be complicated to do in HW but if it can be done in FP, it can be done in integer.

Code:

1100<r>1 10000<r>            sqr    Dn,Dn    (sqr(64) -> 32)

Of course my stuff wouldn't be complete without my bit-set-on-condition.
I don't see targeting other bits as useful and my encoding is different, but Matt's doc explains it quite well.

Code:

00001110 11< ea > <cc>0100 00i< n >    bs<cc>    #n/Dn,ea (bit set from cc)

Normal shifts (LSR & friends) are extended to target any writeable EA, for counts>8 and with an address reg as a counter.
It means you can do ROR.B a0,8(a1) if you want.

Code:

00001110 tt< ea > 00oos100 00i< n >    shift.t #n/Rn,ea

One annoying limit is EOR. We can't EOR from mem. Having a 16-bit version would require stealing the line-A space and i don't want to do this.
Still, a 32-bit version can be useful.

Code:

00001110 tt< ea > 0<r>0101 00000000    eor.t    ea,Dn

x86 has XCHG with mem. Even spartan ARM has SWP. But the 68k doesn't have EXG with mem, and that's often quite a pain.
And so, even though it's quite a large instruction (again due to encoding space constraits), i add it :

Code:

00001110 tt< ea > 0<r>0101 01000000    exg.t    ea,Dn

Quite often you have branches that directly go to RTS. Old z80 has conditional RTS. Now we can have one as well.
RTS is done only if condition is true, else NOP.

Code:

1110<cc> 11111100            rts<cc>

An invention of mine.
If compares the byte in the source register, to all 4 bytes in the target register. If found, it puts the index of that byte (0 to 3) in the target register and sets the CCR accordingly.

Code:

0100<r>1 11001<r>            mbcmp    Dn,Dn

The following instruction swaps the bits in the register pair, whereever the bit in the mask (the source operand) is 1. This nicely replaces the complex EOR merges.

Code:

00001110 11< ea > a<r>0100 0100a<r>    mix    ea,rn:rn

RXL/RXR below are same as ROXL/ROXR. However this time the carry isn't a single bit but a whole register (the LSB are used for this, MSB being unchanged).

Code:

00001110 11< ea > 0<r>s100 10i< n >    rxl/rxr #n/dn,dn:ea

Ouch. What a large post.
I have a few other instruction ideas but that's all for now. I don't expect positive feedback ; cpu discussions often end up in flame wars. But who knows.

matthey · 03 August 2016, 17:32

Quote:

Originally Posted by buggs

This one can be fun. Off the top of my head, two things I'd be happy to see:
1. cmove is one instruction I came to enjoy while away from the Amiga. Conditional moves instead of the usual 68k bcc, move combination are quite convenient, at least in Asm code. Edit: found SELcc

2. One of the favorite toys of you Apollo guys is missing from the PDF: the SIMD stuff.

This is *not* the official documentation for the Apollo-core ISA. No part of it was approved or sanctioned by Gunnar although it includes some ideas from him and others who were in the Apollo Team (meynaf and me mostly). SIMD and privileged instructions were generally not included as discussions about them were at an early stage. The active work group of 3 was too small and I tried to get permission to bring in other experts including compiler designers, CPU designers, embedded designers and 68k hardware experts to help evaluate and create the new ISA. It looked like Gunnar was going to use and evaluate some of the ideas before he changed his mind and went completely a different direction. He has since changed his mind again and backed off of some of the incompatibilities we had disagreed on. We still don't know what he has in mind or if he will change his mind once again but the current Apollo-core "documentation" is at the following link.

http://www.apollo-core.com/index.htm?page=instructions

The SELcc instruction from the 68kF ISA is simpler than a CMOVcc style instruction but the instruction can be large and may be unnecessary with predication like the Apollo core has. It needs further evaluation especially in regards to how well compilers could use it. CMOVcc is often no better for more modern x86/x86_64 processors.

Quote:

Originally Posted by buggs

Which brings me to one more thing: How can one reliably detect a CPU with the feature set in question (at different core development levels)? Do we have to probe instruction by instruction or is there already something like "CPUID"?

A query system for hardware capabilities was discussed but it was not a priority. The ColdFire added it although their system would be inadequate for enhanced 68k processors. ARM has a nice query system but needs it with an overload of variations. I would hope an enhanced 68k would have a more standard set of hardware (SoC) and not as many variations considering it doesn't need to support low end embedded processors where it is not competitive.

Knocker · 03 August 2016, 18:30

I think it would be very useful to have a conditional BSR, "BSRcc". Essentially a Bcc where you can do rts to get back. Would allow for much cleaner code. Or can you use a macro with some local labels to do this somehow?

meynaf · 03 August 2016, 19:18

Quote:

Originally Posted by Knocker

I think it would be very useful to have a conditional BSR, "BSRcc". Essentially a Bcc where you can do rts to get back. Would allow for much cleaner code. Or can you use a macro with some local labels to do this somehow?

You mean, something like this ?

Code:

bsreq macro           ; bsr if eq
 bne .\@
 bsr \1
.\@
 endm

Mrs Beanbag · 03 August 2016, 21:04

Quote:

Originally Posted by meynaf

I start by adding the Coldfire's useful stuff : MVZ, MVS, BITREV, BYTEREV.

Write to PC-relative is now allowed, unless (like in the case of Scc) the encoding has been stolen.

Quote:

Address regs are supported whereever this has a natural encoding and that encoding isn't stolen by something else (i.e. not for Scc).
This means you can now do MOVEA.B to use the same extend trick as MOVEA.W. Note : unsigned extend (more common for bytes).

Maybe more common but then inconsistent with d8(An,Dn) addressing mode, if you care about that.

Quote:

Quite often you have to extend a byte when indexing. This is quite painful.
Therefore the byte index is added.
If compatibility didn't have to be maintained, i'd have done this otherwise.

Code:

  110<r> d<r>0ss1 00111000        (an,dn.b)
  110<r> d<r>0ss1 00111010        d16(an,dn.b)
  110<r> d<r>0ss1 00111011        d32(an,dn.b)
  111011 d<r>0ss1 00111010        d16(pc,dn.b)
  111011 d<r>0ss1 00111011        d32(pc,dn.b)

Another addressing mode "added". Normal (d32,pc) is quite large, larger than a 32-bit address. This fixes that problem so people who hate relocs should like it.

Quote:

Like Matt did, we now have LEA to Dn. Ok it was originally my idea

Quote:

I don't like adding an addressing mode for short immediates : it will only be useful for a handful cases.

Hmm i don't know, i often find i need to add or subtract a value bigger than 8 but smaller than 64k..

Quote:

Quite often you have branches that directly go to RTS. Old z80 has conditional RTS. Now we can have one as well.

Quote:

Code:

1110<cc> 11111100            rts<cc>

An invention of mine.
If compares the byte in the source register, to all 4 bytes in the target register. If found, it puts the index of that byte (0 to 3) in the target register and sets the CCR accordingly.

a tad high-level for my liking, string functions in asm? hmm... x86 is weird enough...

Quote:

Code:

0100<r>1 11001<r>            mbcmp    Dn,Dn

The following instruction swaps the bits in the register pair, whereever the bit in the mask (the source operand) is 1. This nicely replaces the complex EOR merges.

Sometimes what i want to do is conditional AND or OR with register. Like Scc but only sets to -1 (OR) or clears (AND).

meynaf · 03 August 2016, 21:36

Quote:

Originally Posted by Mrs Beanbag

Maybe more common but then inconsistent with d8(An,Dn) addressing mode, if you care about that.

It's true that i care more about usefulness than consistency.

Quote:

Originally Posted by Mrs Beanbag

Hmm i don't know, i often find i need to add or subtract a value bigger than 8 but smaller than 64k..

For values bigger than 8 but smaller than 128 the old moveq+add trick still works and a new instruction wouldn't be smaller.

Quote:

Originally Posted by Mrs Beanbag

a tad high-level for my liking, string functions in asm? hmm... x86 is weird enough...

This is useful for switch..case as well. It's not a string function strictly speaking, just a general-purpose multi-byte cmp.
I planned an immediate mode (but forgot it here). Perhaps even better than the register mode :

Code:

01001110 10000<r>    mbcmp dn,#i32   (disables write ; only sets ccr)
01001110 11000<r>    mbcmp #i8,dn

And thus instead of :

Code:

 cmpi.b #$a9,d0
 beq found
 cmpi.b #$98,d0
 beq found
 cmpi.b #$11,d0
 beq found
 cmpi.b #$fe,d0
 beq found

You could write :

Code:

 mbcmp d0,#$a99811fe
 beq found

Something else while i'm here.
The 68k isn't especially good in range-limiting values (aka saturation).
So my conversion functions take care about that.
If unsigned, values <0 become 0 and values >$ff (for byte) or $ffff (for word) become $ff(ff).
If signed, values <0 become $ffffff80 or $ffff8000, and values >$7f or $7fff become $7f(ff).
This allows doing the computation with the full register size and then convert back to the normal size at the end of the process.

Code:

01001110 1w000<r>    cnvu.bw dn
01001110 1w001<r>    cnvs.bw dn

Quote:

Originally Posted by Mrs Beanbag

Sometimes what i want to do is conditional AND or OR with register. Like Scc but only sets to -1 (OR) or clears (AND).

I don't get it. Can you be more precise ?

Mrs Beanbag · 03 August 2016, 21:45

Quote:

Originally Posted by meynaf

I don't get it. Can you be more precise ?

ok. i mean...
* set a byte to -1 if the condition is met, otherwise leave unchanged, or
* clear a byte to 0 if the condition is not met, otherwise leave unchanged

I was looking at the 88000 instruction set the other day and it had "pixel" instructions, presumably (i didn't read in detail) for working with 32-bit RGBA values.

meynaf · 03 August 2016, 22:10

Quote:

Originally Posted by Mrs Beanbag

ok. i mean...
* set a byte to -1 if the condition is met, otherwise leave unchanged, or
* clear a byte to 0 if the condition is not met, otherwise leave unchanged

You need a temporary to do that :

Code:

 scc dn         ; dn is tmp
 or.b dn,dest

If the target is a data register, you can also use the stack :

Code:

 scc -(sp)
 or.b (sp)+,d0

Adding an instruction for this would take more encoding space than what it's worth. A macro could be valuable, though.

Quote:

Originally Posted by Mrs Beanbag

I was looking at the 88000 instruction set the other day and it had "pixel" instructions, presumably (i didn't read in detail) for working with 32-bit RGBA values.

Makes me think about fast conversion of pixel formats. We had such a discussion with Gunnar ; he wanted to create a new instruction for this.
Say, something like 16-bit argb to 32-bit argb or vice versa. It's straightforward but a little bit cumbersome.
Yet it doesn't value creating a new instruction especially for it.

However, the x86 has the quite interesting parallel bit extract/deposit (PEXT, PDEP) which would be wonders for this and other tasks. Think of them as PACK/UNPK with variable bit pos.
I don't like x86 too much but these two look like gems to me.

matthey · 04 August 2016, 00:35

Quote:

Originally Posted by meynaf

A macro is enough for the conditional move. In hardware the case can be detected and the branch merged with the instruction. Other instructions can as well be merged this way, so you have conditional add, etc.

Adding a conditional move instruction wouldn't help for code density because it would have to be a 32-bit instruction (due to lack of encoding space for it).

Predicated branches are good for code density and nice for the 68k. I don't know if we can assume all processors using the ISA will have predication though. Most of the places a SELcc instruction would be larger are where a single instruction predicated branch can't be used. Of course the Apollo core can sometimes skip 2 instructions with a predicated branch by combining the 2 instructions following a branch into a 3 op instruction but that would really be assuming too much for an ISA. The SELcc instruction is very easy to use as documented but the big test is how well compilers can use it and whether there would be a performance advantage in the most common CPU designs.

Quote:

Originally Posted by meynaf

Some remarks :
1. Your ABS instruction documents a 4-bit field for the register, while saying it works on Dn. To the best of my knowledge you didn't add extra data registers so it ought to be a 3-bit field.

I am aware of this which is easy to change. The register field was originally Rn with address registers allowed. There are several of these instructions which could be allowed for Rn like ABS, BREV (BITREV), REVB (BYTEREV), POPCNT, etc. I'm not sure some make sense but should we just open them anyway? What is the best logic to decide when to open An registers?

Quote:

Originally Posted by meynaf

2. You document longword only for ADDQ to address regs. But word size exists and is used in programs - even though it's useless as it leads to exactly the same result.

This is a purposeful omission. There are several places where I tried to reduce or hide the use of OP.W EA,An because it is confusing.

Quote:

Originally Posted by meynaf

3. You have a branch hint bit. Do you remember what Gunnar said about what occured with this bit in the Power cpu ? You are a compiler writer so you should know that the compiler doesn't have this kind of info, btw.

Compilers can use profiling info to set a branch hint bit. At least some support for this is available in GCC, Clang/LLVM and vbcc compilers. High performance processors don't use hint bits much because nobody bothers optimizing code on them anymore. Two branch mis-predictions on an embedded processor to turn around a branch that couldn't be reversed is a different story and several of them can lead to inconsistent variations in the timing and responsiveness of executing code. The hint bit could potentially give a little extra optimization potential for software that was meant to execute on a more powerful processor than an FPGA CPU.

Quote:

Originally Posted by meynaf

4. You used Gunnar's dirty encoding for DBcc.L. This ain't good for assemblers and disassemblers and isn't overall nice. Just set b7=b6=0 from normal DBcc and you're done.

Setting bits #7 and #6 looks to me like it would interfere with ADDQ.B and SUBQ.B? Yea, the current encoding of DBcc.L is not that great. It is a questionable enhancement anyway unless some way can be figured out to improve the performance over a separate decrement (SUBQ) and branch.

Quote:

Originally Posted by meynaf

5. Perhaps you could simply mention REMU/REMS as alias for DIVUR/DIVSR, instead of having one entry for each.

Good idea. I only fairly recently added aliases. It should be as simple as adding aliases for REMU->DIVUR and REMS->DIVSR.

Quote:

Originally Posted by meynaf

6. How could you enable (An)+ and -(An) for JMP/JSR ? It's unsized, so what's the operation ? Same for LEA/PEA.

The size is longword (unsized=not specified?) although it is not operating in memory but on the contents of An. It may be possible to save an instruction (ADDQ #4,An or SUBQ #4,An) sometimes but it could be confusing too. Maybe I should change back?

Quote:

Originally Posted by meynaf

7. SATS is useless. Yes, really. I can develop if you want. Clearly I wouldn't have put it.

I agree that it is a weak instruction (good name though) especially with predicated branches. The ColdFire sets the CC different for a few instructions so I'm not sure it would help much for CF compatibility. It is very simple, short and doesn't take much encoding space though.

Quote:

Originally Posted by meynaf

8. You seem to be moving 4-bit condition fields here and there. That's no good. At least BScc and SELcc should have it at the same place.

BScc and SELcc are splits of different instructions and the data is different so I doubt it would make a big difference. The data in the encoding could be changed around as needed. That is where help from someone implementing the ISA could be useful, not that anyone is likely to do that.

Quote:

Originally Posted by meynaf

9. I don't see SELcc as very useful. One could just preload the target register with case 1, then test the condition, then conditionally load the target with case 2. Not fully against it, but the encoding really isn't nice.

How would you do this without branches? The point is to do away with speculative branches.

Quote:

Originally Posted by meynaf

BREV and REVB look a little bit ambiguous to me. I'd favor BITREV/BYTEREV not only because of the CF, but also because it's more readable.

I thought 68k assembler programmers were programmed to read the first 'B' of an instruction as bit and the last B of an instruction as Byte. Spelling out BITREV is not bad but then for consistency shouldn't you have BITCHG, BITCLR, BITSET, BITFxxx, etc.? BYTEREV is getting a little long after I made an attempt to shorten the mnemonics. Longer mnemonics are more readable and shorter ones are faster to type. We need to find a consistent compromise between the two but opinions vary.

Quote:

Originally Posted by meynaf

Write to PC-relative is now allowed, unless (like in the case of Scc) the encoding has been stolen.

I've already changed the ISA a couple of times on this. I'm happy to accept either way just don't make me change the documentation again until the decision is finalized

.

Quote:

Originally Posted by meynaf

Address regs are supported whereever this has a natural encoding and that encoding isn't stolen by something else (i.e. not for Scc).
This means you can now do MOVEA.B to use the same extend trick as MOVEA.W. Note : unsigned extend (more common for bytes).

Code:

0001<r>0 01< ea >            movea.b ea,An

As we have byte access to An, the reason why MOVEM.B didn't exist is removed. Massive byte-extend in a single instruction is now available.

Code:

00101001 11< ea >            movem.b ea,rlist
00111001 11< ea >            movem.b rlist,ea

Quite often you have to extend a byte when indexing. This is quite painful.

I'm generally in favor of opening up address registers but the CC is set different with an An destination. This makes the implementation and documentation a little more complicated. Adding byte addressing for address registers I would rather not do. We can zero and sign extend bytes more easily when loading now so we can use a longword index. Adding byte indexes makes no sense if adding 64 bit support where longword->quadword indexes would be more valuable. I have my doubts that full 64 bit support can be added in a consistent and beneficial enough way though. Gunnar's Apollo-core documentation only gives a handicapped 64 bit integer CPU so he can rob the registers for his SIMD.

Quote:

Originally Posted by meynaf

Another addressing mode "added". Normal (d32,pc) is quite large, larger than a 32-bit address. This fixes that problem so people who hate relocs should like it.

I don't like it as I don't see any advantage. Relocs are not evil and they work fine.

Quote:

Originally Posted by meynaf

Quite often you use a register to point to BSS. It's quite nice but nevertheless eats a register. Having some constant address doesn't need all the addressing flexibility of A4/A5 that are used for this purpose, and therefore a new read-only register could make the trick.
Alas it needs OS patches to be saved and therefore i won't defend this much.

I'm not a fan. It may be ok for a different processor but I haven't seen a suggested retrofit for the 68k which I considered worthwhile.

Quote:

Originally Posted by meynaf

Bit-fields are extended to support a "reverse" mode.
The new mode makes the bit-field behave like this :
- for the position, we name the bit like for BTST (and it's the last bit of the field we target here) ; the operation is limited to 32 bits
- for the size, we use (32-n)

This is similar to a suggestion I made back in the day. We do have BITREV now which reverses the bits of registers to be like BTST, BCHG, BCLR and BSET. The CF FF1 D0 is used to count the leading zeros for CLZ() but BITREV D0 + FF1 D0 can be used to count the trailing zeros for CTZ(). This was not so common before and even less important because of other additions now. I could change my mind though with common examples.

Quote:

Originally Posted by meynaf

Like Matt did, we now have LEA to Dn. Ok it was originally my idea

Yes. I believe there is a use for it and the encoding is open but so
it is at least worth considering. There may be some complexity in the implementation though.

Quote:

Originally Posted by meynaf

BITCNT is called POPCNT but i don't like this name. Even if it's more or less "standard issue", we count bits, not people.

Both name are ok with me but POPCNT does seem to be the more common name used for this operation. This is not the most common instruction (like BITREV) but it is difficult to do in software.

Quote:

Originally Posted by meynaf

DUP will duplicate the lowest byte into all others, i.e. $12345678 will become $78787878.

This is very much like a SIMD instruction. PERM can do it (as documented) but Gunnar changed PERM in his ISA. There could be a PERM and VPERM I suppose but it doesn't make much sense with the SIMD grafted on the integer units. If all the 68k FPU instructions start with FOP then shouldn't SIMD instructions start with VOP (V for Vector) or POP (P for parallel) or similar?

Quote:

Originally Posted by meynaf

Sometimes you have to test some value then clear it (like "process this code once if the value is set). It involves clearing the value and thus repeating the same EA - a pain for modes such as (An)+.
Hence the TAC (Test And Clear) instruction, testing the byte like TST.B and clearing it.

Interesting. The similar looking and sounding TAS instruction does a locked test and set which behaves much different. I'll have to think about the functionality.

Quote:

Originally Posted by meynaf

Good enough coders should know about BTST Dn,#i.
Dynamic bit-test a constant is a nice trick, however limited to 8 bits.
This limits will go away.

Fast bitfield instructions reduce the need but it is a cheap trick.

Quote:

Originally Posted by meynaf

I don't like adding an addressing mode for short immediates : it will only be useful for a handful cases.
MOVE.L #16bit,Dn can be handled with LEA adr.w,Dn.
ADD/SUB are useful but relatively rare.
OR/AND/EOR are useless - just operate on the lower part of the reg.
CMP is quite useful but we can add it alone :

Code:

01001000 01100<r>            cmpq.l    #i16,Dn

I disagree that OR/AND are useless. There are plenty of cases where the immediate is <16 bits or the extended value can be used. One of the best uses is for MOVE.L #16bit,EA as the ColdFire added the ugly MOV3Q instruction for this reason. It is possible to interpret the addressing mode as something different for .B or .W sizes. These addressing mode slots without a register are of limited use anyway. They are less valuable than the encoding space used by other solutions. MVS and MVZ can do some of the same immediate compression to a data register as your LEA (xxx).W,Dn and take away some of the advantage for MOVE.L #d16,Dn but there is still a nice savings.

Quote:

Originally Posted by meynaf

Normal DBcc has many limitations. It's one of the major reason for data register shortage, is zero-based and sometimes it's an annoyance, and of course can't operate on anything but a word size.
In addition it works on the lower word only and having it sometimes on the high word would spare a register.
I'm not sure full DBcc.L is really useful even though i have an encoding for it. I've put simple DBF.L here, this probably needs discussion.

Code:

01000001 11000<r>            dbfh    Dn,d16    (high word dbf)
01000011 11000<r>           dbf.b    Dn,d16
01000101 11000<r>            dbf1    Dn,d16    (base 1 dbf)
01000111 11000<r>            dbfa    An,d16    (dbf An)
01001111 11000<r>           dbf.l    Dn,d16

Running out of data registers are we? It is possible to use an address register for a loop counter already if DBcc is abandoned although an extra TST/CMP is needed. I am a little worried about loop complexity with all these loop types. I'm sure you would use all these but they may be more difficult for compilers and beginner assembler programmers to use.

Quote:

Originally Posted by meynaf

Square roots are useful sometimes in demos (for 3D stuff). There is nothing for this, apart maybe floating-point SQR, but FP suffers from rounding errors.
I prefer having an integer SQR, and here it is. May be complicated to do in HW but if it can be done in FP, it can be done in integer.

Code:

1100<r>1 10000<r>            sqr    Dn,Dn    (sqr(64) -> 32)

I haven't needed an integer square root ever so I don't have enough experience to say much. I have seen FSQRT used in 3D code though.

Quote:

Originally Posted by meynaf

Of course my stuff wouldn't be complete without my bit-set-on-condition.
I don't see targeting other bits as useful and my encoding is different, but Matt's doc explains it quite well.

Code:

00001110 11< ea > <cc>0100 00i< n >    bs<cc>    #n/Dn,ea (bit set from cc)

An operation on the other bits is simple and powerful. It could be added in a compatible way to BTST, BCLR, BCHG and BSET to make them more powerful as well. They could replace a CLR+BScc, AND+OR, Scc+EXTB.L+NEG.L, etc. The encoding is not so important at this point though.

Quote:

Originally Posted by meynaf

One annoying limit is EOR. We can't EOR from mem. Having a 16-bit version would require stealing the line-A space and i don't want to do this.
Still, a 32-bit version can be useful.

No EOR EA,Dn is one of the weird decisions for the 68k but your fix is also weird

.

Quote:

Originally Posted by meynaf

x86 has XCHG with mem. Even spartan ARM has SWP. But the 68k doesn't have EXG with mem, and that's often quite a pain.
And so, even though it's quite a large instruction (again due to encoding space constraits), i add it :

Code:

00001110 tt< ea > 0<r>0101 01000000    exg.t    ea,Dn

I would have added EXG with memory for a new ISA but I don't like the retrofit to the 68k or encoding options.

Quote:

Originally Posted by meynaf

Quite often you have branches that directly go to RTS. Old z80 has conditional RTS. Now we can have one as well.
RTS is done only if condition is true, else NOP.

Code:

1110<cc> 11111100            rts<cc>

Can't we just do a predicated branch over the RTS? What percentage of z80 returns are conditional?

Quote:

Originally Posted by meynaf

An invention of mine.
If compares the byte in the source register, to all 4 bytes in the target register. If found, it puts the index of that byte (0 to 3) in the target register and sets the CCR accordingly.

Code:

0100<r>1 11001<r>            mbcmp    Dn,Dn

Could be useful for strings but text handling is already so inefficient and rarely performance critical. It may be possible to do something like this in the SIMD with much more data at a time. However, I believe text handling attempts in the SIMD have generally not been successful.

Quote:

Originally Posted by meynaf

The following instruction swaps the bits in the register pair, whereever the bit in the mask (the source operand) is 1. This nicely replaces the complex EOR merges.

Code:

00001110 11< ea > a<r>0100 0100a<r>    mix    ea,rn:rn

Doesn't this just replace two instructions with one? Where is this performance critical?

Quote:

Originally Posted by meynaf

RXL/RXR below are same as ROXL/ROXR. However this time the carry isn't a single bit but a whole register (the LSB are used for this, MSB being unchanged).

Code:

00001110 11< ea > 0<r>s100 10i< n >    rxl/rxr #n/dn,dn:ea

This is basically a 64 bit rotate. It could be useful in some cases but is not too common. It would be nice for emulating x86 processors.
It doesn't makes much sense if a 64 bit register could be rotated.

Quote:

Originally Posted by meynaf

Ouch. What a large post.
I have a few other instruction ideas but that's all for now. I don't expect positive feedback ; cpu discussions often end up in flame wars. But who knows.

You are not kidding about the long post. It has taken forever to respond. We should probably narrow down the number of topics to discuss.

buggs · 04 August 2016, 08:15

Quote:

Originally Posted by matthey

This is *not* the official documentation for the Apollo-core ISA. No part of it was approved or sanctioned by Gunnar although it includes some ideas from him and others who were in the Apollo Team (meynaf and me mostly). SIMD and privileged instructions were generally not included as discussions about them were at an early stage.

My apologies. It's good to see that some of the work of you two is actually going to be available for practical use.

Quote:

Originally Posted by matthey

A query system for hardware capabilities was discussed but it was not a priority. The ColdFire added it although their system would be inadequate for enhanced 68k processors. ARM has a nice query system but needs it with an overload of variations. I would hope an enhanced 68k would have a more standard set of hardware (SoC) and not as many variations considering it doesn't need to support low end embedded processors where it is not competitive.

Yes indeed and furthermore, we always had the issue that we couldn't directly poll the CPU for it's capabilities from userspace and either had to rely on the flags in exec.library or do it the hard way. In the end, a kind of cpu.library (name just a placeholder) would do the trick just as well.

Mrs Beanbag · 04 August 2016, 10:26

Quote:

Originally Posted by matthey

I thought 68k assembler programmers were programmed to read the first 'B' of an instruction as bit and the last B of an instruction as Byte.

except when it's BRA/BSR/Bcc...

Anyway BITREV doesn't just operate on a byte, it operates on an entire register. Reversing a single bit wouldn't do very much! BFREV would make sense for a bitfield reverse, however...

meynaf · 04 August 2016, 11:18

Quote:

Originally Posted by matthey

Predicated branches are good for code density and nice for the 68k. I don't know if we can assume all processors using the ISA will have predication though. Most of the places a SELcc instruction would be larger are where a single instruction predicated branch can't be used. Of course the Apollo core can sometimes skip 2 instructions with a predicated branch by combining the 2 instructions following a branch into a 3 op instruction but that would really be assuming too much for an ISA. The SELcc instruction is very easy to use as documented but the big test is how well compilers can use it and whether there would be a performance advantage in the most common CPU designs.

I have studied the possibility of such an instruction. And unfortunately, when i need a simple choice between values, they're never in registers.

Quote:

Originally Posted by matthey

I am aware of this which is easy to change. The register field was originally Rn with address registers allowed. There are several of these instructions which could be allowed for Rn like ABS, BREV (BITREV), REVB (BYTEREV), POPCNT, etc. I'm not sure some make sense but should we just open them anyway? What is the best logic to decide when to open An registers?

You can open An registers for very common stuff in which they have a natural encoding. AND, OR are common enough. ABS, BITREV, BYTEREV, POPCNT, aren't.

Quote:

Originally Posted by matthey

Compilers can use profiling info to set a branch hint bit. At least some support for this is available in GCC, Clang/LLVM and vbcc compilers. High performance processors don't use hint bits much because nobody bothers optimizing code on them anymore. Two branch mis-predictions on an embedded processor to turn around a branch that couldn't be reversed is a different story and several of them can lead to inconsistent variations in the timing and responsiveness of executing code. The hint bit could potentially give a little extra optimization potential for software that was meant to execute on a more powerful processor than an FPGA CPU.

It's quite ugly, for a small result. Same as DBcc.L ; bit #0 should perhaps be left alone. Especially as old code may eventually wish to deliberately throw an exception by conditionnally branching to an odd address. 68020+ has TRAPcc but not 68000.

Quote:

Originally Posted by matthey

Setting bits #7 and #6 looks to me like it would interfere with ADDQ.B and SUBQ.B? Yea, the current encoding of DBcc.L is not that great. It is a questionable enhancement anyway unless some way can be figured out to improve the performance over a separate decrement (SUBQ) and branch.

It doesn't interfere with ADDQ/SUBQ .B as it's reusing An mode. And ADDQ/SUBQ .W to An is already useless enough, no need to allow .B as well.

Quote:

Originally Posted by matthey

The size is longword (unsized=not specified?) although it is not operating in memory but on the contents of An. It may be possible to save an instruction (ADDQ #4,An or SUBQ #4,An) sometimes but it could be confusing too. Maybe I should change back?

If something useful can be done with this, you can keep it.
But is a free addq/subq #4 worth all that encoding space ?

Quote:

Originally Posted by matthey

I agree that it is a weak instruction (good name though) especially with predicated branches. The ColdFire sets the CC different for a few instructions so I'm not sure it would help much for CF compatibility. It is very simple, short and doesn't take much encoding space though.

It takes encoding space i wished to use for something else

The problems of SATS are :
1) It works only for signed stuff. Its algorithm can't work at all for unsigned.
2) It works after a single op which must be ADD/SUB. Yet the data comes out of many operations which can individually overflow for better precision, so clamping should be done at the very end of the process.

Quote:

Originally Posted by matthey

BScc and SELcc are splits of different instructions and the data is different so I doubt it would make a big difference. The data in the encoding could be changed around as needed. That is where help from someone implementing the ISA could be useful, not that anyone is likely to do that.

I argued with Gunnar enough to know that he doesn't like fields not located at a fixed place, and I think he has a good reason for this.
Yet for BScc it's impossible to use the normal place unless reclaiming line-A, and if you add SELcc on top of this, better have as few different pos as possible.

Quote:

Originally Posted by matthey

How would you do this without branches? The point is to do away with speculative branches.

I don't do that without a branch, but, as the branch only skips one instruction the case is easily treated by turning the instruction into a conditional one at decode time.

Quote:

Originally Posted by matthey

I thought 68k assembler programmers were programmed to read the first 'B' of an instruction as bit and the last B of an instruction as Byte. Spelling out BITREV is not bad but then for consistency shouldn't you have BITCHG, BITCLR, BITSET, BITFxxx, etc.?

Personnally i read this 'B' as "single bit". Seems i'm not alone.

Quote:

Originally Posted by matthey

BYTEREV is getting a little long after I made an attempt to shorten the mnemonics. Longer mnemonics are more readable and shorter ones are faster to type. We need to find a consistent compromise between the two but opinions vary.

If you need a short name, then why not SWB or SWAPB (swap bytes).
Or even SWAP.B (where the current SWAP is SWAP.W).

Quote:

Originally Posted by matthey

I'm generally in favor of opening up address registers but the CC is set different with an An destination. This makes the implementation and documentation a little more complicated.

This allows moving things around without touching the CC. Sometimes it's practical to have CC set by MOVE, but sometimes it's not.

Quote:

Originally Posted by matthey

Adding byte addressing for address registers I would rather not do. We can zero and sign extend bytes more easily when loading now so we can use a longword index.

It's not for indexes. It's just about having more flexibility when using address registers.

Quote:

Originally Posted by matthey

Adding byte indexes makes no sense if adding 64 bit support where longword->quadword indexes would be more valuable.

Let's be realistic please. Quadword indexes are total useless. Do you plan arrays more than 4GB large ?
On the contrary, bytes are very, very commonly used as index. And sometimes in very critical areas, like byte code interpretation.

Quote:

Originally Posted by matthey

I have my doubts that full 64 bit support can be added in a consistent and beneficial enough way though. Gunnar's Apollo-core documentation only gives a handicapped 64 bit integer CPU so he can rob the registers for his SIMD.

64 bit is necessarily dirty, no matter how you do it. Look at the horror the x86_64 is. And even with these ugly REX prefixes, some operations can't be done (because the instruction would become too large).

Quote:

Originally Posted by matthey

I don't like it as I don't see any advantage. Relocs are not evil and they work fine.

Discuss that with Mrs Beanbag then

Quote:

Originally Posted by matthey

I'm not a fan. It may be ok for a different processor but I haven't seen a suggested retrofit for the 68k which I considered worthwhile.

If the new 68k cpu needs rom patches anyway then i would want to see this basereg feature added ; however it's not worth doing rom patches for itself.

Quote:

Originally Posted by matthey

This is similar to a suggestion I made back in the day. We do have BITREV now which reverses the bits of registers to be like BTST, BCHG, BCLR and BSET. The CF FF1 D0 is used to count the leading zeros for CLZ() but BITREV D0 + FF1 D0 can be used to count the trailing zeros for CTZ(). This was not so common before and even less important because of other additions now. I could change my mind though with common examples.

Bitrev before bfextu won't help, you'll get the bits in the wrong order

Quote:

Originally Posted by matthey

Yes. I believe there is a use for it and the encoding is open but so
it is at least worth considering. There may be some complexity in the implementation though.

If you support PEA, LEA to Dn should be no big deal. It's the same "don't read the data but use the address" trick.

Quote:

Originally Posted by matthey

If all the 68k FPU instructions start with FOP then shouldn't SIMD instructions start with VOP (V for Vector) or POP (P for parallel) or similar?

Not P, as the PMMU instructions already use it. But V for vector, yes.
I don't see my byte dup as a vector instruction, though.

Quote:

Originally Posted by matthey

Fast bitfield instructions reduce the need but it is a cheap trick.

Bitfield instructions don't reduce the need. As bitfield instructions don't operate on immediate data.

Quote:

Originally Posted by matthey

I disagree that OR/AND are useless. There are plenty of cases where the immediate is <16 bits or the extended value can be used.

But if the immediate is <16 bits then you can just operate on the lower part, ok ?
No difference between or.l #$1ff,d0 and or.w #$1ff,d0, ok ?

Quote:

Originally Posted by matthey

One of the best uses is for MOVE.L #16bit,EA as the ColdFire added the ugly MOV3Q instruction for this reason. It is possible to interpret the addressing mode as something different for .B or .W sizes. These addressing mode slots without a register are of limited use anyway. They are less valuable than the encoding space used by other solutions. MVS and MVZ can do some of the same immediate compression to a data register as your LEA (xxx).W,Dn and take away some of the advantage for MOVE.L #d16,Dn but there is still a nice savings.

If you really want MOVE.L #16bit,EA, then why not :

Code:

01001110 00< ea >          MOVE.L #i16,EA

Quote:

Originally Posted by matthey

Running out of data registers are we? It is possible to use an address register for a loop counter already if DBcc is abandoned although an extra TST/CMP is needed. I am a little worried about loop complexity with all these loop types. I'm sure you would use all these but they may be more difficult for compilers and beginner assembler programmers to use.

They seem a cheap enough solution vs adding more registers.

Quote:

Originally Posted by matthey

An operation on the other bits is simple and powerful. It could be added in a compatible way to BTST, BCLR, BCHG and BSET to make them more powerful as well. They could replace a CLR+BScc, AND+OR, Scc+EXTB.L+NEG.L, etc. The encoding is not so important at this point though.

Well i still don't see much use, but i won't grumble if it's there.

Quote:

Originally Posted by matthey

No EOR EA,Dn is one of the weird decisions for the 68k but your fix is also weird

.

Weird but logical : they did statistical analysis of programs and EOR didn't come high (as it's indeed less used than AND, OR). Nevertheless a mistake.
The only other way to fix it, is to reclaim line-A. Perhaps not the best thing to do either.

Quote:

Originally Posted by matthey

I would have added EXG with memory for a new ISA but I don't like the retrofit to the 68k or encoding options.

If we had line-A free it could be there, along with EOR from mem and BScc.
But really if i had this i wouldn't care if it's a long instruction or not. I needed it just too many times.

Quote:

Originally Posted by matthey

Can't we just do a predicated branch over the RTS? What percentage of z80 returns are conditional?

I don't know for z80 (and don't care much), but in 68k code i have many use cases for it.
Would spare many long branches because RTS isn't necessarily nearby and quite a few RTS are here only because none were available in the routine.
Ok, it's not targeted at speed ; rather, it's for code density. Even though it might help making code faster, i don't know.

Quote:

Originally Posted by matthey

Could be useful for strings but text handling is already so inefficient and rarely performance critical. It may be possible to do something like this in the SIMD with much more data at a time. However, I believe text handling attempts in the SIMD have generally not been successful.

It is not especially for text handling. Seen my switch..case example ?

Quote:

Originally Posted by matthey

Doesn't this just replace two instructions with one? Where is this performance critical?

If you can do this in just two instructions i want to see the code

Quote:

Originally Posted by matthey

This is basically a 64 bit rotate. It could be useful in some cases but is not too common. It would be nice for emulating x86 processors.
It doesn't makes much sense if a 64 bit register could be rotated.

Don't think that. With 64 bit rotates you push the problem further one step, but it comes back if you have to rotate larger values.
These instructions being true multiprecision shifts, you can do any size.
Remember the discussion we had with Gunnar about the new instruction he needed for blitter emulation ?

Quote:

Originally Posted by matthey

You are not kidding about the long post. It has taken forever to respond. We should probably narrow down the number of topics to discuss.

And i didn't write everything, i have more potential additions in store

Mrs Beanbag · 04 August 2016, 22:14

Quote:

Originally Posted by meynaf

Discuss that with Mrs Beanbag then

The subject of disassembling Frontier happened to come up at work today (i love my job) and apparently the whole thing is PC relative despite being >600k in size, there are various jump tables in it.

But, generally, as frustrating as it is to have to LEA label(PC),An whenever i want to write to static storage, that's pretty rare for me, it's usually just for reading constants. d32(PC) would be really useful though.

As for the more complex addressing modes of the 68020+, memory indirect and all that... how useful really are they? I can think of some uses, but mostly in conjunction with JSR/JMP to implement VTables of polymorphic classes. A version of Jmp that reads its effective address instead of jumping to it could be useful then.

jsrm / jmpm d16(An) ; reads longword address pointed to by d16(An) and jumps to it

i.e. same as the following:
move.l d16(An),Am
jsr / jmp (Am)

Quote:

bit #0 should perhaps be left alone. Especially as old code may eventually wish to deliberately throw an exception by conditionnally branching to an odd address.

I do wonder whether we should have to pander to this sort of code.

But the 68k branch encodings are another puzzle. Why is it even possible to branch to an odd address, if it cannot possibly be valid? Why didn't they simply use a 7-bit field instead of 8 and use the upper bit to indicate whether it is a short branch or a long one? Because then a long branch could be 24 bit, which is as big as the original 68k's address bus.

matthey · 05 August 2016, 00:21

Quote:

Originally Posted by Knocker

I think it would be very useful to have a conditional BSR, "BSRcc". Essentially a Bcc where you can do rts to get back. Would allow for much cleaner code. Or can you use a macro with some local labels to do this somehow?

Meynaf suggested using a predicated branch with a macro but I would like to take a little closer look at this.

Code:

  bne .skip
  bsr function
.skip:

Branch predication executes a short sequence of instructions branched over conditionally and throws away the results if not needed. The instruction predicated in this case would be "bsr function" but what can we do here? In the best case, the bsr function just calculates the address to be branched to and has it ready if needed. In the worst case, it loads the start of the function code into the ICache for a function which may have a very low probability of being executed. This would defeat the goal of predication and I can only guess that it may be disabled or would not be effective here. The default 68k static branch prediction of Backward Taken Forward Not Taken (BTFN) would then likely be used to predict that the function is taken and would be mis-predicted incorrectly twice with a 2 bit saturating predictor (which I believe the Apollo Core is using) if the function was rarely called (a used branch hint bit could avoid this).

I could not find any statistics on whether conditional branches to subroutines were more likely to be taken or not. ARM has a conditional branch to subroutine with BLcc which could be analyzed but I couldn't find any statistics. If conditional branches are usually taken, then meynaf's macro works pretty well even without predication. If conditional branches to subroutines are usually not taken (which I suspect), then the macro is not efficient. A BSRcc could use static branch prediction appropriate for a branch to subroutine (much as DBcc uses different static prediction than Bcc on the 68060). There is an acceptable encoding available to add an extra word for the condition code of a BSRcc. However, a branch hint bit would allow for the most efficient (semi-)static branch prediction in all cases without increasing code size.

I did find the following ARM code on the internet.

Code:


   CMP    R1, R2		; branch conditionally
   BLLT   SUB_B		; if R1 < R2, then branch to SUB_B
   BLLE   SUB_C		; if R1 =< R3, then branch to SUB_C
   BLGT   SUB_D		; if R1 > R2, then branch to SUB_D
   BLGE   SUB_F		; if R1 >= R2, then branch to SUB_F

If conditional subroutines were used like this then I believe they would not be efficient with meynaf's macro which would assume the functions were called and there would be 4x2 incorrectly predicted branches. This code is also poor on fast processors with good branch predictors because the branches are too close together and the code sequences too short. Branch predictors like gshare commonly take several cycles to lookup the prediction which creates a problem much like a tight loop (unrolling fixes loops). The code sequence is too short before the next branch to lookup the branch prediction in time. Such processors can fall back to (semi-)static prediction which in the case of the existing 68k would shoot many bubbles in a row with this code. The ARM ISA which ran well at low clock speeds with simple branch prediction now doesn't perform so well. Hmm, does BSRcc still look like a good idea?

The case for RTScc is a little difference. The code that will be returned to is already likely in the ICache and it is acceptable to load the return code into the ICache if it is not as it will likely be returned to shortly anyway. I expect a branch over RTS could be problematic for predication also. I would like to see statistics of how common RTScc could be used and of whether the conditional return is more commonly taken or not. I expect a branch hint bit could be effective if predication could not be used but meynaf did come up with a 16 bit encoding which would be smaller than a Bcc+RTS. The case for it is not overly compelling but I believe it deserves more research.

Quote:

Originally Posted by buggs

My apologies. It's good to see that some of the work of you two is actually going to be available for practical use.

No need for an apology for such a simple mistake. It is nice to be open when possible. There are some very good ideas here. I did add a copyright but that is because there should only be one standard and not because I'm trying to limit the ISA use. I wish there was a diverse standards committee to help set and standardize it and that it would be open and free for anyone to use.

Quote:

Originally Posted by buggs

Yes indeed and furthermore, we always had the issue that we couldn't directly poll the CPU for it's capabilities from userspace and either had to rely on the flags in exec.library or do it the hard way. In the end, a kind of cpu.library (name just a placeholder) would do the trick just as well.

IMO, the exec.library flags are good and a better idea than every program using CPU querying instructions. Later, ThoR created a generic 68k cpu.library which abstracts many CPU and hardware differences also. The AmigaOS itself was too hardware specific and we are paying the price, especially with the 68k AmigaOS abandoned and road blocked while others attempts to improve it.

Quote:

Originally Posted by Mrs Beanbag

Anyway BITREV doesn't just operate on a byte, it operates on an entire register. Reversing a single bit wouldn't do very much! BFREV would make sense for a bitfield reverse, however...

A BFREV would have been nice but there isn't a good encoding location for more BF instructions. The existing ones are powerful but they are also challenging to implement in hardware.

IMO, BITREV and BYTEREV are not so easy to understand. REVBITS.L and REVBYTES.L are easy to understandable but too long. BREV.L and REVB.L are shorter compromises. I would change them back if the majority prefers the original ColdFire names.

I still have meynaf's post to get to. Yikes.

matthey · 05 August 2016, 03:01

Quote:

Originally Posted by meynaf

I have studied the possibility of such an instruction. And unfortunately, when i need a simple choice between values, they're never in registers.

SELcc would commonly need one immediate but it can use your favorite addressing mode to compress <16 bit immediates. This gives an instruction length of only 6 bytes while avoiding a branch. Some of the memory accessing addressing modes could get to be long but I doubt the average length of SELcc would be much over 6 bytes.

Quote:

Originally Posted by meynaf

You can open An registers for very common stuff in which they have a natural encoding. AND, OR are common enough. ABS, BITREV, BYTEREV, POPCNT, aren't.

AND and OR are 2 OP while ABS, BITREV and BYTEREV are 1 OP. I opened up An sources everywhere I could which is an easy decision. I'm not sure of what logic is best for opening up An destinations though.

Quote:

Originally Posted by meynaf

It's quite ugly, for a small result. Same as DBcc.L ; bit #0 should perhaps be left alone. Especially as old code may eventually wish to deliberately throw an exception by conditionally branching to an odd address. 68020+ has TRAPcc but not 68000.

I have already discussed the branch hint bit some. There is nothing bad about the encoding even though some people may consider it ugly. Most code would probably not use it but it is very cheap to implement. It is just another level of optimization that would not be there otherwise. It is not as unpopular in the embedded space as I've mentioned. Here is an article from 2006 suggesting a semi-static prediction branch hint bit using profiling to save power.

Quote:

Dynamic branch predictor logic alone accounts for approximately 10% of total process or power dissipation. Recent research indicates that the power cost of a large dynamic branch predictor is oﬀset by the power savings created by its increased accuracy. We describe a method of reducing dynamic predictor power dissipation without degrading prediction accuracy by using a combination of local delay region scheduling and run time proﬁling of branches. Feedback into the static code is achieved with hint bits and avoids the need for dynamic prediction for some individual branches. This method requires only minimal hardware modiﬁcations and coexists with a dynamic predictor.

https://www.researchgate.net/publica...lock_Formation

They didn't show results unfortunately but it is easy to see a power savings for embedded. Fast processors should benefit also as the best dynamic branch prediction has several cycles latency before the prediction is ready. A cheaper and faster 2 bit saturating with (semi-)static prediction to improve it a little may be a better way to go while also being good for embedded.

As for small results, the semi-static hint with profiling should be good for at least 10% improvement over static BTFN. The following article gives some results.

http://www.ele.uri.edu/~uht/research...apers/bert.pdf

Just a few percent difference in branch prediction accuracy makes a huge difference.

Quote:

Originally Posted by meynaf

It doesn't interfere with ADDQ/SUBQ .B as it's reusing An mode. And ADDQ/SUBQ .W to An is already useless enough, no need to allow .B as well.

Yes. This encoding looks like it will work for DBcc.L and it is a big improvement! The size is not only in the first word of the encoding but it uses the normal bits for size. A good encoding like this makes it much more likely that DBcc.L could be kept.

Quote:

Originally Posted by meynaf

If something useful can be done with this, you can keep it.
But is a free addq/subq #4 worth all that encoding space ?

The encoding space it uses looks limited to me anyway. I would be happy to change it though.

Quote:

Originally Posted by meynaf

It takes encoding space i wished to use for something else

The problems of SATS are :
1) It works only for signed stuff. Its algorithm can't work at all for unsigned.
2) It works after a single op which must be ADD/SUB. Yet the data comes out of many operations which can individually overflow for better precision, so clamping should be done at the very end of the process.

I can't argue much for SATS. I thought some of the condition codes might have been changed to support SATS but I didn't think anything but ADD/SUB would work with it either. The ColdFire designers would have been better off creating 4 byte long saturating ADD/SUB (with signed/unsigned bit) instructions. I could have used an "unsigned" saturating add for my DivMagic project also.

Quote:

Originally Posted by meynaf

I argued with Gunnar enough to know that he doesn't like fields not located at a fixed place, and I think he has a good reason for this.
Yet for BScc it's impossible to use the normal place unless reclaiming line-A, and if you add SELcc on top of this, better have as few different pos as possible.

Yes, it is better to use standard field locations, especially in the first word. With that said, much encoding space is lost if always using the same field locations. Oddly, I asked for improved encodings for several instructions and didn't get much feedback from Gunnar about it.

Quote:

Originally Posted by meynaf

If you need a short name, then why not SWB or SWAPB (swap bytes).
Or even SWAP.B (where the current SWAP is SWAP.W).

IMO, swap is less specific than reverse. Swap could be to move around in any order. SWAP.W is specific as 2 can only be reversed. Perhaps SWAP.W should be REVW.L which is better than SWAPW.L which is more logical at first thought. SWAP has a longword result but it is a word size. What is the logic being used?

Quote:

Originally Posted by meynaf

Let's be realistic please. Quadword indexes are total useless. Do you plan arrays more than 4GB large ?

Longword indexes would be important with quadword sized registers. I suppose it would be possible to not allow quadword indexes though.

Quote:

Originally Posted by meynaf

64 bit is necessarily dirty, no matter how you do it. Look at the horror the x86_64 is. And even with these ugly REX prefixes, some operations can't be done (because the instruction would become too large).

IMO, the ISA designers of x86_64 did a good job. Yes, it is far from perfect but it is a huge improvement which allowed them to become the most powerful general purpose personal processors in the world. Granted, they did create a new mode for x86_64 which is expensive but this may also be required to do a good job of moving the 68k to 64 bit.

Quote:

Originally Posted by meynaf

If you support PEA, LEA to Dn should be no big deal. It's the same "don't read the data but use the address" trick.

True. LEA EA,Dn should even be simpler than PEA.

Quote:

Originally Posted by meynaf

Not P, as the PMMU instructions already use it. But V for vector, yes.
I don't see my byte dup as a vector instruction, though.

Most of Gunnars SIMD instructions start with 'P' but then I guess it is not a unit because it is grafted onto the integer unit. I wonder if he used some of the coprocessor encoding space though. It would probably be better than A-line at least.

Quote:

Originally Posted by meynaf

But if the immediate is <16 bits then you can just operate on the lower part, ok ?
No difference between or.l #$1ff,d0 and or.w #$1ff,d0, ok ?

That is true for OR but not AND.

Quote:

Originally Posted by meynaf

If you really want MOVE.L #16bit,EA, then why not :

Code:

01001110 00< ea >          MOVE.L #i16,EA

They seem a cheap enough solution vs adding more registers.

This is a minimal use of encoding space but using the addressing mode uses even less and gives more. Did you ever figure out a good use for those addressing modes without registers?

Quote:

Originally Posted by meynaf

I don't know for z80 (and don't care much), but in 68k code i have many use cases for it.
Would spare many long branches because RTS isn't necessarily nearby and quite a few RTS are here only because none were available in the routine.
Ok, it's not targeted at speed ; rather, it's for code density. Even though it might help making code faster, i don't know.

You don't have to branch to the RTS at the end of the function like compilers. RTS is small so you could have many. It may be necessary to do this anyway to optimize the branches since there is not branch hit bit.

Quote:

Originally Posted by meynaf

It is not especially for text handling. Seen my switch..case example ?

I vaguely remember our discussion. Too bad Gunnar recycles his forum on a whim. There could still be discussions on the Natami.net forum though.

Quote:

Originally Posted by meynaf

If you can do this in just two instructions i want to see the code

Ok, probably not. I'll have to play with mix.

Quote:

Originally Posted by meynaf

Don't think that. With 64 bit rotates you push the problem further one step, but it comes back if you have to rotate larger values.
These instructions being true multiprecision shifts, you can do any size.
Remember the discussion we had with Gunnar about the new instruction he needed for blitter emulation ?

Well, 64 bit shifts are pretty efficient with what we have now but maybe if it would be useful for blitter emulation or some CPU intensive codecs.

Quote:

Originally Posted by meynaf

And i didn't write everything, i have more potential additions in store

You do like a complex instructions set

.

meynaf · 05 August 2016, 09:40

Quote:

Originally Posted by Mrs Beanbag

But the 68k branch encodings are another puzzle. Why is it even possible to branch to an odd address, if it cannot possibly be valid? Why didn't they simply use a 7-bit field instead of 8 and use the upper bit to indicate whether it is a short branch or a long one? Because then a long branch could be 24 bit, which is as big as the original 68k's address bus.

For the long branch they would have had to create a new mode. Current way is the same as d16(pc). HW guys don't like creating something new when they can reuse existing stuff.

Quote:

Originally Posted by matthey

The default 68k static branch prediction of Backward Taken Forward Not Taken (BTFN) would then likely be used to predict that the function is taken and would be mis-predicted incorrectly twice with a 2 bit saturating predictor (which I believe the Apollo Core is using) if the function was rarely called (a used branch hint bit could avoid this).

Whether we have a new BSRcc instruction or we have a cpu merging Bcc+BSR, the problem remains the same.

Quote:

Originally Posted by matthey

The AmigaOS itself was too hardware specific and we are paying the price, especially with the 68k AmigaOS abandoned and road blocked while others attempts to improve it.

You can't have your cake and eat it. AOS is hardware specific but this is what makes it lightweight.

Quote:

Originally Posted by matthey

I still have meynaf's post to get to. Yikes.

Quote:

Originally Posted by matthey

SELcc would commonly need one immediate but it can use your favorite addressing mode to compress <16 bit immediates. This gives an instruction length of only 6 bytes while avoiding a branch. Some of the memory accessing addressing modes could get to be long but I doubt the average length of SELcc would be much over 6 bytes.

Yet i have to see some real life code where SELcc would bring a real benefit.

Common cases are :

Code:

; min/max stuff
 cmp.w (a1),d1
 bhs.s .n0
 move.w d1,(a1)
.n0
 cmp.w 2(a1),d2
 bhs.s .n1
 move.w d2,2(a1)
.n1
 cmp.w 4(a1),d3
 bls.s .n2
 move.w d3,4(a1)
.n2
 cmp.w 6(a1),d4
 bls.s .n3
 move.w d4,6(a1)
.n3

Code:

; two messages depending on a condition
 lea msg1(pc),a0
 tst something
 beq .done
 lea msg2(pc),a0
.done
; show msg here

In first case it's not usable - writes to EA. And even if the targets were registers, it wouldn't make the code shorter.
In second case it's not usable at all.
This was just a quick look i've had in actual code, but nevertheless.

Quote:

Originally Posted by matthey

AND and OR are 2 OP while ABS, BITREV and BYTEREV are 1 OP. I opened up An sources everywhere I could which is an easy decision. I'm not sure of what logic is best for opening up An destinations though.

I'm for opening An when we have a 4-bit register field (like in bit-field or long mul&div) and where we have an EA mode.
Being 1 OP means the encoding is small, but that's about all.
And i'm certainly not for total crazy things such as SWAPA.

Quote:

Originally Posted by matthey

They didn't show results unfortunately but it is easy to see a power savings for embedded. Fast processors should benefit also as the best dynamic branch prediction has several cycles latency before the prediction is ready. A cheaper and faster 2 bit saturating with (semi-)static prediction to improve it a little may be a better way to go while also being good for embedded.

As for small results, the semi-static hint with profiling should be good for at least 10% improvement over static BTFN. The following article gives some results.

http://www.ele.uri.edu/~uht/research...apers/bert.pdf

Just a few percent difference in branch prediction accuracy makes a huge difference.

This is for specific implementations. Change to something different, and the hint bit becomes useless and gets ignored.

Quote:

Originally Posted by matthey

The encoding space it uses looks limited to me anyway. I would be happy to change it though.

Limited ? Consider the LEA space. Two double-reg positions for each LEA (An and now Dn).

Quote:

Originally Posted by matthey

IMO, swap is less specific than reverse. Swap could be to move around in any order. SWAP.W is specific as 2 can only be reversed. Perhaps SWAP.W should be REVW.L which is better than SWAPW.L which is more logical at first thought. SWAP has a longword result but it is a word size. What is the logic being used?

For me the logic is to use the small size, i.e. MVZ.B, SWAP.W. But in some cases the 68k defeats this (EXT.L).
Anyway for me a better name would have been SWW (swap words).

Quote:

Originally Posted by matthey

Longword indexes would be important with quadword sized registers. I suppose it would be possible to not allow quadword indexes though.

We have longword indexes, so no problem here.
If compatibility weren't an issue i'd use Dn.B, Dn.W, Dn.L, An.L instead of Dn.W, Dn.L, An.W, An.L.

Quote:

Originally Posted by matthey

IMO, the ISA designers of x86_64 did a good job. Yes, it is far from perfect but it is a huge improvement which allowed them to become the most powerful general purpose personal processors in the world. Granted, they did create a new mode for x86_64 which is expensive but this may also be required to do a good job of moving the 68k to 64 bit.

But why wanting to move the 68k to 64 bit ? Just because everyone else do something, we should do it too ?

Quote:

Originally Posted by matthey

That is true for OR but not AND.

Indeed. Now consider and.l #$1ff,d0 vs bfextu d0{23:9},d0.

Quote:

Originally Posted by matthey

This is a minimal use of encoding space but using the addressing mode uses even less and gives more.

The addressing mode doesn't use less encoding space. Why would it ?

Quote:

Originally Posted by matthey

Did you ever figure out a good use for those addressing modes without registers?

Not something you would call a good use, but it's in this thread.

Perhaps it could be a good idea to have a look in actual code to find some short immediate use cases.

Quote:

Originally Posted by matthey

You don't have to branch to the RTS at the end of the function like compilers. RTS is small so you could have many. It may be necessary to do this anyway to optimize the branches since there is not branch hit bit.

I have some routines where all RTS are conditional and therefore i have to add a new RTS. And i can't add an RTS, as small as it can be, right in the middle of a big loop.

Quote:

Originally Posted by matthey

You do like a complex instructions set

.

That's still far simpler than x86. And we're CISC, after all

matthey · 05 August 2016, 23:22

Quote:

Originally Posted by Mrs Beanbag

The subject of disassembling Frontier happened to come up at work today (i love my job) and apparently the whole thing is PC relative despite being >600k in size, there are various jump tables in it.

Jump tables as several BSRs to a JMP? SAS/C generated these tables for 68000 code as there is no Bcc.L or BSR.L. They are slow and ugly but they do help to avoid RELOCS and reduce the code size a small amount. A 68020 version with RELOCS could have been much faster. I'm not trying to take anything away from David Braben's accomplishment as there were compromises to be made and the Amiga version is good.

Quote:

Originally Posted by Mrs Beanbag

But, generally, as frustrating as it is to have to LEA label(PC),An whenever i want to write to static storage, that's pretty rare for me, it's usually just for reading constants. d32(PC) would be really useful though.

It is no more difficult to use an absolute location than a PC relative one. The performance and code size would be the same. The 68020 does allow (d32,PC) addressing with (bd,PC,Rn.Size*Scale). This addressing mode may be used by some compilers when (d16,PC) can't be but the instructions grow by 4 bytes in length when it happens. It is possible to add a new addressing mode which would use the last 4 bits of the full format extension word to allow (d20,PC) as well as all (d20,PC,Rn.Size*Scale) variations. This would save 2 bytes when it could be used making PC relative instructions the same size as absolute addressing with RELOCS. The encoding is not bad at all.

Quote:

Originally Posted by Mrs Beanbag

As for the more complex addressing modes of the 68020+, memory indirect and all that... how useful really are they? I can think of some uses, but mostly in conjunction with JSR/JMP to implement VTables of polymorphic classes. A version of Jmp that reads its effective address instead of jumping to it could be useful then.

jsrm / jmpm d16(An) ; reads longword address pointed to by d16(An) and jumps to it

i.e. same as the following:
move.l d16(An),Am
jsr / jmp (Am)

The 68020 addressing modes are very versatile and can do what you want.

Code:

    jmp ([d16,An])
    jsr ([d16,An])

Without OoO, this reduces the chances of instruction scheduling placing an instruction between what would be 2 dependent instructions. They do save a register and sometimes give a smaller instruction though.

The 68020 address modes are good for OOP but OOP is bad for processor performance. Indirect branches are difficult for even modern processors to deal with.

Quote:

Originally Posted by Mrs Beanbag

But the 68k branch encodings are another puzzle. Why is it even possible to branch to an odd address, if it cannot possibly be valid? Why didn't they simply use a 7-bit field instead of 8 and use the upper bit to indicate whether it is a short branch or a long one? Because then a long branch could be 24 bit, which is as big as the original 68k's address bus.

The 68k designers probably had OCD because most data in encodings is well organized and commonly 8, 16 or 32 bits. Data extensions are in multiples of 16 bits as the variable length encodings are multiples of 16 bits.

matthey · 06 August 2016, 01:08

Quote:

Originally Posted by meynaf

You can't have your cake and eat it. AOS is hardware specific but this is what makes it lightweight.

Being hardware specific and close to the hardware helps to make AmigaOS lightweight but it should be possible to swap one low level driver out and another in more easily. I do like the idea of having standardized hardware instead of supporting a multitude of hardware configurations much like a console although most of them are closed.

Quote:

Originally Posted by meynaf

Yet i have to see some real life code where SELcc would bring a real benefit.

If a conditional or SELcc style instruction could replace all branches then we wouldn't have branches. SELcc is at its best when it can replace an IF/THEN/ELSE eliminating 2 branches. It would also be another possible tool for those evil branches which can't be predicted well.

Quote:

Originally Posted by meynaf

This is for specific implementations. Change to something different, and the hint bit becomes useless and gets ignored.

Yes, some CPU designs may choose not to use the branch hint bit and for others it may be the only branch prediction help as it is so cheap. Although many programmers will choose not to use a hint bit or optimize to this level, others may. I see PPC hint bits in much of Frank Wille's PPC assembler code for vbcc and most modern PPC processors don't use it. Amiga programmers may be more likely to use a hint bit because we like optimized code and we generally use slower processors and often in assembler.

Quote:

Originally Posted by meynaf

For me the logic is to use the small size, i.e. MVZ.B, SWAP.W. But in some cases the 68k defeats this (EXT.L).
Anyway for me a better name would have been SWW (swap words).

Yes, this is a source of inconsistency for me. I prefer to look at the result size. How can you tell new programmers to use longword sizes when the result does not match the instruction size?

SWW.L would not be bad but REVW.L is more understandable, IMO. My preferred syntax is very clear at least. There may be too much use of REV with BREV.L, REVB.L and REVW.L though. Maybe REVB.L could be EREV.L or ESWAP.L for endian reverse or endian swap. I don't know. The originals aren't horrible either even if BYTEREV is a little long.

Quote:

Originally Posted by meynaf

We have longword indexes, so no problem here.
If compatibility weren't an issue i'd use Dn.B, Dn.W, Dn.L, An.L instead of Dn.W, Dn.L, An.W, An.L.

That would have made sense considering the upper half of An is almost always already sign extended as needed.

Quote:

Originally Posted by meynaf

But why wanting to move the 68k to 64 bit ? Just because everyone else do something, we should do it too ?

More addressing space is really the only good reason to move to 64 bit and there are other ways of working around this issue. I guess Gunnar would answer so I have 64 bit registers for my SIMD instructions. What happens if the SIMD gets floating point or grows to 128 bit registers though?

Quote:

Originally Posted by meynaf

Indeed. Now consider and.l #$1ff,d0 vs bfextu d0{23:9},d0.

The AND.L is simpler and what a compiler is most likely going to use. The AND.L #d16.w,Dn addressing mode is the same size as BFEXTU but it may be faster on some CPU implementations.

Quote:

Originally Posted by meynaf

The addressing mode doesn't use less encoding space. Why would it ?

I guess it depends on how you look at it. The addressing mode doesn't use much valuable encoding space but I guess you could say it uses a little encoding space in every instruction with an EA.

Quote:

Originally Posted by meynaf

Perhaps it could be a good idea to have a look in actual code to find some short immediate use cases.

There are 2 more of these slots which I suggested designating as immediates 1 and -1.

I played with your mix and it is trickier than it first looks. It might have been a good algorithm for the programming competition. I came up with
the following.

Code:

mix:
; d0 = mask (trashed)
; d1 = number 1 (mixed result 1)
; d2 = number 2 (mixed result 2)
; d3 = scratch
   move.l d0,d3 ; pOEP
   and.l d1,d3 ; sOEP
   and.l d2,d0 ; pOEP
   eor.l d3,d2 ; sOEP
   eor.l d0,d1 ; pOEP
   eor.l d0,d2 ; sOEP
   eor.l d3,d1 ; pOEP

Is this the correct operation? Is there shorter/faster/better code for it? What types of algorithms is this used for?

I'm not a fan of the MIX EA,Rn:Rn using Rn:Rn which should be reserved for a high:low 64 bit value. I would just go for MIX EA,Rn,Rn. The ColdFire tried to limit instructions to 2 OPs also which is why it ended up with REM(S/U) EA,Dr:Dq when there is no 64 bit operation. It is confusing and didn't work for 64 bit REMS/REMU which is one of the reasons I created DIVUR/DIVSR. It also keeps me from using an alias of REMS->DIVSR and REMU->DIVSU like you suggested.

03 August 2016, 07:54	#2
buggs Registered User Join Date: May 2016 Location: Rostock/Germany Posts: 132	This one can be fun. Off the top of my head, two things I'd be happy to see: 1. cmove is one instruction I came to enjoy while away from the Amiga. Conditional moves instead of the usual 68k bcc, move combination are quite convenient, at least in Asm code. Edit: found SELcc 2. One of the favorite toys of you Apollo guys is missing from the PDF: the SIMD stuff. Otherwise, thank you for the update regarding the ISA. Which brings me to one more thing: How can one reliably detect a CPU with the feature set in question (at different core development levels)? Do we have to probe instruction by instruction or is there already something like "CPUID"? Last edited by buggs; 03 August 2016 at 08:03.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
BOOM (DOOM Enhanced) port to 68k	NovaCoder	News	155	05 May 2023 12:26
ISA Ethernet Cards	jmmijo	support.Hardware	13	03 February 2015 11:04
Any ISA Mach64 Information?	CU_AMiGA	support.Hardware	21	09 September 2007 22:17
Help converting an 8bit ISA slot to 16bit ISA slot	Smiley	support.Hardware	4	25 April 2006 11:20
A2000 ISA slots	Unknown_K	support.Hardware	1	20 March 2005 09:48

03 August 2016, 18:30	#5
Knocker Registered User Join Date: Jan 2016 Location: Santa Cruz/US Posts: 48	I think it would be very useful to have a conditional BSR, "BSRcc". Essentially a Bcc where you can do rts to get back. Would allow for much cleaner code. Or can you use a macro with some local labels to do this somehow?

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)