English Amiga Board - View Single Post

meynaf · 19 March 2021, 16:34

Quote:

Originally Posted by a/b

Hmmm... Good point, FPU :\. Let me poke around...
Bad news, FPU regs are saved last and loaded first, and their stack frame has variable size.

Sigh. I expected this.

Quote:

Originally Posted by Thomas Richter

It's not documented, which means "hands off!".

A warning sign will not stop me if that's the only available solution.
At least, should I expect the stored PC to be at the same place ?
It's not as if this part of the OS will change overnight.

Quote:

Originally Posted by Thomas Richter

You can at least avoid a stack push/pop if the called functions are tiny by loading the "continue from here" location into a register and replace the JSR by JMP, and the RTS (for small functions) by a JMP (return). For longer functions, you can still make a "move.l return,-(a7)" and have a RTS at the end. Thus, at least tiny "no-op" functions do not need to touch memory or the stack. Some heavy-duty functions of P96 work this way.

You mean, end all functions with something JMP (A3) with A3 leading to bigloop ? That does not solve the problem of A3 not available from other contexts.

Quote:

Originally Posted by Thomas Richter

War time story: GFA basic had a similar (mis-)design where the basic token interpreter was interrupted from an input.device event handler by hot-patching the interpreter loop (self-modifying code). Needless to say, the whole thing broke with processors having data and code caches.

I guess i could call CacheClearE() to handle this, but i'd prefer to avoid SMC altogether.

Quote:

Originally Posted by a/b

Well, another stupid idea (OS lovers rejoice!): install your exception vector for illegal instructions, get task's loadseg list and locate your code, put a juicy illegal into main loop (+icache nuke), and now you have access to all the registers and don't have to check any flags at all.

I'd prefer to avoid SMC if possible, but if following this way i'd rather put some BRA.B instead of ILLEGAL.
A problem here is that caches will be trashed twice (first time to install the temporary patch, the other to restore normal instruction). And this isn't good for performance.

Quote:

Originally Posted by robinsonb5

Is there a spare entry in your jumptable? Or can you add a -1th entry?

Not -1 as it's $ffff opcode, but i guess i could find a spare one.

Quote:

Originally Posted by robinsonb5

If so, point that spare entry to bigloop, and jump to it at the end of every function. You could then override it in your interrupt / other task, and restore it at the end of your special handling functions?

This means reloading an address to jump to at the end of every function.
It's like doing this for every executed instruction :

Code:

 move.l loopaddr(a5),a0
 jmp (a0)

That will add more clocks to every instruction than i wish to afford...

Quote:

Originally Posted by grond

If you are always doing the JSR from the same place, your minimal interrupt handler could replace the return address on the stack with the entry address of the extended service code such that the JSR will not return to the caller PC but enter the extended service code. The extended service code would then just JMP back into the main loop when it has done its work.

This means i must be sure the return address is there, but it won't if we got interrupted during the execution of the fetch or jmp instructions in the main loop.

Quote:

Originally Posted by NorthWay

Not sure I know myself. It was more or less about spacing out the opcodes every N bytes and shifting the (a6) bits to get a direct address to jump to (possibly needing some rather memory-hungry address alignment - every opcode starts 2^N bytes apart).
If the code is dense then it could have been a series of 4-byte branch opcodes (possibly not faster though) if you could drop bits or shift bits so it only was 14 bits.

Branching on a 4-byte branch would indeed add too many clocks.
But branching directly on regularly spaced code ? That would make the code quite a lot bigger. Not a problem by itself, but icache won't like this.
If the code was 14 bit (it's not), it would mean 32 bytes per opcode. Enough for simple instructions, not enough for complex ones.

Quote:

Originally Posted by NorthWay

I'm just rambling. Orthogonality, spacing, massaged data to make it fit the problem. Is the problem set in stone, or can you modify it to make it easier to handle?

It's more or less set in stone.

Quote:

Originally Posted by a/b

Expanding on robinsonb5's nice idea, if you have a spare register:

Code:

    lea    (main_ptr),a5
bigloop:
    move.w    (a6)+,d7
    jmp    ([a4,d7.l*4])

main_ptr:
    DC.L    bigloop

r1:
...
    jmp    ([a5])
r2:
...
    jmp    ([a5])

When you want to break out, simply change main_ptr and you don't have to mess with the registers. Is this faster than tst/bra combo on 020? It's not as fast as bra/dbf, but shouldn't be much behind.

I guess a spare register can be found.
However i can't really count clocks for these, and i can't check on hw. Docs i have are unclear about timing for jmp ([An]).