fast interpreting technique(s)

meynaf · 19 March 2021, 11:54

This time it's about optimising code to death.
Little care about the size, it's all about speed. It can't be fast enough.
The goal is to execute code depending on some values read from a stream.
See it as some kind of cpu emulation.

Bare minimum is 68020, as 68000 is too lacking (too slow overall, no scale factor, lots of misaligned accesses). Can not be OS killing (needs HD access, has to multitask cleanly, etc).

Somewhat "naïve" implementation :

Code:

; main loop
bigloop
 move.w (a6)+,d7
 move.l (a4,d7.l*4),a0
 jsr (a0)
 tst.b stopflag(a5)
 beq.s bigloop
; code here treats special cases
spec

; jsr (a0) points in many, many routines looking like this :
one_i
*** do something here ***
 rts

Maybe better but problematic :

Code:

bigloop
 move.w (a6)+,d7
 jmp ([a4,d7.l*4])

one_i
*** do something here ***
 dbf d6,bigloop
 bra spec

The trick is that i must be able to conditionnally stop execution to handle things such as interrupt requests (trace mode too). As doing a check for every instruction would slow the bloody thing down.

First version above is ok, i can just set the flag from whereever i need to. But the tst.b/beq pair isn't the fastest way of doing it.
Second version is non ok, as d6 isn't available from 68k interrupt or another task. Or maybe it is ? Is there a trick i could use to access a task register from elsewhere ?

So the question is : how to perform all this, in the fastest possible way ?
All ideas are welcome.

NorthWay · 19 March 2021, 12:33

Is "(a6)+" a full 16-bit word, or are there any unused top/bottom bits?
How much memory can you spend?
How big are the 'opcode' subrutines? The biggest one?

I reworked a 6502 emulator (much easier with only byte opcodes), but one of the things it does is to have the dispatch for next opcode duplicated in every jump target.
If you insist on a central loop then perhaps have two sets of jumptables, and have the opcode that is escaping change your base pointer for the jumptable (as long as a6 can be determined what it was for the previous instruction).

meynaf · 19 March 2021, 13:02

Quote:

Originally Posted by NorthWay

Is "(a6)+" a full 16-bit word, or are there any unused top/bottom bits?

No unused bits, it's full 16-bit word.
Top 16-bit of d7 are free (currently =0) and i can arrange to use word index if that's useful.

There are, however, patterns that will lead to the same routine. Lower bits are often parameters, but not always. You can consider that the situation is similar to that of 68k, so ask yourself what you would do if you had to emulate another 68k model.

Out of curiosity, if it were the case that i had free bits available, what benefit could have come out of it ?

Quote:

Originally Posted by NorthWay

How much memory can you spend?

I already have a 256kb table. A second 256kb is acceptable, but i'd avoid going further if possible. More is only acceptable if there is tremendous benefit.

Quote:

Originally Posted by NorthWay

How big are the 'opcode' subrutines? The biggest one?

That's very variable. More than 8 bytes for sure. Average around 16. Biggest ones can reach 50+ bytes, i'm not sure - but they can probably use subroutines.

Quote:

Originally Posted by NorthWay

I reworked a 6502 emulator (much easier with only byte opcodes), but one of the things it does is to have the dispatch for next opcode duplicated in every jump target.
If you insist on a central loop then perhaps have two sets of jumptables, and have the opcode that is escaping change your base pointer for the jumptable (as long as a6 can be determined what it was for the previous instruction).

I don't insist of having a central loop. The jump for next instruction can be done on the previous one if that's better.
However what must be central is the handling of all irq-like stuff.

The main issue here is that it's not necessarily an opcode that will be escaping. It can come from outside (an interrupt or another task).
For example, there can be something to do every vbl, or on mouse move, etc.

a/b · 19 March 2021, 13:09

Quote:

Originally Posted by meynaf

Second version is non ok, as d6 isn't available from 68k interrupt or another task. Or maybe it is ? Is there a trick i could use to access a task register from elsewhere ?

If you are in a task, then you should be able to safely assume the other task is not running, and if you Forbid() you can change its regs on its local stack via task->tc_SPReg (maybe put something unique in top 16-bits of d6 to make it easier to locate), and once it's rescheduled it will have all the regs loaded from stack.
Interrupts... maybe if you check exec->ThisTask to make sure it's not running, but in general very tricky to do 'right now', as far as I'd guess.

meynaf · 19 March 2021, 13:18

Quote:

Originally Posted by a/b

If you are in a task, then you should be able to safely assume the other task is not running, and if you Forbid() you can change its regs on its local stack via task->tc_SPReg (maybe put something unique in top 16-bits of d6 to make it easier to locate), and once it's rescheduled it will have all the regs loaded from stack.

Does the location of registers change in different versions of the OS ?
I don't think using a "magic number" is very reliable, registers could be anything.

Quote:

Originally Posted by a/b

Interrupts... maybe if you check exec->ThisTask to make sure it's not running, but in general very tricky to do 'right now', as far as I'd guess.

I can eventually Signal() a task of a higher priority from an interrupt so it will be done in that other task - so if a task can do it, an interrupt can do it too.

a/b · 19 March 2021, 13:37

Quote:

Originally Posted by meynaf

Does the location of registers change in different versions of the OS ?
I don't think using a "magic number" is very reliable, registers could be anything.

Task stack always contains (ascending adresses): pc, sr, d0-a6. So d6 should be at +6+6*4.
Magic number just to verify your code works, then you can get rid of it.

About interrupts... I wasn't sure how responsive it has to be. If you can activate your control task (signal, raising its priority temporarily, ..) and do it in the next time slice, yeah it all good. But, as I said, if you need it 'right now', dunno. Can't mess around with stacks too much ;P. Maybe you interrupted a lower level interrupt that interrupted your work task, etc.

meynaf · 19 March 2021, 13:53

Quote:

Originally Posted by a/b

Task stack always contains (ascending adresses): pc, sr, d0-a6. So d6 should be at +6+6*4.
Magic number just to verify your code works, then you can get rid of it.

Even if there are FPU registers there too ?
If yes, that's good news

Now only thing to take care of, is to not call any OS routine when this feature is active - and perhaps find a way to not kill my debugger as well when it's there and i'm hunting an error.

Quote:

Originally Posted by a/b

About interrupts... I wasn't sure how responsive it has to be. If you can activate your control task (signal, raising its priority temporarily, ..) and do it in the next time slice, yeah it all good. But, as I said, if you need it 'right now', dunno. Can't mess around with stacks too much ;P. Maybe you interrupted a lower level interrupt that interrupted your work task, etc.

Usually it's not a problem : can be vbl or some input. This can be delayed somehow.
Else if it really must be "right now" then so be it : i can spawn another, temporary instance of the emulation and do it from within the interrupt. It's just that instructions (of the other side) will cease to be atomic.

a/b · 19 March 2021, 14:04

Hmmm... Good point, FPU :\. Let me poke around...
Bad news, FPU regs are saved last and loaded first, and their stack frame has variable size.

Thomas Richter · 19 March 2021, 14:38

Quote:

Originally Posted by meynaf

Does the location of registers change in different versions of the OS ?

It's not documented, which means "hands off!".

Thomas Richter · 19 March 2021, 14:50

Quote:

Originally Posted by meynaf

So the question is : how to perform all this, in the fastest possible way ?
All ideas are welcome.

You can at least avoid a stack push/pop if the called functions are tiny by loading the "continue from here" location into a register and replace the JSR by JMP, and the RTS (for small functions) by a JMP (return). For longer functions, you can still make a "move.l return,-(a7)" and have a RTS at the end. Thus, at least tiny "no-op" functions do not need to touch memory or the stack. Some heavy-duty functions of P96 work this way.

War time story: GFA basic had a similar (mis-)design where the basic token interpreter was interrupted from an input.device event handler by hot-patching the interpreter loop (self-modifying code). Needless to say, the whole thing broke with processors having data and code caches.

a/b · 19 March 2021, 15:04

Well, another stupid idea (OS lovers rejoice!): install your exception vector for illegal instructions, get task's loadseg list and locate your code, put a juicy illegal into main loop (+icache nuke), and now you have access to all the registers and don't have to check any flags at all.

robinsonb5 · 19 March 2021, 15:04

Is there a spare entry in your jumptable? Or can you add a -1th entry?

If so, point that spare entry to bigloop, and jump to it at the end of every function. You could then override it in your interrupt / other task, and restore it at the end of your special handling functions?

grond · 19 March 2021, 15:33

If you are always doing the JSR from the same place, your minimal interrupt handler could replace the return address on the stack with the entry address of the extended service code such that the JSR will not return to the caller PC but enter the extended service code. The extended service code would then just JMP back into the main loop when it has done its work.

NorthWay · 19 March 2021, 15:34

Quote:

Originally Posted by meynaf

Out of curiosity, if it were the case that i had free bits available, what benefit could have come out of it ?

Not sure I know myself. It was more or less about spacing out the opcodes every N bytes and shifting the (a6) bits to get a direct address to jump to (possibly needing some rather memory-hungry address alignment - every opcode starts 2^N bytes apart).
If the code is dense then it could have been a series of 4-byte branch opcodes (possibly not faster though) if you could drop bits or shift bits so it only was 14 bits.

I'm just rambling. Orthogonality, spacing, massaged data to make it fit the problem. Is the problem set in stone, or can you modify it to make it easier to handle?

a/b · 19 March 2021, 15:54

Expanding on robinsonb5's nice idea, if you have a spare register:

Code:

	lea	(main_ptr),a5
bigloop:
	move.w	(a6)+,d7
	jmp	([a4,d7.l*4])

main_ptr:
	DC.L	bigloop

r1:
...
	jmp	([a5])
r2:
...
	jmp	([a5])

When you want to break out, simply change main_ptr and you don't have to mess with the registers. Is this faster than tst/bra combo on 020? It's not as fast as bra/dbf, but shouldn't be much behind.

meynaf · 19 March 2021, 16:34

Quote:

Originally Posted by a/b

Hmmm... Good point, FPU :\. Let me poke around...
Bad news, FPU regs are saved last and loaded first, and their stack frame has variable size.

Sigh. I expected this.

Quote:

Originally Posted by Thomas Richter

It's not documented, which means "hands off!".

A warning sign will not stop me if that's the only available solution.
At least, should I expect the stored PC to be at the same place ?
It's not as if this part of the OS will change overnight.

Quote:

Originally Posted by Thomas Richter

You can at least avoid a stack push/pop if the called functions are tiny by loading the "continue from here" location into a register and replace the JSR by JMP, and the RTS (for small functions) by a JMP (return). For longer functions, you can still make a "move.l return,-(a7)" and have a RTS at the end. Thus, at least tiny "no-op" functions do not need to touch memory or the stack. Some heavy-duty functions of P96 work this way.

You mean, end all functions with something JMP (A3) with A3 leading to bigloop ? That does not solve the problem of A3 not available from other contexts.

Quote:

Originally Posted by Thomas Richter

War time story: GFA basic had a similar (mis-)design where the basic token interpreter was interrupted from an input.device event handler by hot-patching the interpreter loop (self-modifying code). Needless to say, the whole thing broke with processors having data and code caches.

I guess i could call CacheClearE() to handle this, but i'd prefer to avoid SMC altogether.

Quote:

Originally Posted by a/b

Well, another stupid idea (OS lovers rejoice!): install your exception vector for illegal instructions, get task's loadseg list and locate your code, put a juicy illegal into main loop (+icache nuke), and now you have access to all the registers and don't have to check any flags at all.

I'd prefer to avoid SMC if possible, but if following this way i'd rather put some BRA.B instead of ILLEGAL.
A problem here is that caches will be trashed twice (first time to install the temporary patch, the other to restore normal instruction). And this isn't good for performance.

Quote:

Originally Posted by robinsonb5

Is there a spare entry in your jumptable? Or can you add a -1th entry?

Not -1 as it's $ffff opcode, but i guess i could find a spare one.

Quote:

Originally Posted by robinsonb5

If so, point that spare entry to bigloop, and jump to it at the end of every function. You could then override it in your interrupt / other task, and restore it at the end of your special handling functions?

This means reloading an address to jump to at the end of every function.
It's like doing this for every executed instruction :

Code:

 move.l loopaddr(a5),a0
 jmp (a0)

That will add more clocks to every instruction than i wish to afford...

Quote:

Originally Posted by grond

If you are always doing the JSR from the same place, your minimal interrupt handler could replace the return address on the stack with the entry address of the extended service code such that the JSR will not return to the caller PC but enter the extended service code. The extended service code would then just JMP back into the main loop when it has done its work.

This means i must be sure the return address is there, but it won't if we got interrupted during the execution of the fetch or jmp instructions in the main loop.

Quote:

Originally Posted by NorthWay

Not sure I know myself. It was more or less about spacing out the opcodes every N bytes and shifting the (a6) bits to get a direct address to jump to (possibly needing some rather memory-hungry address alignment - every opcode starts 2^N bytes apart).
If the code is dense then it could have been a series of 4-byte branch opcodes (possibly not faster though) if you could drop bits or shift bits so it only was 14 bits.

Branching on a 4-byte branch would indeed add too many clocks.
But branching directly on regularly spaced code ? That would make the code quite a lot bigger. Not a problem by itself, but icache won't like this.
If the code was 14 bit (it's not), it would mean 32 bytes per opcode. Enough for simple instructions, not enough for complex ones.

Quote:

Originally Posted by NorthWay

I'm just rambling. Orthogonality, spacing, massaged data to make it fit the problem. Is the problem set in stone, or can you modify it to make it easier to handle?

It's more or less set in stone.

Quote:

Originally Posted by a/b

Expanding on robinsonb5's nice idea, if you have a spare register:

Code:

    lea    (main_ptr),a5
bigloop:
    move.w    (a6)+,d7
    jmp    ([a4,d7.l*4])

main_ptr:
    DC.L    bigloop

r1:
...
    jmp    ([a5])
r2:
...
    jmp    ([a5])

When you want to break out, simply change main_ptr and you don't have to mess with the registers. Is this faster than tst/bra combo on 020? It's not as fast as bra/dbf, but shouldn't be much behind.

I guess a spare register can be found.
However i can't really count clocks for these, and i can't check on hw. Docs i have are unclear about timing for jmp ([An]).

robinsonb5 · 19 March 2021, 17:04

Quote:

Originally Posted by meynaf

This means reloading an address to jump to at the end of every function.
It's like doing this for every executed instruction :

Code:

 move.l loopaddr(a5),a0
 jmp (a0)

That will add more clocks to every instruction than i wish to afford...

OK, how about:

Code:

  lea bigloop,a0
  move.l a0,-(a7)
  move.l a7,retptr
bigloop
  move.w (a6)+,d7
  jmp ([a4,d7.l*4])

retptr:
  dc.l 0

one_i
 *** do something here ***
  jmp (a7)

The longword at retptr points to the top stack entry. Since you're entering and leaving functions with jmp rather than jsr/rts the stack pointer is constant at the critical moments.

You can then use retptr to modify that stack entry from interrupts / other task, and once again restore it from your special function.

meynaf · 19 March 2021, 17:18

Quote:

Originally Posted by robinsonb5

OK, how about:

Code:

  lea bigloop,a0
  move.l a0,-(a7)
  move.l a7,retptr
bigloop
  move.w (a6)+,d7
  jmp ([a4,d7.l*4])

retptr:
  dc.l 0

one_i
 *** do something here ***
  jmp (a7)

The longword at retptr points to the top stack entry. Since you're entering and leaving functions with jmp rather than jsr/rts the stack pointer is constant at the critical moments.

You can then use retptr to modify that stack entry from interrupts / other task, and once again restore it from your special function.

I don't understand what you want to do here.
As

jmp (a7)

will not reload retptr so altering it has no effect, and, worse, said jmp will jump to code that's located in the stack.
But maybe you mean

jmp ([a7])

instead ?
In this case, it's the same as

jmp ([a3])

or similar, it just leaves one extra register available.
Note : your two first instructions can be replaced by

pea bigloop

robinsonb5 · 19 March 2021, 17:23

Quote:

Originally Posted by meynaf

But maybe you mean

jmp ([a7])

instead ?

Sorry, yes, that's what I intended - essentially rts without popping the value.

meynaf · 19 March 2021, 17:58

Quote:

Originally Posted by robinsonb5

Sorry, yes, that's what I intended - essentially rts without popping the value.

What is the timing of this indirect jmp ? I can't evaluate it precisely.

19 March 2021, 11:54	#1
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,322	fast interpreting technique(s) This time it's about optimising code to death. Little care about the size, it's all about speed. It can't be fast enough. The goal is to execute code depending on some values read from a stream. See it as some kind of cpu emulation. Bare minimum is 68020, as 68000 is too lacking (too slow overall, no scale factor, lots of misaligned accesses). Can not be OS killing (needs HD access, has to multitask cleanly, etc). Somewhat "naïve" implementation : Code: ; main loop bigloop move.w (a6)+,d7 move.l (a4,d7.l4),a0 jsr (a0) tst.b stopflag(a5) beq.s bigloop ; code here treats special cases spec ; jsr (a0) points in many, many routines looking like this : one_i do something here * rts Maybe better but problematic : Code: bigloop move.w (a6)+,d7 jmp ([a4,d7.l4]) one_i do something here * dbf d6,bigloop bra spec The trick is that i must be able to conditionnally stop execution to handle things such as interrupt requests (trace mode too). As doing a check for every instruction would slow the bloody thing down. First version above is ok, i can just set the flag from whereever i need to. But the tst.b/beq pair isn't the fastest way of doing it. Second version is non ok, as d6 isn't available from 68k interrupt or another task. Or maybe it is ? Is there a trick i could use to access a task register from elsewhere ? So the question is : how to perform all this, in the fastest possible way ? All ideas are welcome.

19 March 2021, 15:54	#15
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,038	Expanding on robinsonb5's nice idea, if you have a spare register: Code: lea (main_ptr),a5 bigloop: move.w (a6)+,d7 jmp ([a4,d7.l*4]) main_ptr: DC.L bigloop r1: ... jmp ([a5]) r2: ... jmp ([a5]) When you want to break out, simply change main_ptr and you don't have to mess with the registers. Is this faster than tst/bra combo on 020? It's not as fast as bra/dbf, but shouldn't be much behind.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Ripping Sprites - Technique...	method	project.Sprites	43	12 October 2021 16:17
Profiling C code, interpreting results	Ernst Blofeld	Coders. C/C++	5	19 November 2020 18:45
Interpreting DMA-Debugger output	selco	support.WinUAE	10	27 November 2019 20:48
Amazing New Retrobrighting Technique	Hewitson	Retrogaming General Discussion	12	12 June 2019 09:27
Error while interpreting script	Makkinen	support.Apps	1	15 October 2004 15:58

19 March 2021, 12:33	#2
NorthWay Registered User Join Date: May 2013 Location: Grimstad / Norway Posts: 839	Is "(a6)+" a full 16-bit word, or are there any unused top/bottom bits? How much memory can you spend? How big are the 'opcode' subrutines? The biggest one? I reworked a 6502 emulator (much easier with only byte opcodes), but one of the things it does is to have the dispatch for next opcode duplicated in every jump target. If you insist on a central loop then perhaps have two sets of jumptables, and have the opcode that is escaping change your base pointer for the jumptable (as long as a6 can be determined what it was for the previous instruction).

19 March 2021, 14:04	#8
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,038	Hmmm... Good point, FPU :\. Let me poke around... Bad news, FPU regs are saved last and loaded first, and their stack frame has variable size.

19 March 2021, 15:04	#11
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,038	Well, another stupid idea (OS lovers rejoice!): install your exception vector for illegal instructions, get task's loadseg list and locate your code, put a juicy illegal into main loop (+icache nuke), and now you have access to all the registers and don't have to check any flags at all.

19 March 2021, 15:04	#12
robinsonb5 Registered User Join Date: Mar 2012 Location: Norfolk, UK Posts: 1,153	Is there a spare entry in your jumptable? Or can you add a -1th entry? If so, point that spare entry to bigloop, and jump to it at the end of every function. You could then override it in your interrupt / other task, and restore it at the end of your special handling functions?

19 March 2021, 15:33	#13
grond Registered User Join Date: Jun 2015 Location: Germany Posts: 1,918	If you are always doing the JSR from the same place, your minimal interrupt handler could replace the return address on the stack with the entry address of the extended service code such that the JSR will not return to the caller PC but enter the extended service code. The extended service code would then just JMP back into the main loop when it has done its work.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)