English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 19 March 2021, 11:54   #1
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
fast interpreting technique(s)

This time it's about optimising code to death.
Little care about the size, it's all about speed. It can't be fast enough.
The goal is to execute code depending on some values read from a stream.
See it as some kind of cpu emulation.

Bare minimum is 68020, as 68000 is too lacking (too slow overall, no scale factor, lots of misaligned accesses). Can not be OS killing (needs HD access, has to multitask cleanly, etc).

Somewhat "naïve" implementation :
Code:
; main loop
bigloop
 move.w (a6)+,d7
 move.l (a4,d7.l*4),a0
 jsr (a0)
 tst.b stopflag(a5)
 beq.s bigloop
; code here treats special cases
spec

; jsr (a0) points in many, many routines looking like this :
one_i
*** do something here ***
 rts
Maybe better but problematic :
Code:
bigloop
 move.w (a6)+,d7
 jmp ([a4,d7.l*4])

one_i
*** do something here ***
 dbf d6,bigloop
 bra spec
The trick is that i must be able to conditionnally stop execution to handle things such as interrupt requests (trace mode too). As doing a check for every instruction would slow the bloody thing down.

First version above is ok, i can just set the flag from whereever i need to. But the tst.b/beq pair isn't the fastest way of doing it.
Second version is non ok, as d6 isn't available from 68k interrupt or another task. Or maybe it is ? Is there a trick i could use to access a task register from elsewhere ?

So the question is : how to perform all this, in the fastest possible way ?
All ideas are welcome.
meynaf is offline  
Old 19 March 2021, 12:33   #2
NorthWay
Registered User
 
Join Date: May 2013
Location: Grimstad / Norway
Posts: 839
Is "(a6)+" a full 16-bit word, or are there any unused top/bottom bits?
How much memory can you spend?
How big are the 'opcode' subrutines? The biggest one?

I reworked a 6502 emulator (much easier with only byte opcodes), but one of the things it does is to have the dispatch for next opcode duplicated in every jump target.
If you insist on a central loop then perhaps have two sets of jumptables, and have the opcode that is escaping change your base pointer for the jumptable (as long as a6 can be determined what it was for the previous instruction).
NorthWay is offline  
Old 19 March 2021, 13:02   #3
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
Quote:
Originally Posted by NorthWay View Post
Is "(a6)+" a full 16-bit word, or are there any unused top/bottom bits?
No unused bits, it's full 16-bit word.
Top 16-bit of d7 are free (currently =0) and i can arrange to use word index if that's useful.

There are, however, patterns that will lead to the same routine. Lower bits are often parameters, but not always. You can consider that the situation is similar to that of 68k, so ask yourself what you would do if you had to emulate another 68k model.

Out of curiosity, if it were the case that i had free bits available, what benefit could have come out of it ?


Quote:
Originally Posted by NorthWay View Post
How much memory can you spend?
I already have a 256kb table. A second 256kb is acceptable, but i'd avoid going further if possible. More is only acceptable if there is tremendous benefit.


Quote:
Originally Posted by NorthWay View Post
How big are the 'opcode' subrutines? The biggest one?
That's very variable. More than 8 bytes for sure. Average around 16. Biggest ones can reach 50+ bytes, i'm not sure - but they can probably use subroutines.


Quote:
Originally Posted by NorthWay View Post
I reworked a 6502 emulator (much easier with only byte opcodes), but one of the things it does is to have the dispatch for next opcode duplicated in every jump target.
If you insist on a central loop then perhaps have two sets of jumptables, and have the opcode that is escaping change your base pointer for the jumptable (as long as a6 can be determined what it was for the previous instruction).
I don't insist of having a central loop. The jump for next instruction can be done on the previous one if that's better.
However what must be central is the handling of all irq-like stuff.

The main issue here is that it's not necessarily an opcode that will be escaping. It can come from outside (an interrupt or another task).
For example, there can be something to do every vbl, or on mouse move, etc.
meynaf is offline  
Old 19 March 2021, 13:09   #4
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,038
Quote:
Originally Posted by meynaf View Post
Second version is non ok, as d6 isn't available from 68k interrupt or another task. Or maybe it is ? Is there a trick i could use to access a task register from elsewhere ?
If you are in a task, then you should be able to safely assume the other task is not running, and if you Forbid() you can change its regs on its local stack via task->tc_SPReg (maybe put something unique in top 16-bits of d6 to make it easier to locate), and once it's rescheduled it will have all the regs loaded from stack.
Interrupts... maybe if you check exec->ThisTask to make sure it's not running, but in general very tricky to do 'right now', as far as I'd guess.
a/b is offline  
Old 19 March 2021, 13:18   #5
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
Quote:
Originally Posted by a/b View Post
If you are in a task, then you should be able to safely assume the other task is not running, and if you Forbid() you can change its regs on its local stack via task->tc_SPReg (maybe put something unique in top 16-bits of d6 to make it easier to locate), and once it's rescheduled it will have all the regs loaded from stack.
Does the location of registers change in different versions of the OS ?
I don't think using a "magic number" is very reliable, registers could be anything.


Quote:
Originally Posted by a/b View Post
Interrupts... maybe if you check exec->ThisTask to make sure it's not running, but in general very tricky to do 'right now', as far as I'd guess.
I can eventually Signal() a task of a higher priority from an interrupt so it will be done in that other task - so if a task can do it, an interrupt can do it too.
meynaf is offline  
Old 19 March 2021, 13:37   #6
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,038
Quote:
Originally Posted by meynaf View Post
Does the location of registers change in different versions of the OS ?
I don't think using a "magic number" is very reliable, registers could be anything.
Task stack always contains (ascending adresses): pc, sr, d0-a6. So d6 should be at +6+6*4.
Magic number just to verify your code works, then you can get rid of it.

About interrupts... I wasn't sure how responsive it has to be. If you can activate your control task (signal, raising its priority temporarily, ..) and do it in the next time slice, yeah it all good. But, as I said, if you need it 'right now', dunno. Can't mess around with stacks too much ;P. Maybe you interrupted a lower level interrupt that interrupted your work task, etc.
a/b is offline  
Old 19 March 2021, 13:53   #7
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
Quote:
Originally Posted by a/b View Post
Task stack always contains (ascending adresses): pc, sr, d0-a6. So d6 should be at +6+6*4.
Magic number just to verify your code works, then you can get rid of it.
Even if there are FPU registers there too ?
If yes, that's good news
Now only thing to take care of, is to not call any OS routine when this feature is active - and perhaps find a way to not kill my debugger as well when it's there and i'm hunting an error.


Quote:
Originally Posted by a/b View Post
About interrupts... I wasn't sure how responsive it has to be. If you can activate your control task (signal, raising its priority temporarily, ..) and do it in the next time slice, yeah it all good. But, as I said, if you need it 'right now', dunno. Can't mess around with stacks too much ;P. Maybe you interrupted a lower level interrupt that interrupted your work task, etc.
Usually it's not a problem : can be vbl or some input. This can be delayed somehow.
Else if it really must be "right now" then so be it : i can spawn another, temporary instance of the emulation and do it from within the interrupt. It's just that instructions (of the other side) will cease to be atomic.
meynaf is offline  
Old 19 March 2021, 14:04   #8
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,038
Hmmm... Good point, FPU :\. Let me poke around...
Bad news, FPU regs are saved last and loaded first, and their stack frame has variable size.
a/b is offline  
Old 19 March 2021, 14:38   #9
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,214
Quote:
Originally Posted by meynaf View Post
Does the location of registers change in different versions of the OS ?
It's not documented, which means "hands off!".
Thomas Richter is offline  
Old 19 March 2021, 14:50   #10
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,214
Quote:
Originally Posted by meynaf View Post
So the question is : how to perform all this, in the fastest possible way ?
All ideas are welcome.

You can at least avoid a stack push/pop if the called functions are tiny by loading the "continue from here" location into a register and replace the JSR by JMP, and the RTS (for small functions) by a JMP (return). For longer functions, you can still make a "move.l return,-(a7)" and have a RTS at the end. Thus, at least tiny "no-op" functions do not need to touch memory or the stack. Some heavy-duty functions of P96 work this way.



War time story: GFA basic had a similar (mis-)design where the basic token interpreter was interrupted from an input.device event handler by hot-patching the interpreter loop (self-modifying code). Needless to say, the whole thing broke with processors having data and code caches.
Thomas Richter is offline  
Old 19 March 2021, 15:04   #11
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,038
Well, another stupid idea (OS lovers rejoice!): install your exception vector for illegal instructions, get task's loadseg list and locate your code, put a juicy illegal into main loop (+icache nuke), and now you have access to all the registers and don't have to check any flags at all.
a/b is offline  
Old 19 March 2021, 15:04   #12
robinsonb5
Registered User
 
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,153
Is there a spare entry in your jumptable? Or can you add a -1th entry?


If so, point that spare entry to bigloop, and jump to it at the end of every function. You could then override it in your interrupt / other task, and restore it at the end of your special handling functions?
robinsonb5 is offline  
Old 19 March 2021, 15:33   #13
grond
Registered User
 
Join Date: Jun 2015
Location: Germany
Posts: 1,918
If you are always doing the JSR from the same place, your minimal interrupt handler could replace the return address on the stack with the entry address of the extended service code such that the JSR will not return to the caller PC but enter the extended service code. The extended service code would then just JMP back into the main loop when it has done its work.
grond is offline  
Old 19 March 2021, 15:34   #14
NorthWay
Registered User
 
Join Date: May 2013
Location: Grimstad / Norway
Posts: 839
Quote:
Originally Posted by meynaf View Post
Out of curiosity, if it were the case that i had free bits available, what benefit could have come out of it ?
Not sure I know myself. It was more or less about spacing out the opcodes every N bytes and shifting the (a6) bits to get a direct address to jump to (possibly needing some rather memory-hungry address alignment - every opcode starts 2^N bytes apart).
If the code is dense then it could have been a series of 4-byte branch opcodes (possibly not faster though) if you could drop bits or shift bits so it only was 14 bits.

I'm just rambling. Orthogonality, spacing, massaged data to make it fit the problem. Is the problem set in stone, or can you modify it to make it easier to handle?
NorthWay is offline  
Old 19 March 2021, 15:54   #15
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,038
Expanding on robinsonb5's nice idea, if you have a spare register:
Code:
	lea	(main_ptr),a5
bigloop:
	move.w	(a6)+,d7
	jmp	([a4,d7.l*4])

main_ptr:
	DC.L	bigloop

r1:
...
	jmp	([a5])
r2:
...
	jmp	([a5])
When you want to break out, simply change main_ptr and you don't have to mess with the registers. Is this faster than tst/bra combo on 020? It's not as fast as bra/dbf, but shouldn't be much behind.
a/b is offline  
Old 19 March 2021, 16:34   #16
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
Quote:
Originally Posted by a/b View Post
Hmmm... Good point, FPU :\. Let me poke around...
Bad news, FPU regs are saved last and loaded first, and their stack frame has variable size.
Sigh. I expected this.


Quote:
Originally Posted by Thomas Richter View Post
It's not documented, which means "hands off!".
A warning sign will not stop me if that's the only available solution.
At least, should I expect the stored PC to be at the same place ?
It's not as if this part of the OS will change overnight.


Quote:
Originally Posted by Thomas Richter View Post
You can at least avoid a stack push/pop if the called functions are tiny by loading the "continue from here" location into a register and replace the JSR by JMP, and the RTS (for small functions) by a JMP (return). For longer functions, you can still make a "move.l return,-(a7)" and have a RTS at the end. Thus, at least tiny "no-op" functions do not need to touch memory or the stack. Some heavy-duty functions of P96 work this way.
You mean, end all functions with something JMP (A3) with A3 leading to bigloop ? That does not solve the problem of A3 not available from other contexts.


Quote:
Originally Posted by Thomas Richter View Post
War time story: GFA basic had a similar (mis-)design where the basic token interpreter was interrupted from an input.device event handler by hot-patching the interpreter loop (self-modifying code). Needless to say, the whole thing broke with processors having data and code caches.
I guess i could call CacheClearE() to handle this, but i'd prefer to avoid SMC altogether.


Quote:
Originally Posted by a/b View Post
Well, another stupid idea (OS lovers rejoice!): install your exception vector for illegal instructions, get task's loadseg list and locate your code, put a juicy illegal into main loop (+icache nuke), and now you have access to all the registers and don't have to check any flags at all.
I'd prefer to avoid SMC if possible, but if following this way i'd rather put some BRA.B instead of ILLEGAL.
A problem here is that caches will be trashed twice (first time to install the temporary patch, the other to restore normal instruction). And this isn't good for performance.


Quote:
Originally Posted by robinsonb5 View Post
Is there a spare entry in your jumptable? Or can you add a -1th entry?
Not -1 as it's $ffff opcode, but i guess i could find a spare one.


Quote:
Originally Posted by robinsonb5 View Post
If so, point that spare entry to bigloop, and jump to it at the end of every function. You could then override it in your interrupt / other task, and restore it at the end of your special handling functions?
This means reloading an address to jump to at the end of every function.
It's like doing this for every executed instruction :
Code:
 move.l loopaddr(a5),a0
 jmp (a0)
That will add more clocks to every instruction than i wish to afford...


Quote:
Originally Posted by grond View Post
If you are always doing the JSR from the same place, your minimal interrupt handler could replace the return address on the stack with the entry address of the extended service code such that the JSR will not return to the caller PC but enter the extended service code. The extended service code would then just JMP back into the main loop when it has done its work.
This means i must be sure the return address is there, but it won't if we got interrupted during the execution of the fetch or jmp instructions in the main loop.


Quote:
Originally Posted by NorthWay View Post
Not sure I know myself. It was more or less about spacing out the opcodes every N bytes and shifting the (a6) bits to get a direct address to jump to (possibly needing some rather memory-hungry address alignment - every opcode starts 2^N bytes apart).
If the code is dense then it could have been a series of 4-byte branch opcodes (possibly not faster though) if you could drop bits or shift bits so it only was 14 bits.
Branching on a 4-byte branch would indeed add too many clocks.
But branching directly on regularly spaced code ? That would make the code quite a lot bigger. Not a problem by itself, but icache won't like this.
If the code was 14 bit (it's not), it would mean 32 bytes per opcode. Enough for simple instructions, not enough for complex ones.


Quote:
Originally Posted by NorthWay View Post
I'm just rambling. Orthogonality, spacing, massaged data to make it fit the problem. Is the problem set in stone, or can you modify it to make it easier to handle?
It's more or less set in stone.


Quote:
Originally Posted by a/b View Post
Expanding on robinsonb5's nice idea, if you have a spare register:
Code:
    lea    (main_ptr),a5
bigloop:
    move.w    (a6)+,d7
    jmp    ([a4,d7.l*4])

main_ptr:
    DC.L    bigloop

r1:
...
    jmp    ([a5])
r2:
...
    jmp    ([a5])
When you want to break out, simply change main_ptr and you don't have to mess with the registers. Is this faster than tst/bra combo on 020? It's not as fast as bra/dbf, but shouldn't be much behind.
I guess a spare register can be found.
However i can't really count clocks for these, and i can't check on hw. Docs i have are unclear about timing for jmp ([An]).
meynaf is offline  
Old 19 March 2021, 17:04   #17
robinsonb5
Registered User
 
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,153
Quote:
Originally Posted by meynaf View Post


This means reloading an address to jump to at the end of every function.
It's like doing this for every executed instruction :
Code:
 move.l loopaddr(a5),a0
 jmp (a0)
That will add more clocks to every instruction than i wish to afford...

OK, how about:
Code:
  lea bigloop,a0
  move.l a0,-(a7)
  move.l a7,retptr
bigloop
  move.w (a6)+,d7
  jmp ([a4,d7.l*4])

retptr:
  dc.l 0

one_i
 *** do something here ***
  jmp (a7)
The longword at retptr points to the top stack entry. Since you're entering and leaving functions with jmp rather than jsr/rts the stack pointer is constant at the critical moments.

You can then use retptr to modify that stack entry from interrupts / other task, and once again restore it from your special function.
robinsonb5 is offline  
Old 19 March 2021, 17:18   #18
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
Quote:
Originally Posted by robinsonb5 View Post
OK, how about:
Code:
  lea bigloop,a0
  move.l a0,-(a7)
  move.l a7,retptr
bigloop
  move.w (a6)+,d7
  jmp ([a4,d7.l*4])

retptr:
  dc.l 0

one_i
 *** do something here ***
  jmp (a7)
The longword at retptr points to the top stack entry. Since you're entering and leaving functions with jmp rather than jsr/rts the stack pointer is constant at the critical moments.

You can then use retptr to modify that stack entry from interrupts / other task, and once again restore it from your special function.
I don't understand what you want to do here.
As
jmp (a7)
will not reload retptr so altering it has no effect, and, worse, said jmp will jump to code that's located in the stack.
But maybe you mean
jmp ([a7])
instead ?
In this case, it's the same as
jmp ([a3])
or similar, it just leaves one extra register available.
Note : your two first instructions can be replaced by
pea bigloop
meynaf is offline  
Old 19 March 2021, 17:23   #19
robinsonb5
Registered User
 
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,153
Quote:
Originally Posted by meynaf View Post
But maybe you mean
jmp ([a7])
instead ?

Sorry, yes, that's what I intended - essentially rts without popping the value.
robinsonb5 is offline  
Old 19 March 2021, 17:58   #20
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
Quote:
Originally Posted by robinsonb5 View Post
Sorry, yes, that's what I intended - essentially rts without popping the value.
What is the timing of this indirect jmp ? I can't evaluate it precisely.
meynaf is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Ripping Sprites - Technique... method project.Sprites 43 12 October 2021 16:17
Profiling C code, interpreting results Ernst Blofeld Coders. C/C++ 5 19 November 2020 18:45
Interpreting DMA-Debugger output selco support.WinUAE 10 27 November 2019 20:48
Amazing New Retrobrighting Technique Hewitson Retrogaming General Discussion 12 12 June 2019 09:27
Error while interpreting script Makkinen support.Apps 1 15 October 2004 15:58

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 09:52.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.20918 seconds with 16 queries