19 March 2021, 11:54 | #1 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
|
fast interpreting technique(s)
This time it's about optimising code to death.
Little care about the size, it's all about speed. It can't be fast enough. The goal is to execute code depending on some values read from a stream. See it as some kind of cpu emulation. Bare minimum is 68020, as 68000 is too lacking (too slow overall, no scale factor, lots of misaligned accesses). Can not be OS killing (needs HD access, has to multitask cleanly, etc). Somewhat "naïve" implementation : Code:
; main loop bigloop move.w (a6)+,d7 move.l (a4,d7.l*4),a0 jsr (a0) tst.b stopflag(a5) beq.s bigloop ; code here treats special cases spec ; jsr (a0) points in many, many routines looking like this : one_i *** do something here *** rts Code:
bigloop move.w (a6)+,d7 jmp ([a4,d7.l*4]) one_i *** do something here *** dbf d6,bigloop bra spec First version above is ok, i can just set the flag from whereever i need to. But the tst.b/beq pair isn't the fastest way of doing it. Second version is non ok, as d6 isn't available from 68k interrupt or another task. Or maybe it is ? Is there a trick i could use to access a task register from elsewhere ? So the question is : how to perform all this, in the fastest possible way ? All ideas are welcome. |
19 March 2021, 12:33 | #2 |
Registered User
Join Date: May 2013
Location: Grimstad / Norway
Posts: 839
|
Is "(a6)+" a full 16-bit word, or are there any unused top/bottom bits?
How much memory can you spend? How big are the 'opcode' subrutines? The biggest one? I reworked a 6502 emulator (much easier with only byte opcodes), but one of the things it does is to have the dispatch for next opcode duplicated in every jump target. If you insist on a central loop then perhaps have two sets of jumptables, and have the opcode that is escaping change your base pointer for the jumptable (as long as a6 can be determined what it was for the previous instruction). |
19 March 2021, 13:02 | #3 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
|
Quote:
Top 16-bit of d7 are free (currently =0) and i can arrange to use word index if that's useful. There are, however, patterns that will lead to the same routine. Lower bits are often parameters, but not always. You can consider that the situation is similar to that of 68k, so ask yourself what you would do if you had to emulate another 68k model. Out of curiosity, if it were the case that i had free bits available, what benefit could have come out of it ? I already have a 256kb table. A second 256kb is acceptable, but i'd avoid going further if possible. More is only acceptable if there is tremendous benefit. That's very variable. More than 8 bytes for sure. Average around 16. Biggest ones can reach 50+ bytes, i'm not sure - but they can probably use subroutines. Quote:
However what must be central is the handling of all irq-like stuff. The main issue here is that it's not necessarily an opcode that will be escaping. It can come from outside (an interrupt or another task). For example, there can be something to do every vbl, or on mouse move, etc. |
||
19 March 2021, 13:09 | #4 | |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,038
|
Quote:
Interrupts... maybe if you check exec->ThisTask to make sure it's not running, but in general very tricky to do 'right now', as far as I'd guess. |
|
19 March 2021, 13:18 | #5 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
|
Quote:
I don't think using a "magic number" is very reliable, registers could be anything. I can eventually Signal() a task of a higher priority from an interrupt so it will be done in that other task - so if a task can do it, an interrupt can do it too. |
|
19 March 2021, 13:37 | #6 | |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,038
|
Quote:
Magic number just to verify your code works, then you can get rid of it. About interrupts... I wasn't sure how responsive it has to be. If you can activate your control task (signal, raising its priority temporarily, ..) and do it in the next time slice, yeah it all good. But, as I said, if you need it 'right now', dunno. Can't mess around with stacks too much ;P. Maybe you interrupted a lower level interrupt that interrupted your work task, etc. |
|
19 March 2021, 13:53 | #7 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
|
Quote:
If yes, that's good news Now only thing to take care of, is to not call any OS routine when this feature is active - and perhaps find a way to not kill my debugger as well when it's there and i'm hunting an error. Quote:
Else if it really must be "right now" then so be it : i can spawn another, temporary instance of the emulation and do it from within the interrupt. It's just that instructions (of the other side) will cease to be atomic. |
||
19 March 2021, 14:04 | #8 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,038
|
Hmmm... Good point, FPU :\. Let me poke around...
Bad news, FPU regs are saved last and loaded first, and their stack frame has variable size. |
19 March 2021, 14:38 | #9 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,214
|
|
19 March 2021, 14:50 | #10 | |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,214
|
Quote:
You can at least avoid a stack push/pop if the called functions are tiny by loading the "continue from here" location into a register and replace the JSR by JMP, and the RTS (for small functions) by a JMP (return). For longer functions, you can still make a "move.l return,-(a7)" and have a RTS at the end. Thus, at least tiny "no-op" functions do not need to touch memory or the stack. Some heavy-duty functions of P96 work this way. War time story: GFA basic had a similar (mis-)design where the basic token interpreter was interrupted from an input.device event handler by hot-patching the interpreter loop (self-modifying code). Needless to say, the whole thing broke with processors having data and code caches. |
|
19 March 2021, 15:04 | #11 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,038
|
Well, another stupid idea (OS lovers rejoice!): install your exception vector for illegal instructions, get task's loadseg list and locate your code, put a juicy illegal into main loop (+icache nuke), and now you have access to all the registers and don't have to check any flags at all.
|
19 March 2021, 15:04 | #12 |
Registered User
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,153
|
Is there a spare entry in your jumptable? Or can you add a -1th entry?
If so, point that spare entry to bigloop, and jump to it at the end of every function. You could then override it in your interrupt / other task, and restore it at the end of your special handling functions? |
19 March 2021, 15:33 | #13 |
Registered User
Join Date: Jun 2015
Location: Germany
Posts: 1,918
|
If you are always doing the JSR from the same place, your minimal interrupt handler could replace the return address on the stack with the entry address of the extended service code such that the JSR will not return to the caller PC but enter the extended service code. The extended service code would then just JMP back into the main loop when it has done its work.
|
19 March 2021, 15:34 | #14 | |
Registered User
Join Date: May 2013
Location: Grimstad / Norway
Posts: 839
|
Quote:
If the code is dense then it could have been a series of 4-byte branch opcodes (possibly not faster though) if you could drop bits or shift bits so it only was 14 bits. I'm just rambling. Orthogonality, spacing, massaged data to make it fit the problem. Is the problem set in stone, or can you modify it to make it easier to handle? |
|
19 March 2021, 15:54 | #15 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,038
|
Expanding on robinsonb5's nice idea, if you have a spare register:
Code:
lea (main_ptr),a5 bigloop: move.w (a6)+,d7 jmp ([a4,d7.l*4]) main_ptr: DC.L bigloop r1: ... jmp ([a5]) r2: ... jmp ([a5]) |
19 March 2021, 16:34 | #16 | ||||||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
|
Quote:
A warning sign will not stop me if that's the only available solution. At least, should I expect the stored PC to be at the same place ? It's not as if this part of the OS will change overnight. Quote:
Quote:
Quote:
A problem here is that caches will be trashed twice (first time to install the temporary patch, the other to restore normal instruction). And this isn't good for performance. Quote:
Quote:
It's like doing this for every executed instruction : Code:
move.l loopaddr(a5),a0 jmp (a0) Quote:
Quote:
But branching directly on regularly spaced code ? That would make the code quite a lot bigger. Not a problem by itself, but icache won't like this. If the code was 14 bit (it's not), it would mean 32 bytes per opcode. Enough for simple instructions, not enough for complex ones. Quote:
Quote:
However i can't really count clocks for these, and i can't check on hw. Docs i have are unclear about timing for jmp ([An]). |
||||||||||
19 March 2021, 17:04 | #17 | |
Registered User
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,153
|
Quote:
OK, how about: Code:
lea bigloop,a0 move.l a0,-(a7) move.l a7,retptr bigloop move.w (a6)+,d7 jmp ([a4,d7.l*4]) retptr: dc.l 0 one_i *** do something here *** jmp (a7) You can then use retptr to modify that stack entry from interrupts / other task, and once again restore it from your special function. |
|
19 March 2021, 17:18 | #18 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
|
Quote:
As jmp (a7)will not reload retptr so altering it has no effect, and, worse, said jmp will jump to code that's located in the stack. But maybe you mean jmp ([a7])instead ? In this case, it's the same as jmp ([a3])or similar, it just leaves one extra register available. Note : your two first instructions can be replaced by pea bigloop |
|
19 March 2021, 17:23 | #19 |
Registered User
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,153
|
|
19 March 2021, 17:58 | #20 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
|
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Ripping Sprites - Technique... | method | project.Sprites | 43 | 12 October 2021 16:17 |
Profiling C code, interpreting results | Ernst Blofeld | Coders. C/C++ | 5 | 19 November 2020 18:45 |
Interpreting DMA-Debugger output | selco | support.WinUAE | 10 | 27 November 2019 20:48 |
Amazing New Retrobrighting Technique | Hewitson | Retrogaming General Discussion | 12 | 12 June 2019 09:27 |
Error while interpreting script | Makkinen | support.Apps | 1 | 15 October 2004 15:58 |
|
|