fast interpreting technique(s) - Page 2

Thomas Richter · 19 March 2021, 18:17

Quote:

Originally Posted by meynaf

A warning sign will not stop me if that's the only available solution.
At least, should I expect the stored PC to be at the same place ?

Once again: It is not documented where in the stack frame the PC is stored, so you cannot expect anything. Developing software means working against documented interfaces, and there is nothing documented there. The stack frame changes depending on whether there is a FPU in the system or not, or whether this is a 68040 or 68060 FPU. The vampire will store its additional registers also somewhere in the stack frame, also undocumented. This is "beyond limits".

Quote:

Originally Posted by meynaf

You mean, end all functions with something JMP (A3) with A3 leading to bigloop ? That does not solve the problem of A3 not available from other contexts.

I'm not sure what you mean by "context". Concerning A3 not being usable, I understood that your problem is that many functions are just an RTS, and for those "many dummies", you could replace them with "jmp (a3)". For regular functions, you can always do a "move.l a3,-(a7)", and end them with RTS. This would keep A3 for them available.

Frankly, did you actually measure that you have a performance problem before you try to solve it? A single "tst.b breakcondition" once in a while does not seem to ask for too much.

Quote:

Originally Posted by meynaf

I guess i could call CacheClearE() to handle this, but i'd prefer to avoid SMC altogether.

GFABasic could have solved it this way, but at that time, there was no CacheClearE() either.

Quote:

Originally Posted by meynaf

A problem here is that caches will be trashed twice (first time to install the temporary patch, the other to restore normal instruction). And this isn't good for performance.

How often will that actually happen that you have to abort the interpreter loop? If it is happening regularly, then an explicit test is the cheaper option as it does not have to go through a context switch (the CPU time for that has to come from somewhere, after all). If it is not happening regularly, then there is no need to bother about the rare cache pushes.

jotd · 19 March 2021, 18:30

tst.b stopflag(a5) should be tst.b (a5) with a5 properly set to save offset computation

(or is that offset thing free?) (or use 0 for stopflag in your a5 struct)

also not sure if someone suggested this:

Code:

; main loop
bigloop
 move.l  stopflag(a5),d7
 move.w (a6)+,d7
 jmp	([a4,d7.l*4])  ; suggested earlier

then first 256kb table is your routine table that all end by "bra bigloop"
And use another 256kb table that only contains the same address: the "special routine" address.

Set stopflag so it's $10000 when set, 0 if not.

meynaf · 19 March 2021, 18:38

Quote:

Originally Posted by Thomas Richter

Once again: It is not documented where in the stack frame the PC is stored, so you cannot expect anything. Developing software means working against documented interfaces, and there is nothing documented there. The stack frame changes depending on whether there is a FPU in the system or not, or whether this is a 68040 or 68060 FPU. The vampire will store its additional registers also somewhere in the stack frame, also undocumented. This is "beyond limits".

Perhaps this should be documented then

Listen, i have considered all possibilities including using the task's exception signals (which will not work due no access to registers), and nothing seems to value this one.

So far the best i have is :

Code:

 move.w (a6)+,d7
 jmp ([a4,d7.l*4])

And that's gonna be repeated at the end of each function.
The only way to interrupt it is to access the registers there, either A4 or bit #16 of D7.
Any extra instruction in the loop is gonna make every emulated instruction slower.

Quote:

Originally Posted by Thomas Richter

I'm not sure what you mean by "context". Concerning A3 not being usable, I understood that your problem is that many functions are just an RTS, and for those "many dummies", you could replace them with "jmp (a3)". For regular functions, you can always do a "move.l a3,-(a7)", and end them with RTS. This would keep A3 for them available.

Context can be : our task, an interrupt, or another task.
Not many functions are simply RTS (actually none), what did you make jump to that conclusion ? It's just that many functions will be identical (i.e. the opcode contains some parameter).
Ending them with RTS does not help either, there is still the need to be as fast as possible for normal code while keeping the ability to break from either here or elsewhere.

Quote:

Originally Posted by Thomas Richter

Frankly, did you actually measure that you have a performance problem before you try to solve it? A single "tst.b breakcondition" once in a while does not seem to ask for too much.

Frankly, there is no need to measure anything. There IS a performance problem, period. Because it's not once in a while, it's potentially millions of times per second.

Quote:

Originally Posted by Thomas Richter

GFABasic could have solved it this way, but at that time, there was no CacheClearE() either.

So we can't blame GFA for having done something without knowing what the future would be

Quote:

Originally Posted by Thomas Richter

How often will that actually happen that you have to abort the interpreter loop? If it is happening regularly, then an explicit test is the cheaper option as it does not have to go through a context switch (the CPU time for that has to come from somewhere, after all). If it is not happening regularly, then there is no need to bother about the rare cache pushes.

It is not happening very often, but nevertheless at least once per frame.
On 040/060 a cache push is probably nothing, but on 020/030 the whole cache is gonna be invalidated - doing that every vbl does not look fine to me.

meynaf · 19 March 2021, 18:51

Quote:

Originally Posted by jotd

tst.b stopflag(a5) should be tst.b (a5) with a5 properly set to save offset computation

(or is that offset thing free?) (or use 0 for stopflag in your a5 struct)

That offset thing is free starting with 040, but not for 020/030.
Indeed tst.b (a5) is possible. But even if not, it's possible to use a7 instead like suggested before.

Quote:

Originally Posted by jotd

also not sure if someone suggested this:

Code:

; main loop
bigloop
 move.l  stopflag(a5),d7
 move.w (a6)+,d7
 jmp    ([a4,d7.l*4])  ; suggested earlier

then first 256kb table is your routine table that all end by "bra bigloop"
And use another 256kb table that only contains the same address: the "special routine" address.

Set stopflag so it's $10000 when set, 0 if not.

Didn't think about that one. Quite clever

But that's 4 instructions (move.l d7, move.w a6+, jmp, bra).
Removing the bra (pun not intended

) is possible by copying the whole block for every routine but now it's bigger which means more drain on icache.
And of course that's gonna be slower than if i could access d7 or a4 from outside.

grond · 19 March 2021, 19:21

Quote:

Originally Posted by meynaf

This means i must be sure the return address is there, but it won't if we got interrupted during the execution of the fetch or jmp instructions in the main loop.

True. You could push the entry of your main loop onto the stack and JMP into the subroutines. The subroutines would then do a RTS on exit. If an interrupt happened, the minimal interrupt handler would have changed the return address and the subroutine would go into the extended handler which then would JMP to the beginning of your main routine again.

But I've totally lost track whether this would give any advantage or not...

meynaf · 19 March 2021, 19:35

Quote:

Originally Posted by grond

True. You could push the entry of your main loop onto the stack and JMP into the subroutines. The subroutines would then do a RTS on exit. If an interrupt happened, the minimal interrupt handler would have changed the return address and the subroutine would go into the extended handler which then would JMP to the beginning of your main routine again.

But I've totally lost track whether this would give any advantage or not...

The problem is that RTS will pop the return address, so it needs to be pushed back at the start of main loop. And if the interrupt occurs right between those two, guess what will happen...
The solution to that is the "RTS that does not pop", i.e. JMP ([A7]) suggested before.

grond · 19 March 2021, 19:42

Quote:

Originally Posted by meynaf

The problem is that RTS will pop the return address, so it needs to be pushed back at the start of main loop. And if the interrupt occurs right between those two, guess what will happen...

I don't need to guess, I know what happens. That's why the first thing the proposed routine would do is to push its start address onto the stack again.

Quote:

The solution to that is the "RTS that does not pop", i.e. JMP ([A7]) suggested before.

Yes, but how are cycle counts for that double-indirection?

meynaf · 19 March 2021, 19:57

Quote:

Originally Posted by grond

I don't need to guess, I know what happens. That's why the first thing the proposed routine would do is to push its start address onto the stack again.

Right, but the interrupt might come between the rts and the push - even if it's the first thing it does, it's already too late.
And if we attempt to push before the rts, then said rts will use the new value, not the old one.

Quote:

Originally Posted by grond

Yes, but how are cycle counts for that double-indirection?

I don't know for sure. It's probably similar to that of doing the same thing with an intermediate address register.

a/b · 19 March 2021, 20:38

OK, back to the SMC approach.
What if you only clear a single icache line? Set bit 2 in CACR and select cache line in CAAR, instead of bit 3 in CACR (full clear). That's 4 bytes on 020, and 16 bytes on 030. Haven't used those on 040+, because of cinv/cpush. So maybe some extra footwork to support all CPUs. If you know the instruction address (task->loadseg list->abacus), for example on 020 the cache line is (address&255)/4, right?

grond · 19 March 2021, 21:46

Quote:

Originally Posted by meynaf

Right, but the interrupt might come between the rts and the push - even if it's the first thing it does, it's already too late.
And if we attempt to push before the rts, then said rts will use the new value, not the old one.

You could set up the return address before installing the irq handler. Then again each time you enter the main loop.

Quote:

I don't know for sure. It's probably similar to that of doing the same thing with an intermediate address register.

I wouldn't be surprised if both methods were the same on 020/030 but saving one register could be useful.

Bruce Abbott · 20 March 2021, 02:34

Quote:

Originally Posted by meynaf

This time it's about optimising code to death.
Little care about the size, it's all about speed. It can't be fast enough.
The goal is to execute code depending on some values read from a stream.
See it as some kind of cpu emulation.

You want it as fast as possible? Then you have to consider what you are emulating.

Quote:

More than 8 bytes for sure. Average around 16. Biggest ones can reach 50+ bytes, i'm not sure - but they can probably use subroutines.

I bet your dispatch code is only a small proportion of the total execution time, but it depends on the mix of 8 to 50+ instructions. "Average of around 16" means nothing if 50+ is executed more often.

Instead of obsessing over micro-optimizations I suggest you just go with the "somewhat naïve implementation", then profile it on live emulation code. That gives you a baseline to see what improvement any future optimizations might have.

IOW, get the code working first - then worry about how to make it faster.

meynaf · 20 March 2021, 07:48

Quote:

Originally Posted by a/b

OK, back to the SMC approach.
What if you only clear a single icache line? Set bit 2 in CACR and select cache line in CAAR, instead of bit 3 in CACR (full clear). That's 4 bytes on 020, and 16 bytes on 030. Haven't used those on 040+, because of cinv/cpush. So maybe some extra footwork to support all CPUs. If you know the instruction address (task->loadseg list->abacus), for example on 020 the cache line is (address&255)/4, right?

I've never done this by hand, but i suppose this can work.
But, thinking twice about it, it means the main decode loop is always at the same place, which forbids repeating that decode part at the end of every instruction - which in turns means an extra branch in the critical path.

Quote:

Originally Posted by grond

You could set up the return address before installing the irq handler. Then again each time you enter the main loop.

That's not the problem. Here the RTS will pop the return address and it's possible to have an irq before it gets pushed again.

Quote:

Originally Posted by grond

I wouldn't be surprised if both methods were the same on 020/030 but saving one register could be useful.

Indeed.

Quote:

Originally Posted by Bruce Abbott

You want it as fast as possible? Then you have to consider what you are emulating.

What does that change ?

Quote:

Originally Posted by Bruce Abbott

I bet your dispatch code is only a small proportion of the total execution time, but it depends on the mix of 8 to 50+ instructions. "Average of around 16" means nothing if 50+ is executed more often.

A small proportion, not really. The dispatch code is executed for every instruction. Like the decode time is there for every instruction a cpu has to execute.
And about that 50+ will, or not, be executed more often, just do some statistical study of typical cpu instructions streams to see if simple instructions are somewhat more common than complex ones.

Quote:

Originally Posted by Bruce Abbott

Instead of obsessing over micro-optimizations I suggest you just go with the "somewhat naïve implementation", then profile it on live emulation code. That gives you a baseline to see what improvement any future optimizations might have.

I don't need profiling to know these instructions are the most critical of the whole app.

Quote:

Originally Posted by Bruce Abbott

IOW, get the code working first - then worry about how to make it faster.

It already works. I currently have a working VM - albeit very slow.

Bruce Abbott · 23 March 2021, 10:26

Quote:

Originally Posted by meynaf

I don't need profiling to know these instructions are the most critical of the whole app.

I'm skeptical.

But there's a way you can convince me... release your code! Then we can have a competition to see who can make it fastest.

Quote:

It already works. I currently have a working VM - albeit very slow.

What is it for?

meynaf · 23 March 2021, 10:54

Quote:

Originally Posted by Bruce Abbott

I'm skeptical.

Why ? Inner loops have always been the most speed sensitive, and this part is THE inner loop.

Quote:

Originally Posted by Bruce Abbott

But there's a way you can convince me... release your code! Then we can have a competition to see who can make it fastest.

Release the code ? Maybe, but it's too big to be suitable for a contest.
This is more refactoring than micro-optimization, as used strategy here will in turn impact everything else.
But if you're interested enough to want to not only see the code but also help, you can PM me.

Quote:

Originally Posted by Bruce Abbott

What is it for?

That's my own stuff. I can't do hardware so i do my thing in software.

Bruce Abbott · 23 March 2021, 19:31

Quote:

Originally Posted by meynaf

But if you're interested enough to want to not only see the code but also help, you can PM me.

PM sent.

Bruce Abbott · 26 March 2021, 09:00

To get an idea of how much overhead you have I isolated the 'vmloop' code, inserted the code of unconditional subroutines and removed code that was executed conditionally. Inlining the subroutines would save a few bsr/rts pairs, but there is still a lot of code that could possibly be bypassed completely by some instructions.

I suggest encoding the instructions into 'classes' using eg. the upper 4 bits like the 68000 does. Then do a partial decode on the class, jumping to code that is specific to that class of instruction. Further decoding would then be applied for miscellaneous instructions.

After 'executing' the instruction you can jump to the finishing code that it needs (eg. setcc, writeback), which then jumps back to the main loop. Small routines could be inlined to avoid unnecessary calls or jumps.

With that technique you should be able to significantly reduce the minimum instruction execution time. However real code often spends most of its time on more complex instructions, so it would be better to concentrate on speeding up the more 'popular' time consuming instructions rather than having the fastest possible NOP.

meynaf · 26 March 2021, 09:50

Quote:

Originally Posted by Bruce Abbott

To get an idea of how much overhead you have I isolated the 'vmloop' code, inserted the code of unconditional subroutines and removed code that was executed conditionally. Here is the result:-
(...)

Inlining the subroutines would save a few bsr/rts pairs, but there is still a lot of code that could possibly be bypassed completely by some instructions.

I suggest encoding the instructions into 'classes' using eg. the upper 4 bits like the 68000 does. Then do a partial decode on the class, jumping to code that is specific to that class of instruction. Further decoding would then be applied for miscellaneous instructions.

After 'executing' the instruction you can jump to the finishing code that it needs (eg. setcc, writeback), which then jumps back to the main loop. Small routines could be inlined to avoid unnecessary calls or jumps.

With that technique you should be able to significantly reduce the minimum instruction execution time. However real code often spends most of its time on more complex instructions, so it would be better to concentrate on speeding up the more 'popular' time consuming instructions rather than having the fastest possible NOP.

I told you, optimizing this code without changing the algorithm is futile.
Besides, please don't publish code here. When I said "PM me", it was that i prefer this to remain private.

Bruce Abbott · 26 March 2021, 11:19

Quote:

Originally Posted by meynaf

please don't publish code here. When I said "PM me", it was that i prefer this to remain private.

Oh sorry, when you said "Release the code ? Maybe, but it's too big to be suitable for a contest" I didn't realize you meant you wanted to keep it private.

Quote:

I told you, optimizing this code without changing the algorithm is futile.

Not sure what you mean by this. But since I don't seem to have any ideas that you find useful I will bow out now.

meynaf · 26 March 2021, 12:22

Quote:

Originally Posted by Bruce Abbott

Oh sorry, when you said "Release the code ? Maybe, but it's too big to be suitable for a contest" I didn't realize you meant you wanted to keep it private.

I was unclear maybe, i thought the "PM me" would have been enough. Sorry.
I would prefer that it remains private for now, but that's not set in stone.

Quote:

Originally Posted by Bruce Abbott

Not sure what you mean by this. But since I don't seem to have any ideas that you find useful I will bow out now.

No harm done in trying, right ?
Of course there is code that must be inlined, and code that must be completely bypassed. For now individual instructions are only the alu part - the thing is that code must be reworked so that everything is there, with as much predecode as possible without making code size completely explode. And i think this requires a rewrite. And i will not rewrite everything without the basic structure, which is the decoding part i precisely wanted to optimize here.

Don_Adan · 27 March 2021, 03:48

Quote:

Originally Posted by meynaf

Even if there are FPU registers there too ?
If yes, that's good news

Now only thing to take care of, is to not call any OS routine when this feature is active - and perhaps find a way to not kill my debugger as well when it's there and i'm hunting an

Which problem to add special ID (f.e $5754 ) in high word for D6 register? And later scan stack for D6 word value?
move.l SP,A0
findID
cmp.w #$5754,(A0)+
bne.b findID

19 March 2021, 18:30	#22
jotd This cat is no more Join Date: Dec 2004 Location: FRANCE Age: 52 Posts: 8,200	tst.b stopflag(a5) should be tst.b (a5) with a5 properly set to save offset computation (or is that offset thing free?) (or use 0 for stopflag in your a5 struct) also not sure if someone suggested this: Code: ; main loop bigloop move.l stopflag(a5),d7 move.w (a6)+,d7 jmp ([a4,d7.l4]) ; suggested earlier then first 256kb table is your routine table that all end by "bra bigloop" And use another 256kb table that only contains the same address: the "special routine" address. Set stopflag so it's $10000 when set, 0 if not. Last edited by jotd; 19 March 2021 at 18:37.*

26 March 2021, 09:00	#36
Bruce Abbott Registered User Join Date: Mar 2018 Location: Hastings, New Zealand Posts: 2,584	To get an idea of how much overhead you have I isolated the 'vmloop' code, inserted the code of unconditional subroutines and removed code that was executed conditionally. Inlining the subroutines would save a few bsr/rts pairs, but there is still a lot of code that could possibly be bypassed completely by some instructions. I suggest encoding the instructions into 'classes' using eg. the upper 4 bits like the 68000 does. Then do a partial decode on the class, jumping to code that is specific to that class of instruction. Further decoding would then be applied for miscellaneous instructions. After 'executing' the instruction you can jump to the finishing code that it needs (eg. setcc, writeback), which then jumps back to the main loop. Small routines could be inlined to avoid unnecessary calls or jumps. With that technique you should be able to significantly reduce the minimum instruction execution time. However real code often spends most of its time on more complex instructions, so it would be better to concentrate on speeding up the more 'popular' time consuming instructions rather than having the fastest possible NOP. Last edited by Bruce Abbott; 26 March 2021 at 10:53. Reason: code removed as meynaf wants to keep it private

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Ripping Sprites - Technique...	method	project.Sprites	43	12 October 2021 16:17
Profiling C code, interpreting results	Ernst Blofeld	Coders. C/C++	5	19 November 2020 18:45
Interpreting DMA-Debugger output	selco	support.WinUAE	10	27 November 2019 20:48
Amazing New Retrobrighting Technique	Hewitson	Retrogaming General Discussion	12	12 June 2019 09:27
Error while interpreting script	Makkinen	support.Apps	1	15 October 2004 15:58

19 March 2021, 20:38	#29
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	OK, back to the SMC approach. What if you only clear a single icache line? Set bit 2 in CACR and select cache line in CAAR, instead of bit 3 in CACR (full clear). That's 4 bytes on 020, and 16 bytes on 030. Haven't used those on 040+, because of cinv/cpush. So maybe some extra footwork to support all CPUs. If you know the instruction address (task->loadseg list->abacus), for example on 020 the cache line is (address&255)/4, right?

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)