fast interpreting technique(s) - Page 3

a/b · 27 March 2021, 05:44

I suggested that already, but it's not 100% reliable. If you start at the bottom you eventually have to go through FPU stack frame which also could include a rather large FPU state, so who knows what's in there.
I was also thinking about starting from the top side but it suffers from the same problem, although it could be a little better, depending on stack usage. I don't know how the code looks like and what is happening with stack. I'm pretty sure there are some subroutine calls and maybe even nested, for example what if you end up storing d6 on stack so you have it there multiple times?
It's ~99% likely that the worker task will be switched out during opcode interpretation (and not in the "main loop", which is ~2 instructions) so the stack could look like something like this:
<bottom, task->tc_SPReg> <variable FPU stack frame> pc sr d0 ... a6 <local stack> <rts address> <local stack> <rts address> <top, this is where we are in main loop>
Still very sketchy...

meynaf · 27 March 2021, 07:52

It is not a problem that the value might be anywhere in stack, IF it's always located at the same place on the same setup.
I mean, i could look for a magical value at startup. Then change the value, trigger a task switch, then check if the value in stack has also changed. If yes keep the offset, else continue scanning.
But i am not sure the fpu stack frame has constant length over time. Reading rom code didn't suggest it had.

Don_Adan · 27 March 2021, 11:13

If you want to be 1000% sure that this is correct D6 word value, you can add second (different) ID for D4 highword too. Later only one extra check must be used.

meynaf · 27 March 2021, 11:50

Quote:

Originally Posted by Don_Adan

If you want to be 1000% sure that this is correct D6 word value, you can add second (different) ID for D4 highword too. Later only one extra check must be used.

This requires extra space in high parts of registers, something not guaranteed (in my current register allocation the only high part that's available is D7).

My current idea anyway is to repeat this code at the end of every routine :

Code:

 move.w (a6)+,d7
 jmp ([a4,d7.w*4])

This means i will not change D6, but A4 (ok, could be D7, but A4 seems more handy here).
That's the same at the end, except that it's faster (removes the branch returning to main loop) and frees D6 (in which i would prefer to have full 32 bits available).

a/b · 27 March 2021, 13:19

OK, one thing I completely missed. If *your* worker task is not using FPU at all or not using it once you go into main loop, FPU stack frame will *always* be NULL (or maybe also IDLE is possible?). So if you kind of calibrate/synchronize your tasks (for the current hw/system) before you enter the main loop, a4 should always be at the same offset.

Thomas Richter · 27 March 2021, 13:56

Quote:

Originally Posted by meynaf

It is not a problem that the value might be anywhere in stack, IF it's always located at the same place on the same setup.

It isn't. The stack frame is FPU-model dependent, and state-dependent as well. The NULL-stateframe is 4 bytes on the 68881 through 68040, but that's all. The 68060 NULL-stateframe is different and 12 bytes. Depending on the state of the FPU, the stateframe may also be an "idle" frame (with less information) or a "busy" frame (with more information) or an "exception stack frame". I neither know where or how the vampire puts its registers there. This is really off-limits and system, hardware and state-dependent.

meynaf · 27 March 2021, 14:08

Quote:

Originally Posted by Thomas Richter

It isn't. The stack frame is FPU-model dependent, and state-dependent as well. The NULL-stateframe is 4 bytes on the 68881 through 68040, but that's all. The 68060 NULL-stateframe is different and 12 bytes. Depending on the state of the FPU, the stateframe may also be an "idle" frame (with less information) or a "busy" frame (with more information) or an "exception stack frame". I neither know where or how the vampire puts its registers there. This is really off-limits and system, hardware and state-dependent.

But my task will not be using FPU at all (not even thru some math lib), and, let's be honest, i don't care if it fails on the vampire - i didn't agree at first place in adding all these registers.

What does matter isn't that the stack frame depends on the config of the machine, the problem is : does a register remains at same place once it has been located.

a/b · 27 March 2021, 14:31

That's what I addressed in my previous post. If your task is not using FPU at all, then the state is either its initial state (set when the task was created: NULL=0.L), or it doesn't matter because you are running on older KS version that doesn't support FPU (no extra 4 zero bytes).

meynaf · 27 March 2021, 14:38

Sure thing, is that MY task's not gonna use the FPU. But what if another one does in the meanwhile ? Can this change the state ?

a/b · 27 March 2021, 14:56

Yes, but not the FPU state of *your* task. If another task is using it, this happens during task switch:
- FPU state is saved to another task's stack (if not NULL, it's followed my fp0-7, fpcr/fpsr/fpiar and possibly other stuff but we don't have to know that at all)
- task switch
- FPU state is restored from your stack (=NULL)
- you do your thing until task switch
- FPU state is saved to your stack (=NULL)
etc.

Thomas Richter · 27 March 2021, 15:17

Well, it depends on what your task does... If anything in your task uses the FPU, even if only indirectly by opening a math library, or opening something that opens the math library, then the stack frame changes mid-term.

The problem is really that you depend on something the Os does not document, and it does not document this to be extensible.

The old amiga problem. Failing to understand the difference between interface and implementation.

meynaf · 27 March 2021, 15:29

Quote:

Originally Posted by Thomas Richter

Well, it depends on what your task does... If anything in your task uses the FPU, even if only indirectly by opening a math library, or opening something that opens the math library, then the stack frame changes mid-term.

As i said my task will not, directly or indirectly, use FPU.

Quote:

Originally Posted by Thomas Richter

The problem is really that you depend on something the Os does not document, and it does not document this to be extensible.

It's not a problem. Lots of things the Os does not document have been used already.

Quote:

Originally Posted by Thomas Richter

The old amiga problem. Failing to understand the difference between interface and implementation.

This is better than failing to provide an alternative path. At least there is something that can work.
But maybe you have a better idea ? Something that can work as fast but does not depend on anything undocumented ? As currently it's the choice between doing it this way and not doing it at all...

a/b · 27 March 2021, 15:32

Quote:

Originally Posted by Thomas Richter

The old amiga problem. Failing to understand the difference between interface and implementation.

There's no such failure, I completely understand what I'm suggesting and what risks it implies. I'd avoid that in any public/commercial software of my own, but private stuff... I've done 2^(a lot) worse.
Yeah, it *is* OS internal implementation, and relying on it is 100% unsupported. I can accept that and still proceed with certain probability of success. If the pattern is present in KS1.2 to KS3.1.x I it's 100% and can live with that.
Well, it's up to Meynaf in this case

.

meynaf · 27 March 2021, 16:48

Quote:

Originally Posted by a/b

Well, it's up to Meynaf in this case

.

Yep. And i'll go for it as long as there is no alternative giving the same level of performance.

Thomas Richter · 27 March 2021, 16:57

Quote:

Originally Posted by meynaf

It's not a problem. Lots of things the Os does not document have been used already.

Such practise blocks the development of the Os and the platform, that is the problem.

Quote:

Originally Posted by meynaf

This is better than failing to provide an alternative path. At least there is something that can work.

There are many things that can work. Some work with the system, some against the system, and some by pure chance.

Quote:

Originally Posted by meynaf

But maybe you have a better idea ?

I already gave you a better idea. But first things first: a) measure, b) improve if necessary. I am not convinced that there is much of a noticable difference, and that you should establish that there is a problem that needs to be solved.

Quote:

Originally Posted by meynaf

Something that can work as fast but does not depend on anything undocumented ? As currently it's the choice between doing it this way and not doing it at all...

And that is just not true - why do you state something that is obviously false. You haven't even measured between various implementation choices. It might or might not be faster, depending on what your problem is, or it might be slower by a small margin that does not matter. If that helps that the end result is stable and independent of undocumented internals, it may be worth it.

meynaf · 27 March 2021, 18:15

Quote:

Originally Posted by Thomas Richter

Such practise blocks the development of the Os and the platform, that is the problem.

Frankly others have made a lot worse. I won't block anything.

Quote:

Originally Posted by Thomas Richter

There are many things that can work. Some work with the system, some against the system, and some by pure chance.

There are also things that don't work.

Quote:

Originally Posted by Thomas Richter

I already gave you a better idea. But first things first: a) measure, b) improve if necessary. I am not convinced that there is much of a noticable difference, and that you should establish that there is a problem that needs to be solved.

You have not in any manner given me a better idea.
What was it already ? The tst.b on a variable ? 4 instructions instead of 2 in the most critical code, not a clever idea.

Quote:

Originally Posted by Thomas Richter

And that is just not true - why do you state something that is obviously false. You haven't even measured between various implementation choices. It might or might not be faster, depending on what your problem is, or it might be slower by a small margin that does not matter. If that helps that the end result is stable and independent of undocumented internals, it may be worth it.

Why would i measure ? If you really need measurement to know that any added instruction in the most inner critical loop of a program will make it slower, you should learn to code.

Sure, cpu designers should measure whether clock cycles added to every instruction will make their cpu slower.

(No, really, adding 1 clock decoding all our instructions is no big deal - most of them already take 4.)

a/b · 27 March 2021, 20:19

Quote:

Originally Posted by Thomas Richter

Well, it depends on what your task does... If anything in your task uses the FPU, even if only indirectly by opening a math library, or opening something that opens the math library, then the stack frame changes mid-term.

Sure, that's a possibility (actually using it, but only Meynaf can answer that since I don't know what dependencies his software has), but if it's only a possibility of probing or opening for whatever reason without actually using it (yeah, it's all hypothetical here because, again, I haven't seen the code/project), you can eliminate that before you start doing any "nasty" things:

Code:

; check exec->AttnFlags if FPU is present, and if it is load a NULL state
	clr.l	-(a7)
	frestore	(a7)+

And you can also do a state+regs save/restore on your own if you are worried about being used after you're done with your "main loop".

Bruce Abbott · 27 March 2021, 21:18

I did some testing on my A1200 with Blizzard 1230-IV 50MHz 030 to see what execution speed can be expected. First I timed 50 million nops (a block of 1000 nops repeated 50,000 times) which took 3 seconds. That's 16.7 mips.

Then I timed the following code, threading its way through 1000 different interpreted instructions that all did nothing (ie. equivalent to 50 million nops).

Code:

   move.w   (a6)+,d7
   jmp      ([a5,d7.l*4])

This took 34 seconds, which is ~1.5 mips. That is probably the upper limit on interpretation speed.

Finally I added code to break out of it if any 'flag' bits are set in a particular memory location (pointed to by A4), like this:-

Code:

   move.w   (a6)+,d7
   tst.b    (a4)
   bne.s    break
   jmp      ([a5,d7.l*4])
break:
   jmp     stop

This took 44 seconds, which is ~1.1 mips.

With CPU caches disabled it was slower of course, taking 37 seconds and 51 seconds respectively (about 15% slower).

In practice a lot more code will be required to interpret most instructions, so the difference between the 'fastest possible' code that is difficult to break out of, and the more useful code with test and branch, will be much less than these numbers suggest.

Rather than wasting time trying to figure out some sneaky and problematic way to break out of the execution sequence (like examining stack frames or poking the interpreter code) I suggest using the simple technique above. You can always try changing it later if you manage to make the rest of the interpreter fast enough to justify it.

meynaf · 29 March 2021, 08:15

Quote:

Originally Posted by a/b

only Meynaf can answer that since I don't know what dependencies his software has

Dependencies are limited : a few libraries (dos, intuition, graphics, keymap, asl), some devices (timer, input, audio), and one resource (ciab).
Not all of them will be used every time.

Quote:

Originally Posted by Bruce Abbott

This took 34 seconds, which is ~1.5 mips. That is probably the upper limit on interpretation speed.

This "upper limit on interpretation speed" is the reason why i'd like to keep it this way.

It is quite obvious that fast instructions will be more sensitive to this than slow instructions.
Some will be very swift :

Code:

; rts - fast if we keep stack ptr in a3
 move.l (a3)+,a6

; 8-bit bra
 extb.l d7
 add.l d7,a6

More typical ones can look like this :

Code:

 moveq #7,d0
 and.w d7,d0
 lsr.w #6,d7
 andi.w #15,d7
 move.l (a5,d7.w*4),$20(a5,d0.w*4)

Note that currently i don't have a real strategy to handle the ccr.
But you can already try to time the above if you want.

Quote:

Originally Posted by Bruce Abbott

Rather than wasting time trying to figure out some sneaky and problematic way to break out of the execution sequence (like examining stack frames or poking the interpreter code) I suggest using the simple technique above. You can always try changing it later if you manage to make the rest of the interpreter fast enough to justify it.

I will be using macros for this, so it could change anytime regardless of the initial choice. So nothing wrong in discussing it right now.

Don_Adan · 29 March 2021, 11:10

I think that you can use/add break interpreter code only for some/few routines, not for all. f.e for your rts interpreter routine and perphaps for a few others. It will be speedup your bigloop routine. For me checking for break signal for every interpreter routine you only waste a few CPU time.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Ripping Sprites - Technique...	method	project.Sprites	43	12 October 2021 16:17
Profiling C code, interpreting results	Ernst Blofeld	Coders. C/C++	5	19 November 2020 18:45
Interpreting DMA-Debugger output	selco	support.WinUAE	10	27 November 2019 20:48
Amazing New Retrobrighting Technique	Hewitson	Retrogaming General Discussion	12	12 June 2019 09:27
Error while interpreting script	Makkinen	support.Apps	1	15 October 2004 15:58

27 March 2021, 05:44	#41
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,053	I suggested that already, but it's not 100% reliable. If you start at the bottom you eventually have to go through FPU stack frame which also could include a rather large FPU state, so who knows what's in there. I was also thinking about starting from the top side but it suffers from the same problem, although it could be a little better, depending on stack usage. I don't know how the code looks like and what is happening with stack. I'm pretty sure there are some subroutine calls and maybe even nested, for example what if you end up storing d6 on stack so you have it there multiple times? It's ~99% likely that the worker task will be switched out during opcode interpretation (and not in the "main loop", which is ~2 instructions) so the stack could look like something like this: <bottom, task->tc_SPReg> <variable FPU stack frame> pc sr d0 ... a6 <local stack> <rts address> <local stack> <rts address> <top, this is where we are in main loop> Still very sketchy...

27 March 2021, 07:52	#42
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,351	It is not a problem that the value might be anywhere in stack, IF it's always located at the same place on the same setup. I mean, i could look for a magical value at startup. Then change the value, trigger a task switch, then check if the value in stack has also changed. If yes keep the offset, else continue scanning. But i am not sure the fpu stack frame has constant length over time. Reading rom code didn't suggest it had.

27 March 2021, 11:13	#43
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 56 Posts: 2,029	If you want to be 1000% sure that this is correct D6 word value, you can add second (different) ID for D4 highword too. Later only one extra check must be used.

27 March 2021, 13:19	#45
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,053	OK, one thing I completely missed. If your worker task is not using FPU at all or not using it once you go into main loop, FPU stack frame will always be NULL (or maybe also IDLE is possible?). So if you kind of calibrate/synchronize your tasks (for the current hw/system) before you enter the main loop, a4 should always be at the same offset.

27 March 2021, 14:31	#48
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,053	That's what I addressed in my previous post. If your task is not using FPU at all, then the state is either its initial state (set when the task was created: NULL=0.L), or it doesn't matter because you are running on older KS version that doesn't support FPU (no extra 4 zero bytes).

27 March 2021, 14:38	#49
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,351	Sure thing, is that MY task's not gonna use the FPU. But what if another one does in the meanwhile ? Can this change the state ?

27 March 2021, 14:56	#50
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,053	Yes, but not the FPU state of your task. If another task is using it, this happens during task switch: - FPU state is saved to another task's stack (if not NULL, it's followed my fp0-7, fpcr/fpsr/fpiar and possibly other stuff but we don't have to know that at all) - task switch - FPU state is restored from your stack (=NULL) - you do your thing until task switch - FPU state is saved to your stack (=NULL) etc.

27 March 2021, 15:17	#51
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,300	Well, it depends on what your task does... If anything in your task uses the FPU, even if only indirectly by opening a math library, or opening something that opens the math library, then the stack frame changes mid-term. The problem is really that you depend on something the Os does not document, and it does not document this to be extensible. The old amiga problem. Failing to understand the difference between interface and implementation.

29 March 2021, 11:10	#60
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 56 Posts: 2,029	I think that you can use/add break interpreter code only for some/few routines, not for all. f.e for your rts interpreter routine and perphaps for a few others. It will be speedup your bigloop routine. For me checking for break signal for every interpreter routine you only waste a few CPU time.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)