27 March 2021, 05:44 | #41 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,053
|
I suggested that already, but it's not 100% reliable. If you start at the bottom you eventually have to go through FPU stack frame which also could include a rather large FPU state, so who knows what's in there.
I was also thinking about starting from the top side but it suffers from the same problem, although it could be a little better, depending on stack usage. I don't know how the code looks like and what is happening with stack. I'm pretty sure there are some subroutine calls and maybe even nested, for example what if you end up storing d6 on stack so you have it there multiple times? It's ~99% likely that the worker task will be switched out during opcode interpretation (and not in the "main loop", which is ~2 instructions) so the stack could look like something like this: <bottom, task->tc_SPReg> <variable FPU stack frame> pc sr d0 ... a6 <local stack> <rts address> <local stack> <rts address> <top, this is where we are in main loop> Still very sketchy... |
27 March 2021, 07:52 | #42 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,351
|
It is not a problem that the value might be anywhere in stack, IF it's always located at the same place on the same setup.
I mean, i could look for a magical value at startup. Then change the value, trigger a task switch, then check if the value in stack has also changed. If yes keep the offset, else continue scanning. But i am not sure the fpu stack frame has constant length over time. Reading rom code didn't suggest it had. |
27 March 2021, 11:13 | #43 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,029
|
If you want to be 1000% sure that this is correct D6 word value, you can add second (different) ID for D4 highword too. Later only one extra check must be used.
|
27 March 2021, 11:50 | #44 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,351
|
Quote:
My current idea anyway is to repeat this code at the end of every routine : Code:
move.w (a6)+,d7 jmp ([a4,d7.w*4]) That's the same at the end, except that it's faster (removes the branch returning to main loop) and frees D6 (in which i would prefer to have full 32 bits available). |
|
27 March 2021, 13:19 | #45 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,053
|
OK, one thing I completely missed. If *your* worker task is not using FPU at all or not using it once you go into main loop, FPU stack frame will *always* be NULL (or maybe also IDLE is possible?). So if you kind of calibrate/synchronize your tasks (for the current hw/system) before you enter the main loop, a4 should always be at the same offset.
|
27 March 2021, 13:56 | #46 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,300
|
It isn't. The stack frame is FPU-model dependent, and state-dependent as well. The NULL-stateframe is 4 bytes on the 68881 through 68040, but that's all. The 68060 NULL-stateframe is different and 12 bytes. Depending on the state of the FPU, the stateframe may also be an "idle" frame (with less information) or a "busy" frame (with more information) or an "exception stack frame". I neither know where or how the vampire puts its registers there. This is really off-limits and system, hardware and state-dependent.
|
27 March 2021, 14:08 | #47 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,351
|
Quote:
What does matter isn't that the stack frame depends on the config of the machine, the problem is : does a register remains at same place once it has been located. |
|
27 March 2021, 14:31 | #48 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,053
|
That's what I addressed in my previous post. If your task is not using FPU at all, then the state is either its initial state (set when the task was created: NULL=0.L), or it doesn't matter because you are running on older KS version that doesn't support FPU (no extra 4 zero bytes).
|
27 March 2021, 14:38 | #49 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,351
|
Sure thing, is that MY task's not gonna use the FPU. But what if another one does in the meanwhile ? Can this change the state ?
|
27 March 2021, 14:56 | #50 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,053
|
Yes, but not the FPU state of *your* task. If another task is using it, this happens during task switch:
- FPU state is saved to another task's stack (if not NULL, it's followed my fp0-7, fpcr/fpsr/fpiar and possibly other stuff but we don't have to know that at all) - task switch - FPU state is restored from your stack (=NULL) - you do your thing until task switch - FPU state is saved to your stack (=NULL) etc. |
27 March 2021, 15:17 | #51 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,300
|
Well, it depends on what your task does... If anything in your task uses the FPU, even if only indirectly by opening a math library, or opening something that opens the math library, then the stack frame changes mid-term.
The problem is really that you depend on something the Os does not document, and it does not document this to be extensible. The old amiga problem. Failing to understand the difference between interface and implementation. |
27 March 2021, 15:29 | #52 | |||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,351
|
Quote:
Quote:
Quote:
But maybe you have a better idea ? Something that can work as fast but does not depend on anything undocumented ? As currently it's the choice between doing it this way and not doing it at all... |
|||
27 March 2021, 15:32 | #53 | |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,053
|
Quote:
Yeah, it *is* OS internal implementation, and relying on it is 100% unsupported. I can accept that and still proceed with certain probability of success. If the pattern is present in KS1.2 to KS3.1.x I it's 100% and can live with that. Well, it's up to Meynaf in this case . |
|
27 March 2021, 16:48 | #54 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,351
|
|
27 March 2021, 16:57 | #55 | ||
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,300
|
Quote:
Quote:
I already gave you a better idea. But first things first: a) measure, b) improve if necessary. I am not convinced that there is much of a noticable difference, and that you should establish that there is a problem that needs to be solved. And that is just not true - why do you state something that is obviously false. You haven't even measured between various implementation choices. It might or might not be faster, depending on what your problem is, or it might be slower by a small margin that does not matter. If that helps that the end result is stable and independent of undocumented internals, it may be worth it. |
||
27 March 2021, 18:15 | #56 | ||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,351
|
Quote:
Quote:
Quote:
What was it already ? The tst.b on a variable ? 4 instructions instead of 2 in the most critical code, not a clever idea. Quote:
Sure, cpu designers should measure whether clock cycles added to every instruction will make their cpu slower. (No, really, adding 1 clock decoding all our instructions is no big deal - most of them already take 4.) |
||||
27 March 2021, 20:19 | #57 | |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,053
|
Quote:
Code:
; check exec->AttnFlags if FPU is present, and if it is load a NULL state clr.l -(a7) frestore (a7)+ |
|
27 March 2021, 21:18 | #58 |
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,708
|
I did some testing on my A1200 with Blizzard 1230-IV 50MHz 030 to see what execution speed can be expected. First I timed 50 million nops (a block of 1000 nops repeated 50,000 times) which took 3 seconds. That's 16.7 mips.
Then I timed the following code, threading its way through 1000 different interpreted instructions that all did nothing (ie. equivalent to 50 million nops). Code:
move.w (a6)+,d7 jmp ([a5,d7.l*4]) Finally I added code to break out of it if any 'flag' bits are set in a particular memory location (pointed to by A4), like this:- Code:
move.w (a6)+,d7 tst.b (a4) bne.s break jmp ([a5,d7.l*4]) break: jmp stop With CPU caches disabled it was slower of course, taking 37 seconds and 51 seconds respectively (about 15% slower). In practice a lot more code will be required to interpret most instructions, so the difference between the 'fastest possible' code that is difficult to break out of, and the more useful code with test and branch, will be much less than these numbers suggest. Rather than wasting time trying to figure out some sneaky and problematic way to break out of the execution sequence (like examining stack frames or poking the interpreter code) I suggest using the simple technique above. You can always try changing it later if you manage to make the rest of the interpreter fast enough to justify it. |
29 March 2021, 08:15 | #59 | |||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,351
|
Quote:
Not all of them will be used every time. Quote:
It is quite obvious that fast instructions will be more sensitive to this than slow instructions. Some will be very swift : Code:
; rts - fast if we keep stack ptr in a3 move.l (a3)+,a6 ; 8-bit bra extb.l d7 add.l d7,a6 Code:
moveq #7,d0 and.w d7,d0 lsr.w #6,d7 andi.w #15,d7 move.l (a5,d7.w*4),$20(a5,d0.w*4) But you can already try to time the above if you want. Quote:
|
|||
29 March 2021, 11:10 | #60 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,029
|
I think that you can use/add break interpreter code only for some/few routines, not for all. f.e for your rts interpreter routine and perphaps for a few others. It will be speedup your bigloop routine. For me checking for break signal for every interpreter routine you only waste a few CPU time.
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Ripping Sprites - Technique... | method | project.Sprites | 43 | 12 October 2021 16:17 |
Profiling C code, interpreting results | Ernst Blofeld | Coders. C/C++ | 5 | 19 November 2020 18:45 |
Interpreting DMA-Debugger output | selco | support.WinUAE | 10 | 27 November 2019 20:48 |
Amazing New Retrobrighting Technique | Hewitson | Retrogaming General Discussion | 12 | 12 June 2019 09:27 |
Error while interpreting script | Makkinen | support.Apps | 1 | 15 October 2004 15:58 |
|
|