19 March 2021, 18:17 | #21 | |||
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,233
|
Quote:
Quote:
Frankly, did you actually measure that you have a performance problem before you try to solve it? A single "tst.b breakcondition" once in a while does not seem to ask for too much. Quote:
How often will that actually happen that you have to abort the interpreter loop? If it is happening regularly, then an explicit test is the cheaper option as it does not have to go through a context switch (the CPU time for that has to come from somewhere, after all). If it is not happening regularly, then there is no need to bother about the rare cache pushes. |
|||
19 March 2021, 18:30 | #22 |
This cat is no more
Join Date: Dec 2004
Location: FRANCE
Age: 52
Posts: 8,200
|
tst.b stopflag(a5) should be tst.b (a5) with a5 properly set to save offset computation
(or is that offset thing free?) (or use 0 for stopflag in your a5 struct) also not sure if someone suggested this: Code:
; main loop bigloop move.l stopflag(a5),d7 move.w (a6)+,d7 jmp ([a4,d7.l*4]) ; suggested earlier And use another 256kb table that only contains the same address: the "special routine" address. Set stopflag so it's $10000 when set, 0 if not. Last edited by jotd; 19 March 2021 at 18:37. |
19 March 2021, 18:38 | #23 | |||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
Listen, i have considered all possibilities including using the task's exception signals (which will not work due no access to registers), and nothing seems to value this one. So far the best i have is : Code:
move.w (a6)+,d7 jmp ([a4,d7.l*4]) The only way to interrupt it is to access the registers there, either A4 or bit #16 of D7. Any extra instruction in the loop is gonna make every emulated instruction slower. Quote:
Not many functions are simply RTS (actually none), what did you make jump to that conclusion ? It's just that many functions will be identical (i.e. the opcode contains some parameter). Ending them with RTS does not help either, there is still the need to be as fast as possible for normal code while keeping the ability to break from either here or elsewhere. Quote:
Quote:
Quote:
On 040/060 a cache push is probably nothing, but on 020/030 the whole cache is gonna be invalidated - doing that every vbl does not look fine to me. |
|||||
19 March 2021, 18:51 | #24 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
Indeed tst.b (a5) is possible. But even if not, it's possible to use a7 instead like suggested before. Quote:
But that's 4 instructions (move.l d7, move.w a6+, jmp, bra). Removing the bra (pun not intended ) is possible by copying the whole block for every routine but now it's bigger which means more drain on icache. And of course that's gonna be slower than if i could access d7 or a4 from outside. |
||
19 March 2021, 19:21 | #25 | |
Registered User
Join Date: Jun 2015
Location: Germany
Posts: 1,918
|
Quote:
But I've totally lost track whether this would give any advantage or not... |
|
19 March 2021, 19:35 | #26 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
The solution to that is the "RTS that does not pop", i.e. JMP ([A7]) suggested before. |
|
19 March 2021, 19:42 | #27 | ||
Registered User
Join Date: Jun 2015
Location: Germany
Posts: 1,918
|
Quote:
Quote:
|
||
19 March 2021, 19:57 | #28 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
And if we attempt to push before the rts, then said rts will use the new value, not the old one. I don't know for sure. It's probably similar to that of doing the same thing with an intermediate address register. |
|
19 March 2021, 20:38 | #29 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,039
|
OK, back to the SMC approach.
What if you only clear a single icache line? Set bit 2 in CACR and select cache line in CAAR, instead of bit 3 in CACR (full clear). That's 4 bytes on 020, and 16 bytes on 030. Haven't used those on 040+, because of cinv/cpush. So maybe some extra footwork to support all CPUs. If you know the instruction address (task->loadseg list->abacus), for example on 020 the cache line is (address&255)/4, right? |
19 March 2021, 21:46 | #30 | ||
Registered User
Join Date: Jun 2015
Location: Germany
Posts: 1,918
|
Quote:
Quote:
|
||
20 March 2021, 02:34 | #31 | ||
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,584
|
Quote:
Quote:
Instead of obsessing over micro-optimizations I suggest you just go with the "somewhat naïve implementation", then profile it on live emulation code. That gives you a baseline to see what improvement any future optimizations might have. IOW, get the code working first - then worry about how to make it faster. |
||
20 March 2021, 07:48 | #32 | ||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
But, thinking twice about it, it means the main decode loop is always at the same place, which forbids repeating that decode part at the end of every instruction - which in turns means an extra branch in the critical path. Quote:
Quote:
Quote:
Quote:
And about that 50+ will, or not, be executed more often, just do some statistical study of typical cpu instructions streams to see if simple instructions are somewhat more common than complex ones. Quote:
It already works. I currently have a working VM - albeit very slow. |
||||||
23 March 2021, 10:26 | #33 | ||
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,584
|
Quote:
But there's a way you can convince me... release your code! Then we can have a competition to see who can make it fastest. Quote:
|
||
23 March 2021, 10:54 | #34 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Why ? Inner loops have always been the most speed sensitive, and this part is THE inner loop.
Quote:
This is more refactoring than micro-optimization, as used strategy here will in turn impact everything else. But if you're interested enough to want to not only see the code but also help, you can PM me. That's my own stuff. I can't do hardware so i do my thing in software. |
|
23 March 2021, 19:31 | #35 |
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,584
|
|
26 March 2021, 09:00 | #36 |
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,584
|
To get an idea of how much overhead you have I isolated the 'vmloop' code, inserted the code of unconditional subroutines and removed code that was executed conditionally. Inlining the subroutines would save a few bsr/rts pairs, but there is still a lot of code that could possibly be bypassed completely by some instructions.
I suggest encoding the instructions into 'classes' using eg. the upper 4 bits like the 68000 does. Then do a partial decode on the class, jumping to code that is specific to that class of instruction. Further decoding would then be applied for miscellaneous instructions. After 'executing' the instruction you can jump to the finishing code that it needs (eg. setcc, writeback), which then jumps back to the main loop. Small routines could be inlined to avoid unnecessary calls or jumps. With that technique you should be able to significantly reduce the minimum instruction execution time. However real code often spends most of its time on more complex instructions, so it would be better to concentrate on speeding up the more 'popular' time consuming instructions rather than having the fastest possible NOP. Last edited by Bruce Abbott; 26 March 2021 at 10:53. Reason: code removed as meynaf wants to keep it private |
26 March 2021, 09:50 | #37 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
Besides, please don't publish code here. When I said "PM me", it was that i prefer this to remain private. |
|
26 March 2021, 11:19 | #38 | ||
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,584
|
Quote:
Quote:
|
||
26 March 2021, 12:22 | #39 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
I would prefer that it remains private for now, but that's not set in stone. Quote:
Of course there is code that must be inlined, and code that must be completely bypassed. For now individual instructions are only the alu part - the thing is that code must be reworked so that everything is there, with as much predecode as possible without making code size completely explode. And i think this requires a rewrite. And i will not rewrite everything without the basic structure, which is the decoding part i precisely wanted to optimize here. |
||
27 March 2021, 03:48 | #40 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
|
Quote:
move.l SP,A0 findID cmp.w #$5754,(A0)+ bne.b findID |
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Ripping Sprites - Technique... | method | project.Sprites | 43 | 12 October 2021 16:17 |
Profiling C code, interpreting results | Ernst Blofeld | Coders. C/C++ | 5 | 19 November 2020 18:45 |
Interpreting DMA-Debugger output | selco | support.WinUAE | 10 | 27 November 2019 20:48 |
Amazing New Retrobrighting Technique | Hewitson | Retrogaming General Discussion | 12 | 12 June 2019 09:27 |
Error while interpreting script | Makkinen | support.Apps | 1 | 15 October 2004 15:58 |
|
|