English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 19 March 2021, 18:17   #21
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,233
Quote:
Originally Posted by meynaf View Post
A warning sign will not stop me if that's the only available solution.
At least, should I expect the stored PC to be at the same place ?
Once again: It is not documented where in the stack frame the PC is stored, so you cannot expect anything. Developing software means working against documented interfaces, and there is nothing documented there. The stack frame changes depending on whether there is a FPU in the system or not, or whether this is a 68040 or 68060 FPU. The vampire will store its additional registers also somewhere in the stack frame, also undocumented. This is "beyond limits".



Quote:
Originally Posted by meynaf View Post

You mean, end all functions with something JMP (A3) with A3 leading to bigloop ? That does not solve the problem of A3 not available from other contexts.
I'm not sure what you mean by "context". Concerning A3 not being usable, I understood that your problem is that many functions are just an RTS, and for those "many dummies", you could replace them with "jmp (a3)". For regular functions, you can always do a "move.l a3,-(a7)", and end them with RTS. This would keep A3 for them available.


Frankly, did you actually measure that you have a performance problem before you try to solve it? A single "tst.b breakcondition" once in a while does not seem to ask for too much.



Quote:
Originally Posted by meynaf View Post
I guess i could call CacheClearE() to handle this, but i'd prefer to avoid SMC altogether.
GFABasic could have solved it this way, but at that time, there was no CacheClearE() either.


Quote:
Originally Posted by meynaf View Post
A problem here is that caches will be trashed twice (first time to install the temporary patch, the other to restore normal instruction). And this isn't good for performance.
How often will that actually happen that you have to abort the interpreter loop? If it is happening regularly, then an explicit test is the cheaper option as it does not have to go through a context switch (the CPU time for that has to come from somewhere, after all). If it is not happening regularly, then there is no need to bother about the rare cache pushes.
Thomas Richter is offline  
Old 19 March 2021, 18:30   #22
jotd
This cat is no more
 
jotd's Avatar
 
Join Date: Dec 2004
Location: FRANCE
Age: 52
Posts: 8,200
tst.b stopflag(a5) should be tst.b (a5) with a5 properly set to save offset computation

(or is that offset thing free?) (or use 0 for stopflag in your a5 struct)

also not sure if someone suggested this:

Code:
; main loop
bigloop
 move.l  stopflag(a5),d7
 move.w (a6)+,d7
 jmp	([a4,d7.l*4])  ; suggested earlier
then first 256kb table is your routine table that all end by "bra bigloop"
And use another 256kb table that only contains the same address: the "special routine" address.

Set stopflag so it's $10000 when set, 0 if not.

Last edited by jotd; 19 March 2021 at 18:37.
jotd is offline  
Old 19 March 2021, 18:38   #23
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by Thomas Richter View Post
Once again: It is not documented where in the stack frame the PC is stored, so you cannot expect anything. Developing software means working against documented interfaces, and there is nothing documented there. The stack frame changes depending on whether there is a FPU in the system or not, or whether this is a 68040 or 68060 FPU. The vampire will store its additional registers also somewhere in the stack frame, also undocumented. This is "beyond limits".
Perhaps this should be documented then
Listen, i have considered all possibilities including using the task's exception signals (which will not work due no access to registers), and nothing seems to value this one.

So far the best i have is :
Code:
 move.w (a6)+,d7
 jmp ([a4,d7.l*4])
And that's gonna be repeated at the end of each function.
The only way to interrupt it is to access the registers there, either A4 or bit #16 of D7.
Any extra instruction in the loop is gonna make every emulated instruction slower.


Quote:
Originally Posted by Thomas Richter View Post
I'm not sure what you mean by "context". Concerning A3 not being usable, I understood that your problem is that many functions are just an RTS, and for those "many dummies", you could replace them with "jmp (a3)". For regular functions, you can always do a "move.l a3,-(a7)", and end them with RTS. This would keep A3 for them available.
Context can be : our task, an interrupt, or another task.
Not many functions are simply RTS (actually none), what did you make jump to that conclusion ? It's just that many functions will be identical (i.e. the opcode contains some parameter).
Ending them with RTS does not help either, there is still the need to be as fast as possible for normal code while keeping the ability to break from either here or elsewhere.


Quote:
Originally Posted by Thomas Richter View Post
Frankly, did you actually measure that you have a performance problem before you try to solve it? A single "tst.b breakcondition" once in a while does not seem to ask for too much.
Frankly, there is no need to measure anything. There IS a performance problem, period. Because it's not once in a while, it's potentially millions of times per second.


Quote:
Originally Posted by Thomas Richter View Post
GFABasic could have solved it this way, but at that time, there was no CacheClearE() either.
So we can't blame GFA for having done something without knowing what the future would be


Quote:
Originally Posted by Thomas Richter View Post
How often will that actually happen that you have to abort the interpreter loop? If it is happening regularly, then an explicit test is the cheaper option as it does not have to go through a context switch (the CPU time for that has to come from somewhere, after all). If it is not happening regularly, then there is no need to bother about the rare cache pushes.
It is not happening very often, but nevertheless at least once per frame.
On 040/060 a cache push is probably nothing, but on 020/030 the whole cache is gonna be invalidated - doing that every vbl does not look fine to me.
meynaf is offline  
Old 19 March 2021, 18:51   #24
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by jotd View Post
tst.b stopflag(a5) should be tst.b (a5) with a5 properly set to save offset computation

(or is that offset thing free?) (or use 0 for stopflag in your a5 struct)
That offset thing is free starting with 040, but not for 020/030.
Indeed tst.b (a5) is possible. But even if not, it's possible to use a7 instead like suggested before.


Quote:
Originally Posted by jotd View Post
also not sure if someone suggested this:

Code:
; main loop
bigloop
 move.l  stopflag(a5),d7
 move.w (a6)+,d7
 jmp    ([a4,d7.l*4])  ; suggested earlier
then first 256kb table is your routine table that all end by "bra bigloop"
And use another 256kb table that only contains the same address: the "special routine" address.

Set stopflag so it's $10000 when set, 0 if not.
Didn't think about that one. Quite clever

But that's 4 instructions (move.l d7, move.w a6+, jmp, bra).
Removing the bra (pun not intended ) is possible by copying the whole block for every routine but now it's bigger which means more drain on icache.
And of course that's gonna be slower than if i could access d7 or a4 from outside.
meynaf is offline  
Old 19 March 2021, 19:21   #25
grond
Registered User
 
Join Date: Jun 2015
Location: Germany
Posts: 1,918
Quote:
Originally Posted by meynaf View Post
This means i must be sure the return address is there, but it won't if we got interrupted during the execution of the fetch or jmp instructions in the main loop.
True. You could push the entry of your main loop onto the stack and JMP into the subroutines. The subroutines would then do a RTS on exit. If an interrupt happened, the minimal interrupt handler would have changed the return address and the subroutine would go into the extended handler which then would JMP to the beginning of your main routine again.

But I've totally lost track whether this would give any advantage or not...
grond is offline  
Old 19 March 2021, 19:35   #26
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by grond View Post
True. You could push the entry of your main loop onto the stack and JMP into the subroutines. The subroutines would then do a RTS on exit. If an interrupt happened, the minimal interrupt handler would have changed the return address and the subroutine would go into the extended handler which then would JMP to the beginning of your main routine again.

But I've totally lost track whether this would give any advantage or not...
The problem is that RTS will pop the return address, so it needs to be pushed back at the start of main loop. And if the interrupt occurs right between those two, guess what will happen...
The solution to that is the "RTS that does not pop", i.e. JMP ([A7]) suggested before.
meynaf is offline  
Old 19 March 2021, 19:42   #27
grond
Registered User
 
Join Date: Jun 2015
Location: Germany
Posts: 1,918
Quote:
Originally Posted by meynaf View Post
The problem is that RTS will pop the return address, so it needs to be pushed back at the start of main loop. And if the interrupt occurs right between those two, guess what will happen...
I don't need to guess, I know what happens. That's why the first thing the proposed routine would do is to push its start address onto the stack again.


Quote:
The solution to that is the "RTS that does not pop", i.e. JMP ([A7]) suggested before.
Yes, but how are cycle counts for that double-indirection?
grond is offline  
Old 19 March 2021, 19:57   #28
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by grond View Post
I don't need to guess, I know what happens. That's why the first thing the proposed routine would do is to push its start address onto the stack again.
Right, but the interrupt might come between the rts and the push - even if it's the first thing it does, it's already too late.
And if we attempt to push before the rts, then said rts will use the new value, not the old one.


Quote:
Originally Posted by grond View Post
Yes, but how are cycle counts for that double-indirection?
I don't know for sure. It's probably similar to that of doing the same thing with an intermediate address register.
meynaf is offline  
Old 19 March 2021, 20:38   #29
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
OK, back to the SMC approach.
What if you only clear a single icache line? Set bit 2 in CACR and select cache line in CAAR, instead of bit 3 in CACR (full clear). That's 4 bytes on 020, and 16 bytes on 030. Haven't used those on 040+, because of cinv/cpush. So maybe some extra footwork to support all CPUs. If you know the instruction address (task->loadseg list->abacus), for example on 020 the cache line is (address&255)/4, right?
a/b is online now  
Old 19 March 2021, 21:46   #30
grond
Registered User
 
Join Date: Jun 2015
Location: Germany
Posts: 1,918
Quote:
Originally Posted by meynaf View Post
Right, but the interrupt might come between the rts and the push - even if it's the first thing it does, it's already too late.
And if we attempt to push before the rts, then said rts will use the new value, not the old one.
You could set up the return address before installing the irq handler. Then again each time you enter the main loop.



Quote:
I don't know for sure. It's probably similar to that of doing the same thing with an intermediate address register.
I wouldn't be surprised if both methods were the same on 020/030 but saving one register could be useful.
grond is offline  
Old 20 March 2021, 02:34   #31
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,584
Quote:
Originally Posted by meynaf View Post
This time it's about optimising code to death.
Little care about the size, it's all about speed. It can't be fast enough.
The goal is to execute code depending on some values read from a stream.
See it as some kind of cpu emulation.
You want it as fast as possible? Then you have to consider what you are emulating.

Quote:
More than 8 bytes for sure. Average around 16. Biggest ones can reach 50+ bytes, i'm not sure - but they can probably use subroutines.
I bet your dispatch code is only a small proportion of the total execution time, but it depends on the mix of 8 to 50+ instructions. "Average of around 16" means nothing if 50+ is executed more often.

Instead of obsessing over micro-optimizations I suggest you just go with the "somewhat naïve implementation", then profile it on live emulation code. That gives you a baseline to see what improvement any future optimizations might have.

IOW, get the code working first - then worry about how to make it faster.
Bruce Abbott is offline  
Old 20 March 2021, 07:48   #32
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by a/b View Post
OK, back to the SMC approach.
What if you only clear a single icache line? Set bit 2 in CACR and select cache line in CAAR, instead of bit 3 in CACR (full clear). That's 4 bytes on 020, and 16 bytes on 030. Haven't used those on 040+, because of cinv/cpush. So maybe some extra footwork to support all CPUs. If you know the instruction address (task->loadseg list->abacus), for example on 020 the cache line is (address&255)/4, right?
I've never done this by hand, but i suppose this can work.
But, thinking twice about it, it means the main decode loop is always at the same place, which forbids repeating that decode part at the end of every instruction - which in turns means an extra branch in the critical path.


Quote:
Originally Posted by grond View Post
You could set up the return address before installing the irq handler. Then again each time you enter the main loop.
That's not the problem. Here the RTS will pop the return address and it's possible to have an irq before it gets pushed again.


Quote:
Originally Posted by grond View Post
I wouldn't be surprised if both methods were the same on 020/030 but saving one register could be useful.
Indeed.


Quote:
Originally Posted by Bruce Abbott View Post
You want it as fast as possible? Then you have to consider what you are emulating.
What does that change ?


Quote:
Originally Posted by Bruce Abbott View Post
I bet your dispatch code is only a small proportion of the total execution time, but it depends on the mix of 8 to 50+ instructions. "Average of around 16" means nothing if 50+ is executed more often.
A small proportion, not really. The dispatch code is executed for every instruction. Like the decode time is there for every instruction a cpu has to execute.
And about that 50+ will, or not, be executed more often, just do some statistical study of typical cpu instructions streams to see if simple instructions are somewhat more common than complex ones.


Quote:
Originally Posted by Bruce Abbott View Post
Instead of obsessing over micro-optimizations I suggest you just go with the "somewhat naïve implementation", then profile it on live emulation code. That gives you a baseline to see what improvement any future optimizations might have.
I don't need profiling to know these instructions are the most critical of the whole app.


Quote:
Originally Posted by Bruce Abbott View Post
IOW, get the code working first - then worry about how to make it faster.
It already works. I currently have a working VM - albeit very slow.
meynaf is offline  
Old 23 March 2021, 10:26   #33
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,584
Quote:
Originally Posted by meynaf View Post
I don't need profiling to know these instructions are the most critical of the whole app.
I'm skeptical.

But there's a way you can convince me... release your code! Then we can have a competition to see who can make it fastest.

Quote:
It already works. I currently have a working VM - albeit very slow.
What is it for?
Bruce Abbott is offline  
Old 23 March 2021, 10:54   #34
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by Bruce Abbott View Post
I'm skeptical.
Why ? Inner loops have always been the most speed sensitive, and this part is THE inner loop.


Quote:
Originally Posted by Bruce Abbott View Post
But there's a way you can convince me... release your code! Then we can have a competition to see who can make it fastest.
Release the code ? Maybe, but it's too big to be suitable for a contest.
This is more refactoring than micro-optimization, as used strategy here will in turn impact everything else.
But if you're interested enough to want to not only see the code but also help, you can PM me.


Quote:
Originally Posted by Bruce Abbott View Post
What is it for?
That's my own stuff. I can't do hardware so i do my thing in software.
meynaf is offline  
Old 23 March 2021, 19:31   #35
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,584
Quote:
Originally Posted by meynaf View Post
But if you're interested enough to want to not only see the code but also help, you can PM me.
PM sent.
Bruce Abbott is offline  
Old 26 March 2021, 09:00   #36
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,584
To get an idea of how much overhead you have I isolated the 'vmloop' code, inserted the code of unconditional subroutines and removed code that was executed conditionally. Inlining the subroutines would save a few bsr/rts pairs, but there is still a lot of code that could possibly be bypassed completely by some instructions.

I suggest encoding the instructions into 'classes' using eg. the upper 4 bits like the 68000 does. Then do a partial decode on the class, jumping to code that is specific to that class of instruction. Further decoding would then be applied for miscellaneous instructions.

After 'executing' the instruction you can jump to the finishing code that it needs (eg. setcc, writeback), which then jumps back to the main loop. Small routines could be inlined to avoid unnecessary calls or jumps.

With that technique you should be able to significantly reduce the minimum instruction execution time. However real code often spends most of its time on more complex instructions, so it would be better to concentrate on speeding up the more 'popular' time consuming instructions rather than having the fastest possible NOP.

Last edited by Bruce Abbott; 26 March 2021 at 10:53. Reason: code removed as meynaf wants to keep it private
Bruce Abbott is offline  
Old 26 March 2021, 09:50   #37
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by Bruce Abbott View Post
To get an idea of how much overhead you have I isolated the 'vmloop' code, inserted the code of unconditional subroutines and removed code that was executed conditionally. Here is the result:-
(...)

Inlining the subroutines would save a few bsr/rts pairs, but there is still a lot of code that could possibly be bypassed completely by some instructions.

I suggest encoding the instructions into 'classes' using eg. the upper 4 bits like the 68000 does. Then do a partial decode on the class, jumping to code that is specific to that class of instruction. Further decoding would then be applied for miscellaneous instructions.

After 'executing' the instruction you can jump to the finishing code that it needs (eg. setcc, writeback), which then jumps back to the main loop. Small routines could be inlined to avoid unnecessary calls or jumps.

With that technique you should be able to significantly reduce the minimum instruction execution time. However real code often spends most of its time on more complex instructions, so it would be better to concentrate on speeding up the more 'popular' time consuming instructions rather than having the fastest possible NOP.
I told you, optimizing this code without changing the algorithm is futile.
Besides, please don't publish code here. When I said "PM me", it was that i prefer this to remain private.
meynaf is offline  
Old 26 March 2021, 11:19   #38
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,584
Quote:
Originally Posted by meynaf View Post
please don't publish code here. When I said "PM me", it was that i prefer this to remain private.
Oh sorry, when you said "Release the code ? Maybe, but it's too big to be suitable for a contest" I didn't realize you meant you wanted to keep it private.

Quote:
I told you, optimizing this code without changing the algorithm is futile.
Not sure what you mean by this. But since I don't seem to have any ideas that you find useful I will bow out now.
Bruce Abbott is offline  
Old 26 March 2021, 12:22   #39
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by Bruce Abbott View Post
Oh sorry, when you said "Release the code ? Maybe, but it's too big to be suitable for a contest" I didn't realize you meant you wanted to keep it private.
I was unclear maybe, i thought the "PM me" would have been enough. Sorry.
I would prefer that it remains private for now, but that's not set in stone.


Quote:
Originally Posted by Bruce Abbott View Post
Not sure what you mean by this. But since I don't seem to have any ideas that you find useful I will bow out now.
No harm done in trying, right ?
Of course there is code that must be inlined, and code that must be completely bypassed. For now individual instructions are only the alu part - the thing is that code must be reworked so that everything is there, with as much predecode as possible without making code size completely explode. And i think this requires a rewrite. And i will not rewrite everything without the basic structure, which is the decoding part i precisely wanted to optimize here.
meynaf is offline  
Old 27 March 2021, 03:48   #40
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
Quote:
Originally Posted by meynaf View Post
Even if there are FPU registers there too ?
If yes, that's good news
Now only thing to take care of, is to not call any OS routine when this feature is active - and perhaps find a way to not kill my debugger as well when it's there and i'm hunting an
Which problem to add special ID (f.e $5754 ) in high word for D6 register? And later scan stack for D6 word value?
move.l SP,A0
findID
cmp.w #$5754,(A0)+
bne.b findID
Don_Adan is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Ripping Sprites - Technique... method project.Sprites 43 12 October 2021 16:17
Profiling C code, interpreting results Ernst Blofeld Coders. C/C++ 5 19 November 2020 18:45
Interpreting DMA-Debugger output selco support.WinUAE 10 27 November 2019 20:48
Amazing New Retrobrighting Technique Hewitson Retrogaming General Discussion 12 12 June 2019 09:27
Error while interpreting script Makkinen support.Apps 1 15 October 2004 15:58

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 00:18.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.14580 seconds with 15 queries