26 March 2009, 14:29 | #101 | |||||
move.l #$c0ff33,throat
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
|
Quote:
Quote:
Quote:
Quote:
Quote:
I don't need to defend it, I just use it and don't judge other assemblers that I never really used. You just think it's crap anyway, no matter what I'll say, you won't change your opinion. Says the one who doesn't even know how to setup Asm1. No further comments needed. And as I know what your next argument will be, here's a line from my 3d-engine source: Code:
PHXASS = 0 ; set to 1 for phx-ass support Last edited by StingRay; 26 March 2009 at 18:46. Reason: typo/grammar |
|||||
17 April 2009, 11:24 | #102 | |||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,350
|
Quote:
And I can write bxx.b everywhere, but not in macros which can lead indifferently to byte, word and even long calls ! Quote:
Of course you can optimize code yourself, but some cases (code in macros) can't be handled like that. There is also the problem of code that is re-used and made to be included. Sometimes the branch will be in range, sometimes it will not. Quote:
Then try this : meynaf.free.fr/tmp/v.lzx Source to assemble is v.s. Good luck. And please check "all errors", just for a laugh. When I did it it crashed. And tell me why some perfectly valid 020+ constructions like this one : Code:
move.b ($1000.w,a0,d0.l),d0 Quote:
But now you have to tell me why the debugger simply freezes my machine with a grey screen. I perhaps didn't spend much time with it, but its learning curve seems to be a little bit too slow to raise... Quote:
Quote:
If I could do asm-one support, I would have done it. You think I didn't do it because I didn't want, but it's because I couldn't. Quote:
But when I was talking to Thorham, I didn't really bash it. Then you came and said I did a hate campaign. You said phxass was horribly broken. I just had to defend, man Now we can stop all that junk talk and you will tell me why my code won't assemble in asm-one, so I can use it. That would be much more useful than telling me I'm wrong in bashing it. |
|||||||
02 May 2009, 11:44 | #103 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,806
|
Off-topic:
After looking at myself and realizing I had a big drinking problem, I decided to do something about it, and gave up the stuff completely. I had come to a point where I didn't finish projects and didn't finish code I said I would write for people all because my drinking was consuming my energy. Of course, while drinking, I had plenty of energy, but just try to code when you're in between sober and drunk, bloody hard, and it takes an hour to do 15 minutes worth of work properly. This had to stop. It means I started to lie about work I should have done already. People will probably remember the PNG codec and the scaling loops I promised meynaf to write. Although I started them, I, of course, didn't finish them and had little to show for (pmed meynaf about that). The same goes for another project I'm not naming here. I'll write a piece in the appropriate thread, but it comes down to the same problem. It wasn't easy to kick the habit, and from what I've heard, I should feel lucky, because I wasn't physically addicted alcohol. It's over and done now. However, I'm not going to get into any more projects unless I can finish them in a reasonable amount of time. It's all too easy to have one project after the other pile up on top of each other, and this is another bad habit I'm breaking. Glad to be back, and a big sorry to meynaf for putting up with my nonesense, it won't happen again mate! |
02 May 2009, 16:05 | #104 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,350
|
If it won't happen again, then let's code !
I'm afraid it's a little bit too late for the PNG code But I have other projects I need help for, and you probably have other projects on your own. I've started to work again on my optimized version of mpega.library and got some result. I can't say I've understood everything in that bunch of muls, but some parts are known enough to be rewritten - and effectively were. Any specialist of muls removal is welcome The current part is the Inverse Modified Discrete Cosine Transform for Long blocks (imdct_l). The actual routine suffers from register shortage (I would need 19 of them). Perhaps I can post it if someone is interested ? |
07 May 2009, 15:54 | #105 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,806
|
Okay, meynaf, post it! A link to the library source with includes (if any) would be nice. It can never hurt to take a look, but I can't promise anything.
Do you still feel like continuing the discussion in this thread? If so, I can continue right where we left off. |
09 May 2009, 09:52 | #106 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,350
|
Well, I think that the best place to talk about this project is the old mpega thread I did. Or maybe a new one ?
With a thread on that subject more other (interested) people will read. I'll post the complete code if you need to make some tests. For now you can have a look at the actual dct code : http://meynaf.free.fr/tmp/mdct_l.s Or at the subband filter : http://meynaf.free.fr/tmp/subb_f.s I have extracted them out of the main bulk to ease tests on them. I think it's enough to keep you busy for a while |
12 May 2009, 11:33 | #107 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,806
|
After my initial look at those muls instructions (things like muls #$187e,dx), I thought: Why not just make a, wait for it, table... I know you probably won't like 256kb tables, but before you dismiss them, read further.
Advantages: - Tables offer better optimization for large constants. - Tables use up less instructions in tight loops, helps the cache potentially. - If the program can be made resident because it's pure, the tables only have to be setup or read once, if I'm not mistaken. - one to two megabytes in overhead doesn't hurt configurations like yours or mine. Since the program has little to no use on bare machines, this overhead doesn't have to be a problem. Disadvantage: - You don't like large tables Now for the muls expert thing. I'm not an expert, but it's almost the same as for mulu. All you need is to do is ext.l the register, and apply normal shifts, adds and subs. If the constant is negative, just neg.l the end result. I know that this isn't very helpful, but unless muls is a lot slower than mulu, I'd just go for a couple of tables. By the way, you forgot to answer about continuing this thread in the way it was going. |
14 May 2009, 09:11 | #108 | ||||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,350
|
Quote:
Perhaps you can guess that I've thought about tables quite long ago, too Quote:
Most constants can be done in 18 to 22 cycles, so the difference wouldn't be important, especially because of register shortage and load/save of the table pointer. Large tables would be ok if they gave stunning results, say, 50% faster. But alas on the overall process, each 256kb block of constants will probably be less than 1% speed. Quote:
Quote:
Quote:
So when I play, say, a 16MB MP3 and do something else in the background (yes I can), memory would go just too low (I think). Besides, computing such a big table could cost a lot of time at startup (to be checked). This is true Quote:
Did you know that : movea.w d0,a0 was actually faster than : ext.l d0 ? Quote:
But it wouldn't be a couple of tables, but a ten of them, if not many more. Quote:
Another thing about muls, is that the program constantly multiplies the same input by different constants, and shift-and-add can have common parts (intermediate results). Look at the following code, and tell me what amount of cycles 4 tables would gain - especially when there is no free register at all : Code:
move.w xx(a0),a4 move.l a4,d3 add.l d3,d3 add.l d3,a4 move.l a4,d4 add.l d3,a4 lsl.l #3,d4 move.l d4,d5 move.l d4,d6 lsl.l #2,d4 add.l d4,d5 add.l d3,d4 lsl.l #6,d4 sub.l d3,d4 ; 187e lsl.l #8,d3 sub.l d3,d5 add.l d3,d6 add.l d6,d6 lsl.l #3,d3 add.l d3,d5 add.l d4,d5 ; 26f6 add.l d5,d3 sub.l d6,d3 ; 32c6 add.l d6,d6 sub.l a4,d6 ; 85b I tried to put them in a loop and it didn't prove to be faster. If you really think tables are the way to go, I encourage you to experience them anyway. I'd give you the whole code so you can assemble and test. The best way to test I know is to play something with settings your machine can't handle : it may hurt your ears but the final playtime will give you precious indications. Of course it's possible to open the lib and measure decoding time, but I've got no program to do that (btw you could help a lot in writing one). |
||||||||
14 May 2009, 11:09 | #109 | ||||||||||||||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,806
|
Quote:
Quote:
Quote:
Quote:
Quote:
Unsigned version: Code:
move.l #table,a0 move.l #$187e,d0 moveq #0,d1 moveq #-1,d2 .loop move.l d1,(a0)+ add.l d0,d1 dbra d2,.loop Code:
move.l #table,a0 move.l #table+2^18,a1 move.l #$187e,d0 moveq #0,d1 moveq #0,d2 move.l #2^15-1,d3 .loop move.l d1,(a0)+ add.l d0,d1 sub.l d0,d2 move.l d2,-(a1) dbra d3,.loop Quote:
Quote:
That's what I thought, but I was unsure. Quote:
I don't need it, either, but I did enjoy it! If you did as well, then I'll continue. Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Edited: Well, so much for the simple and dirty method. I thought it would be enough to just insert a custom vbl interrupt, but the level 3 vector is changed back to the system default automaticaly. Bah, I thought this would be a lot easier than it may actually be Last edited by Thorham; 14 May 2009 at 19:19. |
||||||||||||||
14 May 2009, 15:04 | #110 | ||||||||||
move.l #$c0ff33,throat
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
|
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
If you're happy with PHX-Ass (and I think you are) then by all means use it! I mean, I am using quite an old version of AsmPro and most probably I'll never update it because I learned to even make use of some of its bugs (no joke! ). However, if you need help using Asm1 I'll happily help you. Last edited by StingRay; 14 May 2009 at 15:10. |
||||||||||
14 May 2009, 15:37 | #111 |
Join Date: Jul 2008
Location: Sweden
Posts: 2,269
|
Just want to point out that indirect with index and base displacement is doable in ASM-one, you just have to put the base displacement last:
move.b (a0, d0.l, $1000.w), d0 |
14 May 2009, 15:42 | #112 |
move.l #$c0ff33,throat
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
|
I know, just didn't want to mention it as it doesn't work in all Asm1 versions. (AsmPro doesn't support it at all)
Last edited by StingRay; 14 May 2009 at 16:01. |
14 May 2009, 17:37 | #113 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,806
|
Edited: Sorry for the mistake, but this code only works if it can't be preempted, or if you have a spare address register to swap with a7
A little upadate on the table thing. I think this is going to be as good as it's going to get: Code:
move.w xx(a0),a4 add.l a4,a4 add.l a4,a4 add.l (sp)+,a4 move.l (a4),d3 add.l (sp)+,a4 move.l (a4),d4 add.l (sp)+,a4 move.l (a4),d5 add.l (sp)+,a4 move.l (a4),d6 subq.l #8,sp subq.l #8,sp Edited: I checked the code with my speed testing program and my table reading idea is only, count'm, two whole frames faster for a million loop iterations: Yours is 67 frames, mine is 65 frames. In other words: Do not bother coz it sucks Well, it sucks in this case. It might be useful in cases where you need every bit of speed you can get, such as in demos. And what's bad as well is that my table reading idea needs the stack to be setup properly. Probably no interesting overhead, but it's just not handy, especialy not given the small speed increase. Last edited by Thorham; 15 May 2009 at 11:16. |
15 May 2009, 12:03 | #114 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,806
|
Ok, I've looked at the file mdct_l.s a little closer, and there are several constants, namely 85b, 187e, 26f6 and 32c6, which are used over and over again. Further more, in those code blocks (up to and including ; u7) registers a1, a2 and a3 are free. Seems to me it would make at least some sense to use simple tables for three of those constants, including 187e, which is used more often than the other ones. You can then just optimize the simplest constant in the usual way.
Of course, if you plan to use the optimization you've shown me in all those code blocks, then it might be pointless. Now I don't want to sound like I'm insisting on tables, but it might be possible to get a reasonable gain without making tables for all the constants. It just seems to me that there may be a possible balance between the usual optimizations and tables. Edited: Trivial, but here's an example: Code:
move.w d4,d3 muls #$32c6,d3 Code:
move.l (a1,d4.w*4),d3 Last edited by Thorham; 15 May 2009 at 12:18. |
16 May 2009, 15:26 | #115 | |||||||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,350
|
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Or isn't it one of those debuggers which need an external (parallel or serial) device to display things ? Quote:
I don't see any use for this command-line stuff and it's constantly coming in the way between editor and assembler. Quote:
Quote:
Code:
ifd _PHXASS_ PHXASS = 1 else PHXASS = 0 endc Quote:
Quote:
So, how good is asmpro ? For help on asm1, I think I pointed you some example code... Could be good to make it asm1 compatible. |
|||||||||||
16 May 2009, 16:02 | #116 | ||||||||||||||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,350
|
Quote:
Yes. Quote:
Quote:
Ok, for several megabytes it can be a quarter of a second or so, still acceptable. But 187e constant can be done in 18 cycles. Not very different from a table lookup, especially when there is no free register to hold the pointer. I'm using it massively. If you compare with earlier mpega code, you'll see that many negs were there and no longer are ! Quote:
However for massive sign-extend I prefer movem.w It's divu and divs which don't. Quote:
Quote:
Not only mine. My friend Don Adan (Wanted Team) has to be credited for some parts. The 12 muls blocks are from him. Incredible code, that is. And my 4-muls block is just an improvement over his. Quote:
Quote:
I can't, because I don't know exactly how to use the library (this may sound strange as I'm recoding parts of it). Quote:
Quote:
But the problem isn't to measure time. It is to actually call the library to do the job without using DT's player. Quote:
. save old a7 in memory = memory write . move 4000 to DFF09A to kill pre-emption = equivalent to chipmem write ! . read a7 from some memory area = memory read And afterwards : . restore A7 = memory write . move C000 to DFF09A to restore system = equivalent to chipmem write ! Better to push-n-pop a reg ! Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Besides, movem being slower is not true for 040 and probably not for 060 as well. Furthermore, I can't measure such a small difference. Removing 30 cycles here is the equivalent of roughly one second of decoding time in a 8 minutes layer 3 stream... |
||||||||||||||||||
18 May 2009, 08:00 | #117 | |||||||||||||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,806
|
Mostly only tables and quality reductions
Quote:
Quote:
Quote:
Quote:
Right, I keep forgetting. Quote:
Yup, that's where it's going, isn't it Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
If you already have code, then it's still easy to do. All you really need is a program that sets up the irq handler with a counter routine and outputs the location of that counter. Next you need a little program that resets the counter and a program that reads the counter. The final ingredient is a simple script that calls the counter reset program, then the program you want to bench mark, and finally calls the counter reading program. Not a very elegant way, but surely good enough for most purposes, and since you already have system friendly interrupt code, it should be very easy to write. Quote:
Yes, I've read that more than once. How often is this code actually executed? They're not used in the blocks I was talking about, I checked with a search function. Because they're not used in a relatively large chunk of code, I thought they might be used to hold table pointers. Quote:
Doesn't it mean you can cut this code in half and still only save a few extra seconds on said eight minutes? It seems to me that it's more useful to just concentrate on the tight loops. |
|||||||||||||
18 May 2009, 17:17 | #118 | ||||||||||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,350
|
Yeah, perhaps also things that make the exe size grow by 60% to gain 2% speed (you're not targeted on that one)
Btw can you believe my ham8 rendering routine can be optimised further ? In fact it's possible to gain 5 clocks at average (but I'll let you have a look at it again before I tell) Quote:
Second, I think a 100+ mb mpeg isn't a very clever thing to do... Third, some people have 128mb of mem to waste Quote:
Now I've upped the whole "development suite". All sources, the last version of the library, and DT2 with player : meynaf.free.fr/tmp/mpega.lzx Now you can't tell you can't test If it's shorter & faster, it's always worth ! Those things are easily forgotten. Anyway the 68030 would have been much better IMO if mul were hardwired. Note that the same kind of mulu vs muls thingy happened to me with lsr vs asr. Thought lsr was faster. But doc and tests did show they have same timing. Quote:
It is Quote:
Quote:
Moreover, you can only concentrate on tight loops when there actually are some, and there are very few ! Quote:
Quote:
Besides, a command line utility which uses the lib to decode is usable only if it can decode partially (I don't want to decode to disk and decoded data would be huge in ram: ), and if it can support all different quality settings. Also it's better to have the benchmark stuff right inside. Quote:
Quote:
Quote:
Quote:
On the other hand, as I've given you the whole thing with everything's needed to use it, you can just see by yourself (there even are debug macros to write to dff180 at start of main source, just to see). Quote:
So using them in the blocks would imply restoring them afterwards. Quote:
Of course, for now it's 030, but all 030-only modifications I do must be in if/endc pairs. Say, replacing muls by add/lsl isn't a winner at all for 060 (for 040 I frankly don't know !). Quote:
However if you like tight loops to optimize, then you can have a look at the biggest cpu use contributor of the whole library : the subband window part ! (now located in subb_w.s) EDIT: I've looked on Aminet but didn't find an mpeg decoder which uses mpega.library to decode to a file... Last edited by meynaf; 18 May 2009 at 17:27. |
||||||||||||||
20 May 2009, 16:02 | #119 | |||||||||||||||||||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,806
|
Quote:
Quote:
Code:
.vbrb btst #0,d7 ; 0,2 -> b beq.s .bl btst #1,d7 ; 1 -> r beq.s .ro bra.s .ve ; 3 -> v (3210->vbrb) Code:
.vbrb btst #0,d7 ; 0,2 -> b beq.s .bl btst #1,d7 ; 1 -> r beq.s .ro .vbrb_ve move.l d2,a3 addq.b #3,d2 move.b d2,(a1)+ dbf d7,.loop rts Quote:
Wish I was one of them You told me the code isn't in a tight loop, witch I interpreted as not being called often, and it doesn't contain any loops itself. In such a case it wouldn't matter much, now would it Quote:
Indeed! Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Certeinly. How odd... Quote:
Maybe I could encode 2mb worth of wav to mp3, so you can decode to ram. That way you could use an easy to write benchmark program. Alright, I should have it on a cd around here somewhere. As far as I know, I just got it from aminet, but I should still have it. I'll find it. Quote:
Quote:
Quote:
Quote:
Quote:
It's only so to speak Quote:
As said, I think I have it here somewhere. I'll upload it when I find it. Last edited by Thorham; 20 May 2009 at 16:08. |
|||||||||||||||||||
21 May 2009, 17:45 | #120 | |||||||||||||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,350
|
Quote:
I'm telling you the first one now, and, same as you, I can't believe I didn't spot that one sooner ! Look at the move to a6. What's in the array ? Addresses. Which point on what ? 4 bytes. Hey man, why not directly copying the data there and use lea ? Even if the data is in cache (often in this case) it's 3 cycles gained ! Quote:
What is the bitrate ? Would be pretty pointless on a 030 IMO, especially with this slow IDE controller we have. Frankly if I had the choice I would go for a 3Ghz 68080 Quote:
Quote:
asr keeps the sign where lsr inserts only zeroes, e.g. lsr.b #1 on $80 gives $40, but asr would give $C0. You can see this as asr being signed (half of -128 is -64, not 64). The difference between lsl and asl is much more subtle (but there is one - and you don't need it ). Probably. But when I discovered that 6 muls by a,b,c,d,e,f constants could be done with only 4 because e=a+b and f=c+d, all that repeated 8 times, that made 16 muls removed Quote:
About extra blank lines : okay, but please give me examples on where. I hope Quote:
Quote:
(and remember to flush libs between attempts - you'll also need to eject and reload the module before new settings can take effect) Quote:
Quote:
Perhaps it looks odd. But my usual includes (you know them now ) allow me to do that in two lines of code, without worrying about removing it at the end or saving regs in the int itself. Quote:
Quote:
However if you feel like encoding something, a few mp1/mp2 could be of use. Quote:
Seen anything now ? Quote:
Quote:
Quote:
What's not clear is which parts are the best for non-030 ! What ? Don't tell me you can't reduce that to half Quote:
In fact it is the only routine which changes at all when this setting is different ! Quote:
Now we need some optims to check |
|||||||||||||||||
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Old KGLoad Discussion | killergorilla | project.KGLoad | 357 | 20 January 2011 16:08 |
Castlevania Discussion | john4p | Retrogaming General Discussion | 30 | 30 January 2009 02:10 |
ROM Discussion... | derSammler | project.EAB | 41 | 29 January 2008 23:36 |
General Discussion | Zetr0 | project.Amiga Game Factory | 12 | 15 December 2005 13:53 |
|
|