Misc programming discussion (from pms). - Page 6

StingRay · 26 March 2009, 14:29

Quote:

Originally Posted by meynaf

You will never care manually of branch optimizations in a 40000-line source which evolves all the time.

If you have 1 source file with 40k lines it's your own fault anyway. And Asm1 does optimize forward branches too. Also, if you care about the size, just write bxx.b all the time and Asm1 will convert to bxx.w where necessary!

Quote:

Originally Posted by meynaf

But phxass gives better code than asm1. I resourced enough asm1 programs to know that.

Yeah, because it was the assembler that produced the code, not the coder. *cough* And once I used PHX-Ass to assemble a 4k of mine it indeed gave much better code, the intro was 10 or 12 bytes shorter and didn't run at all anymore. I wonder why I prefer to optimize code myself instead of letting the assembler do that work...

Quote:

Originally Posted by meynaf

Code:

 btst #8,(a0)

Whereas phxass says Bit manipulation out of range.

So who's seriously broken ?

Did I ever say it's perfect? Never did, never will, because it isn't!

Quote:

Originally Posted by meynaf

No, I never really used it. And you know why ? It's simply because I couldn't ! No source I tried accepts to assemble.

So far I could assemble everything I found.

Quote:

Originally Posted by meynaf

When I type a number in the editor (with numpad) it moves the cursor (how to disable that numpad stupidity ? not found !).

This shows how much time you spent with asm1. Preferences->Assembler->Numlock. Takes about 10 seconds to figure out.

Quote:

Originally Posted by meynaf

But please come and defend this software here. C'mon.

I don't need to defend it, I just use it and don't judge other assemblers that I never really used. You just think it's crap anyway, no matter what I'll say, you won't change your opinion.

Quote:

Originally Posted by meynaf

Asm-one will never be able to do so.

Says the one who doesn't even know how to setup Asm1. No further comments needed. And as I know what your next argument will be, here's a line from my 3d-engine source:

Code:

PHXASS        = 0            ; set to 1 for phx-ass support

Contrary to you I don't bash assemblers I don't like without any reason, after all tastes differ and as some coders I work with use and like phx-ass I'll just support it.

meynaf · 17 April 2009, 11:24

Quote:

Originally Posted by StingRay

If you have 1 source file with 40k lines it's your own fault anyway. And Asm1 does optimize forward branches too. Also, if you care about the size, just write bxx.b all the time and Asm1 will convert to bxx.w where necessary!

It's perhaps my fault, but this is re-sourced code, and searching where something gets used and renaming labels by search-replace would be a pain if the code was in several files.

And I can write bxx.b everywhere, but not in macros which can lead indifferently to byte, word and even long calls !

Quote:

Originally Posted by StingRay

Yeah, because it was the assembler that produced the code, not the coder. *cough* And once I used PHX-Ass to assemble a 4k of mine it indeed gave much better code, the intro was 10 or 12 bytes shorter and didn't run at all anymore. I wonder why I prefer to optimize code myself instead of letting the assembler do that work...

If it didn't run anymore, there has to be a reason... If you agree to give me that code, I will tell you what happens and fix it. You will see that the code is still shorter but works ;-)

Of course you can optimize code yourself, but some cases (code in macros) can't be handled like that.
There is also the problem of code that is re-used and made to be included. Sometimes the branch will be in range, sometimes it will not.

Quote:

Originally Posted by StingRay

Did I ever say it's perfect? Never did, never will, because it isn't!

And I never said phxass was perfect either, you know. But I can use it for what I need, and I can't say that for asm-one.

Quote:

Originally Posted by StingRay

So far I could assemble everything I found.

Then try this :
meynaf.free.fr/tmp/v.lzx
Source to assemble is v.s. Good luck. And please check "all errors", just for a laugh. When I did it it crashed.

And tell me why some perfectly valid 020+ constructions like this one :

Code:

 move.b ($1000.w,a0,d0.l),d0

... get rejected even in 030 mode. And don't tell me I shouldn't write this because it is inefficient, this is re-sourced code.

Quote:

Originally Posted by StingRay

This shows how much time you spent with asm1. Preferences->Assembler->Numlock. Takes about 10 seconds to figure out.

Yes, in "assembler" preferences, right in the middle of things like cpu type which have nothing to do with that. No "editor" preferences, ok. Well placed, easy to guess. Not to mention the option is more or less useless and the default setting a little bit on the stupid side.

But now you have to tell me why the debugger simply freezes my machine with a grey screen.

I perhaps didn't spend much time with it, but its learning curve seems to be a little bit too slow to raise...

Quote:

Originally Posted by StingRay

I don't need to defend it, I just use it and don't judge other assemblers that I never really used. You just think it's crap anyway, no matter what I'll say, you won't change your opinion.

Ok, I'll say it doesn't fit my needs instead of saying it's crap, just to please you, but it doesn't change the fact that I simply can't use it.

Quote:

Originally Posted by StingRay

Says the one who doesn't even know how to setup Asm1. No further comments needed. And as I know what your next argument will be, here's a line from my 3d-engine source:

Code:

PHXASS        = 0            ; set to 1 for phx-ass support

Theoretically you don't need this. It's easy to detect phxass with an if/endc pair.

If I could do asm-one support, I would have done it. You think I didn't do it because I didn't want, but it's because I couldn't.

Quote:

Originally Posted by StingRay

Contrary to you I don't bash assemblers I don't like without any reason, after all tastes differ and as some coders I work with use and like phx-ass I'll just support it.

It is useless for me and that's all.

But when I was talking to Thorham, I didn't really bash it. Then you came and said I did a hate campaign. You said phxass was horribly broken. I just had to defend, man

Now we can stop all that junk talk and you will tell me why my code won't assemble in asm-one, so I can use it. That would be much more useful than telling me I'm wrong in bashing it.

Thorham · 02 May 2009, 11:44

Off-topic:

After looking at myself and realizing I had a big drinking problem, I decided to do something about it, and gave up the stuff completely. I had come to a point where I didn't finish projects and didn't finish code I said I would write for people all because my drinking was consuming my energy. Of course, while drinking, I had plenty of energy, but just try to code when you're in between sober and drunk, bloody hard, and it takes an hour to do 15 minutes worth of work properly.

This had to stop. It means I started to lie about work I should have done already. People will probably remember the PNG codec and the scaling loops I promised meynaf to write. Although I started them, I, of course, didn't finish them and had little to show for (pmed meynaf about that). The same goes for another project I'm not naming here. I'll write a piece in the appropriate thread, but it comes down to the same problem.

It wasn't easy to kick the habit, and from what I've heard, I should feel lucky, because I wasn't physically addicted alcohol. It's over and done now. However, I'm not going to get into any more projects unless I can finish them in a reasonable amount of time. It's all too easy to have one project after the other pile up on top of each other, and this is another bad habit I'm breaking.

Glad to be back, and a big sorry to meynaf for putting up with my nonesense, it won't happen again mate!

meynaf · 02 May 2009, 16:05

If it won't happen again, then let's code !

I'm afraid it's a little bit too late for the PNG code

But I have other projects I need help for, and you probably have other projects on your own.

I've started to work again on my optimized version of mpega.library and got some result. I can't say I've understood everything in that bunch of muls, but some parts are known enough to be rewritten - and effectively were.
Any specialist of muls removal is welcome

The current part is the Inverse Modified Discrete Cosine Transform for Long blocks (imdct_l). The actual routine suffers from register shortage (I would need 19 of them). Perhaps I can post it if someone is interested ?

Thorham · 07 May 2009, 15:54

Okay, meynaf, post it! A link to the library source with includes (if any) would be nice. It can never hurt to take a look, but I can't promise anything.

Do you still feel like continuing the discussion in this thread? If so, I can continue right where we left off.

meynaf · 09 May 2009, 09:52

Well, I think that the best place to talk about this project is the old mpega thread I did. Or maybe a new one ?
With a thread on that subject more other (interested) people will read.

I'll post the complete code if you need to make some tests. For now you can have a look at the actual dct code :
http://meynaf.free.fr/tmp/mdct_l.s
Or at the subband filter :
http://meynaf.free.fr/tmp/subb_f.s

I have extracted them out of the main bulk to ease tests on them.
I think it's enough to keep you busy for a while

Thorham · 12 May 2009, 11:33

After my initial look at those muls instructions (things like muls #$187e,dx), I thought: Why not just make a, wait for it, table... I know you probably won't like 256kb tables, but before you dismiss them, read further.

Advantages:

- Tables offer better optimization for large constants.
- Tables use up less instructions in tight loops, helps the cache potentially.
- If the program can be made resident because it's pure, the tables only have to be setup or read once, if I'm not mistaken.
- one to two megabytes in overhead doesn't hurt configurations like yours or mine. Since the program has little to no use on bare machines, this overhead doesn't have to be a problem.

Disadvantage:

- You don't like large tables

Now for the muls expert thing. I'm not an expert, but it's almost the same as for mulu. All you need is to do is ext.l the register, and apply normal shifts, adds and subs. If the constant is negative, just neg.l the end result.

I know that this isn't very helpful, but unless muls is a lot slower than mulu, I'd just go for a couple of tables.

By the way, you forgot to answer about continuing this thread in the way it was going.

meynaf · 14 May 2009, 09:11

Quote:

Originally Posted by Thorham

After my initial look at those muls instructions (things like muls #$187e,dx), I thought: Why not just make a, wait for it, table... I know you probably won't like 256kb tables, but before you dismiss them, read further.

Ok, I will read further, but just because you ask for it

Perhaps you can guess that I've thought about tables quite long ago, too

Quote:

Originally Posted by Thorham

Advantages:
- Tables offer better optimization for large constants.

Simple move.l (an,dn.w*4),dn is around 12 cycles, if not 14 (compared to 28 or 30 of muls).
Most constants can be done in 18 to 22 cycles, so the difference wouldn't be important, especially because of register shortage and load/save of the table pointer.

Large tables would be ok if they gave stunning results, say, 50% faster. But alas on the overall process, each 256kb block of constants will probably be less than 1% speed.

Quote:

Originally Posted by Thorham

- Tables use up less instructions in tight loops, helps the cache potentially.

Even with muls, some parts have large cache contention. True that less code helps a lot - especially because there are no tight loops at all in some areas.

Quote:

Originally Posted by Thorham

- If the program can be made resident because it's pure, the tables only have to be setup or read once, if I'm not mistaken.

Yes.

Quote:

Originally Posted by Thorham

- one to two megabytes in overhead doesn't hurt configurations like yours or mine. Since the program has little to no use on bare machines, this overhead doesn't have to be a problem.

Each table would cost 4*64kb (256kb). Would be no real problem for one constant, but there are many of them (more than 20).

So when I play, say, a 16MB MP3 and do something else in the background (yes I can), memory would go just too low (I think).

Besides, computing such a big table could cost a lot of time at startup (to be checked).

Quote:

Originally Posted by Thorham

Disadvantage:

- You don't like large tables

This is true

Quote:

Originally Posted by Thorham

Now for the muls expert thing. I'm not an expert, but it's almost the same as for mulu. All you need is to do is ext.l the register, and apply normal shifts, adds and subs. If the constant is negative, just neg.l the end result.

I've already passed the point of ext.l and neg.l. I use sub instead of add, and sign-extend with movea whereever possible.

Did you know that :
movea.w d0,a0
was actually faster than :
ext.l d0
?

Quote:

Originally Posted by Thorham

I know that this isn't very helpful, but unless muls is a lot slower than mulu, I'd just go for a couple of tables.

muls and mulu have same timing.

But it wouldn't be a couple of tables, but a ten of them, if not many more.

Quote:

Originally Posted by Thorham

By the way, you forgot to answer about continuing this thread in the way it was going.

If you want to do so, do it, but personnally I don't need it.

Another thing about muls, is that the program constantly multiplies the same input by different constants, and shift-and-add can have common parts (intermediate results).

Look at the following code, and tell me what amount of cycles 4 tables would gain - especially when there is no free register at all :

Code:

 move.w xx(a0),a4
 move.l a4,d3
 add.l d3,d3
 add.l d3,a4
 move.l a4,d4
 add.l d3,a4
 lsl.l #3,d4
 move.l d4,d5
 move.l d4,d6
 lsl.l #2,d4
 add.l d4,d5
 add.l d3,d4
 lsl.l #6,d4
 sub.l d3,d4  ; 187e
 lsl.l #8,d3
 sub.l d3,d5
 add.l d3,d6
 add.l d6,d6
 lsl.l #3,d3
 add.l d3,d5
 add.l d4,d5  ; 26f6
 add.l d5,d3
 sub.l d6,d3  ; 32c6
 add.l d6,d6
 sub.l a4,d6  ; 85b

This is the equivalent of 4 muls. Of course it's much larger code, but I think it would be more efficient to reduce those blocks than use tables.

I tried to put them in a loop and it didn't prove to be faster.

If you really think tables are the way to go, I encourage you to experience them anyway. I'd give you the whole code so you can assemble and test.
The best way to test I know is to play something with settings your machine can't handle : it may hurt your ears but the final playtime will give you precious indications.
Of course it's possible to open the lib and measure decoding time, but I've got no program to do that (btw you could help a lot in writing one).

Thorham · 14 May 2009, 11:09

Quote:

Originally Posted by meynaf

Ok, I will read further, but just because you ask for it

Perhaps you can guess that I've thought about tables quite long ago, too

Well, yes, actually, but you may have dismissed the idea knowing how you don't always like big tables

Quote:

Originally Posted by meynaf

Simple move.l (an,dn.w*4),dn is around 12 cycles, if not 14 (compared to 28 or 30 of muls).
Most constants can be done in 18 to 22 cycles, so the difference wouldn't be important, especially because of register shortage and load/save of the table pointer.

That's not much

Quote:

Originally Posted by meynaf

Large tables would be ok if they gave stunning results, say, 50% faster. But alas on the overall process, each 256kb block of constants will probably be less than 1% speed.

And in the case of the example you posted, the four multiplies add up to 58 cycles, not counting the first instruction. That's only 14.5 cycles per multiply! Tables won't do any good I'm afraid...

Quote:

Originally Posted by meynaf

So when I play, say, a 16MB MP3 and do something else in the background (yes I can), memory would go just too low (I think).

Yeah, that's a problem. Reading parts of the file instead of just the whole file would slow things down again.

Quote:

Originally Posted by meynaf

Besides, computing such a big table could cost a lot of time at startup (to be checked).

Not in this case.

Unsigned version:

Code:

    move.l  #table,a0
    move.l  #$187e,d0
    moveq   #0,d1
    moveq   #-1,d2
.loop
    move.l  d1,(a0)+
    add.l   d0,d1
    dbra    d2,.loop

Signed version:

Code:

    move.l  #table,a0
    move.l  #table+2^18,a1
    move.l  #$187e,d0
    moveq   #0,d1
    moveq   #0,d2
    move.l  #2^15-1,d3
.loop
    move.l  d1,(a0)+
    add.l   d0,d1
    sub.l   d0,d2
    move.l  d2,-(a1)
    dbra    d3,.loop

Quote:

Originally Posted by meynaf

I've already passed the point of ext.l and neg.l. I use sub instead of add, and sign-extend with movea whereever possible.

You're right about using sub instead of add, of course

Quote:

Originally Posted by meynaf

Did you know that :
movea.w d0,a0
was actually faster than :
ext.l d0
?

Really? How silly

Good to know in case data registers are needed.

Quote:

Originally Posted by meynaf

muls and mulu have same timing.

That's what I thought, but I was unsure.

Quote:

Originally Posted by meynaf

But it wouldn't be a couple of tables, but a ten of them, if not many more.

At first I was thinking to just replace the most common ones. Might give a speed boost without using up half the memory, but now that I've actually looked at your code more closely, I've come to the conclusion that tables are probably not worth the trouble.

Quote:

Originally Posted by meynaf

If you want to do so, do it, but personnally I don't need it.

I don't need it, either, but I did enjoy it! If you did as well, then I'll continue.

Quote:

Originally Posted by meynaf

Another thing about muls, is that the program constantly multiplies the same input by different constants, and shift-and-add can have common parts (intermediate results).

Yes, I've seen that. Good job

Quote:

Originally Posted by meynaf

Look at the following code, and tell me what amount of cycles 4 tables would gain - especially when there is no free register at all :

Code:

 move.w xx(a0),a4
 move.l a4,d3
 add.l d3,d3
 add.l d3,a4
 move.l a4,d4
 add.l d3,a4
 lsl.l #3,d4
 move.l d4,d5
 move.l d4,d6
 lsl.l #2,d4
 add.l d4,d5
 add.l d3,d4
 lsl.l #6,d4
 sub.l d3,d4  ; 187e
 lsl.l #8,d3
 sub.l d3,d5
 add.l d3,d6
 add.l d6,d6
 lsl.l #3,d3
 add.l d3,d5
 add.l d4,d5  ; 26f6
 add.l d5,d3
 sub.l d6,d3  ; 32c6
 add.l d6,d6
 sub.l a4,d6  ; 85b

None, tables wouldn't do any good at all here.

Quote:

Originally Posted by meynaf

This is the equivalent of 4 muls. Of course it's much larger code, but I think it would be more efficient to reduce those blocks than use tables.

As said, this is only about 14.5 cycles per mul. Reducing the code, if possible, would be much better.

Quote:

Originally Posted by meynaf

If you really think tables are the way to go, I encourage you to experience them anyway. I'd give you the whole code so you can assemble and test.

If most constants can be optimized like in the above example, but not all of them, then there may still be a use for tables. At least this reduces the number of tables needed

Quote:

Originally Posted by meynaf

The best way to test I know is to play something with settings your machine can't handle : it may hurt your ears but the final playtime will give you precious indications.

Or you can decode to a file.

Quote:

Originally Posted by meynaf

Of course it's possible to open the lib and measure decoding time, but I've got no program to do that (btw you could help a lot in writing one).

If it's not a problem to use code that changes the vbl interrupt without using the system (dirty, I know

), then such a program is easy enough to write, and also very small.

Edited: Well, so much for the simple and dirty method. I thought it would be enough to just insert a custom vbl interrupt, but the level 3 vector is changed back to the system default automaticaly. Bah, I thought this would be a lot easier than it may actually be

StingRay · 14 May 2009, 15:04

Quote:

Originally Posted by meynaf

If it didn't run anymore, there has to be a reason... If you agree to give me that code, I will tell you what happens and fix it. You will see that the code is still shorter but works ;-)

Of course there was a reason why it didn't work anymore. The reason was PHX-Ass trying to optimize my already optimized code which failed miserably.

I would happily give you the code but it's impossible as I released the intro at Mekka'99 and didn't keep the version that crashed when assembled with PHX-Ass. Please notice that I just used this as example to prove my point that assemblers never can be as good as humans when it comes to optimizing code.

Quote:

Originally Posted by meynaf

And I never said phxass was perfect either, you know. But I can use it for what I need, and I can't say that for asm-one.

Same for me with Asm-One/Pro.

Quote:

Originally Posted by meynaf

Then try this :
meynaf.free.fr/tmp/v.lzx
Source to assemble is v.s. Good luck. And please check "all errors", just for a laugh. When I did it it crashed.

Will check it when I have some time and/or am bored enough. :-)

Quote:

Originally Posted by meynaf

And tell me why some perfectly valid 020+ constructions like this one :

Code:

 move.b ($1000.w,a0,d0.l),d0

... get rejected even in 030 mode. And don't tell me I shouldn't write this because it is inefficient, this is re-sourced code.

This is where PHX-Ass is definitely better than any Asm1 version. Asm1 originally was an 68000 only Assembler, 680x0 support was added much later (using resourced Asm1 code mind you) and it wasn't done by Promax (original coder of Asm1) which is why there are a lot of problems with several perfectly valid 680x0 instructions.

Quote:

Originally Posted by meynaf

Yes, in "assembler" preferences, right in the middle of things like cpu type which have nothing to do with that. No "editor" preferences, ok. Well placed, easy to guess. Not to mention the option is more or less useless and the default setting a little bit on the stupid side.

Well, I had no problems to find that very option, after all there are just 2 pages for preferences, I don't see why it is hard to find. But it's a matter of taste I guess. I agree about the default setting being useless though but that's also a matter of taste, maybe some people like it.

Quote:

Originally Posted by meynaf

But now you have to tell me why the debugger simply freezes my machine with a grey screen.

That I don't know, the debugger always worked fine for me. Maybe too old version of Asm1?

Quote:

Originally Posted by meynaf

I perhaps didn't spend much time with it, but its learning curve seems to be a little bit too slow to raise...

Here I disagree, I find it VERY easy to use as you have everything in one package, editor, assembler and debugger. Load Asm1, press Escape, write your source, press Escape again, press A, j to start your program, couldn't be easier IMHO.

Quote:

Originally Posted by meynaf

Ok, I'll say it doesn't fit my needs instead of saying it's crap, just to please you, but it doesn't change the fact that I simply can't use it.

It's again a matter of taste.

I can use PHX-Ass but I don't like it. Might have to do with the fact that I started coding using SEKA and ASM1 is more a less a "pimped" SEKA so I never had problems to use it.

Quote:

Originally Posted by meynaf

Theoretically you don't need this. It's easy to detect phxass with an if/endc pair.

I want to use ASM1 to assemble my code even if I have to use PHX-Ass compatible mode thus the define. I have no other way in Asm1 to "detect" PHX-Ass.

I use it to disable certain Asm1 only features etc.

Quote:

Originally Posted by meynaf

But when I was talking to Thorham, I didn't really bash it. Then you came and said I did a hate campaign. You said phxass was horribly broken. I just had to defend, man

I know!

No hard feelings or anything, I just felt I need to "defend" Asm1 too. =)

Quote:

Originally Posted by meynaf

Now we can stop all that junk talk and you will tell me why my code won't assemble in asm-one, so I can use it. That would be much more useful than telling me I'm wrong in bashing it.

If you're happy with PHX-Ass (and I think you are) then by all means use it! I mean, I am using quite an old version of AsmPro and most probably I'll never update it because I learned to even make use of some of its bugs (no joke!

). However, if you need help using Asm1 I'll happily help you.

Leffmann · 14 May 2009, 15:37

Just want to point out that indirect with index and base displacement is doable in ASM-one, you just have to put the base displacement last:

move.b (a0, d0.l, $1000.w), d0

StingRay · 14 May 2009, 15:42

I know, just didn't want to mention it as it doesn't work in all Asm1 versions. (AsmPro doesn't support it at all)

Thorham · 14 May 2009, 17:37

Edited: Sorry for the mistake, but this code only works if it can't be preempted, or if you have a spare address register to swap with a7

A little upadate on the table thing. I think this is going to be as good as it's going to get:

Code:

    move.w  xx(a0),a4
    
    add.l   a4,a4
    add.l   a4,a4
    
    add.l   (sp)+,a4
    move.l  (a4),d3
    
    add.l   (sp)+,a4
    move.l  (a4),d4
    
    add.l   (sp)+,a4
    move.l  (a4),d5
    
    add.l   (sp)+,a4
    move.l  (a4),d6

    subq.l   #8,sp
    subq.l   #8,sp

That is if the mem reads take four cucles each. In that case it's only 40 cycles, not counting the first instruction, but I could be dead wrong

Edited: I checked the code with my speed testing program and my table reading idea is only, count'm, two whole frames faster for a million loop iterations: Yours is 67 frames, mine is 65 frames. In other words: Do not bother coz it sucks

Well, it sucks in this case. It might be useful in cases where you need every bit of speed you can get, such as in demos.

And what's bad as well is that my table reading idea needs the stack to be setup properly. Probably no interesting overhead, but it's just not handy, especialy not given the small speed increase.

Thorham · 15 May 2009, 12:03

Ok, I've looked at the file mdct_l.s a little closer, and there are several constants, namely 85b, 187e, 26f6 and 32c6, which are used over and over again. Further more, in those code blocks (up to and including ; u7) registers a1, a2 and a3 are free. Seems to me it would make at least some sense to use simple tables for three of those constants, including 187e, which is used more often than the other ones. You can then just optimize the simplest constant in the usual way.

Of course, if you plan to use the optimization you've shown me in all those code blocks, then it might be pointless.

Now I don't want to sound like I'm insisting on tables, but it might be possible to get a reasonable gain without making tables for all the constants. It just seems to me that there may be a possible balance between the usual optimizations and tables.

Edited: Trivial, but here's an example:

Code:

    move.w  d4,d3
    muls    #$32c6,d3

becomes

Code:

    move.l  (a1,d4.w*4),d3

Edited2: Sorry for all the edits

, but isn't movem.l a1-a3,-(sp) slower than three seperate moves?

meynaf · 16 May 2009, 15:26

Quote:

Originally Posted by StingRay

Of course there was a reason why it didn't work anymore. The reason was PHX-Ass trying to optimize my already optimized code which failed miserably.

I would happily give you the code but it's impossible as I released the intro at Mekka'99 and didn't keep the version that crashed when assembled with PHX-Ass.

'99 ? There has been quite a few new versions of phxass since then !

Quote:

Originally Posted by StingRay

Please notice that I just used this as example to prove my point that assemblers never can be as good as humans when it comes to optimizing code.

But both can work together. What one doesn't see, the other sees it. Or, in my case, what one is too lazy to do, the other will

Quote:

Originally Posted by StingRay

Will check it when I have some time and/or am bored enough. :-)

Yeah, sure. If you can find an asm which can assemble this with a minimal amount of modifications and produces code of the same quality, I'll drop phxass !

Quote:

Originally Posted by StingRay

This is where PHX-Ass is definitely better than any Asm1 version. Asm1 originally was an 68000 only Assembler, 680x0 support was added much later (using resourced Asm1 code mind you) and it wasn't done by Promax (original coder of Asm1) which is why there are a lot of problems with several perfectly valid 680x0 instructions.

Just too bad. And of course all asm1 development was abandoned ?

Quote:

Originally Posted by StingRay

Well, I had no problems to find that very option, after all there are just 2 pages for preferences, I don't see why it is hard to find. But it's a matter of taste I guess. I agree about the default setting being useless though but that's also a matter of taste, maybe some people like it.

It's perhaps to make people feel like if they were working on a pc. Damn num lock key

Quote:

Originally Posted by StingRay

That I don't know, the debugger always worked fine for me. Maybe too old version of Asm1?

Tested version was 1.48. Is there a newer one ?
Or isn't it one of those debuggers which need an external (parallel or serial) device to display things ?

Quote:

Originally Posted by StingRay

Here I disagree, I find it VERY easy to use as you have everything in one package, editor, assembler and debugger. Load Asm1, press Escape, write your source, press Escape again, press A, j to start your program, couldn't be easier IMHO.

It COULD be easier. What are the two "press Escape" for, eh ?
I don't see any use for this command-line stuff and it's constantly coming in the way between editor and assembler.

Quote:

Originally Posted by StingRay

It's again a matter of taste.

I can use PHX-Ass but I don't like it. Might have to do with the fact that I started coding using SEKA and ASM1 is more a less a "pimped" SEKA so I never had problems to use it.

I started on devpac 2 and I can't say I like it...

Quote:

Originally Posted by StingRay

I want to use ASM1 to assemble my code even if I have to use PHX-Ass compatible mode thus the define. I have no other way in Asm1 to "detect" PHX-Ass.

I use it to disable certain Asm1 only features etc.

But why not doing something like this :

Code:

 ifd _PHXASS_
PHXASS        = 1
 else
PHXASS        = 0
 endc

Quote:

Originally Posted by StingRay

I know!

No hard feelings or anything, I just felt I need to "defend" Asm1 too. =)

We're fighting for nothing.

Quote:

Originally Posted by StingRay

If you're happy with PHX-Ass (and I think you are) then by all means use it! I mean, I am using quite an old version of AsmPro and most probably I'll never update it because I learned to even make use of some of its bugs (no joke!

). However, if you need help using Asm1 I'll happily help you.

I can't say I'm happy with Phxass. I'm using it because it's simply the only one which is able to do the job.

So, how good is asmpro ?

For help on asm1, I think I pointed you some example code... Could be good to make it asm1 compatible.

meynaf · 16 May 2009, 16:02

Quote:

Originally Posted by Thorham

Well, yes, actually, but you may have dismissed the idea knowing how you don't always like big tables

Do I often dismiss ideas that are given to me ? Oh, well, ok

Quote:

Originally Posted by Thorham

That's not much

Yes.

Quote:

Originally Posted by Thorham

And in the case of the example you posted, the four multiplies add up to 58 cycles, not counting the first instruction. That's only 14.5 cycles per multiply! Tables won't do any good I'm afraid...

That's what I thought

Quote:

Originally Posted by Thorham

Yeah, that's a problem. Reading parts of the file instead of just the whole file would slow things down again.

It's already possible to check this, because DT's Mpega player can read like that. But I won't do it myself, as I think we need more than a few tables to compensate that speed loss...

Quote:

Originally Posted by Thorham

Not in this case.

Ok, for several megabytes it can be a quarter of a second or so, still acceptable.
But 187e constant can be done in 18 cycles. Not very different from a table lookup, especially when there is no free register to hold the pointer.

Quote:

Originally Posted by Thorham

You're right about using sub instead of add, of course

I'm using it massively. If you compare with earlier mpega code, you'll see that many negs were there and no longer are !

Quote:

Originally Posted by Thorham

Really? How silly

Good to know in case data registers are needed.

Yes. I thought movea.w was 4 cycles like ext.l but it's 2 !
However for massive sign-extend I prefer movem.w

Quote:

Originally Posted by Thorham

That's what I thought, but I was unsure.

It's divu and divs which don't.

Quote:

Originally Posted by Thorham

At first I was thinking to just replace the most common ones. Might give a speed boost without using up half the memory, but now that I've actually looked at your code more closely, I've come to the conclusion that tables are probably not worth the trouble.

IMO using tables here would be using megabytes to (perhaps) gain 5 seconds out of 10 minutes of decoding. I'll let you guess if it's worth or not.

Quote:

Originally Posted by Thorham

I don't need it, either, but I did enjoy it! If you did as well, then I'll continue.

Isn't it already what we're doing in here ?

Quote:

Originally Posted by Thorham

Yes, I've seen that. Good job

Not only mine. My friend Don Adan (Wanted Team) has to be credited for some parts. The 12 muls blocks are from him. Incredible code, that is. And my 4-muls block is just an improvement over his.

Quote:

Originally Posted by Thorham

None, tables wouldn't do any good at all here.
As said, this is only about 14.5 cycles per mul. Reducing the code, if possible, would be much better.

If possible ! Perhaps we need a program able to test all possible code combinations to do so !

Quote:

Originally Posted by Thorham

If most constants can be optimized like in the above example, but not all of them, then there may still be a use for tables. At least this reduces the number of tables needed

I think nearly all constants can be optimised like that, provided there are enough of them in the block (4 are ok, 2 are not enough for complex consts).

Quote:

Originally Posted by Thorham

Or you can decode to a file.

I can't, because I don't know exactly how to use the library (this may sound strange as I'm recoding parts of it).

Quote:

Originally Posted by Thorham

If it's not a problem to use code that changes the vbl interrupt without using the system (dirty, I know

), then such a program is easy enough to write, and also very small.

It's surely not a problem for a benchmarking program, but it's not really needed, as the OS doesn't use a lot of cpu time. If our program runs for, say, 10 seconds, results will be precise enough (+/- 1 or 2 frames).

Quote:

Originally Posted by Thorham

Edited: Well, so much for the simple and dirty method. I thought it would be enough to just insert a custom vbl interrupt, but the level 3 vector is changed back to the system default automaticaly. Bah, I thought this would be a lot easier than it may actually be

As I already have code to use AddIntServer(), I find it much simpler to use it.

But the problem isn't to measure time. It is to actually call the library to do the job without using DT's player.

Quote:

Originally Posted by Thorham

Edited: Sorry for the mistake, but this code only works if it can't be preempted, or if you have a spare address register to swap with a7

A little upadate on the table thing. I think this is going to be as good as it's going to get:

That is if the mem reads take four cucles each. In that case it's only 40 cycles, not counting the first instruction, but I could be dead wrong

Edited: I checked the code with my speed testing program and my table reading idea is only, count'm, two whole frames faster for a million loop iterations: Yours is 67 frames, mine is 65 frames. In other words: Do not bother coz it sucks

Well, it sucks in this case. It might be useful in cases where you need every bit of speed you can get, such as in demos.

And what's bad as well is that my table reading idea needs the stack to be setup properly. Probably no interesting overhead, but it's just not handy, especialy not given the small speed increase.

In our case, it's probably even worse. The whole mdct call isn't in a loop. So each time we'll need to :
. save old a7 in memory = memory write
. move 4000 to DFF09A to kill pre-emption = equivalent to chipmem write !
. read a7 from some memory area = memory read
And afterwards :
. restore A7 = memory write
. move C000 to DFF09A to restore system = equivalent to chipmem write !

Better to push-n-pop a reg !

Quote:

Originally Posted by Thorham

Ok, I've looked at the file mdct_l.s a little closer, and there are several constants, namely 85b, 187e, 26f6 and 32c6, which are used over and over again.

Not only those, but also 3b21 and 3f74 (but they are sums of others).

Quote:

Originally Posted by Thorham

Further more, in those code blocks (up to and including ; u7) registers a1, a2 and a3 are free.

Oh no, they are not !

Quote:

Originally Posted by Thorham

Seems to me it would make at least some sense to use simple tables for three of those constants, including 187e, which is used more often than the other ones. You can then just optimize the simplest constant in the usual way.

Could be 3b21 too because 3b21=32c6+85b.

Quote:

Originally Posted by Thorham

Of course, if you plan to use the optimization you've shown me in all those code blocks, then it might be pointless.

I used it in the first 4 blocks, but couldn't for the 4 others due to massive register shortage.

Quote:

Originally Posted by Thorham

Now I don't want to sound like I'm insisting on tables, but it might be possible to get a reasonable gain without making tables for all the constants. It just seems to me that there may be a possible balance between the usual optimizations and tables.

A reasonable gain ? See below about what 30 cycles gain here would give, and then think again about making tables.

Quote:

Originally Posted by Thorham

Edited: Trivial, but here's an example:

Code:

    move.w  d4,d3
    muls    #$32c6,d3

becomes

Code:

    move.l  (a1,d4.w*4),d3

Could be trivial. But A1 isn't free (look twice if you don't believe me). You don't need to save it because it's already done, but it gets used after those blocks.

Quote:

Originally Posted by Thorham

Edited2: Sorry for all the edits

, but isn't movem.l a1-a3,-(sp) slower than three seperate moves?

In a tight loop, yes. But in that damn thing where code doesn't fit in caches and has no chance of it anyway because there is no loop at all in it, I'm sure of absolutely nothing !
Besides, movem being slower is not true for 040 and probably not for 060 as well.
Furthermore, I can't measure such a small difference. Removing 30 cycles here is the equivalent of roughly one second of decoding time in a 8 minutes layer 3 stream...

Thorham · 18 May 2009, 08:00

Quote:

Originally Posted by meynaf

Do I often dismiss ideas that are given to me ? Oh, well, ok

Mostly only tables and quality reductions

Quote:

Originally Posted by meynaf

It's already possible to check this, because DT's Mpega player can read like that. But I won't do it myself, as I think we need more than a few tables to compensate that speed loss...

Most certainly. However, what are you going to do in case someone tries to play an mpeg that's a 100+ mb large?

Quote:

Originally Posted by meynaf

Ok, for several megabytes it can be a quarter of a second or so, still acceptable.

But 187e constant can be done in 18 cycles. Not very different from a table lookup, especially when there is no free register to hold the pointer.

And it doesn't matter much anyway if the code isn't in a tight loop, such as in this case.

Quote:

Originally Posted by meynaf

I'm using it massively. If you compare with earlier mpega code, you'll see that many negs were there and no longer are !

Haven't looked at it for a while now, but it's an improvement.

Quote:

Originally Posted by meynaf

Yes. I thought movea.w was 4 cycles like ext.l but it's 2 !
However for massive sign-extend I prefer movem.w

That's quite a good one there

Quote:

Originally Posted by meynaf

It's divu and divs which don't.

Right, I keep forgetting.

Quote:

Originally Posted by meynaf

IMO using tables here would be using megabytes to (perhaps) gain 5 seconds out of 10 minutes of decoding. I'll let you guess if it's worth or not.

That's very little

How many cycles would you have to chop of here to get a real speed increase?

Quote:

Originally Posted by meynaf

Isn't it already what we're doing in here ?

Yup, that's where it's going, isn't it

Quote:

Originally Posted by meynaf

Not only mine. My friend Don Adan (Wanted Team) has to be credited for some parts. The 12 muls blocks are from him. Incredible code, that is. And my 4-muls block is just an improvement over his.

If you mean the code at the end of the mdct file, then yes, that's quite awesome!

Quote:

Originally Posted by meynaf

If possible ! Perhaps we need a program able to test all possible code combinations to do so !

I wouldn't bother with that, because the code seems to play an insignificant role time wise. It seems better to just focus on the things which happen in the tight loops.

Quote:

Originally Posted by meynaf

I think nearly all constants can be optimised like that, provided there are enough of them in the block (4 are ok, 2 are not enough for complex consts).

Probably, but is it worth it in code that's not executed that often, or am I missing something?

Quote:

Originally Posted by meynaf

I can't, because I don't know exactly how to use the library (this may sound strange as I'm recoding parts of it).

You can use a command line utility that uses mpega.library to decode to files. I did it once for some mpegs before I had a peecee. If only I could remember the name of the program... I can tell you it's from Aminet though

Seemed to work quite well. I don't suppose dial up is a problem when accessing Aminet?

Quote:

Originally Posted by meynaf

It's surely not a problem for a benchmarking program, but it's not really needed, as the OS doesn't use a lot of cpu time. If our program runs for, say, 10 seconds, results will be precise enough (+/- 1 or 2 frames).

I was thinking of the dirty method because it's the easiest to implement.

Quote:

Originally Posted by meynaf

As I already have code to use AddIntServer(), I find it much simpler to use it.

Some obvious things here:

If you already have code, then it's still easy to do. All you really need is a program that sets up the irq handler with a counter routine and outputs the location of that counter. Next you need a little program that resets the counter and a program that reads the counter. The final ingredient is a simple script that calls the counter reset program, then the program you want to bench mark, and finally calls the counter reading program.

Not a very elegant way, but surely good enough for most purposes, and since you already have system friendly interrupt code, it should be very easy to write.

Quote:

Originally Posted by meynaf

But the problem isn't to measure time. It is to actually call the library to do the job without using DT's player.

Perhaps that command line decoder I was talking about could be of use here?

Quote:

Originally Posted by meynaf

The whole mdct call isn't in a loop.

Yes, I've read that more than once. How often is this code actually executed?

Quote:

Originally Posted by meynaf

Oh no, they are not !

They're not used in the blocks I was talking about, I checked with a search function. Because they're not used in a relatively large chunk of code, I thought they might be used to hold table pointers.

Quote:

Originally Posted by meynaf

Besides, movem being slower is not true for 040 and probably not for 060 as well.

If I were you, I'd focus on optimizing for '030 only for now. '040's and '060's are much faster, so the '030 is the cpu which needs optimization the most.

Quote:

Originally Posted by meynaf

Furthermore, I can't measure such a small difference. Removing 30 cycles here is the equivalent of roughly one second of decoding time in a 8 minutes layer 3 stream...

Doesn't it mean you can cut this code in half and still only save a few extra seconds on said eight minutes? It seems to me that it's more useful to just concentrate on the tight loops.

meynaf · 18 May 2009, 17:17

Quote:

Originally Posted by Thorham

Mostly only tables and quality reductions

Yeah, perhaps also things that make the exe size grow by 60% to gain 2% speed

(you're not targeted on that one)

Btw can you believe my ham8 rendering routine can be optimised further ? In fact it's possible to gain 5 clocks at average

(but I'll let you have a look at it again before I tell)

Quote:

Originally Posted by Thorham

Most certainly. However, what are you going to do in case someone tries to play an mpeg that's a 100+ mb large?

First, DT2's Mpega player would be able to play it without preloading (perhaps I was unclear about that).
Second, I think a 100+ mb mpeg isn't a very clever thing to do...
Third, some people have 128mb of mem to waste

Quote:

Originally Posted by Thorham

And it doesn't matter much anyway if the code isn't in a tight loop, such as in this case.

And it doesn't matter much anyway, right.

Quote:

Originally Posted by Thorham

Haven't looked at it for a while now, but it's an improvement.

Now I've upped the whole "development suite". All sources, the last version of the library, and DT2 with player :
meynaf.free.fr/tmp/mpega.lzx

Now you can't tell you can't test

Quote:

Originally Posted by Thorham

That's quite a good one there

If it's shorter & faster, it's always worth !

Quote:

Originally Posted by Thorham

Right, I keep forgetting.

Those things are easily forgotten. Anyway the 68030 would have been much better IMO if mul were hardwired.

Note that the same kind of mulu vs muls thingy happened to me with lsr vs asr. Thought lsr was faster. But doc and tests did show they have same timing.

Quote:

Originally Posted by Thorham

That's very little

How many cycles would you have to chop of here to get a real speed increase?

As I said, it's something like 1 second out of 8 minutes for 1 muls totally removed. Remove 15, and you'll decode in 15 seconds less (not that bad ; that would bring 256kbps stereo low-qual into realtime range).

Quote:

Originally Posted by Thorham

Yup, that's where it's going, isn't it

It is

Quote:

Originally Posted by Thorham

If you mean the code at the end of the mdct file, then yes, that's quite awesome!

That's the code right before the last output pass, yes. It's easy to spot, as it's not formatted my way but his.

Quote:

Originally Posted by Thorham

I wouldn't bother with that, because the code seems to play an insignificant role time wise. It seems better to just focus on the things which happen in the tight loops.

It's not insignificant ; this routine is long enough to have large impact on speed. It's just that individual instructions in it are not that important.

Moreover, you can only concentrate on tight loops when there actually are some, and there are very few !

Quote:

Originally Posted by Thorham

Probably, but is it worth it in code that's not executed that often, or am I missing something?

The code is executed a few times per frame (audio frame is ~240 samples if I'm not mistaken), but I don't know exactly how many.

Quote:

Originally Posted by Thorham

You can use a command line utility that uses mpega.library to decode to files. I did it once for some mpegs before I had a peecee. If only I could remember the name of the program... I can tell you it's from Aminet though

Seemed to work quite well. I don't suppose dial up is a problem when accessing Aminet?

Dial up is no problem for Aminet, except when I write my answers offline and go online just to post them

Besides, a command line utility which uses the lib to decode is usable only if it can decode partially (I don't want to decode to disk and decoded data would be huge in ram: ), and if it can support all different quality settings.
Also it's better to have the benchmark stuff right inside.

Quote:

Originally Posted by Thorham

I was thinking of the dirty method because it's the easiest to implement.

Not for me ;-)

Quote:

Originally Posted by Thorham

Some obvious things here:

If you already have code, then it's still easy to do. All you really need is a program that sets up the irq handler with a counter routine and outputs the location of that counter. Next you need a little program that resets the counter and a program that reads the counter. The final ingredient is a simple script that calls the counter reset program, then the program you want to bench mark, and finally calls the counter reading program.

Not a very elegant way, but surely good enough for most purposes, and since you already have system friendly interrupt code, it should be very easy to write.

It's indeed very easy to write, because it's exactly what I did in my picture viewer's benchmark program !

Quote:

Originally Posted by Thorham

Perhaps that command line decoder I was talking about could be of use here?

It could. Depends on what it can do.

Quote:

Originally Posted by Thorham

Yes, I've read that more than once. How often is this code actually executed?

I don't know exactly. See label u9f54 in the main source ; this is the calling routine.

On the other hand, as I've given you the whole thing with everything's needed to use it, you can just see by yourself (there even are debug macros to write to dff180 at start of main source, just to see).

Quote:

Originally Posted by Thorham

They're not used in the blocks I was talking about, I checked with a search function. Because they're not used in a relatively large chunk of code, I thought they might be used to hold table pointers.

They're not used in the blocks themselves, but they still contain the addresses where to output what's computed in those blocks !
So using them in the blocks would imply restoring them afterwards.

Quote:

Originally Posted by Thorham

If I were you, I'd focus on optimizing for '030 only for now. '040's and '060's are much faster, so the '030 is the cpu which needs optimization the most.

Yes. But I was thinking of making a version for each. Perhaps 040 isn't fast enough for the highest settings, and even 060 users can enjoy a little spare cpu time ;-)

Of course, for now it's 030, but all 030-only modifications I do must be in if/endc pairs. Say, replacing muls by add/lsl isn't a winner at all for 060 (for 040 I frankly don't know !).

Quote:

Originally Posted by Thorham

Doesn't it mean you can cut this code in half and still only save a few extra seconds on said eight minutes? It seems to me that it's more useful to just concentrate on the tight loops.

I said removing one muls, not cutting the code in half ;-)

However if you like tight loops to optimize, then you can have a look at the biggest cpu use contributor of the whole library : the subband window part ! (now located in subb_w.s)

EDIT: I've looked on Aminet but didn't find an mpeg decoder which uses mpega.library to decode to a file...

Thorham · 20 May 2009, 16:02

Quote:

Originally Posted by meynaf

Yeah, perhaps also things that make the exe size grow by 60% to gain 2% speed

(you're not targeted on that one)

Quote:

Originally Posted by meynaf

Btw can you believe my ham8 rendering routine can be optimised further ? In fact it's possible to gain 5 clocks at average

(but I'll let you have a look at it again before I tell)

Astounding

I've taken a look again, and I could only come up with a very small optimization. You know the code with the two btst instructions at the end? The bra at the end of that piece of code can be replaced by the code it jumps to:

Code:

.vbrb
 btst #0,d7				; 0,2 -> b
 beq.s .bl
 btst #1,d7				; 1 -> r
 beq.s .ro
 bra.s .ve				; 3 -> v	(3210->vbrb)

Code:

.vbrb
 btst #0,d7				; 0,2 -> b
 beq.s .bl
 btst #1,d7				; 1 -> r
 beq.s .ro
.vbrb_ve
 move.l d2,a3
 addq.b #3,d2
 move.b d2,(a1)+
 dbf d7,.loop
 rts

It's very little, of course, but it's still faster. Can't believe I didn't spot that one sooner. Anyway, this is not what you're referring to, so please enlighten me.

Quote:

Originally Posted by meynaf

Second, I think a 100+ mb mpeg isn't a very clever thing to do...

I have two. A dj (Tiesto) released his double cd album 'In search of sunrise' as two big mp3s on the net for free before the album was released on cd (or so the guy who gave me the music told me). Also, the music is one long mix, it's not a bunch of separate pieces of music.

Quote:

Originally Posted by meynaf

Third, some people have 128mb of mem to waste

Wish I was one of them

Quote:

Originally Posted by meynaf

And it doesn't matter much anyway, right.

You told me the code isn't in a tight loop, witch I interpreted as not being called often, and it doesn't contain any loops itself. In such a case it wouldn't matter much, now would it

Quote:

Originally Posted by meynaf

Now I've upped the whole "development suite". All sources, the last version of the library, and DT2 with player :
meynaf.free.fr/tmp/mpega.lzx

Cool, thanks

I've taken a look at it, and I must say that it's looking a lot better than the last version I checked. Good job. Guess it can't really hurt to take a good look.

Quote:

Originally Posted by meynaf

Now you can't tell you can't test

Indeed!

Quote:

Originally Posted by meynaf

Note that the same kind of mulu vs muls thingy happened to me with lsr vs asr. Thought lsr was faster. But doc and tests did show they have same timing.

What's the difference between them anyway?

Quote:

Originally Posted by meynaf

As I said, it's something like 1 second out of 8 minutes for 1 muls totally removed. Remove 15, and you'll decode in 15 seconds less (not that bad ; that would bring 256kbps stereo low-qual into realtime range).

Almost makes it seem not worth the trouble

Quote:

Originally Posted by meynaf

That's the code right before the last output pass, yes. It's easy to spot, as it's not formatted my way but his.

About formatting, you should really do something about that, IMHO. Tabs are better than spaces, and a few extra blank lines here and there can't hurt, either. Just my opinion.

Quote:

Originally Posted by meynaf

It's not insignificant ; this routine is long enough to have large impact on speed. It's just that individual instructions in it are not that important.

Ok, that's clear now.

Quote:

Originally Posted by meynaf

Moreover, you can only concentrate on tight loops when there actually are some, and there are very few !

How do you actually define tight loops? Perhaps we see them in a different way.

Quote:

Originally Posted by meynaf

The code is executed a few times per frame (audio frame is ~240 samples if I'm not mistaken), but I don't know exactly how many.

Oh, but that's actually quite often. I guess it wasn't entirely clear to me how many times that code is executed, hence me harping on about how it doesn't matter because it's not in a tight loop

Quote:

Originally Posted by meynaf

Dial up is no problem for Aminet, except when I write my answers offline and go online just to post them

How come? Is it too slow? If so, upgrade! Get some cheap cable connection.

Quote:

Originally Posted by meynaf

Besides, a command line utility which uses the lib to decode is usable only if it can decode partially (I don't want to decode to disk and decoded data would be huge in ram: ), and if it can support all different quality settings.

Maybe it can. I've only used it once, years ago. If it can't, can't you cut the first, say, 50kb of a mp3, and decode just that part?

Quote:

Originally Posted by meynaf

Also it's better to have the benchmark stuff right inside.

Certeinly.

Quote:

Originally Posted by meynaf

Not for me ;-)

How odd...

Quote:

Originally Posted by meynaf

It's indeed very easy to write, because it's exactly what I did in my picture viewer's benchmark program !

Not much use if you want it to be part of the library itself. And if you also want it system friendly, than there's no short cut.

Maybe I could encode 2mb worth of wav to mp3, so you can decode to ram. That way you could use an easy to write benchmark program.

Quote:

Originally Posted by meynaf

It could. Depends on what it can do.

Alright, I should have it on a cd around here somewhere. As far as I know, I just got it from aminet, but I should still have it. I'll find it.

Quote:

Originally Posted by meynaf

I don't know exactly. See label u9f54 in the main source ; this is the calling routine.

Alright, I'll do that.

Quote:

Originally Posted by meynaf

On the other hand, as I've given you the whole thing with everything's needed to use it, you can just see by yourself (there even are debug macros to write to dff180 at start of main source, just to see).

Yes, I can, and I will. Interesting to see how this performs now, as well.

Quote:

Originally Posted by meynaf

They're not used in the blocks themselves, but they still contain the addresses where to output what's computed in those blocks !
So using them in the blocks would imply restoring them afterwards.

That's exactly what I meant. A couple of moves for getting rid of a whole bunch of muls, sounds good to me...

Quote:

Originally Posted by meynaf

Yes. But I was thinking of making a version for each. Perhaps 040 isn't fast enough for the highest settings, and even 060 users can enjoy a little spare cpu time ;-)

Good, because it's a bad idea to make a one size fits all version. Non optimal for all cpus.

Quote:

Originally Posted by meynaf

Of course, for now it's 030, but all 030-only modifications I do must be in if/endc pairs. Say, replacing muls by add/lsl isn't a winner at all for 060 (for 040 I frankly don't know !).

Ok, also clear

Quote:

Originally Posted by meynaf

I said removing one muls, not cutting the code in half ;-)

It's only so to speak

Quote:

Originally Posted by meynaf

However if you like tight loops to optimize, then you can have a look at the biggest cpu use contributor of the whole library : the subband window part ! (now located in subb_w.s)

Ah, now we're talking. I think I'm going to start there, and focus my efforts on mainly that code for now. Don't know if I'll find anything, though. It looks tough.

Quote:

Originally Posted by meynaf

EDIT: I've looked on Aminet but didn't find an mpeg decoder which uses mpega.library to decode to a file...

As said, I think I have it here somewhere. I'll upload it when I find it.

meynaf · 21 May 2009, 17:45

Quote:

Originally Posted by Thorham

Astounding

I've taken a look again, and I could only come up with a very small optimization. You know the code with the two btst instructions at the end? The bra at the end of that piece of code can be replaced by the code it jumps to:
(...)
It's very little, of course, but it's still faster. Can't believe I didn't spot that one sooner. Anyway, this is not what you're referring to, so please enlighten me.

There are several things that could be done. In fact 3 ! (and 3rd is alas incompatible with yours because the code would no longer fit in cache)

I'm telling you the first one now, and, same as you, I can't believe I didn't spot that one sooner !
Look at the move to a6. What's in the array ? Addresses. Which point on what ? 4 bytes.
Hey man, why not directly copying the data there and use lea ?
Even if the data is in cache (often in this case) it's 3 cycles gained !

Quote:

Originally Posted by Thorham

I have two. A dj (Tiesto) released his double cd album 'In search of sunrise' as two big mp3s on the net for free before the album was released on cd (or so the guy who gave me the music told me). Also, the music is one long mix, it's not a bunch of separate pieces of music.

Should be very long then

What is the bitrate ?

Quote:

Originally Posted by Thorham

Wish I was one of them

Would be pretty pointless on a 030 IMO, especially with this slow IDE controller we have.
Frankly if I had the choice I would go for a 3Ghz 68080

Quote:

Originally Posted by Thorham

You told me the code isn't in a tight loop, witch I interpreted as not being called often, and it doesn't contain any loops itself. In such a case it wouldn't matter much, now would it

As I told you, that code is called quite a few times, not many, but it's a so huge bunch of muls that it takes its time anyway.

Quote:

Originally Posted by Thorham

Cool, thanks

I've taken a look at it, and I must say that it's looking a lot better than the last version I checked. Good job. Guess it can't really hurt to take a good look.

I plan on separating (in files) more things that need improvements, but vast areas are completely unknown to me (especially layer I & II).

Quote:

Originally Posted by Thorham

What's the difference between them anyway?

asr keeps the sign where lsr inserts only zeroes, e.g. lsr.b #1 on $80 gives $40, but asr would give $C0. You can see this as asr being signed (half of -128 is -64, not 64).

The difference between lsl and asl is much more subtle (but there is one - and you don't need it

).

Quote:

Originally Posted by Thorham

Almost makes it seem not worth the trouble

Probably. But when I discovered that 6 muls by a,b,c,d,e,f constants could be done with only 4 because e=a+b and f=c+d, all that repeated 8 times, that made 16 muls removed

Quote:

Originally Posted by Thorham

About formatting, you should really do something about that, IMHO. Tabs are better than spaces, and a few extra blank lines here and there can't hurt, either. Just my opinion.

My reason about not using tabs : I put comments at the right, and those are often too long to fit in the line if using tabs. But a lot of people don't put many comments in their code so they don't understand that

About extra blank lines : okay, but please give me examples on where.

Quote:

Originally Posted by Thorham

Ok, that's clear now.

I hope

Quote:

Originally Posted by Thorham

How do you actually define tight loops? Perhaps we see them in a different way.

For me they are code which fit in the cache and is responsible of an important part of the overall time. How do you define them ?

Quote:

Originally Posted by Thorham

Oh, but that's actually quite often. I guess it wasn't entirely clear to me how many times that code is executed, hence me harping on about how it doesn't matter because it's not in a tight loop

If you have any doubt, remember that you can now check it by yourself : you have the whole code and means to assemble and run it !
(and remember to flush libs between attempts - you'll also need to eject and reload the module before new settings can take effect)

Quote:

Originally Posted by Thorham

How come? Is it too slow? If so, upgrade! Get some cheap cable connection.

It's not a matter of being slow, it's a matter of being not free (even though at overall it's still much cheaper than a DSL), so I keep it online only when I'm actually using it.

Quote:

Originally Posted by Thorham

Maybe it can. I've only used it once, years ago. If it can't, can't you cut the first, say, 50kb of a mp3, and decode just that part?

Yeah, it's perfectly doable this way.

Quote:

Originally Posted by Thorham

How odd...

Perhaps it looks odd. But my usual includes (you know them now

) allow me to do that in two lines of code, without worrying about removing it at the end or saving regs in the int itself.

Quote:

Originally Posted by Thorham

Not much use if you want it to be part of the library itself. And if you also want it system friendly, than there's no short cut.

I did not mean part of the library, but part of the benchmark program.

Quote:

Originally Posted by Thorham

Maybe I could encode 2mb worth of wav to mp3, so you can decode to ram. That way you could use an easy to write benchmark program.

Cutting is probably a better idea here.
However if you feel like encoding something, a few mp1/mp2 could be of use.

Quote:

Originally Posted by Thorham

Alright, I should have it on a cd around here somewhere. As far as I know, I just got it from aminet, but I should still have it. I'll find it.

It seems that the "mpega" program (which is a player) is able to decode to disk, too. Was it that program ?

Quote:

Originally Posted by Thorham

Alright, I'll do that.

Seen anything now ?

Quote:

Originally Posted by Thorham

Yes, I can, and I will. Interesting to see how this performs now, as well.

Try with the default settings I've given and you'll see

Quote:

Originally Posted by Thorham

That's exactly what I meant. A couple of moves for getting rid of a whole bunch of muls, sounds good to me...

If it sounds good... just do it !

Quote:

Originally Posted by Thorham

Good, because it's a bad idea to make a one size fits all version. Non optimal for all cpus.

Well, some changes can be better for all cpus, but things are very different when it comes to muls vs shifts.

Quote:

Originally Posted by Thorham

Ok, also clear

What's not clear is which parts are the best for non-030 !

Quote:

Originally Posted by Thorham

It's only so to speak

What ? Don't tell me you can't reduce that to half

Quote:

Originally Posted by Thorham

Ah, now we're talking. I think I'm going to start there, and focus my efforts on mainly that code for now. Don't know if I'll find anything, though. It looks tough.

It IS tough. But if something significant was done in here, it would dramatically reduce the speed impact of the "quality" setting.

In fact it is the only routine which changes at all when this setting is different !

Quote:

Originally Posted by Thorham

As said, I think I have it here somewhere. I'll upload it when I find it.

If this is the "mpega" program then I have it.
Now we need some optims to check

14 May 2009, 15:42	#112
StingRay move.l #$c0ff33,throat Join Date: Dec 2005 Location: Berlin/Joymoney Posts: 6,863	I know, just didn't want to mention it as it doesn't work in all Asm1 versions. (AsmPro doesn't support it at all) Last edited by StingRay; 14 May 2009 at 16:01.

14 May 2009, 17:37	#113
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,806	Edited: Sorry for the mistake, but this code only works if it can't be preempted, or if you have a spare address register to swap with a7 A little upadate on the table thing. I think this is going to be as good as it's going to get: Code: move.w xx(a0),a4 add.l a4,a4 add.l a4,a4 add.l (sp)+,a4 move.l (a4),d3 add.l (sp)+,a4 move.l (a4),d4 add.l (sp)+,a4 move.l (a4),d5 add.l (sp)+,a4 move.l (a4),d6 subq.l #8,sp subq.l #8,sp That is if the mem reads take four cucles each. In that case it's only 40 cycles, not counting the first instruction, but I could be dead wrong Edited: I checked the code with my speed testing program and my table reading idea is only, count'm, two whole frames faster for a million loop iterations: Yours is 67 frames, mine is 65 frames. In other words: Do not bother coz it sucks Well, it sucks in this case. It might be useful in cases where you need every bit of speed you can get, such as in demos. And what's bad as well is that my table reading idea needs the stack to be setup properly. Probably no interesting overhead, but it's just not handy, especialy not given the small speed increase. Last edited by Thorham; 15 May 2009 at 11:16.

15 May 2009, 12:03	#114
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,806	Ok, I've looked at the file mdct_l.s a little closer, and there are several constants, namely 85b, 187e, 26f6 and 32c6, which are used over and over again. Further more, in those code blocks (up to and including ; u7) registers a1, a2 and a3 are free. Seems to me it would make at least some sense to use simple tables for three of those constants, including 187e, which is used more often than the other ones. You can then just optimize the simplest constant in the usual way. Of course, if you plan to use the optimization you've shown me in all those code blocks, then it might be pointless. Now I don't want to sound like I'm insisting on tables, but it might be possible to get a reasonable gain without making tables for all the constants. It just seems to me that there may be a possible balance between the usual optimizations and tables. Edited: Trivial, but here's an example: Code: move.w d4,d3 muls #$32c6,d3 becomes Code: move.l (a1,d4.w4),d3 Edited2: Sorry for all the edits , but isn't movem.l a1-a3,-(sp)* slower than three seperate moves? Last edited by Thorham; 15 May 2009 at 12:18.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Old KGLoad Discussion	killergorilla	project.KGLoad	357	20 January 2011 16:08
Castlevania Discussion	john4p	Retrogaming General Discussion	30	30 January 2009 02:10
ROM Discussion...	derSammler	project.EAB	41	29 January 2008 23:36
General Discussion	Zetr0	project.Amiga Game Factory	12	15 December 2005 13:53

02 May 2009, 11:44	#103
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,806	Off-topic: After looking at myself and realizing I had a big drinking problem, I decided to do something about it, and gave up the stuff completely. I had come to a point where I didn't finish projects and didn't finish code I said I would write for people all because my drinking was consuming my energy. Of course, while drinking, I had plenty of energy, but just try to code when you're in between sober and drunk, bloody hard, and it takes an hour to do 15 minutes worth of work properly. This had to stop. It means I started to lie about work I should have done already. People will probably remember the PNG codec and the scaling loops I promised meynaf to write. Although I started them, I, of course, didn't finish them and had little to show for (pmed meynaf about that). The same goes for another project I'm not naming here. I'll write a piece in the appropriate thread, but it comes down to the same problem. It wasn't easy to kick the habit, and from what I've heard, I should feel lucky, because I wasn't physically addicted alcohol. It's over and done now. However, I'm not going to get into any more projects unless I can finish them in a reasonable amount of time. It's all too easy to have one project after the other pile up on top of each other, and this is another bad habit I'm breaking. Glad to be back, and a big sorry to meynaf for putting up with my nonesense, it won't happen again mate!

02 May 2009, 16:05	#104
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,350	If it won't happen again, then let's code ! I'm afraid it's a little bit too late for the PNG code But I have other projects I need help for, and you probably have other projects on your own. I've started to work again on my optimized version of mpega.library and got some result. I can't say I've understood everything in that bunch of muls, but some parts are known enough to be rewritten - and effectively were. Any specialist of muls removal is welcome The current part is the Inverse Modified Discrete Cosine Transform for Long blocks (imdct_l). The actual routine suffers from register shortage (I would need 19 of them). Perhaps I can post it if someone is interested ?

07 May 2009, 15:54	#105
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,806	Okay, meynaf, post it! A link to the library source with includes (if any) would be nice. It can never hurt to take a look, but I can't promise anything. Do you still feel like continuing the discussion in this thread? If so, I can continue right where we left off.

09 May 2009, 09:52	#106
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,350	Well, I think that the best place to talk about this project is the old mpega thread I did. Or maybe a new one ? With a thread on that subject more other (interested) people will read. I'll post the complete code if you need to make some tests. For now you can have a look at the actual dct code : http://meynaf.free.fr/tmp/mdct_l.s Or at the subband filter : http://meynaf.free.fr/tmp/subb_f.s I have extracted them out of the main bulk to ease tests on them. I think it's enough to keep you busy for a while

12 May 2009, 11:33	#107
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,806	After my initial look at those muls instructions (things like muls #$187e,dx), I thought: Why not just make a, wait for it, table... I know you probably won't like 256kb tables, but before you dismiss them, read further. Advantages: - Tables offer better optimization for large constants. - Tables use up less instructions in tight loops, helps the cache potentially. - If the program can be made resident because it's pure, the tables only have to be setup or read once, if I'm not mistaken. - one to two megabytes in overhead doesn't hurt configurations like yours or mine. Since the program has little to no use on bare machines, this overhead doesn't have to be a problem. Disadvantage: - You don't like large tables Now for the muls expert thing. I'm not an expert, but it's almost the same as for mulu. All you need is to do is ext.l the register, and apply normal shifts, adds and subs. If the constant is negative, just neg.l the end result. I know that this isn't very helpful, but unless muls is a lot slower than mulu, I'd just go for a couple of tables. By the way, you forgot to answer about continuing this thread in the way it was going.

14 May 2009, 15:37	#111
Leffmann Join Date: Jul 2008 Location: Sweden Posts: 2,269	Just want to point out that indirect with index and base displacement is doable in ASM-one, you just have to put the base displacement last: move.b (a0, d0.l, $1000.w), d0

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)