Quote:
Originally Posted by meynaf
There are several things that could be done. In fact 3 ! (and 3rd is alas incompatible with yours because the code would no longer fit in cache)
|
Pity, but it seems to save little, so it doesn't matter much.
Quote:
Originally Posted by meynaf
I'm telling you the first one now, and, same as you, I can't believe I didn't spot that one sooner !
Look at the move to a6. What's in the array ? Addresses. Which point on what ? 4 bytes.
Hey man, why not directly copying the data there and use lea ?
Even if the data is in cache (often in this case) it's 3 cycles gained !
|
Right, I get it. But that means an indexed
lea is faster than an indexed
move, that's great! Now tell me the other ones, please.
Quote:
Originally Posted by meynaf
Should be very long then
What is the bitrate ?
|
Yes, they're long. They're an hour of continuous music each. They were originally encoded in mp3 format, and the guy I got them from burned them to cd, so I encoded them at 320kb to keep them in a good quality. They're both 182 mb large. Still a lot better then 650 mb per cd.
Quote:
Originally Posted by meynaf
Would be pretty pointless on a 030 IMO, especially with this slow IDE controller we have.
|
Yeah, the interface sucks, but only because it dumps data to chip mem, witch then has to be copied to fast, that's the real problem. If only it was cpu idle and low mem bandwidth, it would've been a lot better.
Quote:
Originally Posted by meynaf
Frankly if I had the choice I would go for a 3Ghz 68080
|
Not me. You might as well get a peecee with those speeds. Trying to get things fast on a 50mhz '030 is part of the fun!
Quote:
Originally Posted by meynaf
I plan on separating (in files) more things that need improvements, but vast areas are completely unknown to me (especially layer I & II).
|
But aren't layers one and two video layers? You shouldn't really need those for mp3s as far as I know.
Quote:
Originally Posted by meynaf
asr keeps the sign where lsr inserts only zeroes, e.g. lsr.b #1 on $80 gives $40, but asr would give $C0. You can see this as asr being signed (half of -128 is -64, not 64).
|
That's very interesting, didn't know that, thanks.
Quote:
Originally Posted by meynaf
Probably. But when I discovered that 6 muls by a,b,c,d,e,f constants could be done with only 4 because e=a+b and f=c+d, all that repeated 8 times, that made 16 muls removed
|
Now that's optimizing!
Quote:
Originally Posted by meynaf
My reason about not using tabs : I put comments at the right, and those are often too long to fit in the line if using tabs. But a lot of people don't put many comments in their code so they don't understand that
|
If I comment code properly, I usually just split comments over several lines. If it doesn't fit, I'll compact the comment. Also, English comments are usually short enough
Quote:
Originally Posted by meynaf
About extra blank lines : okay, but please give me examples on where.
|
This:
Code:
; 2. valeurs 2,e,14,20
; on utilise le fait que 3b21=32c6+85b et 3f74=26f6+187e
move.l d1,a4 ; z0
move.l d0,a5 ; zc
move.l d1,a6 ; z14
move.l d0,d2 ; z10
neg.l d1 ; z4
neg.l d0 ; z8
move.w 2(a0),d4
beq.s .l0
move.w d4,d3
muls #$32c6,d3
sub.l d3,a4 ; z0
sub.l d3,d1 ; -32c6
move.w d4,d3
muls #$26f6,d3
sub.l d3,a5 ; zc
sub.l d3,d0 ; -26f6
move.w d4,d3
muls #$187e,d3
sub.l d3,d2 ; z10
sub.l d3,d0 ; -26f6 -187e -> -3f74
muls #$85b,d4
sub.l d4,a6 ; z14
sub.l d4,d1 ; -32c6 - 85b -> -3b21
And this:
Code:
.loop
ifne round
moveq #2,d1
moveq #2,d2
moveq #2,d3
add.b (a0)+,d1
subx.b d4,d4 ; 00 si ok, FF si ça dépasse
or.b d4,d1 ; inchangé si ok, FF si ça dépasse
add.b (a0)+,d2
subx.b d4,d4
or.b d4,d2
add.b (a0)+,d3
subx.b d4,d4
or.b d4,d3
else ; round=0 pas d'arrondi, on coupe juste
move.b (a0)+,d1
move.b (a0)+,d2
move.b (a0)+,d3
endc
;
; Do this in parts later.
;
; moveq #-4,d4 ; fc
; and.b d4,d1
; and.b d4,d2
; and.b d4,d3
;
move.l d1,d4 ; rrrr....
lsl.l #4,d4 ; rrrr....0000
move.b d2,d4 ; rrrrvvvv....
lsl.l #4,d4 ; rrrrvvvv....0000
move.b d3,d4 ; rrrrvvvvbbbb....
lsr.l #4,d4 ; rrrrvvvvbbbb
move.l (a5,d4.l*4),a6
;
; Register swap and dec instead of
; inc (modify table gen a little)
;
move.l d3,d6
sub.b -(a6),d6
bcc.s .n0
neg.b d6
.n0
move.l d2,d5
sub.b -(a6),d5
bcc.s .n1
neg.b d5
.n1
move.l d1,d0
sub.b (a6),d0
bcc.s .n2
neg.b d0
.n2
add.l d5,d0 ; r+v
add.l d0,d0 ; (r+v)*2 = 2r+2v
add.l d5,d0 ; 2r+3v
add.l d6,d0 ; 2r+3v+1b
move.l d1,d4
sub.l a2,d4
bpl.s .n3
neg.l d4
add.l d4,d4 ; r *2
.n3
move.l d2,d5
sub.l a3,d5
bpl.s .n4
neg.l d5
.n4
move.l d5,d6
add.l d5,d5
add.l d6,d5 ; v *3
move.l d3,d6
sub.l a4,d6
bpl.s .n5
neg.l d6
.n5 ; b *1
add.l d4,d5 ; d4=r d5=r+v d6=b
add.l d6,d4 ; r+b r+v b
add.l d6,d6 ; r+b r+v 2b
add.l d5,d6 ; r+b r+v 2b+r+v
beq.s .vbrb ; all together = 0 -> gbrb
sub.l d4,d6 ; r+b r+v v+b
; d6=r d5=b d4=v d0=f
cmp.l d4,d6
bls.s .br
cmp.l d4,d5
bls.s .bx
cmp.l d4,d0
bls.s .fi
.ve
moveq #-4,d4 ;Moved (see above)
and.b d4,d2
move.l d2,a3
addq.b #3,d2
move.b d2,(a1)+
dbf d7,.loop
rts
.br
cmp.l d6,d5
bls.s .bx
cmp.l d6,d0
bls.s .fi
.ro
moveq #-4,d4 ;Moved
and.b d4,d1
move.l d1,a2
addq.b #2,d1
move.b d1,(a1)+
dbf d7,.loop
rts
.bx
cmp.l d5,d0
bls.s .fi
.bl
moveq #-4,d4 ;Moved
and.b d4,d3
move.l d3,a4
addq.b #1,d3
move.b d3,(a1)+
dbf d7,.loop
rts
.fi
move.b (a6)+,d3 ;Changed from dec to inc.
move.b (a6)+,d2 ;This is why the palette table
move.b (a6)+,d1 ;has to be altered a little.
move.b (a6),(a1)+ ;Moved for pipeline.
move.l d1,a2
move.l d2,a3
move.l d3,a4
dbf d7,.loop
rts
.vbrb
btst #0,d7 ; 0,2 -> b
beq.s .bl
btst #1,d7 ; 1 -> r
beq.s .ro
bra.s .ve ; 3 -> v (3210->vbrb)
Quote:
Originally Posted by meynaf
For me they are code which fit in the cache and is responsible of an important part of the overall time. How do you define them ?
|
Code witch is executed a large number of times in a loop. It doesn't have to fit in the cache, and it can call outside routines if need be. Not much difference.
Quote:
Originally Posted by meynaf
If you have any doubt, remember that you can now check it by yourself : you have the whole code and means to assemble and run it !
(and remember to flush libs between attempts - you'll also need to eject and reload the module before new settings can take effect)
|
Great, I forgot about the flush libs thing. Bah
Quote:
Originally Posted by meynaf
It's not a matter of being slow, it's a matter of being not free (even though at overall it's still much cheaper than a DSL), so I keep it online only when I'm actually using it.
|
That makes sense.
Quote:
Originally Posted by meynaf
Perhaps it looks odd. But my usual includes (you know them now ) allow me to do that in two lines of code, without worrying about removing it at the end or saving regs in the int itself.
|
Two lines, huh? Then such a program should still be easy to write.
Quote:
Originally Posted by meynaf
I did not mean part of the library, but part of the benchmark program.
|
Okay.
Quote:
Originally Posted by meynaf
Cutting is probably a better idea here.
However if you feel like encoding something, a few mp1/mp2 could be of use.
|
I don't know if I have the software. If not I can probably download it. But, who uses mp1/mp2? My guess is no one.
Quote:
Originally Posted by meynaf
It seems that the "mpega" program (which is a player) is able to decode to disk, too. Was it that program ?
|
I think it is.
Quote:
Originally Posted by meynaf
Seen anything now ?
|
Not yet. This code is quite tough. Maybe I'll never spot a single thing, so I'm not promising anything. It's fun to try, though.
Quote:
Originally Posted by meynaf
If it sounds good... just do it !
|
Not yet. I'm going to keep looking first. Maybe I'll spot more interesting optimizations. If tables won't increase the speed much, it's perhaps better to try them later, when all else has failed.
Quote:
Originally Posted by meynaf
Well, some changes can be better for all cpus, but things are very different when it comes to muls vs shifts.
|
Yeah, that's true. For '060 you don't need shifts at all, as far as I know. Makes life simple, though.
Quote:
Originally Posted by meynaf
What's not clear is which parts are the best for non-030 !
|
I'd still forget about it for now, if I were you. Just focus on the cpu that needs optimizations the most.
Quote:
Originally Posted by meynaf
What ? Don't tell me you can't reduce that to half
|
Yeah, right, if only.
Quote:
Originally Posted by meynaf
It IS tough. But if something significant was done in here, it would dramatically reduce the speed impact of the "quality" setting.
|
Haven't seen anything so far...
Quote:
Originally Posted by meynaf
If this is the "mpega" program then I have it.
Now we need some optims to check
|
Yes, I think it is. Just try decoding a single three minute song in good quality to wav and play it back. If it sounds good, it's the program.