English Amiga Board - View Single Post

Thorham · 23 May 2009, 12:10

Quote:

Originally Posted by meynaf

There are several things that could be done. In fact 3 ! (and 3rd is alas incompatible with yours because the code would no longer fit in cache)

Pity, but it seems to save little, so it doesn't matter much.

Quote:

Originally Posted by meynaf

I'm telling you the first one now, and, same as you, I can't believe I didn't spot that one sooner !
Look at the move to a6. What's in the array ? Addresses. Which point on what ? 4 bytes.
Hey man, why not directly copying the data there and use lea ?
Even if the data is in cache (often in this case) it's 3 cycles gained !

Right, I get it. But that means an indexed lea is faster than an indexed move, that's great! Now tell me the other ones, please.

Quote:

Originally Posted by meynaf

Should be very long then

What is the bitrate ?

Yes, they're long. They're an hour of continuous music each. They were originally encoded in mp3 format, and the guy I got them from burned them to cd, so I encoded them at 320kb to keep them in a good quality. They're both 182 mb large. Still a lot better then 650 mb per cd.

Quote:

Originally Posted by meynaf

Would be pretty pointless on a 030 IMO, especially with this slow IDE controller we have.

Yeah, the interface sucks, but only because it dumps data to chip mem, witch then has to be copied to fast, that's the real problem. If only it was cpu idle and low mem bandwidth, it would've been a lot better.

Quote:

Originally Posted by meynaf

Frankly if I had the choice I would go for a 3Ghz 68080

Not me. You might as well get a peecee with those speeds. Trying to get things fast on a 50mhz '030 is part of the fun!

Quote:

Originally Posted by meynaf

I plan on separating (in files) more things that need improvements, but vast areas are completely unknown to me (especially layer I & II).

But aren't layers one and two video layers? You shouldn't really need those for mp3s as far as I know.

Quote:

Originally Posted by meynaf

asr keeps the sign where lsr inserts only zeroes, e.g. lsr.b #1 on $80 gives $40, but asr would give $C0. You can see this as asr being signed (half of -128 is -64, not 64).

That's very interesting, didn't know that, thanks.

Quote:

Originally Posted by meynaf

Probably. But when I discovered that 6 muls by a,b,c,d,e,f constants could be done with only 4 because e=a+b and f=c+d, all that repeated 8 times, that made 16 muls removed

Now that's optimizing!

Quote:

Originally Posted by meynaf

My reason about not using tabs : I put comments at the right, and those are often too long to fit in the line if using tabs. But a lot of people don't put many comments in their code so they don't understand that

If I comment code properly, I usually just split comments over several lines. If it doesn't fit, I'll compact the comment. Also, English comments are usually short enough

Quote:

Originally Posted by meynaf

About extra blank lines : okay, but please give me examples on where.

This:

Code:

; 2. valeurs 2,e,14,20
; on utilise le fait que 3b21=32c6+85b et 3f74=26f6+187e
 move.l d1,a4			; z0
 move.l d0,a5			; zc
 move.l d1,a6			; z14
 move.l d0,d2			; z10
 neg.l d1			; z4
 neg.l d0			; z8

 move.w 2(a0),d4
 beq.s .l0

 move.w d4,d3
 muls #$32c6,d3
 sub.l d3,a4			; z0
 sub.l d3,d1			; -32c6

 move.w d4,d3
 muls #$26f6,d3
 sub.l d3,a5			; zc
 sub.l d3,d0			; -26f6

 move.w d4,d3
 muls #$187e,d3
 sub.l d3,d2			; z10
 sub.l d3,d0			; -26f6 -187e -> -3f74

 muls #$85b,d4
 sub.l d4,a6			; z14
 sub.l d4,d1			; -32c6 - 85b -> -3b21

And this:

Code:

.loop

 ifne round
	moveq #2,d1
	moveq #2,d2
	moveq #2,d3
	add.b (a0)+,d1
	subx.b d4,d4			; 00 si ok, FF si ça dépasse
	or.b d4,d1			; inchangé si ok, FF si ça dépasse
	add.b (a0)+,d2
	subx.b d4,d4
	or.b d4,d2
	add.b (a0)+,d3
	subx.b d4,d4
	or.b d4,d3
 else					; round=0 pas d'arrondi, on coupe juste
	move.b (a0)+,d1
	move.b (a0)+,d2
	move.b (a0)+,d3
 endc
;
; Do this in parts later.
;
; moveq #-4,d4			; fc
; and.b d4,d1
; and.b d4,d2
; and.b d4,d3
;
 move.l d1,d4			; rrrr....
 lsl.l #4,d4			; rrrr....0000
 move.b d2,d4			; rrrrvvvv....
 lsl.l #4,d4			; rrrrvvvv....0000
 move.b d3,d4			; rrrrvvvvbbbb....
 lsr.l #4,d4			; rrrrvvvvbbbb
 move.l (a5,d4.l*4),a6
;
; Register swap and dec instead of
; inc (modify table gen a little)
;
 move.l d3,d6
 sub.b -(a6),d6
 bcc.s .n0
 neg.b d6

.n0
 move.l d2,d5
 sub.b -(a6),d5
 bcc.s .n1
 neg.b d5

.n1
 move.l d1,d0
 sub.b (a6),d0
 bcc.s .n2
 neg.b d0

.n2
 add.l d5,d0			; r+v
 add.l d0,d0			; (r+v)*2 = 2r+2v
 add.l d5,d0			; 2r+3v
 add.l d6,d0			; 2r+3v+1b

 move.l d1,d4
 sub.l a2,d4
 bpl.s .n3
 neg.l d4
 add.l d4,d4			; r *2

.n3
 move.l d2,d5
 sub.l a3,d5
 bpl.s .n4
 neg.l d5

.n4
 move.l d5,d6
 add.l d5,d5
 add.l d6,d5			; v *3
 move.l d3,d6
 sub.l a4,d6
 bpl.s .n5
 neg.l d6
.n5						; b *1

 add.l d4,d5			; d4=r d5=r+v d6=b
 add.l d6,d4			; r+b r+v b
 add.l d6,d6			; r+b r+v 2b
 add.l d5,d6			; r+b r+v 2b+r+v
 beq.s .vbrb			; all together = 0 -> gbrb
 sub.l d4,d6			; r+b r+v v+b

; d6=r d5=b d4=v d0=f
 cmp.l d4,d6
 bls.s .br
 cmp.l d4,d5
 bls.s .bx
 cmp.l d4,d0
 bls.s .fi

.ve
 moveq #-4,d4		;Moved (see above)
 and.b d4,d2

 move.l d2,a3
 addq.b #3,d2
 move.b d2,(a1)+
 dbf d7,.loop
 rts

.br
 cmp.l d6,d5
 bls.s .bx
 cmp.l d6,d0
 bls.s .fi

.ro
 moveq #-4,d4		;Moved
 and.b d4,d1

 move.l d1,a2
 addq.b #2,d1
 move.b d1,(a1)+
 dbf d7,.loop
 rts

.bx
 cmp.l d5,d0
 bls.s .fi

.bl
 moveq #-4,d4		;Moved
 and.b d4,d3

 move.l d3,a4
 addq.b #1,d3
 move.b d3,(a1)+
 dbf d7,.loop
 rts

.fi
 move.b (a6)+,d3	;Changed from dec to inc.
 move.b (a6)+,d2	;This is why the palette table
 move.b (a6)+,d1	;has to be altered a little.

 move.b (a6),(a1)+	;Moved for pipeline.

 move.l d1,a2
 move.l d2,a3
 move.l d3,a4
 dbf d7,.loop
 rts

.vbrb
 btst #0,d7				; 0,2 -> b
 beq.s .bl
 btst #1,d7				; 1 -> r
 beq.s .ro
 bra.s .ve				; 3 -> v	(3210->vbrb)

Quote:

Originally Posted by meynaf

For me they are code which fit in the cache and is responsible of an important part of the overall time. How do you define them ?

Code witch is executed a large number of times in a loop. It doesn't have to fit in the cache, and it can call outside routines if need be. Not much difference.

Quote:

Originally Posted by meynaf

If you have any doubt, remember that you can now check it by yourself : you have the whole code and means to assemble and run it !
(and remember to flush libs between attempts - you'll also need to eject and reload the module before new settings can take effect)

Great, I forgot about the flush libs thing. Bah

Quote:

Originally Posted by meynaf

It's not a matter of being slow, it's a matter of being not free (even though at overall it's still much cheaper than a DSL), so I keep it online only when I'm actually using it.

That makes sense.

Quote:

Originally Posted by meynaf

Perhaps it looks odd. But my usual includes (you know them now

) allow me to do that in two lines of code, without worrying about removing it at the end or saving regs in the int itself.

Two lines, huh? Then such a program should still be easy to write.

Quote:

Originally Posted by meynaf

I did not mean part of the library, but part of the benchmark program.

Okay.

Quote:

Originally Posted by meynaf

Cutting is probably a better idea here.
However if you feel like encoding something, a few mp1/mp2 could be of use.

I don't know if I have the software. If not I can probably download it. But, who uses mp1/mp2? My guess is no one.

Quote:

Originally Posted by meynaf

It seems that the "mpega" program (which is a player) is able to decode to disk, too. Was it that program ?

I think it is.

Quote:

Originally Posted by meynaf

Seen anything now ?

Not yet. This code is quite tough. Maybe I'll never spot a single thing, so I'm not promising anything. It's fun to try, though.

Quote:

Originally Posted by meynaf

If it sounds good... just do it !

Not yet. I'm going to keep looking first. Maybe I'll spot more interesting optimizations. If tables won't increase the speed much, it's perhaps better to try them later, when all else has failed.

Quote:

Originally Posted by meynaf

Well, some changes can be better for all cpus, but things are very different when it comes to muls vs shifts.

Yeah, that's true. For '060 you don't need shifts at all, as far as I know. Makes life simple, though.

Quote:

Originally Posted by meynaf

What's not clear is which parts are the best for non-030 !

I'd still forget about it for now, if I were you. Just focus on the cpu that needs optimizations the most.

Quote:

Originally Posted by meynaf

What ? Don't tell me you can't reduce that to half

Yeah, right, if only.

Quote:

Originally Posted by meynaf

It IS tough. But if something significant was done in here, it would dramatically reduce the speed impact of the "quality" setting.

Haven't seen anything so far...

Quote:

Originally Posted by meynaf

If this is the "mpega" program then I have it.
Now we need some optims to check

Yes, I think it is. Just try decoding a single three minute song in good quality to wav and play it back. If it sounds good, it's the program.