Anyone up for an ASM coding competition? - Page 12

Thorham · 09 June 2016, 21:10

Quote:

Originally Posted by meynaf

Part of mpeg audio layer 3 (my accelerated mpega.library) ; huff quad decode (a part simple enough to be submitted here).

Why not just use and + table read? It's a 512 byte table. Doesn't seem excessive.

meynaf · 09 June 2016, 21:18

Quote:

Originally Posted by Thorham

Why not just use and + table read? It's a 512 byte table. Doesn't seem excessive.

I don't get it. Do you mean 4 table reads each giving 16 bits ? Or 2 table reads each giving 32 bits ?
How would it look like when turned into code ?

Thorham · 09 June 2016, 22:22

Quote:

Originally Posted by meynaf

I don't get it. Do you mean 4 table reads each giving 16 bits ? Or 2 table reads each giving 32 bits ?
How would it look like when turned into code ?

I made a mistake there, but this should be nice and simple, and it's still only a 512 byte table:

Code:

    clr.w   d6
    move.b  d0,d6
    move.w  table(pc,d6.w*2),(a1)+

meynaf · 09 June 2016, 22:42

That's not enough. You did only one value this way and there are 4 to do.

Leffmann · 09 June 2016, 23:25

Try a sequence of 4

bfexts d1{x:2}, d6

+

move.w d6, (a1)+

.

Thorham · 09 June 2016, 23:59

Quote:

Originally Posted by meynaf

That's not enough. You did only one value this way and there are 4 to do.

No, it does four values in one go, hence the 512 byte table. There's no need to look up each bit pair individually, you can just do all four at once.

Don_Adan · 10 June 2016, 00:56

Thorham idea is the best and fastest, code can looks next:

moveq #31,D6
and.b D0,D6
move.l Table(PC,D6.W*4),(A1)+

Table will be 128x4.

meynaf · 10 June 2016, 08:15

Quote:

Originally Posted by Leffmann

Try a sequence of 4

bfexts d1{x:2}, d6

+

move.w d6, (a1)+

.

Assuredly the shortest solution

Unfortunately bfexts is slow (10 clocks on 030).

Quote:

Originally Posted by Thorham

No, it does four values in one go, hence the 512 byte table. There's no need to look up each bit pair individually, you can just do all four at once.

All four at once give 64 bits of data (4 16-bit blocks). You can't read that in one go. The code you presented does only one 16-bit value.
To fetch the whole 64 bits (8 bytes) you need a table of 256*8=2048 bytes.

Please present the full code doing it, to prevent any misunderstanding.

Quote:

Originally Posted by Don_Adan

Thorham idea is the best and fastest, code can looks next:

moveq #31,D6
and.b D0,D6
move.l Table(PC,D6.W*4),(A1)+

Table will be 128x4.

Full code please

This thread isn't for theoretical ideas, it's for direct working code submitting - if i'm not mistaken. So please submit full working code.

Leffmann · 10 June 2016, 08:56

If you want the fastest and still keep it within 100 bytes, it should be two lookups from a 64-byte table:

Code:

moveq    #15, d6
and.b    d1, d6
move.l   (table, pc, d6.w*4), (4, a1)
move.b   d1, d6
lsr.b    #4, d6
move.l   (table, pc, d6.w*4), (a1)
addq.l   #8, a1

meynaf · 10 June 2016, 09:11

Not bad. But compare it to :

Code:

 rept 4
 add.b d1,d1
 subx.w d6,d6
 add.b d1,d1
 addx.w d6,d6
 move.w d6,(a1)+
 endr

Don_Adan · 10 June 2016, 11:23

Quote:

Originally Posted by meynaf

Assuredly the shortest solution

Unfortunately bfexts is slow (10 clocks on 030).

All four at once give 64 bits of data (4 16-bit blocks). You can't read that in one go. The code you presented does only one 16-bit value.
To fetch the whole 64 bits (8 bytes) you need a table of 256*8=2048 bytes.

Please present the full code doing it, to prevent any misunderstanding.

Full code please

This thread isn't for theoretical ideas, it's for direct working code submitting - if i'm not mistaken. So please submit full working code.

Oki, i forget that it must be word output for every 2 bits input. Then if output can not be changed to byte size, like $00, $01, $xx, $FF, then table must be 1024 bytes long.
Code can looks next:

Code:

moveq    #63,D6
and.b    D1, D6
move.l   Table(PC,D6.W*8),(A1)+
move.l   Table+4(PC,D6.W*8),(A1)+

Table
dc.w 0,0,0,0
dc.w 0,0,0,1
....
dc.w -1,-1,-1,-1

meynaf · 10 June 2016, 11:32

It doesn't work. You clear b7 and b6 so first word will be wrong.

This method can work but it needs 256*8=2048 bytes. This is overkill (actually even a 64-byte table is too big for my taste ; this code isn't that important).

Don_Adan · 10 June 2016, 12:25

Quote:

Originally Posted by meynaf

It doesn't work. You clear b7 and b6 so first word will be wrong.

This method can work but it needs 256*8=2048 bytes. This is overkill (actually even a 64-byte table is too big for my taste ; this code isn't that important).

You right, sorry. Only Leffmann table version can be a few optimised. But you can made speed benchmark for both versions. I dont know why you need word output, not byte output for your code. You waste half of destination buff, single ext.w when value is readed from table will be better for me.

meynaf · 10 June 2016, 12:40

Quote:

Originally Posted by Don_Adan

I dont know why you need word output, not byte output for your code. You waste half of destination buff, single ext.w when value is readed from table will be better for me.

Taken out of context it seems strange indeed.
But the huffquad data follows regular huff data (which is full 16-bit) in the same buffer, to be read by the same routines after that (first stereo handling, then imdct). The position at which quad data starts isn't constant : there may be a lot of it, or very few, or even none at all.

That said, the quickest method appears to be :

Code:

 moveq #0,d6
 move.b d1,d6
 move.l table(pc,d6.w*8),(a1)+
 move.l table+4(pc,d6.w*8),(a1)+

(with a 2048-byte table).

But... Did i say that i don't like tables, especially when they're put right in the middle of the code ?

Btw. if we're done with this one and/or ppl are ready i already have my next challenge idea...

Thorham · 10 June 2016, 12:54

Quote:

Originally Posted by meynaf

All four at once give 64 bits of data (4 16-bit blocks). You can't read that in one go. The code you presented does only one 16-bit value.
To fetch the whole 64 bits (8 bytes) you need a table of 256*8=2048 bytes.

Yeah, I misread it. I thought the output cases were in binary

meynaf · 10 June 2016, 13:02

Quote:

Originally Posted by Thorham

Yeah, I misread it. I thought the output cases were in binary

I suppose i wasn't too clear either

Ready for the next one ?

Don_Adan · 10 June 2016, 14:59

Quote:

Originally Posted by meynaf

Taken out of context it seems strange indeed.
But the huffquad data follows regular huff data (which is full 16-bit) in the same buffer, to be read by the same routines after that (first stereo handling, then imdct). The position at which quad data starts isn't constant : there may be a lot of it, or very few, or even none at all.

That said, the quickest method appears to be :

Code:

 moveq #0,d6
 move.b d1,d6
 move.l table(pc,d6.w*8),(a1)+
 move.l table+4(pc,d6.w*8),(a1)+

(with a 2048-byte table).

But... Did i say that i don't like tables, especially when they're put right in the middle of the code ?

Btw. if we're done with this one and/or ppl are ready i already have my next challenge idea...

You can use Leffman version with 64 byte table. 10c slowest than 2048 table version, if i remember correctly 68030 timings.

Code:

 moveq #0,d6
 move.b d1,d6
 ror.l #4,d6
 move.l table(pc,d6.w*4),(a1)+
 clr.w d6
 rol.l #4,d6
 move.l table(pc,d6.w*4),(a1)+

(with a 64-byte table).

What is next challenge? And which is main prize?

meynaf · 10 June 2016, 15:10

Quote:

Originally Posted by Don_Adan

What is next challenge? And which is main prize?

No prize awarded, sorry

But i can explain my next challenge.

Not the same project, but still real life - and still used in some audio decoder.
This is middle-side stereo decode of flac.

We have :
c0 = longword data coming from (a0)+
c1 = longword data coming from (a1)+
l,r = left and right pcm samples - double word data to write to (a2)+

And the computation to get l,r from c0,c1 is (mid,side are temporaries) :
side = c1
mid = c0 *2 + (side & 1)
l = (mid+side) /2
r = (mid-side) /2

This is a lossless format and the data can just be truncated without any need to clamp.
However the end result must be exact.
(Anyone interested can read the original libflac C source ; the stuff comes from there.)

All regs can be used (personnally i used d1,d2,d3). Again assume 030 timing.

I wonder if ppl will still try to use tables for this.

Thorham · 10 June 2016, 16:20

Questions:

1. What's read from (a0) and (a1) exactly? Two 16 bit samples per long word, or one 32 bit sample? If it's two samples per long word, then what order are they in?
2. What's the order of the 2 x 16 bit output samples per longword?

meynaf · 10 June 2016, 16:43

Quote:

Originally Posted by Thorham

1. What's read from (a0) and (a1) exactly? Two 16 bit samples per long word, or one 32 bit sample? If it's two samples per long word, then what order are they in?

That's one 32 bit sample (up to 17 bits are actually used as it might be the sum of two 16-bit values).
As you have two sources you will end up with two 32 bit samples, of course.

Quote:

Originally Posted by Thorham

2. What's the order of the 2 x 16 bit output samples per longword?

Samples are output left channel, then right - like in a wave file (actually more like in aiff as it's signed and big endian).

09 June 2016, 23:25	#225
Leffmann Join Date: Jul 2008 Location: Sweden Posts: 2,269	Try a sequence of 4 bfexts d1{x:2}, d6 + move.w d6, (a1)+ .

10 June 2016, 08:56	#229
Leffmann Join Date: Jul 2008 Location: Sweden Posts: 2,269	If you want the fastest and still keep it within 100 bytes, it should be two lookups from a 64-byte table: Code: moveq #15, d6 and.b d1, d6 move.l (table, pc, d6.w4), (4, a1) move.b d1, d6 lsr.b #4, d6 move.l (table, pc, d6.w4), (a1) addq.l #8, a1

10 June 2016, 09:11	#230
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,355	Not bad. But compare it to : Code: rept 4 add.b d1,d1 subx.w d6,d6 add.b d1,d1 addx.w d6,d6 move.w d6,(a1)+ endr

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Starting ASM coding on A1200. Which Assembler?	Nosferax	Coders. Asm / Hardware	68	27 November 2015 16:14
4th tutorial on ASM- and HW-coding	Vikke	Coders. Asm / Hardware	11	10 April 2013 20:32
3rd tutorial on ASM- and HW-coding	Vikke	Coders. Asm / Hardware	6	26 March 2013 15:57
First tutorial on ASM- and HW-coding	Vikke	Coders. Asm / Hardware	46	18 March 2013 12:33
2nd tutorial on ASM- and HW-coding	Vikke	Coders. Asm / Hardware	10	17 March 2013 11:49

09 June 2016, 22:42	#224
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,355	That's not enough. You did only one value this way and there are 4 to do.

10 June 2016, 00:56	#227
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 56 Posts: 2,047	Thorham idea is the best and fastest, code can looks next: moveq #31,D6 and.b D0,D6 move.l Table(PC,D6.W*4),(A1)+ Table will be 128x4.

10 June 2016, 11:32	#232
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,355	It doesn't work. You clear b7 and b6 so first word will be wrong. This method can work but it needs 256*8=2048 bytes. This is overkill (actually even a 64-byte table is too big for my taste ; this code isn't that important).

10 June 2016, 16:20	#239
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 48 Posts: 3,847	Questions: 1. What's read from (a0) and (a1) exactly? Two 16 bit samples per long word, or one 32 bit sample? If it's two samples per long word, then what order are they in? 2. What's the order of the 2 x 16 bit output samples per longword?

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)