jpeg decoding in full asm

meynaf · 10 December 2007, 13:26

Hi there,

In the process of writing my own image viewer (why not ?), I ended up with an ilbm/gif viewer which is faster than everything else I tested (yes ! - see behind).
However, when it came to jpeg, things have gone much more complex...

All was written in full asm, but not the jpeg part which remained in C (using the IJG code). I removed unneeded stuff from the jpeg library, but now I want to rewrite it in asm. A bit too tough maybe...

So, if you want to help, you're welcome

I'm pretty sure my friend Thorham will be interested

You can find my actual sources here :
http://meynaf.free.fr/tmp/jpeg.lzx

You have to alter the project options for having the correct directories for sources. Also, you may wish to change the command line and working dir.

I compile them with HisoftDev 4.1. Anything else untested, may or may not work. What's needed is :
- no startup/cleanup code (this is all done in the asm part linked to it)
- first object to link is the asm part (file v.o)
- no small data model (c code shouldn't use an addr reg to access globals).

So you don't believe me when I say it's fast at displaying iff/gif files, eh ?

Here is a viewer benchmark. My own viewer is loosely labeled "v" (if someone can think of a better name it's welcome...).
The machine is an a1200/030-50, images are read from ram disk.
Quality of the rendering is indicated for jpegs ; results are in number of frames (50 is 1 sec).
I've put the results in a "code" tag to preserve formatting :

Code:

iff (768x512x8)
    v              31
    ppshow         45
    visage         48
    fastview       56
    viewtek        58

gif (800x533x8)
    v              76
    showgif        93
    zgif          119
    ppshow        207
    viewtek       378

jpeg (500x333x24)
    sjfif          67    (low qual)
    visage        129    (medium qual)
    fastview      158    (medium qual)
    fjpeg_aga     186    (high qual)
    v             233    (high qual)
    ppshow        235    (low qual)
    jpegaga       374    (high qual)
    viewtek       512    (medium qual)

Versions used : sjfif 1.72, visage 39.22, fastview 2.0, fjpeg_aga 1.10, ppshow 4.0, jpegaga 2.2, viewtek 2.00.

As you can see, there's still room for improvement on the jpeg part...
Who's willing to help ?

Thorham · 10 December 2007, 14:58

Quote:

Originally Posted by meynaf

I'm pretty sure my friend Thorham will be interested

You got that right

Quote:

Originally Posted by meynaf

As you can see, there's still room for improvement on the jpeg part...
Who's willing to help ?

You know I am, meynaf. Pretty cool, some more code look into. Thanks for sharing

For iff and gif the speed is definitely great, as for the jpeg part, if it's unoptimized c then there should be a lot of room for improvement.

meynaf · 10 December 2007, 15:38

Thanks for your help

The code isn't unoptimized C. The IJG (Independent Jpeg Group) have taken care about their code's speed quite a lot.
But, of course, it's still C, so it leaves a lot of room for improvements.

If you like code to look at, then you can also look at the other thread I've just opened :
http://eab.abime.net/showthread.php?t=33574

Thorham · 10 December 2007, 23:28

Quote:

Originally Posted by meynaf

If you like code to look at, then you can also look at the other thread I've just opened :
http://eab.abime.net/showthread.php?t=33574

Thanks

When I tried to assemble the v source code, all I got was an object file, as I didn't realize the c parts were needed, too

(that's called not reading a post properly) I have dice 3.16 and storm c++, and don't know a lot about them, so it's going to take some time to get this to work, and I don't have much experience with c compilers in general anyway, so it's not going to be easy for me, either. But that doesn't matter, maybe I can learn something from this

Although it might not be necessary to compile the jpeg decoder, as the c code hast to go anyway.

I must also admit that I don't have a whole lot of c programming experience, so I might not be able to help here

meynaf · 11 December 2007, 10:28

If you're unable to read c code, then it's possible to make the compiler output an asm source (I sure can do it), then to take this as a starting point. We don't even need to look at all the asm at once, just one part after another - some may require a little bit more code cleanup in the source before ; there are just too many structs with fields we won't use.

Alternatively, it could be possible to start from scratch.
If so, a good thing to do would be the dct part, as it takes a good deal of computational power - if you're a good mathematician then a discrete cosine transform implementation should be nothing for you

killergorilla · 11 December 2007, 11:52

Hmmm...

quick gif viewing...

could come in handy with my games frontend...

Photon · 11 December 2007, 12:08

This is a very interesting topic, cos this is exactly what I intend for code2x.com (for the GP2x). Applying skilled asm coding on selected c functions or loops.

Been offering my asm services to a few selected gp2x devs, but it's as if they don't know how to isolate a loop and give an asm coder the assembly of what is to be optimized. Could also be that they don't have time, don't think the code is near enough to final, or haven't had time to profile the code so they know which parts take the most time.

Seems to be a need for creating awareness of the fact that not a single c function can ever be made nearly as fast as the corresponding one in asm.

I think if you isolate out any init code from what you want optimized and give Thorham a chunk of asm (together with provided inputs and relied-upon output from that chunk) optimization will be EZ Street!

meynaf · 11 December 2007, 12:38

Quote:

Originally Posted by Photon

Seems to be a need for creating awareness of the fact that not a single c function can ever be made nearly as fast as the corresponding one in asm.

I agree.

In a C program you've got a very different environment than in an ASM program. Parameters are not passed in registers, references are via pointers rather than output registers, and you don't know your caller's register usage so you have to save whatever reg you use which isn't defined as a scratch reg.
That's why I intend to rewrite the whole thing ; not only a few selected, profiled parts.

Quote:

Originally Posted by Photon

I think if you isolate out any init code from what you want optimized and give Thorham a chunk of asm (together with provided inputs and relied-upon output from that chunk) optimization will be EZ Street!

Yes, I can write a wrapper to call ASM from C and vice-versa. This way the asm part gets its values in registers and outputs whatever it needs, whereever suitable.
Once the caller is also rewritten, the wrapper disappears.

If you look at my jpeg.s source, you'll see what I mean.

However the input/output data formats need to be built for the optimized part, not the original C code, e.g. the fixed-point format must be defined by the asm writer.

Thorham · 11 December 2007, 15:43

Quote:

Originally Posted by meynaf

If you're unable to read c code, then it's possible to make the compiler output an asm source (I sure can do it), then to take this as a starting point. We don't even need to look at all the asm at once, just one part after another - some may require a little bit more code cleanup in the source before ; there are just too many structs with fields we won't use.

I can read c, but I just don't have a lot of experience with it, thats all, really. It just means that I can't rush through the c parts. I'll make do!

Quote:

Originally Posted by meynaf

Alternatively, it could be possible to start from scratch.
If so, a good thing to do would be the dct part, as it takes a good deal of computational power - if you're a good mathematician then a discrete cosine transform implementation should be nothing for you

This is certainly preferable. However, I must point out that I'm not a math guy. I suppose I could learn how to read the inverse dct (there isn't that much to it), but I don't know if I can write a highly optimized asm implementation. Fingers crossed...

meynaf · 11 December 2007, 16:18

The main problem of such computations is that they're not integer and contain lots of multiplies.

There are two implementations you might want to take a look at.
. jidctint.c - fixed-point implementation, Amiga decoders generally use that one
. jidctflt.c - float implementation, simpler... but using floats

Of course, there are other algorithms...

Thorham · 11 December 2007, 16:37

Quote:

Originally Posted by meynaf

The main problem of such computations is that they're not integer and contain lots of multiplies.

O no, mulu

Quote:

Originally Posted by meynaf

There are two implementations you might want to take a look at.
. jidctint.c - fixed-point implementation, Amiga decoders generally use that one
. jidctflt.c - float implementation, simpler... but using floats

It seems doable to write optimized asm for the int version, but the float version must be thrown away, or you'll have to use the math libraries (very slow). It's a pity the 680x0 family doesn't always have a math co pro as a standard.

All thats needed to do the idct in asm is some test data to verify the output of the asm routine. Any suggestions???

Quote:

Originally Posted by meynaf

Of course, there are other algorithms...

Really, where can I find them? If they are faster (and just as good) then the before mentioned int routine then it's definitively worth taking a look at these. Maybe I'll just see if I can find them on the net if you don't have them.

meynaf · 11 December 2007, 17:21

Quote:

Originally Posted by Thorham

O no, mulu

No mulu, don't worry. Just muls

Quote:

Originally Posted by Thorham

It seems doable to write optimized asm for the int version, but the float version must be thrown away, or you'll have to use the math libraries (very slow). It's a pity the 680x0 family doesn't always have a math co pro as a standard.

Thrown away, I'm not sure. The float version can be changed into a fixed-point version.
And, you know, floating-point computations are slower than integer even if you have a math co-pro.

Quote:

Originally Posted by Thorham

All thats needed to do the idct in asm is some test data to verify the output of the asm routine. Any suggestions???

We could start with random data, and keep the original code to check the output. Once something is done, integrating it into the viewer is easy, and then we'll see the result

Or we could get some data in a real image, while debugging.

Quote:

Originally Posted by Thorham

Really, where can I find them? If they are faster (and just as good) then the before mentioned int routine then it's definitively worth taking a look at these. Maybe I'll just see if I can find them on the net if you don't have them.

I have next to nothing, so the net is the right solution

Thorham · 12 December 2007, 16:24

Quote:

Originally Posted by meynaf

No mulu, don't worry. Just muls

Nooooo, even slower

Quote:

Originally Posted by meynaf

Thrown away, I'm not sure. The float version can be changed into a fixed-point version.
And, you know, floating-point computations are slower than integer even if you have a math co-pro.

Well, that figures

Quote:

Originally Posted by meynaf[We could start with random data, and keep the original code to check the output. Once something is done, integrating it into the viewer is easy, and then we'll [I

see[/i] the result

Or we could get some data in a real image, while debugging.

Sounds like a good idea. I still have to get my mind around this one, though.

Quote:

Originally Posted by meynaf

I have next to nothing, so the net is the right solution

Ok, I'll just have a go at it. I'll let you know if anything cool shows up, as I'm pretty sure you'll want to know.

meynaf · 12 December 2007, 16:59

Quote:

Originally Posted by Thorham

Nooooo, even slower

Strangely, muls isn't slower than mulu on 020/030. But they're still 14 times slower than an add

Quote:

Originally Posted by Thorham

Ok, I'll just have a go at it. I'll let you know if anything cool shows up, as I'm pretty sure you'll want to know.

I sure will

Thorham · 12 December 2007, 18:03

Quote:

Originally Posted by meynaf

Strangely, muls isn't slower than mulu on 020/030. But they're still 14 times slower than an add

Must have mixed it up with divu vs divs, where divs is slower then divu (I think it's 44 cycles vs 56 cycles).

Quote:

Originally Posted by meynaf

I sure will

Great, I'll post about anything interesting I find.

Photon · 12 December 2007, 18:40

If I may offer advice (humbly, I assure you) there are some choices to be made:

- Converter for FPU users only?
- Otherwise, let libraries detect FPU?
- make FPU and non-FPU versions in asm.
- code it all in integer (if 32 bits is not enough precision, 68xxx has instructions to help with 64-bit math. A few examples are listed in AsmOne docs, I believe.)

For myself, I'd do it in 32-bit integer. Works on all 68xxx Amigas and should be good enough for 8bit rgb components.

When I offer asm services to the GP2xdev community, I prefer to keep it simple. c coder selects a time-consuming snippet, send me the asm from the binary, and when I've noted which inputs it gets I scrap that asm and write an exact optimized replica that generates the correct results. If you start looking at "how the c compiler does it in asm" I would be led astray

On 68k, muls and mulu take 70n, while add takes 8n. divs takes 158n and divu takes 140n (all worst case timings).

If you work from asm to asm, you don't really have to know formulas etc. If the c-asm code calls some routine in some fp library, all you have to do is replace the call with asm that delivers the same result. If you find that it multiplies a*sin x with b*cos y, then do the same "your way". (As you see I haven't looked at jpeg algorithms, was just an example

)

Asm of innermost loop and exact function description of the used library functions will get you a long way.

Thorham · 12 December 2007, 20:36

Meynaf, I've found some interesting stuff, considering I didn't search for hours

Some sites with seemingly in-depth explanations:
Exploring JPEG
Image Compression - from DCT to Wavelets : A Review

Some papers, with, I think, very in-depth explanations:
www.eie.polyu.edu.hk/~enyhchan/mm_jpeg.pdf
www.es.ele.tue.nl/mininoc/doc/report_sander_stuijk.pdf
www.cse.fau.edu/~borko/Chapter8_mc.pdf

Interesting stuff, no doubt, but if it's not useful I'll have to try a little harder

meynaf · 13 December 2007, 10:41

@Photon :
I really want to do it in pure integer ; a floating-point version would be slower anyway and won't bring a visible quality difference (I think).
There are several algorithms (for the dct part) that can be used, I still don't know which will be the best when reduced to (asm) code. I need to be as fast as possible, yet having enough precision so give the highest quality.
Knowing exactly what's to be done can end up with a very different architecture than what already exists, because it will be amiga asm and not portable c.
There are no such things as sin and cos, all of those are pre-computed and put in tables. What remains is a lot of adds, and a bunch of muls... oh, yes, so many muls

@Thorham :
I've had a quick look at the links you provided. And found more and more math formulas and schemas.

Sure interesting, but what would suit me better is an actual algorithm to perform a dct (which is the tricky part ; the rest is more or less integer, it can (and will ?) be done from actual c sources - once I know which fixed-point format to use).
Oh, well, ok. Maybe I can just compile jidctint.c, see the asm and try to optimize it

Thorham · 13 December 2007, 10:57

Quote:

Originally Posted by meynaf

I've had a quick look at the links you provided. And found more and more math formulas and schemas.

Sure interesting, but what would suit me better is an actual algorithm to perform a dct (which is the tricky part ; the rest is more or less integer, it can (and will ?) be done from actual c sources - once I know which fixed-point format to use).
Oh, well, ok. Maybe I can just compile jidctint.c, see the asm and try to optimize it

The Exploring Jpeg page has a complete explanation, which also shows how to implement it (in haskell, though the author says you can skip the code). Are you sure you've looked this page through completely? It might be helpful in understanding what's going on in the source code you have, unless you don't really need to.

Anyway, if it's source code you want, then that shouldn't be too hard to find.

Unfortunately, a lot of jpeg explanations are presented in the mathematical format, which is useless if you can't read it. I've had the same problem years ago, when I wanted to know how jpeg works.

meynaf · 13 December 2007, 11:21

Quote:

Originally Posted by Thorham

The Exploring Jpeg page has a complete explanation, which also shows how to implement it (in haskell, though the author says you can skip the code). Are you sure you've looked this page through completely? It might be helpful in understanding what's going on in the source code you have, unless you don't really need to.

I already know what's going on. The main problem for me is that dct trick, I know what it does, but I want the quickest way to do it without significantly losing precision - and the haskell code, not quite readable, apparently contains direct square roots and cosinus computations, which I do not want to do.

Quote:

Originally Posted by Thorham

Anyway, if it's source code you want, then that shouldn't be too hard to find.

Maybe not source code, but a detailed algorithm to efficiently perform the computations. Of course, a correctly commented source code will do

Things such as sqr/sin/cos are to avoid at all costs, muls should be reduced to the bare minimum. Else we'll end up with something sloooooow

Quote:

Originally Posted by Thorham

Unfortunately, a lot of jpeg explanations are presented in the mathematical format, which is useless if you can't read it. I've had the same problem years ago, when I wanted to know how jpeg works.

Hopefully you know what I feel about those formulas

10 December 2007, 13:26	#1
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	jpeg decoding in full asm Hi there, In the process of writing my own image viewer (why not ?), I ended up with an ilbm/gif viewer which is faster than everything else I tested (yes ! - see behind). However, when it came to jpeg, things have gone much more complex... All was written in full asm, but not the jpeg part which remained in C (using the IJG code). I removed unneeded stuff from the jpeg library, but now I want to rewrite it in asm. A bit too tough maybe... So, if you want to help, you're welcome I'm pretty sure my friend Thorham will be interested You can find my actual sources here : http://meynaf.free.fr/tmp/jpeg.lzx You have to alter the project options for having the correct directories for sources. Also, you may wish to change the command line and working dir. I compile them with HisoftDev 4.1. Anything else untested, may or may not work. What's needed is : - no startup/cleanup code (this is all done in the asm part linked to it) - first object to link is the asm part (file v.o) - no small data model (c code shouldn't use an addr reg to access globals). So you don't believe me when I say it's fast at displaying iff/gif files, eh ? Here is a viewer benchmark. My own viewer is loosely labeled "v" (if someone can think of a better name it's welcome...). The machine is an a1200/030-50, images are read from ram disk. Quality of the rendering is indicated for jpegs ; results are in number of frames (50 is 1 sec). I've put the results in a "code" tag to preserve formatting : Code: iff (768x512x8) v 31 ppshow 45 visage 48 fastview 56 viewtek 58 gif (800x533x8) v 76 showgif 93 zgif 119 ppshow 207 viewtek 378 jpeg (500x333x24) sjfif 67 (low qual) visage 129 (medium qual) fastview 158 (medium qual) fjpeg_aga 186 (high qual) v 233 (high qual) ppshow 235 (low qual) jpegaga 374 (high qual) viewtek 512 (medium qual) Versions used : sjfif 1.72, visage 39.22, fastview 2.0, fjpeg_aga 1.10, ppshow 4.0, jpegaga 2.2, viewtek 2.00. As you can see, there's still room for improvement on the jpeg part... Who's willing to help ?

12 December 2007, 20:36	#17
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,751	Meynaf, I've found some interesting stuff, considering I didn't search for hours Some sites with seemingly in-depth explanations: Exploring JPEG Image Compression - from DCT to Wavelets : A Review Some papers, with, I think, very in-depth explanations: www.eie.polyu.edu.hk/~enyhchan/mm_jpeg.pdf www.es.ele.tue.nl/mininoc/doc/report_sander_stuijk.pdf www.cse.fau.edu/~borko/Chapter8_mc.pdf Interesting stuff, no doubt, but if it's not useful I'll have to try a little harder Last edited by Thorham; 13 December 2007 at 01:26.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
JPEG to IFF Coverter	W4r3DeV1L	request.Apps	15	14 February 2020 17:21
Overzealous Kickstart ROM - address decoding?	robinsonb5	Hardware mods	3	30 June 2013 11:09
JPEG to PNG (via CLI)	amiga_user	support.Apps	3	28 November 2011 11:50
Decoding algorithm(s) for encoded disk sectors (ADOS)	andreas	Coders. General	10	02 November 2009 22:18
Blitter MFM decoding	Photon	Coders. General	14	16 March 2006 11:24

10 December 2007, 15:38	#3
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	Thanks for your help The code isn't unoptimized C. The IJG (Independent Jpeg Group) have taken care about their code's speed quite a lot. But, of course, it's still C, so it leaves a lot of room for improvements. If you like code to look at, then you can also look at the other thread I've just opened : http://eab.abime.net/showthread.php?t=33574

11 December 2007, 10:28	#5
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	If you're unable to read c code, then it's possible to make the compiler output an asm source (I sure can do it), then to take this as a starting point. We don't even need to look at all the asm at once, just one part after another - some may require a little bit more code cleanup in the source before ; there are just too many structs with fields we won't use. Alternatively, it could be possible to start from scratch. If so, a good thing to do would be the dct part, as it takes a good deal of computational power - if you're a good mathematician then a discrete cosine transform implementation should be nothing for you

11 December 2007, 11:52	#6
killergorilla Lesser Talent Join Date: Jan 2003 Location: UK Age: 42 Posts: 7,957	Hmmm... quick gif viewing... could come in handy with my games frontend...

11 December 2007, 12:08	#7
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,602	This is a very interesting topic, cos this is exactly what I intend for code2x.com (for the GP2x). Applying skilled asm coding on selected c functions or loops. Been offering my asm services to a few selected gp2x devs, but it's as if they don't know how to isolate a loop and give an asm coder the assembly of what is to be optimized. Could also be that they don't have time, don't think the code is near enough to final, or haven't had time to profile the code so they know which parts take the most time. Seems to be a need for creating awareness of the fact that not a single c function can ever be made nearly as fast as the corresponding one in asm. I think if you isolate out any init code from what you want optimized and give Thorham a chunk of asm (together with provided inputs and relied-upon output from that chunk) optimization will be EZ Street!

11 December 2007, 16:18	#10
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	The main problem of such computations is that they're not integer and contain lots of multiplies. There are two implementations you might want to take a look at. . jidctint.c - fixed-point implementation, Amiga decoders generally use that one . jidctflt.c - float implementation, simpler... but using floats Of course, there are other algorithms...

12 December 2007, 18:40	#16
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,602	If I may offer advice (humbly, I assure you) there are some choices to be made: - Converter for FPU users only? - Otherwise, let libraries detect FPU? - make FPU and non-FPU versions in asm. - code it all in integer (if 32 bits is not enough precision, 68xxx has instructions to help with 64-bit math. A few examples are listed in AsmOne docs, I believe.) For myself, I'd do it in 32-bit integer. Works on all 68xxx Amigas and should be good enough for 8bit rgb components. When I offer asm services to the GP2xdev community, I prefer to keep it simple. c coder selects a time-consuming snippet, send me the asm from the binary, and when I've noted which inputs it gets I scrap that asm and write an exact optimized replica that generates the correct results. If you start looking at "how the c compiler does it in asm" I would be led astray On 68k, muls and mulu take 70n, while add takes 8n. divs takes 158n and divu takes 140n (all worst case timings). If you work from asm to asm, you don't really have to know formulas etc. If the c-asm code calls some routine in some fp library, all you have to do is replace the call with asm that delivers the same result. If you find that it multiplies asin x with bcos y, then do the same "your way". (As you see I haven't looked at jpeg algorithms, was just an example ) Asm of innermost loop and exact function description of the used library functions will get you a long way.

13 December 2007, 10:41	#18
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	@Photon : I really want to do it in pure integer ; a floating-point version would be slower anyway and won't bring a visible quality difference (I think). There are several algorithms (for the dct part) that can be used, I still don't know which will be the best when reduced to (asm) code. I need to be as fast as possible, yet having enough precision so give the highest quality. Knowing exactly what's to be done can end up with a very different architecture than what already exists, because it will be amiga asm and not portable c. There are no such things as sin and cos, all of those are pre-computed and put in tables. What remains is a lot of adds, and a bunch of muls... oh, yes, so many muls @Thorham : I've had a quick look at the links you provided. And found more and more math formulas and schemas. Sure interesting, but what would suit me better is an actual algorithm to perform a dct (which is the tricky part ; the rest is more or less integer, it can (and will ?) be done from actual c sources - once I know which fixed-point format to use). Oh, well, ok. Maybe I can just compile jidctint.c, see the asm and try to optimize it

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)