English Amiga Board - jpeg decoding in full asm

Page 1 of 6

Last »

Show 20 post(s) from this thread on one page

English Amiga Board (https://eab.abime.net/index.php)

- Coders. General (https://eab.abime.net/forumdisplay.php?f=37)

- - jpeg decoding in full asm (https://eab.abime.net/showthread.php?t=33576)

meynaf

10 December 2007 13:26

jpeg decoding in full asm

Hi there,

In the process of writing my own image viewer (why not ?), I ended up with an ilbm/gif viewer which is faster than everything else I tested (yes ! - see behind).
However, when it came to jpeg, things have gone much more complex...

All was written in full asm, but not the jpeg part which remained in C (using the IJG code). I removed unneeded stuff from the jpeg library, but now I want to rewrite it in asm. A bit too tough maybe...

So, if you want to help, you're welcome :)

I'm pretty sure my friend Thorham will be interested ;)

You can find my actual sources here :
http://meynaf.free.fr/tmp/jpeg.lzx

You have to alter the project options for having the correct directories for sources. Also, you may wish to change the command line and working dir.

I compile them with HisoftDev 4.1. Anything else untested, may or may not work. What's needed is :
- no startup/cleanup code (this is all done in the asm part linked to it)
- first object to link is the asm part (file v.o)
- no small data model (c code shouldn't use an addr reg to access globals).

So you don't believe me when I say it's fast at displaying iff/gif files, eh ?

Here is a viewer benchmark. My own viewer is loosely labeled "v" (if someone can think of a better name it's welcome...).
The machine is an a1200/030-50, images are read from ram disk.
Quality of the rendering is indicated for jpegs ; results are in number of frames (50 is 1 sec).
I've put the results in a "code" tag to preserve formatting :

Code:

iff (768x512x8)

    v              31

    ppshow         45

    visage         48

    fastview       56

    viewtek        58



gif (800x533x8)

    v              76

    showgif        93

    zgif          119

    ppshow        207

    viewtek       378



jpeg (500x333x24)

    sjfif          67    (low qual)

    visage        129    (medium qual)

    fastview      158    (medium qual)

    fjpeg_aga     186    (high qual)

    v             233    (high qual)

    ppshow        235    (low qual)

    jpegaga       374    (high qual)

    viewtek       512    (medium qual)

Versions used : sjfif 1.72, visage 39.22, fastview 2.0, fjpeg_aga 1.10, ppshow 4.0, jpegaga 2.2, viewtek 2.00.

As you can see, there's still room for improvement on the jpeg part...
Who's willing to help ?

Thorham

10 December 2007 14:58

Quote:

Originally Posted by meynaf

I'm pretty sure my friend Thorham will be interested ;)

You got that right :D

Quote:

Originally Posted by meynaf

As you can see, there's still room for improvement on the jpeg part...
Who's willing to help ?

You know I am, meynaf. Pretty cool, some more code look into. Thanks for sharing :great

For iff and gif the speed is definitely great, as for the jpeg part, if it's unoptimized c then there should be a lot of room for improvement.

meynaf

10 December 2007 15:38

Thanks for your help :great

The code isn't unoptimized C. The IJG (Independent Jpeg Group) have taken care about their code's speed quite a lot.
But, of course, it's still C, so it leaves a lot of room for improvements.

If you like code to look at, then you can also look at the other thread I've just opened :
http://eab.abime.net/showthread.php?t=33574

Thorham

10 December 2007 23:28

Quote:

Originally Posted by meynaf

If you like code to look at, then you can also look at the other thread I've just opened :
http://eab.abime.net/showthread.php?t=33574

Thanks :great

When I tried to assemble the v source code, all I got was an object file, as I didn't realize the c parts were needed, too :banghead:banghead:banghead (that's called not reading a post properly) I have dice 3.16 and storm c++, and don't know a lot about them, so it's going to take some time to get this to work, and I don't have much experience with c compilers in general anyway, so it's not going to be easy for me, either. But that doesn't matter, maybe I can learn something from this :D Although it might not be necessary to compile the jpeg decoder, as the c code hast to go anyway.

I must also admit that I don't have a whole lot of c programming experience, so I might not be able to help here :o

meynaf

11 December 2007 10:28

If you're unable to read c code, then it's possible to make the compiler output an asm source (I sure can do it), then to take this as a starting point. We don't even need to look at all the asm at once, just one part after another - some may require a little bit more code cleanup in the source before ; there are just too many structs with fields we won't use.

Alternatively, it could be possible to start from scratch.
If so, a good thing to do would be the dct part, as it takes a good deal of computational power - if you're a good mathematician then a discrete cosine transform implementation should be nothing for you :D

killergorilla

11 December 2007 11:52

Hmmm...

quick gif viewing...

could come in handy with my games frontend...

Photon

11 December 2007 12:08

This is a very interesting topic, cos this is exactly what I intend for code2x.com (for the GP2x). Applying skilled asm coding on selected c functions or loops. :cool

Been offering my asm services to a few selected gp2x devs, but it's as if they don't know how to isolate a loop and give an asm coder the assembly of what is to be optimized. Could also be that they don't have time, don't think the code is near enough to final, or haven't had time to profile the code so they know which parts take the most time.

Seems to be a need for creating awareness of the fact that not a single c function can ever be made nearly as fast as the corresponding one in asm.

I think if you isolate out any init code from what you want optimized and give Thorham a chunk of asm (together with provided inputs and relied-upon output from that chunk) optimization will be EZ Street! :great

meynaf

11 December 2007 12:38

Quote:

Originally Posted by Photon (Post 377758)

Seems to be a need for creating awareness of the fact that not a single c function can ever be made nearly as fast as the corresponding one in asm.

I agree. :agree
In a C program you've got a very different environment than in an ASM program. Parameters are not passed in registers, references are via pointers rather than output registers, and you don't know your caller's register usage so you have to save whatever reg you use which isn't defined as a scratch reg.
That's why I intend to rewrite the whole thing ; not only a few selected, profiled parts.

Quote:

Originally Posted by Photon (Post 377758)

I think if you isolate out any init code from what you want optimized and give Thorham a chunk of asm (together with provided inputs and relied-upon output from that chunk) optimization will be EZ Street! :great

Yes, I can write a wrapper to call ASM from C and vice-versa. This way the asm part gets its values in registers and outputs whatever it needs, whereever suitable.
Once the caller is also rewritten, the wrapper disappears.

If you look at my jpeg.s source, you'll see what I mean.

However the input/output data formats need to be built for the optimized part, not the original C code, e.g. the fixed-point format must be defined by the asm writer.

Thorham

11 December 2007 15:43

Quote:

Originally Posted by meynaf

I can read c, but I just don't have a lot of experience with it, thats all, really. It just means that I can't rush through the c parts. I'll make do!

Quote:

Originally Posted by meynaf

Alternatively, it could be possible to start from scratch.
If so, a good thing to do would be the dct part, as it takes a good deal of computational power - if you're a good mathematician then a discrete cosine transform implementation should be nothing for you :D

This is certainly preferable. However, I must point out that I'm not a math guy. I suppose I could learn how to read the inverse dct (there isn't that much to it), but I don't know if I can write a highly optimized asm implementation. Fingers crossed...

meynaf

11 December 2007 16:18

The main problem of such computations is that they're not integer and contain lots of multiplies. :banghead

There are two implementations you might want to take a look at.
. jidctint.c - fixed-point implementation, Amiga decoders generally use that one
. jidctflt.c - float implementation, simpler... but using floats

Of course, there are other algorithms...

Thorham

11 December 2007 16:37

Quote:

Originally Posted by meynaf

The main problem of such computations is that they're not integer and contain lots of multiplies. :banghead

O no, mulu :guru

Quote:

Originally Posted by meynaf

There are two implementations you might want to take a look at.
. jidctint.c - fixed-point implementation, Amiga decoders generally use that one
. jidctflt.c - float implementation, simpler... but using floats

It seems doable to write optimized asm for the int version, but the float version must be thrown away, or you'll have to use the math libraries (very slow). It's a pity the 680x0 family doesn't always have a math co pro as a standard.

All thats needed to do the idct in asm is some test data to verify the output of the asm routine. Any suggestions???

Quote:

Originally Posted by meynaf

Of course, there are other algorithms...

Really, where can I find them? If they are faster (and just as good) then the before mentioned int routine then it's definitively worth taking a look at these. Maybe I'll just see if I can find them on the net if you don't have them.

meynaf

11 December 2007 17:21

Quote:

Originally Posted by Thorham (Post 377830)

O no, mulu :guru

No mulu, don't worry. Just muls :p

Quote:

Originally Posted by Thorham (Post 377830)

Thrown away, I'm not sure. The float version can be changed into a fixed-point version.
And, you know, floating-point computations are slower than integer even if you have a math co-pro. :banghead

Quote:

Originally Posted by Thorham (Post 377830)

All thats needed to do the idct in asm is some test data to verify the output of the asm routine. Any suggestions???

We could start with random data, and keep the original code to check the output. Once something is done, integrating it into the viewer is easy, and then we'll see the result :shocked
Or we could get some data in a real image, while debugging.

Quote:

Originally Posted by Thorham (Post 377830)

I have next to nothing, so the net is the right solution :D

Thorham

12 December 2007 16:24

Quote:

Originally Posted by meynaf

No mulu, don't worry. Just muls :p

Nooooo, even slower :D

Quote:

Originally Posted by meynaf

Thrown away, I'm not sure. The float version can be changed into a fixed-point version.
And, you know, floating-point computations are slower than integer even if you have a math co-pro. :banghead

Well, that figures :guru

Quote:

Originally Posted by meynaf[We could start with random data, and keep the original code to check the output. Once something is done, integrating it into the viewer is easy, and then we'll [I

see[/i] the result :shocked
Or we could get some data in a real image, while debugging.

Sounds like a good idea. I still have to get my mind around this one, though.

Quote:

Originally Posted by meynaf

I have next to nothing, so the net is the right solution :D

Ok, I'll just have a go at it. I'll let you know if anything cool shows up, as I'm pretty sure you'll want to know.

meynaf

12 December 2007 16:59

Quote:

Originally Posted by Thorham (Post 378213)

Nooooo, even slower :D

Strangely, muls isn't slower than mulu on 020/030. But they're still 14 times slower than an add :shocked

Quote:

Originally Posted by Thorham (Post 378213)

Ok, I'll just have a go at it. I'll let you know if anything cool shows up, as I'm pretty sure you'll want to know.

I sure will :agree

Thorham

12 December 2007 18:03

Quote:

Originally Posted by meynaf

Strangely, muls isn't slower than mulu on 020/030. But they're still 14 times slower than an add :shocked

Must have mixed it up with divu vs divs, where divs is slower then divu (I think it's 44 cycles vs 56 cycles).

Quote:

Originally Posted by meynaf

I sure will :agree

Great, I'll post about anything interesting I find.

Photon

12 December 2007 18:40

If I may offer advice (humbly, I assure you) there are some choices to be made:

- Converter for FPU users only?
- Otherwise, let libraries detect FPU?
- make FPU and non-FPU versions in asm.
- code it all in integer (if 32 bits is not enough precision, 68xxx has instructions to help with 64-bit math. A few examples are listed in AsmOne docs, I believe.)

For myself, I'd do it in 32-bit integer. Works on all 68xxx Amigas and should be good enough for 8bit rgb components.

When I offer asm services to the GP2xdev community, I prefer to keep it simple. c coder selects a time-consuming snippet, send me the asm from the binary, and when I've noted which inputs it gets I scrap that asm and write an exact optimized replica that generates the correct results. If you start looking at "how the c compiler does it in asm" I would be led astray ;)

On 68k, muls and mulu take 70n, while add takes 8n. divs takes 158n and divu takes 140n (all worst case timings).

If you work from asm to asm, you don't really have to know formulas etc. If the c-asm code calls some routine in some fp library, all you have to do is replace the call with asm that delivers the same result. If you find that it multiplies a*sin x with b*cos y, then do the same "your way". (As you see I haven't looked at jpeg algorithms, was just an example ;))

Asm of innermost loop and exact function description of the used library functions will get you a long way. :great

Thorham

12 December 2007 20:36

Meynaf, I've found some interesting stuff, considering I didn't search for hours :D

Some sites with seemingly in-depth explanations:
Exploring JPEG
Image Compression - from DCT to Wavelets : A Review

Some papers, with, I think, very in-depth explanations:
www.eie.polyu.edu.hk/~enyhchan/mm_jpeg.pdf
www.es.ele.tue.nl/mininoc/doc/report_sander_stuijk.pdf
www.cse.fau.edu/~borko/Chapter8_mc.pdf

Interesting stuff, no doubt, but if it's not useful I'll have to try a little harder :D

meynaf

13 December 2007 10:41

@Photon :
I really want to do it in pure integer ; a floating-point version would be slower anyway and won't bring a visible quality difference (I think).
There are several algorithms (for the dct part) that can be used, I still don't know which will be the best when reduced to (asm) code. I need to be as fast as possible, yet having enough precision so give the highest quality.
Knowing exactly what's to be done can end up with a very different architecture than what already exists, because it will be amiga asm and not portable c.
There are no such things as sin and cos, all of those are pre-computed and put in tables. What remains is a lot of adds, and a bunch of muls... oh, yes, so many muls :crazy

@Thorham :
I've had a quick look at the links you provided. And found more and more math formulas and schemas. :shocked
Sure interesting, but what would suit me better is an actual algorithm to perform a dct (which is the tricky part ; the rest is more or less integer, it can (and will ?) be done from actual c sources - once I know which fixed-point format to use).
Oh, well, ok. Maybe I can just compile jidctint.c, see the asm and try to optimize it :guru

Thorham

13 December 2007 10:57

Quote:

Originally Posted by meynaf

I've had a quick look at the links you provided. And found more and more math formulas and schemas. :shocked
Sure interesting, but what would suit me better is an actual algorithm to perform a dct (which is the tricky part ; the rest is more or less integer, it can (and will ?) be done from actual c sources - once I know which fixed-point format to use).
Oh, well, ok. Maybe I can just compile jidctint.c, see the asm and try to optimize it :guru

The Exploring Jpeg page has a complete explanation, which also shows how to implement it (in haskell, though the author says you can skip the code). Are you sure you've looked this page through completely? It might be helpful in understanding what's going on in the source code you have, unless you don't really need to.

Anyway, if it's source code you want, then that shouldn't be too hard to find.

Unfortunately, a lot of jpeg explanations are presented in the mathematical format, which is useless if you can't read it. I've had the same problem years ago, when I wanted to know how jpeg works.

meynaf

13 December 2007 11:21

Quote:

Originally Posted by Thorham (Post 378462)

I already know what's going on. The main problem for me is that dct trick, I know what it does, but I want the quickest way to do it without significantly losing precision - and the haskell code, not quite readable, apparently contains direct square roots and cosinus computations, which I do not want to do.

Quote:

Originally Posted by Thorham (Post 378462)

Anyway, if it's source code you want, then that shouldn't be too hard to find.

Maybe not source code, but a detailed algorithm to efficiently perform the computations. Of course, a correctly commented source code will do :)
Things such as sqr/sin/cos are to avoid at all costs, muls should be reduced to the bare minimum. Else we'll end up with something sloooooow :shocked

Quote:

Originally Posted by Thorham (Post 378462)

Unfortunately, a lot of jpeg explanations are presented in the mathematical format, which is useless if you can't read it. I've had the same problem years ago, when I wanted to know how jpeg works.

Hopefully you know what I feel about those formulas :scream

All times are GMT +2. The time now is 14:25.

Page 1 of 6

Last »

Show 20 post(s) from this thread on one page

Page generated in 0.05367 seconds with 10 queries