English Amiga Board


Go Back   English Amiga Board > Coders > Coders. General

 
 
Thread Tools
Old 10 December 2007, 13:26   #1
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 48
Posts: 4,228
jpeg decoding in full asm

Hi there,

In the process of writing my own image viewer (why not ?), I ended up with an ilbm/gif viewer which is faster than everything else I tested (yes ! - see behind).
However, when it came to jpeg, things have gone much more complex...

All was written in full asm, but not the jpeg part which remained in C (using the IJG code). I removed unneeded stuff from the jpeg library, but now I want to rewrite it in asm. A bit too tough maybe...

So, if you want to help, you're welcome

I'm pretty sure my friend Thorham will be interested

You can find my actual sources here :
http://meynaf.free.fr/tmp/jpeg.lzx

You have to alter the project options for having the correct directories for sources. Also, you may wish to change the command line and working dir.

I compile them with HisoftDev 4.1. Anything else untested, may or may not work. What's needed is :
- no startup/cleanup code (this is all done in the asm part linked to it)
- first object to link is the asm part (file v.o)
- no small data model (c code shouldn't use an addr reg to access globals).


So you don't believe me when I say it's fast at displaying iff/gif files, eh ?

Here is a viewer benchmark. My own viewer is loosely labeled "v" (if someone can think of a better name it's welcome...).
The machine is an a1200/030-50, images are read from ram disk.
Quality of the rendering is indicated for jpegs ; results are in number of frames (50 is 1 sec).
I've put the results in a "code" tag to preserve formatting :
Code:
iff (768x512x8)
    v              31
    ppshow         45
    visage         48
    fastview       56
    viewtek        58

gif (800x533x8)
    v              76
    showgif        93
    zgif          119
    ppshow        207
    viewtek       378

jpeg (500x333x24)
    sjfif          67    (low qual)
    visage        129    (medium qual)
    fastview      158    (medium qual)
    fjpeg_aga     186    (high qual)
    v             233    (high qual)
    ppshow        235    (low qual)
    jpegaga       374    (high qual)
    viewtek       512    (medium qual)
Versions used : sjfif 1.72, visage 39.22, fastview 2.0, fjpeg_aga 1.10, ppshow 4.0, jpegaga 2.2, viewtek 2.00.

As you can see, there's still room for improvement on the jpeg part...
Who's willing to help ?
meynaf is offline  
Old 10 December 2007, 14:58   #2
Thorham
Computer Nerd

Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 44
Posts: 3,216
Quote:
Originally Posted by meynaf
I'm pretty sure my friend Thorham will be interested
You got that right

Quote:
Originally Posted by meynaf
As you can see, there's still room for improvement on the jpeg part...
Who's willing to help ?
You know I am, meynaf. Pretty cool, some more code look into. Thanks for sharing

For iff and gif the speed is definitely great, as for the jpeg part, if it's unoptimized c then there should be a lot of room for improvement.
Thorham is offline  
Old 10 December 2007, 15:38   #3
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 48
Posts: 4,228
Thanks for your help

The code isn't unoptimized C. The IJG (Independent Jpeg Group) have taken care about their code's speed quite a lot.
But, of course, it's still C, so it leaves a lot of room for improvements.

If you like code to look at, then you can also look at the other thread I've just opened :
http://eab.abime.net/showthread.php?t=33574
meynaf is offline  
Old 10 December 2007, 23:28   #4
Thorham
Computer Nerd

Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 44
Posts: 3,216
Quote:
Originally Posted by meynaf
If you like code to look at, then you can also look at the other thread I've just opened :
http://eab.abime.net/showthread.php?t=33574
Thanks

When I tried to assemble the v source code, all I got was an object file, as I didn't realize the c parts were needed, too (that's called not reading a post properly) I have dice 3.16 and storm c++, and don't know a lot about them, so it's going to take some time to get this to work, and I don't have much experience with c compilers in general anyway, so it's not going to be easy for me, either. But that doesn't matter, maybe I can learn something from this Although it might not be necessary to compile the jpeg decoder, as the c code hast to go anyway.

I must also admit that I don't have a whole lot of c programming experience, so I might not be able to help here
Thorham is offline  
Old 11 December 2007, 10:28   #5
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 48
Posts: 4,228
If you're unable to read c code, then it's possible to make the compiler output an asm source (I sure can do it), then to take this as a starting point. We don't even need to look at all the asm at once, just one part after another - some may require a little bit more code cleanup in the source before ; there are just too many structs with fields we won't use.

Alternatively, it could be possible to start from scratch.
If so, a good thing to do would be the dct part, as it takes a good deal of computational power - if you're a good mathematician then a discrete cosine transform implementation should be nothing for you
meynaf is offline  
Old 11 December 2007, 11:52   #6
killergorilla
Lesser Talent
killergorilla's Avatar
 
Join Date: Jan 2003
Location: UK
Age: 39
Posts: 7,957
Hmmm...

quick gif viewing...

could come in handy with my games frontend...
killergorilla is offline  
Old 11 December 2007, 12:08   #7
Photon
Moderator

Photon's Avatar
 
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 4,948
This is a very interesting topic, cos this is exactly what I intend for code2x.com (for the GP2x). Applying skilled asm coding on selected c functions or loops.

Been offering my asm services to a few selected gp2x devs, but it's as if they don't know how to isolate a loop and give an asm coder the assembly of what is to be optimized. Could also be that they don't have time, don't think the code is near enough to final, or haven't had time to profile the code so they know which parts take the most time.

Seems to be a need for creating awareness of the fact that not a single c function can ever be made nearly as fast as the corresponding one in asm.

I think if you isolate out any init code from what you want optimized and give Thorham a chunk of asm (together with provided inputs and relied-upon output from that chunk) optimization will be EZ Street!
Photon is offline  
Old 11 December 2007, 12:38   #8
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 48
Posts: 4,228
Quote:
Originally Posted by Photon View Post
Seems to be a need for creating awareness of the fact that not a single c function can ever be made nearly as fast as the corresponding one in asm.
I agree.
In a C program you've got a very different environment than in an ASM program. Parameters are not passed in registers, references are via pointers rather than output registers, and you don't know your caller's register usage so you have to save whatever reg you use which isn't defined as a scratch reg.
That's why I intend to rewrite the whole thing ; not only a few selected, profiled parts.

Quote:
Originally Posted by Photon View Post
I think if you isolate out any init code from what you want optimized and give Thorham a chunk of asm (together with provided inputs and relied-upon output from that chunk) optimization will be EZ Street!
Yes, I can write a wrapper to call ASM from C and vice-versa. This way the asm part gets its values in registers and outputs whatever it needs, whereever suitable.
Once the caller is also rewritten, the wrapper disappears.

If you look at my jpeg.s source, you'll see what I mean.

However the input/output data formats need to be built for the optimized part, not the original C code, e.g. the fixed-point format must be defined by the asm writer.
meynaf is offline  
Old 11 December 2007, 15:43   #9
Thorham
Computer Nerd

Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 44
Posts: 3,216
Quote:
Originally Posted by meynaf
If you're unable to read c code, then it's possible to make the compiler output an asm source (I sure can do it), then to take this as a starting point. We don't even need to look at all the asm at once, just one part after another - some may require a little bit more code cleanup in the source before ; there are just too many structs with fields we won't use.
I can read c, but I just don't have a lot of experience with it, thats all, really. It just means that I can't rush through the c parts. I'll make do!

Quote:
Originally Posted by meynaf
Alternatively, it could be possible to start from scratch.
If so, a good thing to do would be the dct part, as it takes a good deal of computational power - if you're a good mathematician then a discrete cosine transform implementation should be nothing for you
This is certainly preferable. However, I must point out that I'm not a math guy. I suppose I could learn how to read the inverse dct (there isn't that much to it), but I don't know if I can write a highly optimized asm implementation. Fingers crossed...
Thorham is offline  
Old 11 December 2007, 16:18   #10
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 48
Posts: 4,228
The main problem of such computations is that they're not integer and contain lots of multiplies.

There are two implementations you might want to take a look at.
. jidctint.c - fixed-point implementation, Amiga decoders generally use that one
. jidctflt.c - float implementation, simpler... but using floats

Of course, there are other algorithms...
meynaf is offline  
Old 11 December 2007, 16:37   #11
Thorham
Computer Nerd

Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 44
Posts: 3,216
Quote:
Originally Posted by meynaf
The main problem of such computations is that they're not integer and contain lots of multiplies.
O no, mulu

Quote:
Originally Posted by meynaf
There are two implementations you might want to take a look at.
. jidctint.c - fixed-point implementation, Amiga decoders generally use that one
. jidctflt.c - float implementation, simpler... but using floats
It seems doable to write optimized asm for the int version, but the float version must be thrown away, or you'll have to use the math libraries (very slow). It's a pity the 680x0 family doesn't always have a math co pro as a standard.

All thats needed to do the idct in asm is some test data to verify the output of the asm routine. Any suggestions???

Quote:
Originally Posted by meynaf
Of course, there are other algorithms...
Really, where can I find them? If they are faster (and just as good) then the before mentioned int routine then it's definitively worth taking a look at these. Maybe I'll just see if I can find them on the net if you don't have them.
Thorham is offline  
Old 11 December 2007, 17:21   #12
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 48
Posts: 4,228
Quote:
Originally Posted by Thorham View Post
O no, mulu
No mulu, don't worry. Just muls

Quote:
Originally Posted by Thorham View Post
It seems doable to write optimized asm for the int version, but the float version must be thrown away, or you'll have to use the math libraries (very slow). It's a pity the 680x0 family doesn't always have a math co pro as a standard.
Thrown away, I'm not sure. The float version can be changed into a fixed-point version.
And, you know, floating-point computations are slower than integer even if you have a math co-pro.

Quote:
Originally Posted by Thorham View Post
All thats needed to do the idct in asm is some test data to verify the output of the asm routine. Any suggestions???
We could start with random data, and keep the original code to check the output. Once something is done, integrating it into the viewer is easy, and then we'll see the result
Or we could get some data in a real image, while debugging.

Quote:
Originally Posted by Thorham View Post
Really, where can I find them? If they are faster (and just as good) then the before mentioned int routine then it's definitively worth taking a look at these. Maybe I'll just see if I can find them on the net if you don't have them.
I have next to nothing, so the net is the right solution
meynaf is offline  
Old 12 December 2007, 16:24   #13
Thorham
Computer Nerd

Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 44
Posts: 3,216
Quote:
Originally Posted by meynaf
No mulu, don't worry. Just muls
Nooooo, even slower

Quote:
Originally Posted by meynaf
Thrown away, I'm not sure. The float version can be changed into a fixed-point version.
And, you know, floating-point computations are slower than integer even if you have a math co-pro.
Well, that figures

Quote:
Originally Posted by meynaf[We could start with random data, and keep the original code to check the output. Once something is done, integrating it into the viewer is easy, and then we'll [I
see[/i] the result
Or we could get some data in a real image, while debugging.
Sounds like a good idea. I still have to get my mind around this one, though.

Quote:
Originally Posted by meynaf
I have next to nothing, so the net is the right solution
Ok, I'll just have a go at it. I'll let you know if anything cool shows up, as I'm pretty sure you'll want to know.
Thorham is offline  
Old 12 December 2007, 16:59   #14
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 48
Posts: 4,228
Quote:
Originally Posted by Thorham View Post
Nooooo, even slower
Strangely, muls isn't slower than mulu on 020/030. But they're still 14 times slower than an add

Quote:
Originally Posted by Thorham View Post
Ok, I'll just have a go at it. I'll let you know if anything cool shows up, as I'm pretty sure you'll want to know.
I sure will
meynaf is offline  
Old 12 December 2007, 18:03   #15
Thorham
Computer Nerd

Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 44
Posts: 3,216
Quote:
Originally Posted by meynaf
Strangely, muls isn't slower than mulu on 020/030. But they're still 14 times slower than an add
Must have mixed it up with divu vs divs, where divs is slower then divu (I think it's 44 cycles vs 56 cycles).

Quote:
Originally Posted by meynaf
I sure will
Great, I'll post about anything interesting I find.
Thorham is offline  
Old 12 December 2007, 18:40   #16
Photon
Moderator

Photon's Avatar
 
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 4,948
If I may offer advice (humbly, I assure you) there are some choices to be made:

- Converter for FPU users only?
- Otherwise, let libraries detect FPU?
- make FPU and non-FPU versions in asm.
- code it all in integer (if 32 bits is not enough precision, 68xxx has instructions to help with 64-bit math. A few examples are listed in AsmOne docs, I believe.)

For myself, I'd do it in 32-bit integer. Works on all 68xxx Amigas and should be good enough for 8bit rgb components.

When I offer asm services to the GP2xdev community, I prefer to keep it simple. c coder selects a time-consuming snippet, send me the asm from the binary, and when I've noted which inputs it gets I scrap that asm and write an exact optimized replica that generates the correct results. If you start looking at "how the c compiler does it in asm" I would be led astray

On 68k, muls and mulu take 70n, while add takes 8n. divs takes 158n and divu takes 140n (all worst case timings).

If you work from asm to asm, you don't really have to know formulas etc. If the c-asm code calls some routine in some fp library, all you have to do is replace the call with asm that delivers the same result. If you find that it multiplies a*sin x with b*cos y, then do the same "your way". (As you see I haven't looked at jpeg algorithms, was just an example )

Asm of innermost loop and exact function description of the used library functions will get you a long way.
Photon is offline  
Old 12 December 2007, 20:36   #17
Thorham
Computer Nerd

Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 44
Posts: 3,216
Meynaf, I've found some interesting stuff, considering I didn't search for hours

Some sites with seemingly in-depth explanations:
Exploring JPEG
Image Compression - from DCT to Wavelets : A Review

Some papers, with, I think, very in-depth explanations:
www.eie.polyu.edu.hk/~enyhchan/mm_jpeg.pdf
www.es.ele.tue.nl/mininoc/doc/report_sander_stuijk.pdf
www.cse.fau.edu/~borko/Chapter8_mc.pdf

Interesting stuff, no doubt, but if it's not useful I'll have to try a little harder

Last edited by Thorham; 13 December 2007 at 01:26.
Thorham is offline  
Old 13 December 2007, 10:41   #18
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 48
Posts: 4,228
@Photon :
I really want to do it in pure integer ; a floating-point version would be slower anyway and won't bring a visible quality difference (I think).
There are several algorithms (for the dct part) that can be used, I still don't know which will be the best when reduced to (asm) code. I need to be as fast as possible, yet having enough precision so give the highest quality.
Knowing exactly what's to be done can end up with a very different architecture than what already exists, because it will be amiga asm and not portable c.
There are no such things as sin and cos, all of those are pre-computed and put in tables. What remains is a lot of adds, and a bunch of muls... oh, yes, so many muls

@Thorham :
I've had a quick look at the links you provided. And found more and more math formulas and schemas.
Sure interesting, but what would suit me better is an actual algorithm to perform a dct (which is the tricky part ; the rest is more or less integer, it can (and will ?) be done from actual c sources - once I know which fixed-point format to use).
Oh, well, ok. Maybe I can just compile jidctint.c, see the asm and try to optimize it
meynaf is offline  
Old 13 December 2007, 10:57   #19
Thorham
Computer Nerd

Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 44
Posts: 3,216
Quote:
Originally Posted by meynaf
I've had a quick look at the links you provided. And found more and more math formulas and schemas.
Sure interesting, but what would suit me better is an actual algorithm to perform a dct (which is the tricky part ; the rest is more or less integer, it can (and will ?) be done from actual c sources - once I know which fixed-point format to use).
Oh, well, ok. Maybe I can just compile jidctint.c, see the asm and try to optimize it
The Exploring Jpeg page has a complete explanation, which also shows how to implement it (in haskell, though the author says you can skip the code). Are you sure you've looked this page through completely? It might be helpful in understanding what's going on in the source code you have, unless you don't really need to.

Anyway, if it's source code you want, then that shouldn't be too hard to find.

Unfortunately, a lot of jpeg explanations are presented in the mathematical format, which is useless if you can't read it. I've had the same problem years ago, when I wanted to know how jpeg works.
Thorham is offline  
Old 13 December 2007, 11:21   #20
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 48
Posts: 4,228
Quote:
Originally Posted by Thorham View Post
The Exploring Jpeg page has a complete explanation, which also shows how to implement it (in haskell, though the author says you can skip the code). Are you sure you've looked this page through completely? It might be helpful in understanding what's going on in the source code you have, unless you don't really need to.
I already know what's going on. The main problem for me is that dct trick, I know what it does, but I want the quickest way to do it without significantly losing precision - and the haskell code, not quite readable, apparently contains direct square roots and cosinus computations, which I do not want to do.

Quote:
Originally Posted by Thorham View Post
Anyway, if it's source code you want, then that shouldn't be too hard to find.
Maybe not source code, but a detailed algorithm to efficiently perform the computations. Of course, a correctly commented source code will do
Things such as sqr/sin/cos are to avoid at all costs, muls should be reduced to the bare minimum. Else we'll end up with something sloooooow

Quote:
Originally Posted by Thorham View Post
Unfortunately, a lot of jpeg explanations are presented in the mathematical format, which is useless if you can't read it. I've had the same problem years ago, when I wanted to know how jpeg works.
Hopefully you know what I feel about those formulas
meynaf is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
JPEG to IFF Coverter W4r3DeV1L request.Apps 15 14 February 2020 17:21
Overzealous Kickstart ROM - address decoding? robinsonb5 Hardware mods 3 30 June 2013 11:09
JPEG to PNG (via CLI) amiga_user support.Apps 3 28 November 2011 11:50
Decoding algorithm(s) for encoded disk sectors (ADOS) andreas Coders. General 10 02 November 2009 22:18
Blitter MFM decoding Photon Coders. General 14 16 March 2006 11:24

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 14:10.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, vBulletin Solutions Inc.
Page generated in 0.09998 seconds with 13 queries