10 December 2007, 13:26 | #1 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
jpeg decoding in full asm
Hi there,
In the process of writing my own image viewer (why not ?), I ended up with an ilbm/gif viewer which is faster than everything else I tested (yes ! - see behind). However, when it came to jpeg, things have gone much more complex... All was written in full asm, but not the jpeg part which remained in C (using the IJG code). I removed unneeded stuff from the jpeg library, but now I want to rewrite it in asm. A bit too tough maybe... So, if you want to help, you're welcome I'm pretty sure my friend Thorham will be interested You can find my actual sources here : http://meynaf.free.fr/tmp/jpeg.lzx You have to alter the project options for having the correct directories for sources. Also, you may wish to change the command line and working dir. I compile them with HisoftDev 4.1. Anything else untested, may or may not work. What's needed is : - no startup/cleanup code (this is all done in the asm part linked to it) - first object to link is the asm part (file v.o) - no small data model (c code shouldn't use an addr reg to access globals). So you don't believe me when I say it's fast at displaying iff/gif files, eh ? Here is a viewer benchmark. My own viewer is loosely labeled "v" (if someone can think of a better name it's welcome...). The machine is an a1200/030-50, images are read from ram disk. Quality of the rendering is indicated for jpegs ; results are in number of frames (50 is 1 sec). I've put the results in a "code" tag to preserve formatting : Code:
iff (768x512x8) v 31 ppshow 45 visage 48 fastview 56 viewtek 58 gif (800x533x8) v 76 showgif 93 zgif 119 ppshow 207 viewtek 378 jpeg (500x333x24) sjfif 67 (low qual) visage 129 (medium qual) fastview 158 (medium qual) fjpeg_aga 186 (high qual) v 233 (high qual) ppshow 235 (low qual) jpegaga 374 (high qual) viewtek 512 (medium qual) As you can see, there's still room for improvement on the jpeg part... Who's willing to help ? |
10 December 2007, 14:58 | #2 | ||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,751
|
Quote:
Quote:
For iff and gif the speed is definitely great, as for the jpeg part, if it's unoptimized c then there should be a lot of room for improvement. |
||
10 December 2007, 15:38 | #3 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Thanks for your help
The code isn't unoptimized C. The IJG (Independent Jpeg Group) have taken care about their code's speed quite a lot. But, of course, it's still C, so it leaves a lot of room for improvements. If you like code to look at, then you can also look at the other thread I've just opened : http://eab.abime.net/showthread.php?t=33574 |
10 December 2007, 23:28 | #4 | |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,751
|
Quote:
When I tried to assemble the v source code, all I got was an object file, as I didn't realize the c parts were needed, too (that's called not reading a post properly) I have dice 3.16 and storm c++, and don't know a lot about them, so it's going to take some time to get this to work, and I don't have much experience with c compilers in general anyway, so it's not going to be easy for me, either. But that doesn't matter, maybe I can learn something from this Although it might not be necessary to compile the jpeg decoder, as the c code hast to go anyway. I must also admit that I don't have a whole lot of c programming experience, so I might not be able to help here |
|
11 December 2007, 10:28 | #5 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
If you're unable to read c code, then it's possible to make the compiler output an asm source (I sure can do it), then to take this as a starting point. We don't even need to look at all the asm at once, just one part after another - some may require a little bit more code cleanup in the source before ; there are just too many structs with fields we won't use.
Alternatively, it could be possible to start from scratch. If so, a good thing to do would be the dct part, as it takes a good deal of computational power - if you're a good mathematician then a discrete cosine transform implementation should be nothing for you |
11 December 2007, 11:52 | #6 |
Lesser Talent
Join Date: Jan 2003
Location: UK
Age: 42
Posts: 7,957
|
Hmmm...
quick gif viewing... could come in handy with my games frontend... |
11 December 2007, 12:08 | #7 |
Moderator
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,602
|
This is a very interesting topic, cos this is exactly what I intend for code2x.com (for the GP2x). Applying skilled asm coding on selected c functions or loops.
Been offering my asm services to a few selected gp2x devs, but it's as if they don't know how to isolate a loop and give an asm coder the assembly of what is to be optimized. Could also be that they don't have time, don't think the code is near enough to final, or haven't had time to profile the code so they know which parts take the most time. Seems to be a need for creating awareness of the fact that not a single c function can ever be made nearly as fast as the corresponding one in asm. I think if you isolate out any init code from what you want optimized and give Thorham a chunk of asm (together with provided inputs and relied-upon output from that chunk) optimization will be EZ Street! |
11 December 2007, 12:38 | #8 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
In a C program you've got a very different environment than in an ASM program. Parameters are not passed in registers, references are via pointers rather than output registers, and you don't know your caller's register usage so you have to save whatever reg you use which isn't defined as a scratch reg. That's why I intend to rewrite the whole thing ; not only a few selected, profiled parts. Quote:
Once the caller is also rewritten, the wrapper disappears. If you look at my jpeg.s source, you'll see what I mean. However the input/output data formats need to be built for the optimized part, not the original C code, e.g. the fixed-point format must be defined by the asm writer. |
||
11 December 2007, 15:43 | #9 | ||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,751
|
Quote:
Quote:
|
||
11 December 2007, 16:18 | #10 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
The main problem of such computations is that they're not integer and contain lots of multiplies.
There are two implementations you might want to take a look at. . jidctint.c - fixed-point implementation, Amiga decoders generally use that one . jidctflt.c - float implementation, simpler... but using floats Of course, there are other algorithms... |
11 December 2007, 16:37 | #11 | |||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,751
|
Quote:
Quote:
All thats needed to do the idct in asm is some test data to verify the output of the asm routine. Any suggestions??? Quote:
|
|||
11 December 2007, 17:21 | #12 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
No mulu, don't worry. Just muls
Quote:
And, you know, floating-point computations are slower than integer even if you have a math co-pro. Quote:
Or we could get some data in a real image, while debugging. I have next to nothing, so the net is the right solution |
||
12 December 2007, 16:24 | #13 | ||||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,751
|
Quote:
Quote:
Quote:
Quote:
|
||||
12 December 2007, 16:59 | #14 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Strangely, muls isn't slower than mulu on 020/030. But they're still 14 times slower than an add
I sure will |
12 December 2007, 18:03 | #15 | ||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,751
|
Quote:
Quote:
|
||
12 December 2007, 18:40 | #16 |
Moderator
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,602
|
If I may offer advice (humbly, I assure you) there are some choices to be made:
- Converter for FPU users only? - Otherwise, let libraries detect FPU? - make FPU and non-FPU versions in asm. - code it all in integer (if 32 bits is not enough precision, 68xxx has instructions to help with 64-bit math. A few examples are listed in AsmOne docs, I believe.) For myself, I'd do it in 32-bit integer. Works on all 68xxx Amigas and should be good enough for 8bit rgb components. When I offer asm services to the GP2xdev community, I prefer to keep it simple. c coder selects a time-consuming snippet, send me the asm from the binary, and when I've noted which inputs it gets I scrap that asm and write an exact optimized replica that generates the correct results. If you start looking at "how the c compiler does it in asm" I would be led astray On 68k, muls and mulu take 70n, while add takes 8n. divs takes 158n and divu takes 140n (all worst case timings). If you work from asm to asm, you don't really have to know formulas etc. If the c-asm code calls some routine in some fp library, all you have to do is replace the call with asm that delivers the same result. If you find that it multiplies a*sin x with b*cos y, then do the same "your way". (As you see I haven't looked at jpeg algorithms, was just an example ) Asm of innermost loop and exact function description of the used library functions will get you a long way. |
12 December 2007, 20:36 | #17 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,751
|
Meynaf, I've found some interesting stuff, considering I didn't search for hours
Some sites with seemingly in-depth explanations: Exploring JPEG Image Compression - from DCT to Wavelets : A Review Some papers, with, I think, very in-depth explanations: www.eie.polyu.edu.hk/~enyhchan/mm_jpeg.pdf www.es.ele.tue.nl/mininoc/doc/report_sander_stuijk.pdf www.cse.fau.edu/~borko/Chapter8_mc.pdf Interesting stuff, no doubt, but if it's not useful I'll have to try a little harder Last edited by Thorham; 13 December 2007 at 01:26. |
13 December 2007, 10:41 | #18 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
@Photon :
I really want to do it in pure integer ; a floating-point version would be slower anyway and won't bring a visible quality difference (I think). There are several algorithms (for the dct part) that can be used, I still don't know which will be the best when reduced to (asm) code. I need to be as fast as possible, yet having enough precision so give the highest quality. Knowing exactly what's to be done can end up with a very different architecture than what already exists, because it will be amiga asm and not portable c. There are no such things as sin and cos, all of those are pre-computed and put in tables. What remains is a lot of adds, and a bunch of muls... oh, yes, so many muls @Thorham : I've had a quick look at the links you provided. And found more and more math formulas and schemas. Sure interesting, but what would suit me better is an actual algorithm to perform a dct (which is the tricky part ; the rest is more or less integer, it can (and will ?) be done from actual c sources - once I know which fixed-point format to use). Oh, well, ok. Maybe I can just compile jidctint.c, see the asm and try to optimize it |
13 December 2007, 10:57 | #19 | |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,751
|
Quote:
Anyway, if it's source code you want, then that shouldn't be too hard to find. Unfortunately, a lot of jpeg explanations are presented in the mathematical format, which is useless if you can't read it. I've had the same problem years ago, when I wanted to know how jpeg works. |
|
13 December 2007, 11:21 | #20 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
Quote:
Things such as sqr/sin/cos are to avoid at all costs, muls should be reduced to the bare minimum. Else we'll end up with something sloooooow Hopefully you know what I feel about those formulas |
||
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
JPEG to IFF Coverter | W4r3DeV1L | request.Apps | 15 | 14 February 2020 17:21 |
Overzealous Kickstart ROM - address decoding? | robinsonb5 | Hardware mods | 3 | 30 June 2013 11:09 |
JPEG to PNG (via CLI) | amiga_user | support.Apps | 3 | 28 November 2011 11:50 |
Decoding algorithm(s) for encoded disk sectors (ADOS) | andreas | Coders. General | 10 | 02 November 2009 22:18 |
Blitter MFM decoding | Photon | Coders. General | 14 | 16 March 2006 11:24 |
|
|