jpeg decoding in full asm
Hi there,
In the process of writing my own image viewer (why not ?), I ended up with an ilbm/gif viewer which is faster than everything else I tested (yes ! - see behind). However, when it came to jpeg, things have gone much more complex... All was written in full asm, but not the jpeg part which remained in C (using the IJG code). I removed unneeded stuff from the jpeg library, but now I want to rewrite it in asm. A bit too tough maybe... So, if you want to help, you're welcome :) I'm pretty sure my friend Thorham will be interested ;) You can find my actual sources here : http://meynaf.free.fr/tmp/jpeg.lzx You have to alter the project options for having the correct directories for sources. Also, you may wish to change the command line and working dir. I compile them with HisoftDev 4.1. Anything else untested, may or may not work. What's needed is : - no startup/cleanup code (this is all done in the asm part linked to it) - first object to link is the asm part (file v.o) - no small data model (c code shouldn't use an addr reg to access globals). So you don't believe me when I say it's fast at displaying iff/gif files, eh ? Here is a viewer benchmark. My own viewer is loosely labeled "v" (if someone can think of a better name it's welcome...). The machine is an a1200/030-50, images are read from ram disk. Quality of the rendering is indicated for jpegs ; results are in number of frames (50 is 1 sec). I've put the results in a "code" tag to preserve formatting : Code:
iff (768x512x8) As you can see, there's still room for improvement on the jpeg part... Who's willing to help ? |
Quote:
Quote:
For iff and gif the speed is definitely great, as for the jpeg part, if it's unoptimized c then there should be a lot of room for improvement. |
Thanks for your help :great
The code isn't unoptimized C. The IJG (Independent Jpeg Group) have taken care about their code's speed quite a lot. But, of course, it's still C, so it leaves a lot of room for improvements. If you like code to look at, then you can also look at the other thread I've just opened : http://eab.abime.net/showthread.php?t=33574 |
Quote:
When I tried to assemble the v source code, all I got was an object file, as I didn't realize the c parts were needed, too :banghead:banghead:banghead (that's called not reading a post properly) I have dice 3.16 and storm c++, and don't know a lot about them, so it's going to take some time to get this to work, and I don't have much experience with c compilers in general anyway, so it's not going to be easy for me, either. But that doesn't matter, maybe I can learn something from this :D Although it might not be necessary to compile the jpeg decoder, as the c code hast to go anyway. I must also admit that I don't have a whole lot of c programming experience, so I might not be able to help here :o |
If you're unable to read c code, then it's possible to make the compiler output an asm source (I sure can do it), then to take this as a starting point. We don't even need to look at all the asm at once, just one part after another - some may require a little bit more code cleanup in the source before ; there are just too many structs with fields we won't use.
Alternatively, it could be possible to start from scratch. If so, a good thing to do would be the dct part, as it takes a good deal of computational power - if you're a good mathematician then a discrete cosine transform implementation should be nothing for you :D |
Hmmm...
quick gif viewing... could come in handy with my games frontend... |
This is a very interesting topic, cos this is exactly what I intend for code2x.com (for the GP2x). Applying skilled asm coding on selected c functions or loops. :cool
Been offering my asm services to a few selected gp2x devs, but it's as if they don't know how to isolate a loop and give an asm coder the assembly of what is to be optimized. Could also be that they don't have time, don't think the code is near enough to final, or haven't had time to profile the code so they know which parts take the most time. Seems to be a need for creating awareness of the fact that not a single c function can ever be made nearly as fast as the corresponding one in asm. I think if you isolate out any init code from what you want optimized and give Thorham a chunk of asm (together with provided inputs and relied-upon output from that chunk) optimization will be EZ Street! :great |
Quote:
In a C program you've got a very different environment than in an ASM program. Parameters are not passed in registers, references are via pointers rather than output registers, and you don't know your caller's register usage so you have to save whatever reg you use which isn't defined as a scratch reg. That's why I intend to rewrite the whole thing ; not only a few selected, profiled parts. Quote:
Once the caller is also rewritten, the wrapper disappears. If you look at my jpeg.s source, you'll see what I mean. However the input/output data formats need to be built for the optimized part, not the original C code, e.g. the fixed-point format must be defined by the asm writer. |
Quote:
Quote:
|
The main problem of such computations is that they're not integer and contain lots of multiplies. :banghead
There are two implementations you might want to take a look at. . jidctint.c - fixed-point implementation, Amiga decoders generally use that one . jidctflt.c - float implementation, simpler... but using floats Of course, there are other algorithms... |
Quote:
Quote:
All thats needed to do the idct in asm is some test data to verify the output of the asm routine. Any suggestions??? Quote:
|
Quote:
Quote:
And, you know, floating-point computations are slower than integer even if you have a math co-pro. :banghead Quote:
Or we could get some data in a real image, while debugging. Quote:
|
Quote:
Quote:
Quote:
Quote:
|
Quote:
Quote:
|
Quote:
Quote:
|
If I may offer advice (humbly, I assure you) there are some choices to be made:
- Converter for FPU users only? - Otherwise, let libraries detect FPU? - make FPU and non-FPU versions in asm. - code it all in integer (if 32 bits is not enough precision, 68xxx has instructions to help with 64-bit math. A few examples are listed in AsmOne docs, I believe.) For myself, I'd do it in 32-bit integer. Works on all 68xxx Amigas and should be good enough for 8bit rgb components. When I offer asm services to the GP2xdev community, I prefer to keep it simple. c coder selects a time-consuming snippet, send me the asm from the binary, and when I've noted which inputs it gets I scrap that asm and write an exact optimized replica that generates the correct results. If you start looking at "how the c compiler does it in asm" I would be led astray ;) On 68k, muls and mulu take 70n, while add takes 8n. divs takes 158n and divu takes 140n (all worst case timings). If you work from asm to asm, you don't really have to know formulas etc. If the c-asm code calls some routine in some fp library, all you have to do is replace the call with asm that delivers the same result. If you find that it multiplies a*sin x with b*cos y, then do the same "your way". (As you see I haven't looked at jpeg algorithms, was just an example ;)) Asm of innermost loop and exact function description of the used library functions will get you a long way. :great |
Meynaf, I've found some interesting stuff, considering I didn't search for hours :D
Some sites with seemingly in-depth explanations: Exploring JPEG Image Compression - from DCT to Wavelets : A Review Some papers, with, I think, very in-depth explanations: www.eie.polyu.edu.hk/~enyhchan/mm_jpeg.pdf www.es.ele.tue.nl/mininoc/doc/report_sander_stuijk.pdf www.cse.fau.edu/~borko/Chapter8_mc.pdf Interesting stuff, no doubt, but if it's not useful I'll have to try a little harder :D |
@Photon :
I really want to do it in pure integer ; a floating-point version would be slower anyway and won't bring a visible quality difference (I think). There are several algorithms (for the dct part) that can be used, I still don't know which will be the best when reduced to (asm) code. I need to be as fast as possible, yet having enough precision so give the highest quality. Knowing exactly what's to be done can end up with a very different architecture than what already exists, because it will be amiga asm and not portable c. There are no such things as sin and cos, all of those are pre-computed and put in tables. What remains is a lot of adds, and a bunch of muls... oh, yes, so many muls :crazy @Thorham : I've had a quick look at the links you provided. And found more and more math formulas and schemas. :shocked Sure interesting, but what would suit me better is an actual algorithm to perform a dct (which is the tricky part ; the rest is more or less integer, it can (and will ?) be done from actual c sources - once I know which fixed-point format to use). Oh, well, ok. Maybe I can just compile jidctint.c, see the asm and try to optimize it :guru |
Quote:
Anyway, if it's source code you want, then that shouldn't be too hard to find. Unfortunately, a lot of jpeg explanations are presented in the mathematical format, which is useless if you can't read it. I've had the same problem years ago, when I wanted to know how jpeg works. |
Quote:
Quote:
Things such as sqr/sin/cos are to avoid at all costs, muls should be reduced to the bare minimum. Else we'll end up with something sloooooow :shocked Quote:
|
All times are GMT +2. The time now is 14:25. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.