jpeg decoding in full asm - Page 2

Thorham · 13 December 2007, 17:50

Quote:

Originally Posted by meynaf

I already know what's going on. The main problem for me is that dct trick, I know what it does, but I want the quickest way to do it without significantly losing precision - and the haskell code, not quite readable, apparently contains direct square roots and cosinus computations, which I do not want to do.

The haskell code does suck, doesn't it? And I don't suppose you want to make your own algorithm based on the formula

Quote:

Originally Posted by meynaf

Hopefully you know what I feel about those formulas

The same as me, I suppose. Math in pretty much un-cool, and it always looks like it could be done just about 10 million times simpler.

Quote:

Originally Posted by meynaf

Maybe not source code, but a detailed algorithm to efficiently perform the computations. Of course, a correctly commented source code will do

Things such as sqr/sin/cos are to avoid at all costs, muls should be reduced to the bare minimum. Else we'll end up with something sloooooow

I guess I'll try again then. How difficult is it to find what you need on the net

meynaf · 13 December 2007, 18:13

Quote:

Originally Posted by Thorham

The haskell code does suck, doesn't it? And I don't suppose you want to make your own algorithm based on the formula

I don't know if it sucks or not, but I surely don't want to make my own algorithm on the formula !

Quote:

Originally Posted by Thorham

The same as me, I suppose. Math in pretty much un-cool, and it always looks like it could be done just about 10 million times simpler.

I can't imagine writing a cosinus calculation in integer maths

Quote:

Originally Posted by Thorham

I guess I'll try again then. How difficult is it to find what you need on the net

I'm not sure if it will be useful right now, 'coz I finally dared to start the rewrite of the integer dct in asm, from jidctint.c (I prefer to convert c into asm, rather than compiling c to optimize the asm it produces).
Well, it turned out not to be that hard (once all those constants and macros have been replaced by what they mean).

But... muls, muls, muls and more muls (might remind you of something you've read recently

).
Oh, and muls again.
Did I forget muls ?

What I have now isn't a carbon copy of the original code, I had to move things to reduce register usage.
There is no init/exit code for now, and I only have the first half (columns).

I've included it here, so that you can have a look at it.
Not quite optimized already, there are some unneeded data movement.

Of course I dunno if it works

Thorham · 14 December 2007, 16:55

Quote:

Originally Posted by meynaf

I'm not sure if it will be useful right now, 'coz I finally dared to start the rewrite of the integer dct in asm, from jidctint.c

Good luck. I've taken a look at the c code, and it shouldn't be that hard. Seems perfectly doable.

Quote:

Originally Posted by meynaf

(I prefer to convert c into asm, rather than compiling c to optimize the asm it produces).

Oh yeah, that is just much better, as everything gets named properly and you can easily comment everything.

Quote:

Originally Posted by meynaf

But... muls, muls, muls and more muls (might remind you of something you've read recently

).
Oh, and muls again.
Did I forget muls ?

That's just completely uncool. It will be very hard to reduce the number muls, if it's possible in the first place.

Quote:

Originally Posted by meynaf

What I have now isn't a carbon copy of the original code, I had to move things to reduce register usage.
There is no init/exit code for now, and I only have the first half (columns).

I've included it here, so that you can have a look at it.
Not quite optimized already, there are some unneeded data movement.

Looking good! Is it my imagination, or does the asm look a lot cleaner then the c code? Anyway, that code looks like it's going to be pretty good. so keep up the good work

Quote:

Originally Posted by meynaf

Of course I dunno if it works

Impossible to say at this stage

meynaf · 14 December 2007, 17:25

Quote:

Originally Posted by Thorham

Good luck. I've taken a look at the c code, and it shouldn't be that hard. Seems perfectly doable.

That might take some time before I come out with something usable, so don't worry if I remain silent for a few days

Quote:

Originally Posted by Thorham

Oh yeah, that is just much better, as everything gets named properly and you can easily comment everything.

And you know who to blame if it doesn't work...

Quote:

Originally Posted by Thorham

That's just completely uncool. It will be very hard to reduce the number muls, if it's possible in the first place.

Don't look at the muls, their number has already been reduced to its minimum by chosing the algorithm. Same goes for the adds (I think).
But there can be some unneeded moves, and it is possible that the register usage can be reduced as well.
I also strongly doubt it could be useful to replace those muls by tables, because there are just too many different constants.

Quote:

Originally Posted by Thorham

Looking good! Is it my imagination, or does the asm look a lot cleaner then the c code? Anyway, that code looks like it's going to be pretty good. so keep up the good work

Not hard to look a lot cleaner, as c code is always dirty

Quote:

Originally Posted by Thorham

Impossible to say at this stage

Oh, you didn't see a bug already ? Curious

Thorham · 17 December 2007, 16:18

Quote:

Originally Posted by meynaf

That might take some time before I come out with something usable, so don't worry if I remain silent for a few days

I hope it's been a productive few days

Productivity on my side of the net has been zero, because I've had nasty cold (which is still not completely over).

Quote:

Originally Posted by meynaf

And you know who to blame if it doesn't work...

It's an advantage if every mistake is your own. No one else to rely on.

Quote:

Originally Posted by meynaf

Don't look at the muls, their number has already been reduced to its minimum by chosing the algorithm. Same goes for the adds (I think).
But there can be some unneeded moves, and it is possible that the register usage can be reduced as well.
I also strongly doubt it could be useful to replace those muls by tables, because there are just too many different constants.

That's what I was thinking, too. Ridding yourself of those muls is probably not possible. The numbers they multiply can't be optimized to a bunch of adds, either, so you're definitely right. Still a shame, though...

Quote:

Originally Posted by meynaf

Not hard to look a lot cleaner, as c code is always dirty

You got that right

Quote:

Originally Posted by meynaf

Oh, you didn't see a bug already ? Curious

So am I. Also looking forward to what you did this weekend.

meynaf · 21 December 2007, 10:47

Quote:

Originally Posted by Thorham

I hope it's been a productive few days

Productivity on my side of the net has been zero, because I've had nasty cold (which is still not completely over).

I've had one too. Still coughing a little, but nothing more.

Quote:

Originally Posted by Thorham

It's an advantage if every mistake is your own. No one else to rely on.

Not entirely an advantage : you also have no one else to blame

Quote:

Originally Posted by Thorham

That's what I was thinking, too. Ridding yourself of those muls is probably not possible. The numbers they multiply can't be optimized to a bunch of adds, either, so you're definitely right. Still a shame, though...

Multiplies that could be removed already have, thanks to the IJG... The constants are all derived of cosine/square root stuff, so, yes, they look like random values.

Quote:

Originally Posted by Thorham

So am I. Also looking forward to what you did this weekend.

I finished it, but I found a mistake (register confusion) in the code.
This :

Code:

move.w d6,d1
add.w d7,d1
muls #9633,d1

muls #-16069,d6
muls #-3196,d7
add.l d1,d6
add.l d1,d7

Should be replaced by this :

Code:

move.w d6,d4
add.w d7,d4
muls #9633,d4

muls #-16069,d6
muls #-3196,d7
add.l d4,d6
add.l d4,d7

Now it works. Gone from 233 to 197 frames for my test image

If you want to have a look at it, it's in the zone, along with all modified files.

In the archive you'll find jidctint.s - the asm version of jidctint.c, which now is nothing but a wrapper for the asm version.

You'll also find a pre-compiled version ; after the c-code for dct has vanished the exe's size has dropped.

That code is probably tougher than the ham code you're used to, so to make things easier I've kept some (modified) c code as comments, and translated my comments for you.

Hint : try to free regs by moving things around, that is, output something right after it is computed, to free a reg for the next computation.
Then there could be a lot of possible opts if you have free regs (not only Dn but also An).

meynaf · 21 December 2007, 14:20

For the next step - the colorspace conversion may be my next victim

- I'm looking for jpeg files with unusual color spaces, to test them, and to check whether they're worth supporting or not (certainly not if they are extremely rare).
Can someone fire up a photoshop and save RGB/YCCK/CMYK encoded jpeg files for me ? (as I don't have photoshop and those are apparently adobe specific)

Please...

Thorham · 22 December 2007, 23:02

Quote:

Originally Posted by meynaf

Now it works. Gone from 233 to 197 frames for my test image

Impressive

Quote:

Originally Posted by meynaf

If you want to have a look at it, it's in the zone, along with all modified files.

Thanks man

Quote:

Originally Posted by meynaf

That code is probably tougher than the ham code you're used to, so to make things easier I've kept some (modified) c code as comments, and translated my comments for you.

Yes, it is. But that's only logical, considering the ham renderer is mathematically a lot simpler then an idct routine. Still a shame that I don't under stand those (i)dct formulas.

Thanks for the translation, greatly appreciated

Using those translators works, but they are annoying to use, and of course, having to fill in the context manually doesn't always help, either.

Quote:

Originally Posted by meynaf

Hint : try to free regs by moving things around, that is, output something right after it is computed, to free a reg for the next computation.
Then there could be a lot of possible opts if you have free regs (not only Dn but also An).

Oh yes, I can definitely give that a try.

I guess it was a bad idea to search for a full explanation. Even when you do completely understand the subject, it's going to be very tough to optimize the idct routine.

Anyway, good job, and great looking code. Keep up the good work.

Quote:

Originally Posted by meynaf

For the next step - the colorspace conversion may be my next victim - I'm looking for jpeg files with unusual color spaces, to test them, and to check whether they're worth supporting or not (certainly not if they are extremely rare).
Can someone fire up a photoshop and save RGB/YCCK/CMYK encoded jpeg files for me ? (as I don't have photoshop and those are apparently adobe specific)

Please...

Aren't all jpeg images, except gray scale, yuv color space encoded? Further more, those adobe specific formats are probably very rare indeed, so if you ask me, only gray scale and yuv are needed (as far as I can tell from the explanations). But hey, thats just my opinion

meynaf · 24 December 2007, 10:29

Quote:

Originally Posted by Thorham

Yes, it is. But that's only logical, considering the ham renderer is mathematically a lot simpler then an idct routine. Still a shame that I don't under stand those (i)dct formulas.

What you have here is the algorithm from C. Loeffler, A. Ligtenberg and G. Moschytz (LL&M). But don't ask me more about it

If someone wants to optimize that, then it's better to just do what they do, regardless of what it mathematically means.
As in the original IJG code, it must also perform the dequantization.

Maybe the comments in jidctint.c (the original one) can be useful for you.

Quote:

Originally Posted by Thorham

Thanks for the translation, greatly appreciated

Using those translators works, but they are annoying to use, and of course, having to fill in the context manually doesn't always help, either.
Oh yes, I can definitely give that a try.

There is some important register pressure in here, but maybe it's possible to free one (I already did it with the variable z5). Did you find something already ?

Quote:

Originally Posted by Thorham

I guess it was a bad idea to search for a full explanation. Even when you do completely understand the subject, it's going to be very tough to optimize the idct routine.

The basis is that the dct is (a particular case of) a Fourier transform, and its inverse is the idct. It takes you from spacial values from frequencies (well, here we're going from frequencies to values).
In jpegs, the high frequencies are stored with less precision (-> less bits) than the lower ones, because they are less visible. That's why they can look somewhat blurred.

Quote:

Originally Posted by Thorham

Anyway, good job, and great looking code. Keep up the good work.

I sure will

Quote:

Originally Posted by Thorham

Aren't all jpeg images, except gray scale, yuv color space encoded? Further more, those adobe specific formats are probably very rare indeed, so if you ask me, only gray scale and yuv are needed (as far as I can tell from the explanations). But hey, thats just my opinion

Most jpegs are yuv encoded (though they call it YCbCr in the code), but the format itself supports a bigger set of color spaces (as I've seen in the code).
However if nothing using them can be found, then they're not worth supporting in my asm code (except by throwing an error message in the face of the unfortunate user who accidentally stumbled upon such a file

).

Thorham · 24 December 2007, 15:38

Quote:

Originally Posted by meynaf

There is some important register pressure in here, but maybe it's possible to free one (I already did it with the variable z5). Did you find something already ?

Yes, I think I did. It seems that the part which writes the columns can have addi.l #1<<10,d7 replaced by add.l d4,d7 if you add move.l #1024,d4 to the code, as d4 isn't used in this part of the routine, and gets set to another value in the beginning of the first loop. This is the first thing I've found, and I'm still looking for more.

Quote:

Originally Posted by meynaf

Most jpegs are yuv encoded (though they call it YCbCr in the code), but the format itself supports a bigger set of color spaces (as I've seen in the code).
However if nothing using them can be found, then they're not worth supporting in my asm code (except by throwing an error message in the face of the unfortunate user who accidentally stumbled upon such a file

).

I'm completely convinced most jpegs are yuv encoded, and that you really don't need any other color spaces except gray, because that's what I keep reading in jpeg docs. YCbCrC is just the digital variant of yuv, as yuv is for analog video.

I'll let you know if I find more, and I'm pretty sure I will, since the first versions of a piece of code are usually not completely optimized.

alexh · 24 December 2007, 15:50

Quote:

Originally Posted by Thorham

Y'CbCr is just the digital variant of yuv, as yuv is for analog video.

Wow, someone who actually knows the truth. I've been telling people this for years but they never listen.

Charles Poynton is my hero when it comes to this stuff.

Thorham · 24 December 2007, 16:03

Quote:

Originally Posted by alexh

Wow, someone who actually knows the truth. I've been telling people this for years but they never listen.

Really

Even wikipedia's explanation about yuv clearly states this. It's amazing how people can just dismiss these facts.

Quote:

Originally Posted by alexh

Charles Poynton is my hero when it comes to this stuff.

Thanks for the name

I've been interested in this kind of stuff for a while, and he seems to have a pretty cool site about video related stuff. Great!

meynaf · 24 December 2007, 16:14

Does that guy have asm code for YCbCr -> RGB conversion ?

What, no 68k version ? Well, ok, I'll do it... That's the next thing I've spotted that's not too difficult and can give us an important speed increase.

I promise I won't use the term yuv if it's digital video

Thorham · 29 December 2007, 02:18

Quote:

Originally Posted by meynaf

Does that guy have asm code for YCbCr -> RGB conversion ?

What, no 68k version ? Well, ok, I'll do it... That's the next thing I've spotted that's not too difficult and can give us an important speed increase.

I promise I won't use the term yuv if it's digital video

That code is a cake walk, as you've probably found out by now

I've done a version in free basic, and it was pretty simple.

Going off-topic a bit now. Ultimately I'm still wondering if there isn't a plain and simple way to effectively crunch gfx, something which doesn't require 'advanced' math, and can be implemented algorithmically. Surely something is possible, it's not as if everything has been thought of in the wonderful world of algorithms (your ham rendering engine seems to be a good example, haven't seen it before).

meynaf · 04 January 2008, 11:12

Of course the YCbCr->RGB conversion is simple. But it becomes more interesting if you try to do it without multiplies

The problem of gfx data is that it doesn't crunch very well. The more colourful it is, the worse the compression will be.
The one and only thing I see here is that we could exploit the fact that adjacent pixels are often close in color ; they're closer in YCbCr than in RGB but changing color spaces is a lossy process because of rounding errors.

My ham engine is more standard issue than it looks. I'm pretty sure nearly all high quality renderers use the very same algorithm.

Thorham · 04 January 2008, 18:36

Quote:

Originally Posted by meynaf

Of course the YCbCr->RGB conversion is simple. But it becomes more interesting if you try to do it without multiplies

Now that's a very good idea, meynaf. Can't wait to try it! Of course I'm going to try it in freebasic first

(more convenient because of easy to use gfx functions which all just work in 24bit).

Quote:

Originally Posted by meynaf

The problem of gfx data is that it doesn't crunch very well. The more colourful it is, the worse the compression will be.

True, but it also applies to gray scale: The more visible detail the less compression.

Quote:

Originally Posted by meynaf

The one and only thing I see here is that we could exploit the fact that adjacent pixels are often close in color ; they're closer in YCbCr than in RGB but changing color spaces is a lossy process because of rounding errors.

That's true. There is a problem with this, though. I've tried (in freebasic, what else

) scaling the color information by 50%x50%, then, even with interpolation, the final rendered image is not as good as the original because aliasing is introduced. Could be fixed by somehow not scaling the whole image, just parts of it. And, of course, my interpolation algorithm isn't the best, although it does yield better results then none at all.

Quote:

Originally Posted by meynaf

My ham engine is more standard issue than it looks. I'm pretty sure nearly all high quality renderers use the very same algorithm.

Oh, I hadn't realized this. Probably because I use Adpro as a quality reference, and this program is terribly slow, as you know, while your program is very fast

meynaf · 14 January 2008, 15:30

Quote:

Originally Posted by Thorham

Now that's a very good idea, meynaf. Can't wait to try it! Of course I'm going to try it in freebasic first

(more convenient because of easy to use gfx functions which all just work in 24bit).

You could do it right in asm, because else it's already done. Look in jdcolor.c to see how.

I tried this already, but it was quite deceiving. I ran into a lot of bugs (you can't imagine the -ahem- beautiful images I've seen) and it didn't give a good speed gain. Too many tables to peek (4), 3 data sources for 1 destination -> too many address registers used -> too much swapping. Gosh !
Maybe you'll be more lucky if you give it a try...

Quote:

Originally Posted by Thorham

True, but it also applies to gray scale: The more visible detail the less compression.

Of course.

Quote:

Originally Posted by Thorham

That's true. There is a problem with this, though. I've tried (in freebasic, what else

) scaling the color information by 50%x50%, then, even with interpolation, the final rendered image is not as good as the original because aliasing is introduced. Could be fixed by somehow not scaling the whole image, just parts of it. And, of course, my interpolation algorithm isn't the best, although it does yield better results then none at all.

You're doing here something jpeg already does. They're using a triangle filter for upsampling, maybe you can try that too (look in jdsample.c for more info).

I'm more interested in lossless compression, though. What if you could predict what the image will be by computing it with whatever you've already decoded, and only store the difference between the reality and your prediction ?
(this has already been applied to audio, but afaik not to gfx data)
(and, oh, yes, I don't have a clue on the predictors to use

)

Quote:

Originally Posted by Thorham

Oh, I hadn't realized this. Probably because I use Adpro as a quality reference, and this program is terribly slow, as you know, while your program is very fast

FastJpeg also uses a similar algorithm, but probably not Visage because its rendering is quite ugly (though it is fast).

Adpro probably makes a lot of analysis to adapt its palette before rendering, which is terribly slow (that's why I didn't want to do it too).
Furthermore, if it's 100% compiled code then it's likely to be up to 4x slower...

Answer here for the viewer options (off-topic in the mpega thread

) : just list a few here and we'll see. Maybe they're already planned.

About the scaling, there is something that annoys me quite a lot : what to do on a palettized display ? Skipping pixels will be ugly and we don't have enough colors to get all the rgb combinations an average would make, but ham display may be even uglier than pixel skipping on some images.

Thorham · 15 January 2008, 03:50

Quote:

Originally Posted by maynaf

right in asm, because else it's already done. Look in jdcolor.c to see how.

I tried this already, but it was quite deceiving. I ran into a lot of bugs (you can't imagine the -ahem- beautiful images I've seen) and it didn't give a good speed gain. Too many tables to peek (4), 3 data sources for 1 destination -> too many address registers used -> too much swapping. Gosh !
Maybe you'll be more lucky if you give it a try...

I gave the color space conversion a go, and this is what I came up with:

Code:

    move.l In_Y,a0
    move.l In_CB,a1
    move.l In_CR,a2
    move.l Out,a3
    move.l CB_Table,a4
    move.l CR_Table,a5
    move.l Range_Table,a6
    move.l YCBCR_Buf_Size,d5
    
;Free regs: d6 and d7

.lp moveq  #0,d0
    moveq  #0,d1
    moveq  #0,d2

    move.b (a0)+,d0      ;Y
    move.b (a1)+,d1      ;CB
    move.b (a2)+,d2      ;CR

    move.b (a4,d1.l),d1  ;0.34414*Cb-128
    move.b (a5,d2.l),d2  ;0.71414*Cr-128

    move.l d0,d4         ;Calc green
    sub.l  d1,d4
    sub.l  d2,d4
    move.b (a6,d4.l),d4  ;Clip green
    move.b d4,(a3)+      ;Write green

    move.l d2,d4         ;Calc blue
    add.l  d2,d2         ;5*Cb=1.7202*Cb instead of 1.772*Cb
    add.l  d2,d2
    add.l  d2,d4
    add.l  d0,d4
    move.b (a6,d4.l),d4  ;Clip blue
    move.b d4,(a3)+      ;Write blue

    add.l  d3,d0         ;Calc red
    add.l  d3,d0         ;2*Cr=1.42828*Cb instead of 1.402*Cr
    move.b (a6,d0.l),d0  ;Clip red
    move.b d0,(a3)+      ;Write red

    subq.l #1,d5
    bpl    .lp

The code is pretty simple. It uses two multiplication tables each with a subtraction (-128). It then just calculates green first and uses the old Cb and Cr values again by multiplying them by respectively 5 and 2. This gives a good approximation of the values needed to calculate red and blue, and get's rid of some table reading. After that the rgb values just need to be clipped to fit in the range 0-255. Like the c code, I use a table for this, but it might be faster to use compares instead. I don't know, because I can't test it!

About the approximations: I've tested these with my YCbCr program in freebasic (damned handy), and the color differences are quite small, meaning the images look great, you literary have to see the original next to the encoded version to see any difference, otherwise you'd think it's the original! I've tested it with a straight gray scale in rgb as well, and there is no difference what so ever. If this code is faster then what you've tried, it's perfect for fast viewing in high quality.

However, you will have to integrate the code yourself, and although I've tested the approximation, the asm code itself is untested and may contain a bug here and there. Nothing serious, though. Should be easy to fix if there are any. Also due to data format differences the code may not work as is, but I suppose this should still give you a good idea of what can be done.

Quote:

Originally Posted by meynaf

You're doing here something jpeg already does. They're using a triangle filter for upsampling, maybe you can try that too (look in jdsample.c for more info).

I'm more interested in lossless compression, though. What if you could predict what the image will be by computing it with whatever you've already decoded, and only store the difference between the reality and your prediction ?
(this has already been applied to audio, but afaik not to gfx data)
(and, oh, yes, I don't have a clue on the predictors to use

)

Ah, triangular interpolation eh? I'm going to do some yahooing on that one.

Lossless is indeed interesting as an extra option. Trying to predict the data is interesting, too. Hadn't thought of that. I'm going to have to see if I can come up with anything.

Quote:

Originally Posted by meynaf

Adpro probably makes a lot of analysis to adapt its palette before rendering, which is terribly slow (that's why I didn't want to do it too).
Furthermore, if it's 100% compiled code then it's likely to be up to 4x slower...

Yes, it does. And it's indeed compiled as far as I know, with a compiler from 1992... Yep, it doesn't get any slower than that

Quote:

Originally Posted by meynaf

Answer here for the viewer options (off-topic in the mpega thread

) : just list a few here and we'll see. Maybe they're already planned.

As I think of options, I'll post them here. No problem.

Quote:

Originally Posted by meynaf

About the scaling, there is something that annoys me quite a lot : what to do on a palettized display ? Skipping pixels will be ugly and we don't have enough colors to get all the rgb combinations an average would make, but ham display may be even uglier than pixel skipping on some images.

Skipping just sucks rocks. One way, is to convert the data to rgb while scaling, then just render to ham. Should be ok. Another one is to do the same and count how many times each color is used during scaling. Then quick sort the table. Since the table is max 256 entries, this should be fast. Once that's done, use the 64 most frequent colors as the ham palette. Obviously, the first method is faster, but will never look as good as the original 256 color image. I do doubt it will look bad, though. This is the best I can come up with, since ham will be the only way, unless you want to do high quality rgb to 256 color conversion, which will never be as fast.

meynaf · 15 January 2008, 10:58

Quote:

Originally Posted by Thorham

I gave the color space conversion a go, and this is what I came up with

You came up with something very interesting at your first try

Quote:

Originally Posted by Thorham

The code is pretty simple. It uses two multiplication tables each with a subtraction (-128). It then just calculates green first and uses the old Cb and Cr values again by multiplying them by respectively 5 and 2. This gives a good approximation of the values needed to calculate red and blue, and get's rid of some table reading. After that the rgb values just need to be clipped to fit in the range 0-255.

Hmm... I admit I don't like losses... I noted that there is further accuracy loss as compared to the original code, because the adds for green pixels (a*Cb + b*Cr) were done with 32-bit fixed-point values, not bytes.

Quote:

Originally Posted by Thorham

Like the c code, I use a table for this, but it might be faster to use compares instead. I don't know, because I can't test it!

I didn't really test it, but from the timings I know, compares would be slower. Or can you do a range-limit in less than 11 clock cycles ???

Quote:

Originally Posted by Thorham

About the approximations: I've tested these with my YCbCr program in freebasic (damned handy), and the color differences are quite small, meaning the images look great, you literary have to see the original next to the encoded version to see any difference, otherwise you'd think it's the original! I've tested it with a straight gray scale in rgb as well, and there is no difference what so ever. If this code is faster then what you've tried, it's perfect for fast viewing in high quality.

To be acceptable, such a loss must make the thing much faster.
I've counted clock cycles (including pipeline) and you should get something like 123 of them per pixel.
My actual code runs in 125 or so, with full accuracy. Not worth changing.
But you can get rid of 8 of them, going down to 115, if you replace :
- subq/bpl by a dbf (-2)
- move.b (a6,dn.l),dn / move.b dn,(a3)+ by move.b (a6,dn.l),(a3)+ (-2, 3 times)

However it's still not enough IMHO (a bit less than 10%, but much less for the overall speed). A 40% gain could be good though.

Quote:

Originally Posted by Thorham

However, you will have to integrate the code yourself, and although I've tested the approximation, the asm code itself is untested and may contain a bug here and there. Nothing serious, though. Should be easy to fix if there are any. Also due to data format differences the code may not work as is, but I suppose this should still give you a good idea of what can be done.

The data formats are the same, except that you must write red, then green, then blue, not red last.

Here is my code, should you spot something that can be done to accelerate it :

Code:

; parameters :
;    a0=input_buf[0]+input_row
;    a1=input_buf[1]+input_row
;    a2=input_buf[2]+input_row
;    a3=output_buf
;    d7=num_rows
;    d6=cinfo->output_width
ycc_rgb_convert
 subq.w #1,d6
 moveq #0,d0
 move.l #$100,d1     ; with *8, we will go to $800 bytes after the 1st array
 moveq #0,d2         ; (which will make us point on the 2nd array)
.yloop
 move.l d6,d5
 move.l (a0)+,a4
 move.l (a1)+,a5
 move.l (a2)+,a6
 movem.l a0-a3,-(a7)
 lea cxtab,a0
 lea range_limit2+$180,a2
 move.l (a3),a3
.xloop
; inner loop
 move.b (a4)+,d0
 move.b (a5)+,d1
 move.b (a6)+,d2
 lea (a0,d2.w*8),a1
 move.l (a1)+,d4
 move.l (a1),d3
 add.l d0,d3
 move.b (a2,d3.w),(a3)+
 lea (a0,d1.w*8),a1
 add.l (a1)+,d2
 add.l d4,d2
 swap d2
 add.w d0,d2
 move.b (a2,d2.w),(a3)+
 move.l d0,d3
 moveq #0,d2
 add.l (a1),d3
 move.b (a2,d3.w),(a3)+
 dbf d5,.xloop
; end of inner loop
 movem.l (a7)+,a0-a3
 addq.l #4,a3
 subq.l #1,d7
 bne.s .yloop
 rts

Note : I'm using 4 arrays, of which 2 are interleaved, all with the same pointer. Of course only the inner loop really has to be optimized.
Other note : this one has been tested and works. But it doesn't give as much gain as we could have expected...

Quote:

Originally Posted by Thorham

Ah, triangular interpolation eh? I'm going to do some yahooing on that one.

I've started this one in asm. Very funny to do (*cough*).

Quote:

Originally Posted by Thorham

Lossless is indeed interesting as an extra option. Trying to predict the data is interesting, too. Hadn't thought of that. I'm going to have to see if I can come up with anything.

Sure, this is an area where little has been done.

Quote:

Originally Posted by Thorham

Yes, it does. And it's indeed compiled as far as I know, with a compiler from 1992... Yep, it doesn't get any slower than that

I'm sure it can be made slower

Quote:

Originally Posted by Thorham

As I think of options, I'll post them here. No problem.

I'm waiting...

Quote:

Originally Posted by Thorham

Skipping just sucks rocks. One way, is to convert the data to rgb while scaling, then just render to ham. Should be ok. Another one is to do the same and count how many times each color is used during scaling. Then quick sort the table. Since the table is max 256 entries, this should be fast. Once that's done, use the 64 most frequent colors as the ham palette. Obviously, the first method is faster, but will never look as good as the original 256 color image. I do doubt it will look bad, though. This is the best I can come up with, since ham will be the only way, unless you want to do high quality rgb to 256 color conversion, which will never be as fast.

What I fear is the brutal color changes when rendering to ham : they're often too much visible.
And when scaling down an iff, you have to p2c it, then scale it, then count colors to get the most used ones, then write them back into a buffer, then c2p it. Ouch ! Ilbm displaying has never been so slow !

Thorham · 15 January 2008, 20:32

Quote:

Originally Posted by meynaf

You came up with something very interesting at your first try

Thank you!

Quote:

Originally Posted by meynaf

Hmm... I admit I don't like losses... I noted that there is further accuracy loss as compared to the original code, because the adds for green pixels (a*Cb + b*Cr) were done with 32-bit fixed-point values, not bytes.

I know you don't, but you relly have to see this in action, so I've made a bunch of test images, and put them in the zone. The images are in 24bit png format. Each image has the original image on the right and the approximation on the left. Both images are 640x512 and have been fitted in 1280x512 as to make it easy to compare them. Note that the originals where all in the jpeg format.

Quote:

Originally Posted by meynaf

I didn't really test it, but from the timings I know, compares would be slower. Or can you do a range-limit in less than 11 clock cycles ???

No, I don't think it can be done since that would require a cmp , three branches and two moves.

Quote:

Originally Posted by meynaf

To be acceptable, such a loss must make the thing much faster.
I've counted clock cycles (including pipeline) and you should get something like 123 of them per pixel.
My actual code runs in 125 or so, with full accuracy. Not worth changing.
But you can get rid of 8 of them, going down to 115, if you replace :
- subq/bpl by a dbf (-2)
- move.b (a6,dn.l),dn / move.b dn,(a3)+ by move.b (a6,dn.l),(a3)+ (-2, 3 times)

However it's still not enough IMHO (a bit less than 10%, but much less for the overall speed). A 40% gain could be good though.

As you have seen, the loss is hardly noticeable , if at all. IMHO this is quite acceptable.

Silly me, I forgot about the moves

The subq/bpl can not be changed to dbf since dbf works on words, and the input can be larger than 64kb. Unless 68030 can handle 32bit dbf (wouldn't be surprised). On the other hand you could just use two of them, since there are two unused data regs.

Quote:

Originally Posted by meynaf

The data formats are the same, except that you must write red, then green, then blue, not red last.

Since we're rendering to ham, the order of the gun colors is not important:

Code:

;Code in ham rendering routine:
    move.b (a0)+,d1  ;red
    move.b (a0)+,d2  ;green
    move.b (a0)+,d3  ;blue

;Changes to:
    move.b (a0)+,d2  ;green
    move.b (a0)+,d3  ;blue
    move.b (a0)+,d1  ;red

This doesn't affect the rest of the ham rendering routine at all.

I am a bit surprised I got the data formats just right, I was quite unsure about it. Cool

Quote:

Originally Posted by meynaf

Here is my code, should you spot something that can be done to accelerate it :

Before I pain my brain, I want to know what you think about the losses acceptability. If you like it, you might be able to come up with a faster method than the one used in the c code!

Quote:

Originally Posted by meynaf

I've started this one in asm. Very funny to do (*cough*).

Good luck

Quote:

Originally Posted by meynaf

Sure, this is an area where little has been done.

After thinking about it, I came to the conclusion that this is a bit like adaptive interpolation. But I still have to try it.

Quote:

Originally Posted by meynaf

What I fear is the brutal color changes when rendering to ham : they're often too much visible.

The only way to know, is to try it, you might be in for a surprise. The thing is, you have to convert to full rgb anyway. This just makes the image 24bit. There will be differences, but I really doubt they're going to be very big. But again, only way to know is to try it.

Edited: Testing will be easy. All you have to do is convert 256 color images to jpeg in the highest quality setting, and use your viewer to see what it looks like! Further more, I tried adpros ham rendering on 256 color images, and although there is a loss, it's really not bad. However, that is to be expected, and it can't be helped. I've also tried it with visage, and that is just plain ugly

Since your ham rendering routine is much better, it might just be ok. If you don't have the time, I can make some test images for you, since I've got a whole bunch of 256 color bmps which I ripped from the Final Fantasy 6 Playstation cd edition.

Quote:

Originally Posted by meynaf

And when scaling down an iff, you have to p2c it, then scale it, then count colors to get the most used ones, then write them back into a buffer, then c2p it. Ouch ! Ilbm displaying has never been so slow !

Yep, that's true (although counting can be done while scaling). However, don't you agree 24bit iffs are a silly format? I mean, planar 24bit

I don't think there's any hardware capable of displaying this directly. It's all chunky rgb. IMHO 24bit iff should have never been created, and is not worthy of being supported. I know some amiga software uses it, but it's far better to store images as bmp or png.

As for iffs up to 8bit per pixel, these are typical amiga format images, and probably none of them need scaling. For those, scaling would be optional, and I seriously doubt anyone would use it.

21 December 2007, 14:20	#27
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	For the next step - the colorspace conversion may be my next victim - I'm looking for jpeg files with unusual color spaces, to test them, and to check whether they're worth supporting or not (certainly not if they are extremely rare). Can someone fire up a photoshop and save RGB/YCCK/CMYK encoded jpeg files for me ? (as I don't have photoshop and those are apparently adobe specific) Please... Last edited by meynaf; 21 December 2007 at 14:32.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
JPEG to IFF Coverter	W4r3DeV1L	request.Apps	15	14 February 2020 17:21
Overzealous Kickstart ROM - address decoding?	robinsonb5	Hardware mods	3	30 June 2013 11:09
JPEG to PNG (via CLI)	amiga_user	support.Apps	3	28 November 2011 11:50
Decoding algorithm(s) for encoded disk sectors (ADOS)	andreas	Coders. General	10	02 November 2009 22:18
Blitter MFM decoding	Photon	Coders. General	14	16 March 2006 11:24

24 December 2007, 16:14	#33
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	Does that guy have asm code for YCbCr -> RGB conversion ? What, no 68k version ? Well, ok, I'll do it... That's the next thing I've spotted that's not too difficult and can give us an important speed increase. I promise I won't use the term yuv if it's digital video

04 January 2008, 11:12	#35
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	Of course the YCbCr->RGB conversion is simple. But it becomes more interesting if you try to do it without multiplies The problem of gfx data is that it doesn't crunch very well. The more colourful it is, the worse the compression will be. The one and only thing I see here is that we could exploit the fact that adjacent pixels are often close in color ; they're closer in YCbCr than in RGB but changing color spaces is a lossy process because of rounding errors. My ham engine is more standard issue than it looks. I'm pretty sure nearly all high quality renderers use the very same algorithm.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)