fast HAM8 conversion ? - Page 4

meynaf · 26 November 2007, 14:53

I don't see the point for the div thing. You're supposed to average 3 pixels -> 1 to do 33%, not 9 pixels -> 1.

Thorham · 26 November 2007, 16:56

The nine pixel average is for scaling down both the x axis and y axis to 33% in one go (for 1600x1200 in hq ham). You are indeed right in saying you only need to average three pixels for 33% on one axis.

The code I uploaded scales down to 33%x50% (six pixels), so that's good for seeing this principal in action. It's done by the 'SetupBmp' sub routine.

And remember, the divs have nothing to do with the number of pixels, just with the gun colors, so it's always three divs or shifts per pixel set.

meynaf · 26 November 2007, 17:31

Now that I know it was for 33%/33% scaling, things are much clearer !

But divides are basically slow. Exactly how much of the 60 frames (for a 800*600 image) are used by the downsampling ?

Thorham · 26 November 2007, 18:47

Yes, I know! Divs are terrible at 44 cycles

And signed divs are even worse...

The down sampling takes 26 frames for 33%x50%, which isn't that bad. My c2p routine can be optimized, though, it's probably possible to strip off 30-40 instructions (first timer at the c2p business), so the 60 frames can even go down. Also the c2p always converts 1280xbmp height, so it can be optimized even more. In the end the down sampling is going to be a bit slower then the rendering.

Kalms · 27 November 2007, 00:11

You can replace a division with a constant factor with a multiplication with the reciprocal. The basic idea is: instead of dividing by 9, multiply by 65536/9 and then shift down 16 steps. For more info, search on terms like "division by constant" and "reciprocal multiplication".

Also, given that you are (in this example) adding up to 9 terms in the range 0..255, the end result will be in the range 0..2295, so you could use a lookup table for the division-by-9.

meynaf · 27 November 2007, 10:16

Divs on a '060 are only 2 cycles

The multiply would lead you from 44 to 28 (in fact 32 because you'll have to swap).
A table look-up would cost 14 or so (and 1531 bytes for 33%x50%).

However if you can put a multiply right after a write to chipmem it'll only cost 5. Maybe you can use this to make the scaling+c2p in one pass ?

Also, if you consider using 25%x50% instead, shifting is 4 cycles.

When at home this week-end I'll check if I can do a 50%x50% with my method.

Thorham · 27 November 2007, 14:27

Cool guys

Those are some mighty fine optimizations indeed.

Implementing 50%x50% for hq ham should be a cake walk for you meynaf! Doing the c2p and scaling in one pass would be really great, but it's out of the question for my current code

, I can only hope you have more luck then me with that one. However, even with two passes, 50%x50% is already going to be pretty fast.

Can't wait to see the finished jpeg viewer, so long Visage

Edited: Tried the table look up for scaling 800x600 down to 33%x50%. The number of frames dropped from 60 to 50! Also, using a table means you can do a divide and a write to fast mem in one instruction. With divu, you first divide and then write to memory, effectively making it better then 14 cycles. Or is move.b (a0,d0.w),-(a1) slower then 14 cycles?

meynaf · 27 November 2007, 17:02

Errhm... for the "finished" jpeg viewer, you might have to wait a little. There is still the jpeg decoding part to rewrite in asm

Doing this down-scale for my high-quality rendering is a piece of cake, well maybe, but I have to remember the position of the next line (forcing me to swap a reg).

Doing move.b (a0,d0.w),-(a1) is probably slower than 14 cycles. So you might try to put register-only instructions after it, just to use the pipeline...

Kalms · 27 November 2007, 19:32

Division on 060 is a bit slower than 2 cycles unfortunately (they are at 17 or so cycles). It's the multiplications that are very quick on that chip.

Anyway, you should pay some consideration to what sort of filter kernel you are using when you are downsampling your image.

Right now you are using a filter whose kernel is like this for a 33%/33% shrink:

1 1 1
1 1 1 / 9
1 1 1

The 3x3 numbers are weights (scale factors) for the corresponding pixel, and the "/ 9" is a global scale factor for the entire filter.
Or described in plain english: you take the average of the current pixel and its 8 neighbours, divide by 9, and use that as the final (downsampled) color value.

For better visual results you should use a filter shape where
A) the kernel is larger than 3x3 pixels and
B) pixels further away from the origin have less weight than pixels close to the origin. Signal processing theory and transform theory describes how to construct a "good" filter kernel.

Since you have performance concerns, you might want to test with:

1 2 1
2 4 2 / 16
1 2 1

You can filter using this kernel with just shifts & adds.

Thorham · 27 November 2007, 20:54

Quote:

Originally Posted by meynaf

Errhm... for the "finished" jpeg viewer, you might have to wait a little. There is still the jpeg decoding part to rewrite in asm

Doing this down-scale for my high-quality rendering is a piece of cake, well maybe, but I have to remember the position of the next line (forcing me to swap a reg).

Doing move.b (a0,d0.w),-(a1) is probably slower than 14 cycles. So you might try to put register-only instructions after it, just to use the pipeline...

Thanks for the advice, I'll check it out immediately

Gonna have another go at your ham rendering engine, but it's starting to look like it's impossible to get it any faster

Quote:

Originally Posted by Kalms

Division on 060 is a bit slower than 2 cycles unfortunately (they are at 17 or so cycles). It's the multiplications that are very quick on that chip.

Anyway, you should pay some consideration to what sort of filter kernel you are using when you are downsampling your image.

Right now you are using a filter whose kernel is like this for a 33%/33% shrink:

1 1 1
1 1 1 / 9
1 1 1

The 3x3 numbers are weights (scale factors) for the corresponding pixel, and the "/ 9" is a global scale factor for the entire filter.
Or described in plain english: you take the average of the current pixel and its 8 neighbours, divide by 9, and use that as the final (downsampled) color value.

For better visual results you should use a filter shape where
A) the kernel is larger than 3x3 pixels and
B) pixels further away from the origin have less weight than pixels close to the origin. Signal processing theory and transform theory describes how to construct a "good" filter kernel.

Since you have performance concerns, you might want to test with:

1 2 1
2 4 2 / 16
1 2 1

You can filter using this kernel with just shifts & adds.

So, I guess for six pixels it would be something like this:

1 2 1
1 2 1 / 8

I don't want to sound ungrateful, but:

Simple averaging is already very good. I've tried your nine pixel idea, but the quality does not seem to improve (might just be some lameness on my part). Also, it's going to be slower then a divu table: I've tried this idea with six pixels (as above), which is one frame faster then a divu table (for a 1280x1024 bmp). The reason is, that you only need six extra adds to multiply two of the pixels by two. For nine pixels, it's quite different: Four pixels need multiplying by two, so thats 12 adds, and one needs multiplying by 4, thats three shifts. This is a lot more extra work then whats needed for six pixels.

Sorry for that

StrategyGamer · 27 November 2007, 21:12

If you are going for speed then just do
1 2 1 /4

That way you can get rid of that nasty divu

Thorham · 27 November 2007, 21:22

True, but for a 1280x1024 bmp scaled down to 33%x50% it only saves one frame compared to a divu table... The real optimizations can be done in the c2p department.

meynaf · 28 November 2007, 10:09

Quote:

Originally Posted by Thorham

Gonna have another go at your ham rendering engine, but it's starting to look like it's impossible to get it any faster

If you're looking at it, then I'll give you the latest modifications I did.
I used your 32-bit palette idea, pushing it a little bit further.

You may want to take a look at the attached file. Dunno if it works or even assemble (I might have broken something).

The basic idea is 32-bit palette entries, of which the 4th byte is color# *4.
This allowed me to remove the temporary variable on the stack !

I have put some repetitive code in macros to ease testing of different methods. That might have made the thing even more unreadable

To be faster if quality loss is acceptable, it is also possible to use a more regular palette, thus removing the need of a table to find the closest color.
You may activate it with the "quick" equ. Enjoy !

Thorham · 28 November 2007, 14:43

Thanks for uploading

Have tried it immediately, and think there's a bug. I've also replied immediately after finding out, but maybe I can find it. It might be I haven't setup the table properly, also, while trying things myself, I've seen exactly the same render bugs. It's either my table or something in the code.

It did assemble the first time I tried it. A while ago, you said you were going try this in the weekend when you got home, does that mean you made the modifications without an amiga/uae? If so, then I can tell you the quick mode seems to be ok, and is indeed faster, about 34 frames I think. I have to admit that I'm using my amiga's composite output, which gives bad image quality, meaning I can't always see everything, but it seems ok! I'll try it on winuae today as well (I hate using uae for this, because the nr of frames is going to drop beneath realistic values, making it useless for speed testing).

meynaf · 28 November 2007, 15:20

The quality seems good for the quick method ; however texts get blurred and you see green pixels at the left of important color changes.

And, yes, you're right : I made the modifications without amiga/uae

How does the bug look like ?

Thorham · 28 November 2007, 15:38

In the test image I'm using, some parts which should be grayish, become a bit colored, and the ham fringing increases.

By the way, could you tell me what the two other data files are for? Currently I haven't even started to get into those parts, yet.

meynaf · 28 November 2007, 16:07

The symptoms make me think of fixed-pixels error. To check this I would uncomment one of the two moveq,d0 to force fixed or relative pixels.
But I can't do this here at work

The two data files are simply for building the palette table (which gives the closest palette entry for each color).

Thorham · 29 November 2007, 00:52

I've done some more optimizations, and got the number of frames down to 139.

Your code is fine, by the way. I made my own palette table without realizing your code generates it's own palette table, and screwed up completely

The optimizations are in the attached file, and are commented. They should be easy to spot because of the formatting. They're based on the version I already had, as I know that one better then the new one.

ham8.zip

Kalms · 29 November 2007, 11:13

Are you displaying the SHRES screen while performing the HAM8 conversion? A 1280x512 SHRES screen eats 85% of the chipram bandwidth.

If so, here are three things that might be worth considering:

* don't display the image until you have finished conversion
* display the image in HAM6 until conversion finishes (this should take approx 30% less chipram bandwidth), if HAM6 is supported in SHRES (not sure)
* adjust the bottom end of the display window (DIWSTOP) each frame such that you don't show any region of the screen which has not finished processing yet

meynaf · 29 November 2007, 13:15

I had a quick look at the code you posted.
Some remarks about it :
- I saw both moveq.b and moveq.l - as moveq has no size, giving it one could be misleading
- move.b (a6),d0 followed by move.b d0,(a1)+ can be replaced by move.b (a6),(a1)+
- is move.l (sp),a5 really faster than move.l #adr,a5 ?

For SHRES display, it's easy to open an intuition screen in the background, and bring it to top once finished. My actual viewer already has the option to do this.

27 November 2007, 14:27	#67
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,762	Cool guys Those are some mighty fine optimizations indeed. Implementing 50%x50% for hq ham should be a cake walk for you meynaf! Doing the c2p and scaling in one pass would be really great, but it's out of the question for my current code , I can only hope you have more luck then me with that one. However, even with two passes, 50%x50% is already going to be pretty fast. Can't wait to see the finished jpeg viewer, so long Visage Edited: Tried the table look up for scaling 800x600 down to 33%x50%. The number of frames dropped from 60 to 50! Also, using a table means you can do a divide and a write to fast mem in one instruction. With divu, you first divide and then write to memory, effectively making it better then 14 cycles. Or is move.b (a0,d0.w),-(a1) slower then 14 cycles? Last edited by Thorham; 27 November 2007 at 15:28. Reason: Update

27 November 2007, 17:02	#68
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	Errhm... for the "finished" jpeg viewer, you might have to wait a little. There is still the jpeg decoding part to rewrite in asm Doing this down-scale for my high-quality rendering is a piece of cake, well maybe, but I have to remember the position of the next line (forcing me to swap a reg). Doing move.b (a0,d0.w),-(a1) is probably slower than 14 cycles. So you might try to put register-only instructions after it, just to use the pipeline... Last edited by meynaf; 27 November 2007 at 17:05. Reason: oops

27 November 2007, 21:22	#72
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,762	True, but for a 1280x1024 bmp scaled down to 33%x50% it only saves one frame compared to a divu table... The real optimizations can be done in the c2p department. Last edited by Thorham; 28 November 2007 at 07:50.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
HAM8 screen question.	Thorham	Coders. General	28	04 April 2011 19:26
HAM8 C2P Hacking	NovaCoder	Coders. General	2	25 March 2010 10:37
Problem making ham8 icons.	Thorham	support.Apps	0	12 March 2008 22:30
Multiple HAM8 pictures?	killergorilla	support.Other	4	15 February 2007 14:41

26 November 2007, 14:53	#61
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	I don't see the point for the div thing. You're supposed to average 3 pixels -> 1 to do 33%, not 9 pixels -> 1.

26 November 2007, 16:56	#62
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,762	The nine pixel average is for scaling down both the x axis and y axis to 33% in one go (for 1600x1200 in hq ham). You are indeed right in saying you only need to average three pixels for 33% on one axis. The code I uploaded scales down to 33%x50% (six pixels), so that's good for seeing this principal in action. It's done by the 'SetupBmp' sub routine. And remember, the divs have nothing to do with the number of pixels, just with the gun colors, so it's always three divs or shifts per pixel set.

26 November 2007, 17:31	#63
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	Now that I know it was for 33%/33% scaling, things are much clearer ! But divides are basically slow. Exactly how much of the 60 frames (for a 800*600 image) are used by the downsampling ?

26 November 2007, 18:47	#64
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,762	Yes, I know! Divs are terrible at 44 cycles And signed divs are even worse... The down sampling takes 26 frames for 33%x50%, which isn't that bad. My c2p routine can be optimized, though, it's probably possible to strip off 30-40 instructions (first timer at the c2p business), so the 60 frames can even go down. Also the c2p always converts 1280xbmp height, so it can be optimized even more. In the end the down sampling is going to be a bit slower then the rendering.

27 November 2007, 00:11	#65
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	You can replace a division with a constant factor with a multiplication with the reciprocal. The basic idea is: instead of dividing by 9, multiply by 65536/9 and then shift down 16 steps. For more info, search on terms like "division by constant" and "reciprocal multiplication". Also, given that you are (in this example) adding up to 9 terms in the range 0..255, the end result will be in the range 0..2295, so you could use a lookup table for the division-by-9.

27 November 2007, 10:16	#66
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	Divs on a '060 are only 2 cycles The multiply would lead you from 44 to 28 (in fact 32 because you'll have to swap). A table look-up would cost 14 or so (and 1531 bytes for 33%x50%). However if you can put a multiply right after a write to chipmem it'll only cost 5. Maybe you can use this to make the scaling+c2p in one pass ? Also, if you consider using 25%x50% instead, shifting is 4 cycles. When at home this week-end I'll check if I can do a 50%x50% with my method.

27 November 2007, 19:32	#69
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	Division on 060 is a bit slower than 2 cycles unfortunately (they are at 17 or so cycles). It's the multiplications that are very quick on that chip. Anyway, you should pay some consideration to what sort of filter kernel you are using when you are downsampling your image. Right now you are using a filter whose kernel is like this for a 33%/33% shrink: 1 1 1 1 1 1 / 9 1 1 1 The 3x3 numbers are weights (scale factors) for the corresponding pixel, and the "/ 9" is a global scale factor for the entire filter. Or described in plain english: you take the average of the current pixel and its 8 neighbours, divide by 9, and use that as the final (downsampled) color value. For better visual results you should use a filter shape where A) the kernel is larger than 3x3 pixels and B) pixels further away from the origin have less weight than pixels close to the origin. Signal processing theory and transform theory describes how to construct a "good" filter kernel. Since you have performance concerns, you might want to test with: 1 2 1 2 4 2 / 16 1 2 1 You can filter using this kernel with just shifts & adds.

27 November 2007, 21:12	#71
StrategyGamer Total Chaos AGA is fun! Join Date: Jun 2005 Location: USA Posts: 873	If you are going for speed then just do 1 2 1 /4 That way you can get rid of that nasty divu

28 November 2007, 14:43	#74
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,762	Thanks for uploading Have tried it immediately, and think there's a bug. I've also replied immediately after finding out, but maybe I can find it. It might be I haven't setup the table properly, also, while trying things myself, I've seen exactly the same render bugs. It's either my table or something in the code. It did assemble the first time I tried it. A while ago, you said you were going try this in the weekend when you got home, does that mean you made the modifications without an amiga/uae? If so, then I can tell you the quick mode seems to be ok, and is indeed faster, about 34 frames I think. I have to admit that I'm using my amiga's composite output, which gives bad image quality, meaning I can't always see everything, but it seems ok! I'll try it on winuae today as well (I hate using uae for this, because the nr of frames is going to drop beneath realistic values, making it useless for speed testing).

28 November 2007, 15:20	#75
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	The quality seems good for the quick method ; however texts get blurred and you see green pixels at the left of important color changes. And, yes, you're right : I made the modifications without amiga/uae How does the bug look like ?

28 November 2007, 15:38	#76
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,762	In the test image I'm using, some parts which should be grayish, become a bit colored, and the ham fringing increases. By the way, could you tell me what the two other data files are for? Currently I haven't even started to get into those parts, yet.

28 November 2007, 16:07	#77
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	The symptoms make me think of fixed-pixels error. To check this I would uncomment one of the two moveq,d0 to force fixed or relative pixels. But I can't do this here at work The two data files are simply for building the palette table (which gives the closest palette entry for each color).

29 November 2007, 00:52	#78
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,762	I've done some more optimizations, and got the number of frames down to 139. Your code is fine, by the way. I made my own palette table without realizing your code generates it's own palette table, and screwed up completely The optimizations are in the attached file, and are commented. They should be easy to spot because of the formatting. They're based on the version I already had, as I know that one better then the new one. ham8.zip

29 November 2007, 11:13	#79
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	Are you displaying the SHRES screen while performing the HAM8 conversion? A 1280x512 SHRES screen eats 85% of the chipram bandwidth. If so, here are three things that might be worth considering: * don't display the image until you have finished conversion * display the image in HAM6 until conversion finishes (this should take approx 30% less chipram bandwidth), if HAM6 is supported in SHRES (not sure) * adjust the bottom end of the display window (DIWSTOP) each frame such that you don't show any region of the screen which has not finished processing yet

29 November 2007, 13:15	#80
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	I had a quick look at the code you posted. Some remarks about it : - I saw both moveq.b and moveq.l - as moveq has no size, giving it one could be misleading - move.b (a6),d0 followed by move.b d0,(a1)+ can be replaced by move.b (a6),(a1)+ - is move.l (sp),a5 really faster than move.l #adr,a5 ? For SHRES display, it's easy to open an intuition screen in the background, and bring it to top once finished. My actual viewer already has the option to do this.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)