fast HAM8 conversion ? - Page 3

Toni Wilen · 20 November 2007, 15:43

Quote:

Originally Posted by meynaf

I knew I didn't think like the rest of the world

But from this point of view it's a 13-bit register, as bit #15 is the transparency (for genlock video).

No, it is 12.5-bit register because genlock-bit only exists when BPLCON3 LOCT=0

meynaf · 20 November 2007, 15:53

Quote:

Originally Posted by Toni Wilen

No, it is 12.5-bit register because genlock-bit only exists when BPLCON3 LOCT=0

Thorham · 21 November 2007, 10:04

I've been looking at the ham8 test code. So far I've only been able to strip off about 10 frames (for an 800x600 picture), and that's only because you can chop of bits, instead of rounding them. Also, using a 32bit palette table (currently the code reads the original table for the library, too) allows you to strip off one whole instruction (wow).

I know this isn't much yet...

As for the translators: The back draw is that they can't understand the context, but in this case they're quite help full (better then nothing), and of course learning a language is always best!

I'll see if I can find some more optimizations (tough).

meynaf · 21 November 2007, 10:42

Could you please post here what you exactly did ? (or MP it to me ?)

Thorham · 21 November 2007, 11:05

Of course, here goes (for the ham8.s file):

I replaced the rounding code with this:

moveq.l #4,d0
neg.b d0

move.b (a0)+,d1
and.l d0,d1
move.b (a0)+,d2
and.l d0,d2
move.b (a0)+,d3
and.l d0,d3

And I changed this:

lea zv_h8pal(pc),a6
add.l d4,a6
add.l d4,d4
add.l d4,a6

To this:

lea Palette32(pc),a6
lsl.l #2,d4
add.l d4,a6

Where Palette32 points to the palette table with the following format:

dc.b value,value,value,0

Where value is always the original value.

As said the library still uses the original format, so both tables are loaded, which doesn't seem to slow anything down.

meynaf · 21 November 2007, 11:17

I dunno on which cpu you're testing, but on a 030, this :

Code:

lea Palette32(pc),a6
lsl.l #2,d4
add.l d4,a6

isn't faster than that :

Code:

lea zv_h8pal(pc),a6
add.l d4,a6
add.l d4,d4
add.l d4,a6

because of the lsl taking 4 clock cycles (adds take only 2).

Also, I just noticed that high part of d1-d2-d3 stays 0 all long, so 3 moveq #0 outside of the loop, and moveq #$fc / and.b should do the trick.

Thorham · 21 November 2007, 11:39

I'm testing on a 68030/50mhz. I didn't know about lsl being twice as slow as add

The main reason for doing this is that I was trying to do some further optimization by reading a whole long word (even have a third version of the palette table for that) instead of reading three bytes separately, but I found out it didn't work in this case, because you need two swaps. Also, this got me in trouble further down the source, where the table is read again.

Edited:
By the way, I don't have a good doc on instruction timings. Any suggestions? I only just found out rol can be slower then reading bytes from fast mem, for example.

meynaf · 21 November 2007, 13:13

I think I'll boot up my miga on this saturday to check your 32-bit palette idea.

Results on monday here, maybe a new version ?

For the timings Flint/Darkness did a great job with his guide :
http://aminet.net/package/dev/asm/mc680x0
Timings when accessing memory can vary, however on my configuration (030/50 with 60ns EDO ram) you need 8 cycles for a fastmem access, and 26 for a chipmem access. Both can be pipelined if they are writes.

I wonder if we should not continue this by MP, since few people here will understand what we're talking about (who has looked into the code ?)...

StrategyGamer · 21 November 2007, 14:08

On 060 this is faster because first 2 instructions execute simultaneously:

Code:

lea Palette32(pc),a6
lsl.l #2,d4
add.l d4,a6

In this version all instructions depend on output of previous instruction. So none of them can dual execute on 060.

Code:

lea zv_h8pal(pc),a6
add.l d4,a6
add.l d4,d4
add.l d4,a6

meynaf · 21 November 2007, 14:25

Quote:

Originally Posted by StrategyGamer

On 060 this is faster because first 2 instructions execute simultaneously:

Code:

lea Palette32(pc),a6
lsl.l #2,d4
add.l d4,a6

In this version all instructions depend on output of previous instruction. So none of them can dual execute on 060.

Code:

lea zv_h8pal(pc),a6
add.l d4,a6
add.l d4,d4
add.l d4,a6

And what for the 040 ?
Do someone here have 040/060 timings ?

Thorham · 22 November 2007, 13:41

Yes, we could keep each other posted through pm. On the other hand, people are still replying, so there seems to be some interest in the topic. Just let me know what you prefer.

meynaf · 22 November 2007, 13:52

What I prefer is the forum for whatever can be understood by other ppl, and pm for the rest. Staying here doesn't bother me, but I thought about those who will read us and not understand a thing because they didn't see the code

Spellcoder · 22 November 2007, 13:53

Although I'm not familiar with the techniques and with calculating instructing timings, I find it interesting to read about the techniques and optimizing. So please, do continue

.
Those who don't understand/care should just not read the thread

.

Wepl · 22 November 2007, 17:46

Quote:

Originally Posted by meynaf

And what for the 040 ?
Do someone here have 040/060 timings ?

Some time ago the cpu manuals could be downloaded as pdf from the web. I also got the books printed for free from mot years ago.
In the manuals you will find the instructions times. But especially on 40/60 they depend on *.
The 40 is not superscalar and has only one integer unit.
It is most times impossible to write code which is the fastest on all cpus. e.g. avoid pc-relative on 40 but its no problem on 30/60.
The timing also depends on previous instruction because of the pipeline and alignment.

Thorham · 23 November 2007, 15:12

The MC680x0 manuals can be downloaded from www.freescale.com

Here's the link:

http://www.freescale.com/webapp/sear...Order=default&

The manuals cover the full 680x0 family

BippyM · 23 November 2007, 15:45

Quote:

Originally Posted by Thorham

The MC680x0 manuals can be downloaded from www.freescale.com

Here's the link:

http://www.freescale.com/webapp/sear...Order=default&

The manuals cover the full 680x0 family

I have grabbed all the relevant manuals and 7zipped..

There are 51mb worth zipped, over 8 files.. anyone wants lemme know and I'll zone it all

meynaf · 23 November 2007, 16:05

For me it's too late : I grabbed them already.

Thorham · 24 November 2007, 08:27

To get back to the topic, I made a simple bmp viewer in assembler which uses the 3x1 'screen mode'. It scales down (sorry) an 800x600 24bit image to 33%x50% and displays the image in 60 frames. The program does kill the os, and thus lacks an intuition screen (pure metal banging).

The 60 frames include scaling and drawing, but exclude reading the bmp file and memory allocation. Further more, the c2p currently draws 1280xscaled height.

It turns out the scaling/3x1 mode combination is less expensive then a high quality ham rendering engine, and yours is definitely high quality (the only way to improve it is to calculate the base palette for the whole image, maybe an idea for a super high quality mode).

Since you showed interest in the 3x1 screen for a quick and dirty mode, I thought I'd write a proper test program. If you want, I can put it in the zone, including the right include files, and AsmOne (if you don't have it), as you'll need that to do the speed test.

Another advantage of this quick and dirty stuff is that any image up to 1280x1024 fits on the screen completely. However, it's definitely not a replacement for the hq code you already have.

Have edited this reply far too many times now! One last go:

One more thing about scaling: It would be a great option. If someone doesn't want it, they don't have to use it, while if they do, it's there! The reason I'm mentioning the whole scaling thing again? Have you considered images which are way to large to fit on the screen, such as 1280x1024 jpegs, or worse, 1600x1200? For your hq mode 1280x1024 can be scaled down to 640x512 by averaging 4 pixels, that means you can shift and don't have to use divs. Also, consider the chip mem needed for 1280x1024, it's a whopping 1.2 megs! Scaled down it's only 320kb, and it fits the screen snugly! For 1600x1200 you can scale down to 33%x33%, now you do need divs, but since it can be done in one go, you have three divs for nine pixels, which seems pretty good to me. And for 1600x1200 you have to scale, as it takes up 1.9 megs, simply making it impossible to display this directly. Scaling like this also has the advantage of less ham pixels to render, so the scaling isn't much of an issue anyway!

It's these kind of features I've always wanted to see in jpeg viewers, but never found them. Just thought to let you know.

meynaf · 26 November 2007, 11:27

Yes, you can upload it - with the bmp file if possible.
I'm sure I have asmone somewhere, but I'm also sure that I'll try to change the source to make it assemblable by phxass first

Of course scaling of enormous pics is useful, and it's amongst the planned features of my viewer - though not on the urgent todolist.
I intended to do it by averaging 50%/50% (hi-q mode) or simple pixel skipping (fast mode). Alternatively the jpeg library has code for down-sampled rendering, which could be even faster and get the best quality.

Speaking of quality, I don't believe that a palette adaptation is much better (however I'm sure it will be much slower

). I compared my rendering to other viewers which do that and didn't see a real diff.
Anyway how to find out what the "best" colors are ?

EDIT: how can you possibly average 9 pixels in 3 divs ???

Thorham · 26 November 2007, 12:37

The files will be in the zone today, including some test bmps.

Palette adaption only improves the image quality if the colors are chosen just right. Adpro is the only program I know of which does this properly. Choosing the best colors is probably done by analyzing the image first to determine which pixels would generate the largest errors when rendered in ham. Then you probably just have to pick the 64 pixels with the highest error ratio, and use their values in the palette. However, I am not very sure about this, it's more of an educated guess. I'll also include an Adpro rendering, as this program probably does the best quality ham rendering.

Taking the average of nine pixels can be done with three divs like so:

move.b (a0)+,d3
add.l d3,d0 ;Red
move.b (a0)+,d3
add.l d3,d1 ;Green
move.b (a0)+,d3
add.l d3,d2 ;Blue

This code is then applied to all nine pixels, and when that is done you need one div per gun color, hence three divs only. Just make sure the upper 24 bits of d3 are cleared before the loop, and make sure d0/d1/d2 get cleared for each set of pixels. This principle works for any number of pixels which can be averaged in one go, so if you want to scale down to 25%x25% (16 pixels) you'd still only need three divs, or in this case shifts.

I still recommend AsmOne, for one reason, and thats the speed test, which stores the number of frames in a memory location, and asm has a nice mem viewer. But you don't really need to.

21 November 2007, 11:17	#46
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	I dunno on which cpu you're testing, but on a 030, this : Code: lea Palette32(pc),a6 lsl.l #2,d4 add.l d4,a6 isn't faster than that : Code: lea zv_h8pal(pc),a6 add.l d4,a6 add.l d4,d4 add.l d4,a6 because of the lsl taking 4 clock cycles (adds take only 2). Also, I just noticed that high part of d1-d2-d3 stays 0 all long, so 3 moveq #0 outside of the loop, and moveq #$fc / and.b should do the trick.

21 November 2007, 11:39	#47
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,753	I'm testing on a 68030/50mhz. I didn't know about lsl being twice as slow as add The main reason for doing this is that I was trying to do some further optimization by reading a whole long word (even have a third version of the palette table for that) instead of reading three bytes separately, but I found out it didn't work in this case, because you need two swaps. Also, this got me in trouble further down the source, where the table is read again. Edited: By the way, I don't have a good doc on instruction timings. Any suggestions? I only just found out rol can be slower then reading bytes from fast mem, for example. Last edited by Thorham; 21 November 2007 at 12:02.

21 November 2007, 14:08	#49
StrategyGamer Total Chaos AGA is fun! Join Date: Jun 2005 Location: USA Posts: 873	On 060 this is faster because first 2 instructions execute simultaneously: Code: lea Palette32(pc),a6 lsl.l #2,d4 add.l d4,a6 In this version all instructions depend on output of previous instruction. So none of them can dual execute on 060. Code: lea zv_h8pal(pc),a6 add.l d4,a6 add.l d4,d4 add.l d4,a6

24 November 2007, 08:27	#58
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,753	To get back to the topic, I made a simple bmp viewer in assembler which uses the 3x1 'screen mode'. It scales down (sorry) an 800x600 24bit image to 33%x50% and displays the image in 60 frames. The program does kill the os, and thus lacks an intuition screen (pure metal banging). The 60 frames include scaling and drawing, but exclude reading the bmp file and memory allocation. Further more, the c2p currently draws 1280xscaled height. It turns out the scaling/3x1 mode combination is less expensive then a high quality ham rendering engine, and yours is definitely high quality (the only way to improve it is to calculate the base palette for the whole image, maybe an idea for a super high quality mode). Since you showed interest in the 3x1 screen for a quick and dirty mode, I thought I'd write a proper test program. If you want, I can put it in the zone, including the right include files, and AsmOne (if you don't have it), as you'll need that to do the speed test. Another advantage of this quick and dirty stuff is that any image up to 1280x1024 fits on the screen completely. However, it's definitely not a replacement for the hq code you already have. Have edited this reply far too many times now! One last go: One more thing about scaling: It would be a great option. If someone doesn't want it, they don't have to use it, while if they do, it's there! The reason I'm mentioning the whole scaling thing again? Have you considered images which are way to large to fit on the screen, such as 1280x1024 jpegs, or worse, 1600x1200? For your hq mode 1280x1024 can be scaled down to 640x512 by averaging 4 pixels, that means you can shift and don't have to use divs. Also, consider the chip mem needed for 1280x1024, it's a whopping 1.2 megs! Scaled down it's only 320kb, and it fits the screen snugly! For 1600x1200 you can scale down to 33%x33%, now you do need divs, but since it can be done in one go, you have three divs for nine pixels, which seems pretty good to me. And for 1600x1200 you have to scale, as it takes up 1.9 megs, simply making it impossible to display this directly. Scaling like this also has the advantage of less ham pixels to render, so the scaling isn't much of an issue anyway! It's these kind of features I've always wanted to see in jpeg viewers, but never found them. Just thought to let you know. Last edited by Thorham; 26 November 2007 at 09:11. Reason: Added some text, and corrected mistakes.

26 November 2007, 11:27	#59
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	Yes, you can upload it - with the bmp file if possible. I'm sure I have asmone somewhere, but I'm also sure that I'll try to change the source to make it assemblable by phxass first Of course scaling of enormous pics is useful, and it's amongst the planned features of my viewer - though not on the urgent todolist. I intended to do it by averaging 50%/50% (hi-q mode) or simple pixel skipping (fast mode). Alternatively the jpeg library has code for down-sampled rendering, which could be even faster and get the best quality. Speaking of quality, I don't believe that a palette adaptation is much better (however I'm sure it will be much slower ). I compared my rendering to other viewers which do that and didn't see a real diff. Anyway how to find out what the "best" colors are ? EDIT: how can you possibly average 9 pixels in 3 divs ??? Last edited by meynaf; 26 November 2007 at 11:33.

21 November 2007, 10:04	#43
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,753	I've been looking at the ham8 test code. So far I've only been able to strip off about 10 frames (for an 800x600 picture), and that's only because you can chop of bits, instead of rounding them. Also, using a 32bit palette table (currently the code reads the original table for the library, too) allows you to strip off one whole instruction (wow). I know this isn't much yet... As for the translators: The back draw is that they can't understand the context, but in this case they're quite help full (better then nothing), and of course learning a language is always best! I'll see if I can find some more optimizations (tough).

21 November 2007, 10:42	#44
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	Could you please post here what you exactly did ? (or MP it to me ?)

21 November 2007, 11:05	#45
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,753	Of course, here goes (for the ham8.s file): I replaced the rounding code with this: moveq.l #4,d0 neg.b d0 move.b (a0)+,d1 and.l d0,d1 move.b (a0)+,d2 and.l d0,d2 move.b (a0)+,d3 and.l d0,d3 And I changed this: lea zv_h8pal(pc),a6 add.l d4,a6 add.l d4,d4 add.l d4,a6 To this: lea Palette32(pc),a6 lsl.l #2,d4 add.l d4,a6 Where Palette32 points to the palette table with the following format: dc.b value,value,value,0 Where value is always the original value. As said the library still uses the original format, so both tables are loaded, which doesn't seem to slow anything down.

21 November 2007, 13:13	#48
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	I think I'll boot up my miga on this saturday to check your 32-bit palette idea. Results on monday here, maybe a new version ? For the timings Flint/Darkness did a great job with his guide : http://aminet.net/package/dev/asm/mc680x0 Timings when accessing memory can vary, however on my configuration (030/50 with 60ns EDO ram) you need 8 cycles for a fastmem access, and 26 for a chipmem access. Both can be pipelined if they are writes. I wonder if we should not continue this by MP, since few people here will understand what we're talking about (who has looked into the code ?)...

22 November 2007, 13:41	#51
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,753	Yes, we could keep each other posted through pm. On the other hand, people are still replying, so there seems to be some interest in the topic. Just let me know what you prefer.

22 November 2007, 13:52	#52
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	What I prefer is the forum for whatever can be understood by other ppl, and pm for the rest. Staying here doesn't bother me, but I thought about those who will read us and not understand a thing because they didn't see the code

22 November 2007, 13:53	#53
Spellcoder Spellcoder Join Date: Aug 2006 Location: The Netherlands Age: 44 Posts: 27	Although I'm not familiar with the techniques and with calculating instructing timings, I find it interesting to read about the techniques and optimizing. So please, do continue . Those who don't understand/care should just not read the thread .

23 November 2007, 15:12	#55
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,753	The MC680x0 manuals can be downloaded from www.freescale.com Here's the link: http://www.freescale.com/webapp/sear...Order=default& The manuals cover the full 680x0 family

23 November 2007, 16:05	#57
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	For me it's too late : I grabbed them already.

26 November 2007, 12:37	#60
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,753	The files will be in the zone today, including some test bmps. Palette adaption only improves the image quality if the colors are chosen just right. Adpro is the only program I know of which does this properly. Choosing the best colors is probably done by analyzing the image first to determine which pixels would generate the largest errors when rendered in ham. Then you probably just have to pick the 64 pixels with the highest error ratio, and use their values in the palette. However, I am not very sure about this, it's more of an educated guess. I'll also include an Adpro rendering, as this program probably does the best quality ham rendering. Taking the average of nine pixels can be done with three divs like so: move.b (a0)+,d3 add.l d3,d0 ;Red move.b (a0)+,d3 add.l d3,d1 ;Green move.b (a0)+,d3 add.l d3,d2 ;Blue This code is then applied to all nine pixels, and when that is done you need one div per gun color, hence three divs only. Just make sure the upper 24 bits of d3 are cleared before the loop, and make sure d0/d1/d2 get cleared for each set of pixels. This principle works for any number of pixels which can be averaged in one go, so if you want to scale down to 25%x25% (16 pixels) you'd still only need three divs, or in this case shifts. I still recommend AsmOne, for one reason, and thats the speed test, which stores the number of frames in a memory location, and asm has a nice mem viewer. But you don't really need to.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
HAM8 screen question.	Thorham	Coders. General	28	04 April 2011 19:26
HAM8 C2P Hacking	NovaCoder	Coders. General	2	25 March 2010 10:37
Problem making ham8 icons.	Thorham	support.Apps	0	12 March 2008 22:30
Multiple HAM8 pictures?	killergorilla	support.Other	4	15 February 2007 14:41