Optimising ILBM decode

pmc · 06 October 2011, 12:00

Hey fellas

For some work on a new prod I'm doing I need to decode lores EHB ILBM files. I only need to decode this type of ILBM, not ILBMs in general so I've written a decoder for just that purpose and it works perfectly fine.

The two main loops required are one to pull out the colour data and another to decode the RLE graphics data.

Here's my colour extraction loop:

Code:

                    moveq.l             #32-1,d7
.put_colours:       move.b              (a0)+,d0
                    lsr.b               #4,d0
                    move.b              (a0)+,d1
                    andi.b              #$f0,d1
                    move.b              (a0)+,d2
                    lsr.b               #4,d2
                    move.b              d0,-(sp)
                    move.w              (sp)+,d3
                    sf.b                d3
                    or.b                d1,d3
                    or.b                d2,d3
                    move.w              d3,(a1)
                    addq.w              #4,a1
                    dbf                 d7,.put_colours

and here's my RLE decoder loop:

Code:

                    movea.l             screenone_ptr(a5),a2
                    move.w              #screen_ht-1,d5
.next_row:          moveq.l             #screen_bpls-1,d6
                    movea.l             a2,a3
.crntrow_allbpls:   moveq.l             #screen_wd,d4
.rle_decode:        moveq.l             #0,d7
                    move.b              (a0)+,d7
                    bmi.b               .replicate
                    sub.b               d7,d4
.copy:              move.b              (a0)+,(a3)+
                    dbf                 d7,.copy
                    bra.b               .next_bpl
.replicate:         neg.b               d7
                    sub.b               d7,d4
.do_replicate:      move.b              (a0),(a3)+
                    dbf                 d7,.do_replicate
                    addq.w              #1,a0
.next_bpl:          subq.b              #1,d4
                    bne.b               .rle_decode
                    lea                 screen_bplsz-screen_wd(a3),a3
                    dbf                 d6,.crntrow_allbpls
                    lea                 screen_wd(a2),a2
                    dbf                 d5,.next_row

Now, while this works fine and doesn't take too long, I'd like to be certain I'm doing the ILBM decode as a whole as fast as possible.

So, my question is - is there any way the above routines could be optimised further than I already have or, alternatively, a completely different approach altogether which I've missed?

By the way, I should mention that I'm coding specifically for the 68000 processor and not 68020+

EDIT: in the RLE decoder, possibly the copy and replicate loops could be speeded up by determining the number of bytes in the current copy or replicate operation and moving words or longwords instead of bytes when possible...? The speedup of this would need to be traded off against how long it would take the code doing the decision logic for that to run of course...

Thorham · 06 October 2011, 20:33

For the color extraction part you could perhaps do this:

Code:

; assumed format: 0rgb
;
	move.b	#$f0,d6
	moveq	#32-1,d7
.loop
	moveq	#0,d0
	move.b	(a0)+,d0	; assumed red
	move.b	(a0)+,d1	; assumed green
	move.b	(a0)+,d2	; assumed blue

	lsl.w	#4,d0
	and.b	d6,d1
	lsr.b	#4,d2

	or.w	d1,d0
	or.w	d2,d0

	move.w	d0,(a1)
	addq.l	#4,a1

.next
	dbra	d7,.loop

pmc · 06 October 2011, 20:47

Thanks for taking a look and suggesting an alternative Thoram. All your assumptions about RGB were spot on

I did a quick test - the code you posted doesn't always work unfortunately.

For example, if d0=$f4, d1=$12 and d2=$23 (which is perfectly possible with the way colour bytes are written into the CMAP structure in an ILBM file) the resulting colour moved into the copperlist should be: $0f12 - your code outputs $0f52

Thorham · 06 October 2011, 20:56

Sorry

It should be this:

Code:

; assumed format: 0rgb
;
	move.b	#$f0,d6
	moveq	#32-1,d7
.loop
	moveq	#0,d0
	move.b	(a0)+,d0	; assumed red
	move.b	(a0)+,d1	; assumed green
	move.b	(a0)+,d2	; assumed blue

	lsl.w	#4,d0
	and.b	d6,d1
	lsr.b	#4,d2

	move.b	d1,d0
	or.b	d2,d0

	move.w	d0,(a1)
	addq.l	#4,a1

.next
	dbra	d7,.loop

pmc · 06 October 2011, 21:03

Nice one mate - looks better with that move in place of the or

Will do some speed testing and see if your version gains me some time.

EDIT: Out of interest, does anyone know of any utility available that could parse a text source code and add up all the cycles the various opcodes take? Something like that would be very very handy and it seems like it should be possible to do, although I'm not sure how easy or hard it would be to code such a utility in practice...

Leffmann · 06 October 2011, 21:12

I would just do this to keep it short. I guess speed doesn't matter much since it's only 32 colors. Are you making an image converter or is it for a demo?

Code:

move.b  (a0)+, d0
lsl.w   #4, d0
move.b  (a0)+, d0
lsl.w   #4, d0
move.b  (a0)+, d0
lsr.w   #4, d0

Thorham · 06 October 2011, 21:14

Yeah, basically the 1 in $12 and the 4 in $f4 got or-ed and that's 5 of course

Here's one with the move removed:

Code:

;
; assumed format: 0rgb
;
	move.b	#$f0,d6
	moveq	#32-1,d7

.loop
	moveq	#0,d0
	move.b	(a0)+,d0	; assumed red
	lsl.w	#4,d0
	move.b	(a0)+,d0	; assumed green
	and.b	d6,d0
	move.b	(a0)+,d1	; assumed blue
	lsr.b	#4,d1
	or.b	d1,d0

	move.w	d0,(a1)
	addq.l	#4,a1

.next
	dbra	d7,.loop

pmc · 06 October 2011, 21:21

@ Thoram - thanks man. I can't see how your version will be anything but quicker but I'll test it out against the others.

@ Leffmann - it's not for an image converter, like I say - I'm only interested in being able to convert EHB pics and nothing else. I've been asked to help with making a slideshow and I want to be able to decode the images as quick as possible. Plus, I just enjoy trying to optimise my code and seeing how others would solve the same problems - helps me to learn to think more laterally.

Oh and thanks for posting a version yourself.

Like you say - the main speed savings would come from speeding up the RLE decode. Either of you got any ideas for that routine?

hitchhikr · 06 October 2011, 21:32

My RLE depacker looks like that, dunno if it's faster or not.

Code:

; d0=size
; a0=source
; a1=dest
RLEDecrunch:    moveq   #0,d2
                move.b  (a0)+,d2
                bmi.b   Pixels
CopyPixs:       move.b  (a0)+,(a1)+
                subq.l  #1,d0
                dble    d2,CopyPixs
                bra.b   NoPixel
Pixels:         neg.b   d2
                move.b  (a0)+,d1
CopyRepeat:     move.b  d1,(a1)+
                subq.l  #1,d0
                dble    d2,CopyRepeat
NoPixel:        tst.l   d0
                bgt.b   RLEDecrunch
                rts

Leffmann · 06 October 2011, 21:39

I guess raw palette and image data is the fastest then. The palette would only be 64 bytes ready to be written to the color registers so encoding them like this only eats time and space, and the gain from RL-encoding the images is typically not very big.

If image size does matter then plain sliding window compression might be better. It's very fast to decompress and compression ratio is always better than RLE. Ask Photon for his compressor, it does just this.

pmc · 06 October 2011, 21:40

hitchhikr: your version looks basically the same as mine, except it's missing all the other manipulations for putting the data in the correct places in the bitplanes.

I need to do those as I'm not using interleaved raw bitplanes.

I take it your version converts the interleaved ILBM bitplane data straight to interleaved raw bitplanes?

Leffmann - Agreed. RLE isn't very efficient. Mainly I'm trying to get a balance between ease of use of the images (cos I can just directly load and use the .iff images I'm provided) versus the time it takes to decode them.

hitchhikr · 06 October 2011, 22:03

Maybe using the blitter to "de-interleave" the bitmap afterwards would be faster ?

Thorham · 08 October 2011, 22:13

Here's a small one for the RLE part.

Instead of copying memory to memory, like this:

Code:

.replicate
	neg.b	d7
	sub.b	d7,d4
.do_replicate:
	move.b	(a0),(a3)+
	dbf	d7,.do_replicate
	addq.w	#1,a0

You can move to a register first and then move the register to memory instead (hitchhikr's code also does this):

Code:

.replicate
	neg.b	d7
	sub.b	d7,d4
	move.b	(a0)+,d0
.do_replicate:
	move.b	d0,(a3)+
	dbf	d7,.do_replicate

Don't know how much faster it is, but the more often that loop gets executed, the more speed is gained.

Perhaps you can unroll both copy loops and try to copy words instead of bytes as well, but it might be a bit of a pain

pmc · 08 October 2011, 22:53

Nice one Thoram - yeah, I'd already stolen that idea from the code hitchhikr posted (

) and implemented it into my routine.

Thorham · 09 October 2011, 04:16

Quote:

Originally Posted by pmc

Nice one Thoram - yeah, I'd already stolen that idea from the code hitchhikr posted (

) and implemented it into my routine.

Hadn't even noticed it

Have you tried unrolling the loop?

Photon · 09 October 2011, 17:44

If speed is of the essence, you will always do better with converting to a custom format. You might even save a few bytes in doing so! The fastest solution is to decode ilbm and save as raw files, then switch bitplane ptrs to that frame. By Grabthar's hammer... what a savings.

If this is for replaying animation frames, you will save even more space and time by making a player compatible with IFFanim.

But I think you just want to make the ultimate most fantastic superfast *drumroll* ILBM converter... which I find utterly unnecessary but there you go

Just REPT 256 the copy or fill instruction and jump into the chunk of moves at an offset of (256-count)*2. For the fill, move (a0) into Dn first.

pmc · 09 October 2011, 22:52

Quote:

Originally Posted by Photon

But I think you just want to make the ultimate most fantastic superfast *drumroll* ILBM converter... which I find utterly unnecessary but there you go

No, I quite agree. Under all previous circumstances I've converted graphics data to raw. Converting ILBMs on the fly is pointless - except when the specific task at hand is to convert them on the fly that is.

So my thinking was along these lines: if that's what I've gotta do, might as well do it as quick as possible cos, after all, doing things smaller and faster is where the fun comes into assembly coding.

Anyway, updated quicker versions are -

cols:

Code:

                    move.b              #$f0,d6
                    moveq.l             #32-1,d7
.put_colours:       move.b              (a0)+,d0
                    lsl.w               #4,d0
                    move.b              (a0)+,d0
                    and.b               d6,d0
                    move.b              (a0)+,d1
                    lsr.w               #4,d1
                    or.w                d1,d0
                    move.w              d0,(a1)
                    addq.w              #4,a1
                    dbf                 d7,.put_colours

RLE decode:

Code:

                    movea.l             screenone_ptr(a5),a2
                    move.w              #screen_ht-1,d5
.next_row:          moveq.l             #screen_bpls-1,d6
                    movea.l             a2,a3
.crntrow_allbpls:   moveq.l             #screen_wd,d4
.rle_decode:        moveq.l             #0,d7
                    move.b              (a0)+,d7
                    bmi.b               .replicate
                    sub.b               d7,d4
.copy:              move.b              (a0)+,(a3)+
                    dbf                 d7,.copy
                    bra.b               .next_bpl
.replicate:         neg.b               d7
                    sub.b               d7,d4
                    move.b              (a0)+,d3
.do_replicate:      move.b              d3,(a3)+
                    dbf                 d7,.do_replicate
.next_bpl:          subq.b              #1,d4
                    bne.b               .rle_decode
                    lea                 screen_bplsz-screen_wd(a3),a3
                    dbf                 d6,.crntrow_allbpls
                    lea                 screen_wd(a2),a2
                    dbf                 d5,.next_row
                    rts

Thanks for your advice guys - especially Thoram, much neater and tidier colour extraction loop and shaved about a raster line off execution time too

Thorham · 10 October 2011, 18:09

Quote:

Originally Posted by Photon

Just REPT 256 the copy or fill instruction and jump into the chunk of moves at an offset of (256-count)*2. For the fill, move (a0) into Dn first.

Yes, a good idea indeed

Quote:

Originally Posted by pmc

Thanks for your advice guys - especially Thoram, much neater and tidier colour extraction loop and shaved about a raster line off execution time too

You're welcome

Oh, and it's Thorham, not Thoram

pmc · 10 October 2011, 22:27

Sorry for getting your name wrong Thorham - no offence intended. Sometimes I seem to read names wrong - I keep spelling Leffmann's name with only one n too

I've been reminding myself to check how I've spelt his name in posts so I'll remind myself to check the spelling of your name too now as well

Thorham · 12 October 2011, 19:32

Quote:

Originally Posted by pmc

Sorry for getting your name wrong Thorham - no offence intended.

That's fine

Did you try Photon's REPT idea? Gets rid of the dbf instructions, and without caches and pipelines (68020+) this will most certainly be faster.

06 October 2011, 21:12	#6
Leffmann Join Date: Jul 2008 Location: Sweden Posts: 2,269	I would just do this to keep it short. I guess speed doesn't matter much since it's only 32 colors. Are you making an image converter or is it for a demo? Code: move.b (a0)+, d0 lsl.w #4, d0 move.b (a0)+, d0 lsl.w #4, d0 move.b (a0)+, d0 lsr.w #4, d0

08 October 2011, 22:13	#13
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,751	Here's a small one for the RLE part. Instead of copying memory to memory, like this: Code: .replicate neg.b d7 sub.b d7,d4 .do_replicate: move.b (a0),(a3)+ dbf d7,.do_replicate addq.w #1,a0 You can move to a register first and then move the register to memory instead (hitchhikr's code also does this): Code: .replicate neg.b d7 sub.b d7,d4 move.b (a0)+,d0 .do_replicate: move.b d0,(a3)+ dbf d7,.do_replicate Don't know how much faster it is, but the more often that loop gets executed, the more speed is gained. Perhaps you can unroll both copy loops and try to copy words instead of bytes as well, but it might be a bit of a pain

08 October 2011, 22:53	#14
pmc gone Join Date: Apr 2007 Location: completely gone Posts: 1,596	Nice one Thoram - yeah, I'd already stolen that idea from the code hitchhikr posted () and implemented it into my routine.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
OS X Quick Look plugin for IFF ILBM images	dalton	News	17	01 September 2021 22:24
help optimising a section of code	h0ffman	Coders. General	15	02 March 2011 13:19
ILBM picture	mai	support.Other	27	31 July 2010 13:30
IFF/ILBM structures ....	freddix	Coders. General	7	18 September 2006 09:54
ILBM files - different versions?	TikTok	Coders. General	2	07 March 2005 12:00

06 October 2011, 12:00	#1
pmc gone Join Date: Apr 2007 Location: completely gone Posts: 1,596	Optimising ILBM decode Hey fellas For some work on a new prod I'm doing I need to decode lores EHB ILBM files. I only need to decode this type of ILBM, not ILBMs in general so I've written a decoder for just that purpose and it works perfectly fine. The two main loops required are one to pull out the colour data and another to decode the RLE graphics data. Here's my colour extraction loop: Code: moveq.l #32-1,d7 .put_colours: move.b (a0)+,d0 lsr.b #4,d0 move.b (a0)+,d1 andi.b #$f0,d1 move.b (a0)+,d2 lsr.b #4,d2 move.b d0,-(sp) move.w (sp)+,d3 sf.b d3 or.b d1,d3 or.b d2,d3 move.w d3,(a1) addq.w #4,a1 dbf d7,.put_colours and here's my RLE decoder loop: Code: movea.l screenone_ptr(a5),a2 move.w #screen_ht-1,d5 .next_row: moveq.l #screen_bpls-1,d6 movea.l a2,a3 .crntrow_allbpls: moveq.l #screen_wd,d4 .rle_decode: moveq.l #0,d7 move.b (a0)+,d7 bmi.b .replicate sub.b d7,d4 .copy: move.b (a0)+,(a3)+ dbf d7,.copy bra.b .next_bpl .replicate: neg.b d7 sub.b d7,d4 .do_replicate: move.b (a0),(a3)+ dbf d7,.do_replicate addq.w #1,a0 .next_bpl: subq.b #1,d4 bne.b .rle_decode lea screen_bplsz-screen_wd(a3),a3 dbf d6,.crntrow_allbpls lea screen_wd(a2),a2 dbf d5,.next_row Now, while this works fine and doesn't take too long, I'd like to be certain I'm doing the ILBM decode as a whole as fast as possible. So, my question is - is there any way the above routines could be optimised further than I already have or, alternatively, a completely different approach altogether which I've missed? By the way, I should mention that I'm coding specifically for the 68000 processor and not 68020+ EDIT: in the RLE decoder, possibly the copy and replicate loops could be speeded up by determining the number of bytes in the current copy or replicate operation and moving words or longwords instead of bytes when possible...? The speedup of this would need to be traded off against how long it would take the code doing the decision logic for that to run of course... Last edited by pmc; 06 October 2011 at 15:58.

06 October 2011, 20:47	#3
pmc gone Join Date: Apr 2007 Location: completely gone Posts: 1,596	Thanks for taking a look and suggesting an alternative Thoram. All your assumptions about RGB were spot on I did a quick test - the code you posted doesn't always work unfortunately. For example, if d0=$f4, d1=$12 and d2=$23 (which is perfectly possible with the way colour bytes are written into the CMAP structure in an ILBM file) the resulting colour moved into the copperlist should be: $0f12 - your code outputs $0f52

06 October 2011, 21:03	#5
pmc gone Join Date: Apr 2007 Location: completely gone Posts: 1,596	Nice one mate - looks better with that move in place of the or Will do some speed testing and see if your version gains me some time. EDIT: Out of interest, does anyone know of any utility available that could parse a text source code and add up all the cycles the various opcodes take? Something like that would be very very handy and it seems like it should be possible to do, although I'm not sure how easy or hard it would be to code such a utility in practice...

06 October 2011, 21:21	#8
pmc gone Join Date: Apr 2007 Location: completely gone Posts: 1,596	@ Thoram - thanks man. I can't see how your version will be anything but quicker but I'll test it out against the others. @ Leffmann - it's not for an image converter, like I say - I'm only interested in being able to convert EHB pics and nothing else. I've been asked to help with making a slideshow and I want to be able to decode the images as quick as possible. Plus, I just enjoy trying to optimise my code and seeing how others would solve the same problems - helps me to learn to think more laterally. Oh and thanks for posting a version yourself. Like you say - the main speed savings would come from speeding up the RLE decode. Either of you got any ideas for that routine?

06 October 2011, 21:39	#10
Leffmann Join Date: Jul 2008 Location: Sweden Posts: 2,269	I guess raw palette and image data is the fastest then. The palette would only be 64 bytes ready to be written to the color registers so encoding them like this only eats time and space, and the gain from RL-encoding the images is typically not very big. If image size does matter then plain sliding window compression might be better. It's very fast to decompress and compression ratio is always better than RLE. Ask Photon for his compressor, it does just this.

06 October 2011, 21:40	#11
pmc gone Join Date: Apr 2007 Location: completely gone Posts: 1,596	hitchhikr: your version looks basically the same as mine, except it's missing all the other manipulations for putting the data in the correct places in the bitplanes. I need to do those as I'm not using interleaved raw bitplanes. I take it your version converts the interleaved ILBM bitplane data straight to interleaved raw bitplanes? Leffmann - Agreed. RLE isn't very efficient. Mainly I'm trying to get a balance between ease of use of the images (cos I can just directly load and use the .iff images I'm provided) versus the time it takes to decode them.

06 October 2011, 22:03	#12
hitchhikr Registered User Join Date: Jun 2008 Location: somewhere else Posts: 511	Maybe using the blitter to "de-interleave" the bitmap afterwards would be faster ?

09 October 2011, 17:44	#16
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,602	If speed is of the essence, you will always do better with converting to a custom format. You might even save a few bytes in doing so! The fastest solution is to decode ilbm and save as raw files, then switch bitplane ptrs to that frame. By Grabthar's hammer... what a savings. If this is for replaying animation frames, you will save even more space and time by making a player compatible with IFFanim. But I think you just want to make the ultimate most fantastic superfast drumroll ILBM converter... which I find utterly unnecessary but there you go Just REPT 256 the copy or fill instruction and jump into the chunk of moves at an offset of (256-count)*2. For the fill, move (a0) into Dn first.

10 October 2011, 22:27	#19
pmc gone Join Date: Apr 2007 Location: completely gone Posts: 1,596	Sorry for getting your name wrong Thorham - no offence intended. Sometimes I seem to read names wrong - I keep spelling Leffmann's name with only one n too I've been reminding myself to check how I've spelt his name in posts so I'll remind myself to check the spelling of your name too now as well

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)