Fast ZIP (Deflate) unpacking - Page 2

Keir · 16 November 2014, 22:33

Quote:

Originally Posted by Photon

Decompression speed means output rate, of course. In general, you have a bunch of files totalling n bytes. For either compression or decompression you're really only interested in how fast it compresses or decompresses the files, i.e. those n bytes. With the optimized version and measuring the output rate you will probably get a higher figure then!

Speed tests were run for the selected crunchers on 8 exe, graphics, and music files larger than 80KB. My timing routine was 1/100s precision, so the size was to avoid the rounding affecting results. For LZX I used an optimized source-code which may account for its nice result. In all cases, the depacker decompressed from fastram to fastram, with code running in fastram. Except in the case of LHA, where I didn't have the source and unpacked from file to RAM: with lots of buffers, which may account for its low score.

Measurements were done on my A500, GVP HD8+ SCSI interface with 2.3MB/s DMA and 8MB Zorro (24-bit) fastram.

With fastmem disabled, in the example of DPaint 3.25 (277616 bytes), the decompression speed of Nibbler from slowmem to slowmem on A500 gave a speed of 132198 bytes/s.

Basically, the speed is throttled by the speed at which new bytes can be written to RAM.

In fact, you can compare a decompressor with a naïve (non-unraveled) byte copy loop to get a figure to put on "decompression code overhead". Copying DPaint from slowmem to slowmem gave a rate of 182642 bytes/s, making Nibbler decompress 38% slower than this byte-copy loop, i.e. 38% decompression code overhead. On CPUs with instruction caches this figure would drop a little, but still be throttled by how well burst writes help with the byte output stream write to RAM.

A quick calculation puts LZX (the best competitor) at 658% decompression code overhead, so ...mmm... that chart still looks mighty fine to me!

Thank you very much taking the time to write this detailed information. Well I cannot compare fastmem numbers since all my fastram expansions are on accelerators, so I can't get basic 68000 numbers with them.

However, chipmem only, I did test ASM-One_V1.48 which is one that you did too. 'gzip -9' compresses it to 47.6% of its original size (288464 -> 137221). I unpack this at a rate of 27657 bytes/s ingest, or 57666 bytes/s output.

My original test was the first 204480 bytes of New Zealand Story in ADF form. It packs better (204480 -> 68741; 33.6% of original size). I unpack this at a rate of 23868 bytes/s ingest, or 71000 bytes/s output.

Decompression overhead compared with the simple byte-copy loop is 2.57x - 3.17x (c.w. your overhead of 1.38x, or LZX at 7.58x).

So I don't look so bad after all. Beats everyone but you by a good margin, and 'gzip -9' compresses very well. I don't think Inflate is going to touch the speed of your unpack routine however!

Photon · 16 November 2014, 23:24

Well, it's simply how much longer it takes to decompress, compared to a byte-copy loop. The reference to a byte-copy loop is my own, but it's a real one. This is what optimization of decompressors will approach. At least until someone writes a word or longword compressor.

38% overhead simply means it takes 1.38x the time to decompress - compared to a byte-copy loop of the original file. 658% means it takes 7.58x the time.

Had a look now, and at first glance, I'm not sure how to provide the correct parameters, f.ex. it doesn't mention source and destination address, and I'm not sure what to provide as nodes[] etc. Is there a source that decompresses an example .gz file from RAM to RAM?

PeterK · 16 November 2014, 23:49

@kaffer

Thanks a lot for contibuting your inflate routine here. It's very interesting for me, because I had to fight a lot with the zlib inflate code some time ago which I extracted from zlib.library 3.2 and then tried to optimize as good as possible for my icon.library. I never really understood how the huffman decoding works, but I could at least reduce the size of the zlib decoder by nearly 80 %.

Compared to the ported and optimized inflate function from the zlib.library your code is even a lot shorter and consumes much less memory than my current inflate function. So, I replaced my routine with your code and checked it out by loading several PNG and OS4 iconsets.

The good news: Your inflate code could successfully decode all the tested iconsets. And the clear structure and your comments may finally help me to understand the huffman decoding.

The bad news: It is significantly slower than my current zlib decoder, and I really hoped and expected the opposite. (It needs 6.5 times as long, although the real bottleneck in my code is the color reduction). Using only the stack can also be a problem without MinStack or StackAttack, since StackCheck reported more than 5200 bytes used by the WB calling my icon.library.

In order to use your code for a zlib stream, the zlib header has to be processed before calling your inflate function:

Old inflate code removed. There is newer code at the end of my icon.library assembler source.

Keir · 16 November 2014, 23:55

Quote:

Originally Posted by Photon

Well, it's simply how much longer it takes to decompress, compared to a byte-copy loop. The reference to a byte-copy loop is my own, but it's a real one. This is what optimization of decompressors will approach. At least until someone writes a word or longword compressor.

38% overhead simply means it takes 1.38x the time to decompress - compared to a byte-copy loop of the original file. 658% means it takes 7.58x the time.

Ah yes, that's more intuitive to me. Then I'm in the range 2.57x - 3.17x.

Quote:

Had a look now, and at first glance, I'm not sure how to provide the correct parameters, f.ex. it doesn't mention source and destination address, and I'm not sure what to provide as nodes[] etc. Is there a source that decompresses an example .gz file from RAM to RAM?

Sorry, the entry-point routine is at the end of inflate.asm

Use it something like this:

Code:

lea (output_buf),a4
lea (input_stream),a5
bsr inflate

input_stream is a DEFLATE stream such as ripped from a gzip file. The gzip header is very easy to remove, there is a degzip.c routine which does that in my repository here https://github.com/keirf/Amiga-Stuff but it also implements Inflate so is way more complex than is actually required. But still if you build that then you can simply 'degzip in.gz out' and then pass out to inflate via a5.

Photon · 17 November 2014, 00:01

Quote:

Originally Posted by kaffer

Code:

lea (output_buf),a4
lea (input_stream),a5
bsr inflate

input_stream is a DEFLATE stream such as ripped from a gzip file. The gzip header is very easy to remove, there is a degzip.c routine which does that in my repository here https://github.com/keirf/Amiga-Stuff but it also implements Inflate so is way more complex than is actually required. But still if you build that then you can simply 'degzip in.gz out' and then pass out to inflate via a5.

That makes more sense, I guess I was looking too much at build_code.

No, I don't have a c compiler for Amiga, so a degzip.exe for 68000 would be helpful (well, required!

) for me to decompress any files then.

Keir · 17 November 2014, 00:07

Quote:

Originally Posted by PeterK

@kaffer
The good news: Your inflate code could successfully decode all the tested iconsets. And the clear structure and your comments may finally help me to understand the hufman decoding.

The fact they are specifically Huffman trees is almost of no consequence -- to the unpacker they are simply binary tries for converting from codes to symbols. RFC 1951 is actually really well written and also quite short. Read Section 3.2 of that and all will be clear.

Quote:

The bad news: It is significantly slower than my current zlib decoder, and I really hoped and expected the opposite. (It needs 6.5 times as long, although the real bottleneck in my code is the color reduction).

Wow that sucks. What CPU do you test on? I wonder whether inlining all subroutines in the main decode loop hurts on 68020+ by overflowing the icache. I would suggest trying with OPT_INLINE_FUNCTIONS=0. Also be sure that OPT_TABLE_LOOKUP=1.

Quote:

Using only the stack can also be a problem without MinStack or StackAttack, since StackCheck reported more than 5200 bytes used by the WB calling my icon.library.

There is an OPT_STORAGE_OFFSTACK you can specify, and then you must pass a few kB of temporary storage space via a6. The stack will then be left alone

Quote:

In order to use your code for a zlib stream, the zlib header has to be processed before calling your inflate function:

Thanks, I will take a look!

Keir · 17 November 2014, 00:35

Quote:

Originally Posted by PeterK

@kaffer
Compared to the ported and optimized inflate function from the zlib.library your code is even a lot shorter and consumes much less memory than my current inflate function. So, I replaced my routine with your code and checked it out by loading several PNG and OS4 iconsets.

How big is an average iconset? I wonder whether much bigger (or especially) much smaller than files I've mostly tested could mean that different areas of the code dominate, such as the rather less optimised routine to create the lookup tables/tries.

PeterK · 17 November 2014, 01:16

I tested your code under WinUAE with 68020 CPU, since I don't have a working real Amiga anymore.

Disabling the Inline-option makes no difference for the JIT compiler.

The iconsets are usually a bunch of 50-200 PNG or OS4 icons, where each of their images has a compressed size of 3-10 kB.

Attached is my zlib decoder stuff, although it's not cleaned up and prepared for a public release. So, you will find a lot of confusing comments from a coder who did not understand the code

and also still some references to my icon.library (A5).

Update:
Used the timer device now to compare exclusively the inflate functions. The result was 45* (not 6.5*)

But this huge difference may only occur for icon decoding. They make use of the big tables in my routine in contrary to standard gzip streams which don't use these tables. My code is not optimized for standard gzip decompression.

I think, one problem is that you rebuild the tables for every function call again. That makes it slow.

Old inflate code removed. There is newer code at the end of my icon.library assembler source.

Keir · 17 November 2014, 08:02

Quote:

Originally Posted by PeterK

I tested your code under WinUAE with 68020 CPU, since I don't have a working real Amiga anymore.

Disabling the Inline-option makes no difference for the JIT compiler.

The iconsets are usually a bunch of 50-200 PNG or OS4 icons, where each of their images has a compressed size of 3-10 kB.

Attached is my zlib decoder stuff, although it's not cleaned up and prepared for a public release. So, you will find a lot of confusing comments from a coder who did not understand the code

and also still some references to my icon.library (A5).

Update:
Used the timer device now to compare exclusively the inflate functions. The result was 45* (not 6.5*)

I think, one problem is that you rebuild the tables for every function call again. That makes it slow.

Thanks, it's an interesting case of optimising for different input test data. gzip rarely generates blocks that use the static huffman trees (your static 'bigtabs'), so I made that code/data as small as possible and treat as a special case of dynamic huffman tables. Recalculating those trees 100+ times for lots of small files is going to suck compared with a nice big static lookup table and going straight into decode.

I could optimise that case by allowing the static-case tables to be pre-generated and set aside at start of day. I reckon I could have a go at beating your code because although it is unrolled and inlined up the wazoo and really quite slick, it is quite unnecessarily outputting data via a pointless 32kB window buffer (if I'm not misreading the code). I expect that's the bulk of your extra memory usage right there, plus you have an extra read and write of every output byte. The window buffer is only necessary if you are streaming output to disk or network (for example); no need if doing in-memory decompression.

If you optimise out that intermediate buffer then there's not much fat to trim speed-wise and it's a very slick albeit ugly routine!

PeterK · 17 November 2014, 08:18

Yes, I think you're right, the 32 kB window buffer is definitely not optimal. But since I only ported the code from zlib.library I just tried to optimize it as it was without changing the concept. That's because I still don't have a clear imagination of what's going on and no vision yet to make it better. But I will try again.

Keir · 17 November 2014, 08:52

Quote:

Originally Posted by PeterK

Yes, I think you're right, the 32 kB window buffer is definitely not optimal. But since I only ported the code from zlib.library I just tried to optimize it as it was without changing the concept. That's because I still don't have a clear imagination of what's going on and no vision yet to make it better. But I will try again.

As will I, optimising for your usage case

Maybe I am mad, but this kind of optimisation and algorithm work is quite fun!

My plan is to allow pre-generation of the len/distance base+extrabits tables, and the static-huffman tables. Also I will look into optimisations for 68020+: is there anything to look for there apart from allowing unaligned memory accesses and using scaled-index addressing modes? I guess shifts are cheap(er) so less need to work to avoid them...

Keir · 17 November 2014, 17:50

Okay I have implemented table pre-generation in the latest version of inflate.asm here https://raw.githubusercontent.com/ke...er/inflate.asm

To use it you need to:

Set OPT_STORAGE_OFFSTACK and OPT_PREGENERATE_TABLES to 1 (both are described at the top of the file, and default to 0)
At start of day 'bsr inflate_gentables' with a6 pointing at the end of a 6000-byte memory area
To inflate, 'bsr inflate_fromtables' with a6 set exactly the same as above, and a4/a5 set as for the normal inflate function

I haven't yet done any 68020+ optimisations but I think there is only minor cycle shaving to do there; table pre-generation should make a really massive difference.

PeterK · 17 November 2014, 20:17

Thank you, kaffer

Your new inflate function is already much faster. It needs only 60 % longer now, but it is still more than 2 kB shorter than mine and also saves a lot of memory, which both can be very important for standard low-end Amigas like the A500 or A600. The PNG and OS4 icons can be converted into the OS 3.5 format. The difference for the complete icon reading, uncompressing, color processing and rendering is only 7 %.

Keir · 17 November 2014, 20:33

Quote:

Originally Posted by PeterK

Thank you, kaffer

Your new inflate function is already much faster. It needs only 60 % longer now, but it is still more than 2 kB shorter than mine and also saves a lot of memory, which both can be very important for standard low-end Amigas like the A500 or A600. The PNG and OS4 icons can be converted into the OS 3.5 format. The difference for the complete icon reading, uncompressing, color processing and rendering is only 7 %.

One algorithmic difference I can see is that your bigtable has a full 9-bit lookup array for the literal/length symbol (9 bits is the longest code length in the static huffman encoding). Whereas I do an 8-bit prefix lookup, and codes longer than that fall through to a subtree for the remaining bits. That could make a bit of difference. I'm not sure all that much though, 8-bit + 1-bit lookup comes out quite nice in the code.

Apart from that it might just be some optimisations with an eye to 68020+. I add bytes to the input-buffering register lazily, as shifting bytes up into the most-significant position is expensive on 68000, whereas prefetching up front for the next few code lookups probably makes sense on 68020, avoiding some tests+branches, and the longer shifts up to the far end of the shift register have no extra cost on 68020+.

I'll have another browse of your code and see if there are any tricks to steal

If you could send me a Deflate stream for one or two icons that might be handy, then I can examine their profile a bit and see where time might be being spent.

PeterK · 17 November 2014, 20:49

Most likely I would use your code for the 68000 version of the icon.library, since the users of these systems will be happy about every kB of code less and the difference in memory consumption is even more than 40 kB, and that is really a lot on a 500 kB Amiga. I didn't check your code for possible optimizations yet. Maybe on one of the next days I will have a look.

Keir · 17 November 2014, 22:07

Quote:

Originally Posted by PeterK

Most likely I would use your code for the 68000 version of the icon.library, since the users of these systems will be happy about every kB of code less and the difference in memory consumption is even more than 40 kB, and that is really a lot on a 500 kB Amiga. I didn't check your code for possible optimizations yet. Maybe on one of the next days I will have a look.

That's the direction I approached my implementation from, so it makes sense it is suited to that. Thanks for trying it out and if you find time to do some optimising that's great!

PeterK · 17 November 2014, 22:12

Ok, two PNG icon streams are attached (only the inflate data, no zlib header and CRC). I hope that I grabbed the correct data bytes.

Thanks for your great work, kaffer !!

Keir · 17 November 2014, 22:43

Quote:

Originally Posted by PeterK

Ok, two PNG icon streams are attached (only the inflate data, no zlib header and CRC). I hope that I grabbed the correct data bytes.

Very interesting, Arnold.dat is the kind of stream I'm used to, just a couple of sub-blocks, both with their own embedded huffman dictionaries.

BoingBall.dat is obviously made by a different and rather insane encoder; it contains nearly 100(!) sub-blocks, about 95% of which use the static huffman dictionaries. This would explain why generating those static table on-the-fly per-block sucked so bad -- the generating overhead was suffered approx per 50 bytes of input. I will hook this one into my test harness.

Keir · 18 November 2014, 16:34

Quote:

Originally Posted by kaffer

Very interesting, Arnold.dat is the kind of stream I'm used to, just a couple of sub-blocks, both with their own embedded huffman dictionaries.

BoingBall.dat is obviously made by a different and rather insane encoder; it contains nearly 100(!) sub-blocks, about 95% of which use the static huffman dictionaries. This would explain why generating those static table on-the-fly per-block sucked so bad -- the generating overhead was suffered approx per 50 bytes of input. I will hook this one into my test harness.

I have done some 68020+ optimisations, enabled by setting MC68020=1. Seems to gain about 5%. That's me done for a while on this, I expect.

Don_Adan · 23 November 2014, 16:47

Quote:

Originally Posted by kaffer

Thanks, I'd spotted the LEA->ADD.w optimisations, but shortening the exit branch is nice to have too, shaves a few more cycles!

You can replace lea commands, f.e.
lea (a1,d2.w),a3

with
move.l A1,A3
add.w D2,A3

this is fastest or same for most 680x0 (except 68060), if I remember right.

16 November 2014, 23:49	#23
PeterK Registered User Join Date: Apr 2005 Location: digital hell, Germany, after 1984, but worse Posts: 3,365	@kaffer Thanks a lot for contibuting your inflate routine here. It's very interesting for me, because I had to fight a lot with the zlib inflate code some time ago which I extracted from zlib.library 3.2 and then tried to optimize as good as possible for my icon.library. I never really understood how the huffman decoding works, but I could at least reduce the size of the zlib decoder by nearly 80 %. Compared to the ported and optimized inflate function from the zlib.library your code is even a lot shorter and consumes much less memory than my current inflate function. So, I replaced my routine with your code and checked it out by loading several PNG and OS4 iconsets. The good news: Your inflate code could successfully decode all the tested iconsets. And the clear structure and your comments may finally help me to understand the huffman decoding. The bad news: It is significantly slower than my current zlib decoder, and I really hoped and expected the opposite. (It needs 6.5 times as long, although the real bottleneck in my code is the color reduction). Using only the stack can also be a problem without MinStack or StackAttack, since StackCheck reported more than 5200 bytes used by the WB calling my icon.library. In order to use your code for a zlib stream, the zlib header has to be processed before calling your inflate function: Old inflate code removed. There is newer code at the end of my icon.library assembler source. Last edited by PeterK; 26 June 2020 at 13:10.

17 November 2014, 01:16	#28
PeterK Registered User Join Date: Apr 2005 Location: digital hell, Germany, after 1984, but worse Posts: 3,365	I tested your code under WinUAE with 68020 CPU, since I don't have a working real Amiga anymore. Disabling the Inline-option makes no difference for the JIT compiler. The iconsets are usually a bunch of 50-200 PNG or OS4 icons, where each of their images has a compressed size of 3-10 kB. Attached is my zlib decoder stuff, although it's not cleaned up and prepared for a public release. So, you will find a lot of confusing comments from a coder who did not understand the code and also still some references to my icon.library (A5). Update: Used the timer device now to compare exclusively the inflate functions. The result was 45* (not 6.5) But this huge difference may only occur for icon decoding. They make use of the big tables in my routine in contrary to standard gzip streams which don't use these tables. My code is not optimized for standard gzip decompression. I think, one problem is that you rebuild the tables for every function call again. That makes it slow. Old inflate code removed. There is newer code at the end of my icon.library assembler source. Last edited by PeterK; 26 June 2020 at 13:13.*

17 November 2014, 17:50	#32
Keir Registered User Join Date: May 2011 Location: Cambridge Posts: 682	Okay I have implemented table pre-generation in the latest version of inflate.asm here https://raw.githubusercontent.com/ke...er/inflate.asm To use it you need to: Set OPT_STORAGE_OFFSTACK and OPT_PREGENERATE_TABLES to 1 (both are described at the top of the file, and default to 0) At start of day 'bsr inflate_gentables' with a6 pointing at the end of a 6000-byte memory area To inflate, 'bsr inflate_fromtables' with a6 set exactly the same as above, and a4/a5 set as for the normal inflate function I haven't yet done any 68020+ optimisations but I think there is only minor cycle shaving to do there; table pre-generation should make a really massive difference.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Amiga formatted zip disks via USB zip drive in WinUAE?	planetidiot	New to Emulation or Amiga scene	8	02 February 2018 08:43
System.zip / bad zip-file	Ztein	project.ClassicWB	24	22 April 2012 02:14
Use of 4MB PCMCIA Fast Flash Memory as Fast RAM in A1200	nkarytia	support.Hardware	10	16 September 2011 13:37
Added SIMM to ZIP adapter, now 16MB Fast RAM	tonyyeb	support.Hardware	18	01 September 2008 10:59
LZX unpacking???	Medvind	support.Apps	25	27 November 2002 12:33

16 November 2014, 23:24	#22
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,602	Well, it's simply how much longer it takes to decompress, compared to a byte-copy loop. The reference to a byte-copy loop is my own, but it's a real one. This is what optimization of decompressors will approach. At least until someone writes a word or longword compressor. 38% overhead simply means it takes 1.38x the time to decompress - compared to a byte-copy loop of the original file. 658% means it takes 7.58x the time. Had a look now, and at first glance, I'm not sure how to provide the correct parameters, f.ex. it doesn't mention source and destination address, and I'm not sure what to provide as nodes[] etc. Is there a source that decompresses an example .gz file from RAM to RAM?

17 November 2014, 08:18	#30
PeterK Registered User Join Date: Apr 2005 Location: digital hell, Germany, after 1984, but worse Posts: 3,365	Yes, I think you're right, the 32 kB window buffer is definitely not optimal. But since I only ported the code from zlib.library I just tried to optimize it as it was without changing the concept. That's because I still don't have a clear imagination of what's going on and no vision yet to make it better. But I will try again.

17 November 2014, 20:17	#33
PeterK Registered User Join Date: Apr 2005 Location: digital hell, Germany, after 1984, but worse Posts: 3,365	Thank you, kaffer Your new inflate function is already much faster. It needs only 60 % longer now, but it is still more than 2 kB shorter than mine and also saves a lot of memory, which both can be very important for standard low-end Amigas like the A500 or A600. The PNG and OS4 icons can be converted into the OS 3.5 format. The difference for the complete icon reading, uncompressing, color processing and rendering is only 7 %.

17 November 2014, 20:49	#35
PeterK Registered User Join Date: Apr 2005 Location: digital hell, Germany, after 1984, but worse Posts: 3,365	Most likely I would use your code for the 68000 version of the icon.library, since the users of these systems will be happy about every kB of code less and the difference in memory consumption is even more than 40 kB, and that is really a lot on a 500 kB Amiga. I didn't check your code for possible optimizations yet. Maybe on one of the next days I will have a look.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)