|
Dr. Dobb's Journal
October, 1989 |
Thanks to Jan Hakenberg for correction of a couple of errors! In Figure 4, the values for new table entries 259 and 265 were truncated.
Thanks to David Littlewood for pointing out the missing line of pseudocde in Figure 6.
Thanks to Joe Snyder for pointing out a line where a macro should replace a hard-coded constant.
Any programmer working on mini or microcomputers in this day and age should have at least some exposure to the concept of data compression. In MS-DOS world, programs like ARC, by System Enhancement Associates, and PKZIP, by PKware are ubiquitous. ARC has also been ported to quite a few other machines, running UNIX, CP/M, and so on. CP/M users have long had SQ and USQ to squeeze and expand programs. Unix users have the COMPRESS and COMPACT utilities. Yet the data compression techniques used in these programs typically only show up in two places: file transfers over phone lines, and archival storage.
Data compression has an undeserved reputation for being difficult to master, hard to implement, and tough to maintain. In fact, the techniques used in the previously mentioned programs are relatively simple, and can be implemented with standard utilities taking only a few lines of code. This article discusses a good all-purpose data compression technique: Lempel-Ziv-Welch, or LZW compression.
The routines shown here belong in any programmer's toolbox. For example, a program that has a few dozen help screens could easily chop 50K bytes off by compressing the screens. Or 500K bytes of software could be distributed to end users on a single 360K byte floppy disk. Highly redundant database files can be compressed down to 10% of their original size. Once the tools are available, the applications for compression will show up on a regular basis.
LZW Fundamentals
The original Lempel Ziv approach to data compression was first published in in 1977, followed by an alternate approach in 1978. Terry Welch's refinements to the 1978 algorithm were published in 1984. The algorithm is surprisingly simple. In a nutshell, LZW compression replaces strings of characters with single codes. It does not do any analysis of the incoming text. Instead, it just adds every new string of characters it sees to a table of strings. Compression occurs when a single code is output instead of a string of characters.
The code that the LZW algorithm outputs can be of any arbitrary length, but it must have more bits in it than a single character. The first 256 codes (when using eight bit characters) are by default assigned to the standard character set. The remaining codes are assigned to strings as the algorithm proceeds. The sample program runs as shown with 12 bit codes. This means codes 0-255 refer to individual bytes, while codes 256-4095 refer to substrings.
Compression
The LZW compression algorithm in its simplest form is shown in Figure 1. A quick examination of the algorithm shows that LZW is always trying to output codes for strings that are already known. And each time a new code is output, a new string is added to the string table.
Routine LZW_COMPRESS
-
STRING = get input character
-
WHILE there are still input characters DO
-
CHARACTER = get input character
-
IF STRING+CHARACTER is in the string table then
-
STRING = STRING+character
-
ELSE
-
output the code for STRING
-
add STRING+CHARACTER to the string table
-
STRING = CHARACTER
-
END of IF
-
END of WHILE
-
output the code for STRING
The Compression Algorithm
Figure 1
A sample string used to demonstrate the algorithm is shown in Figure 2. The input string is a short list of English words separated by the '/' character. Stepping through the start of the algorithm for this string, you can see that the first pass through the loop, a check is performed to see if the string "/W" is in the table. Since it isn't, the code for '/' is output, and the string "/W" is added to the table. Since we have 256 characters already defined for codes 0-255, the first string definition can be assigned to code 256. After the third letter, 'E', has been read in, the second string code, "WE" is added to the table, and the code for letter 'W' is output. This continues until in the second word, the characters '/' and 'W' are read in, matching string number 256. In this case, the code 256 is output, and a three character string is added to the string table. The process continues until the string is exhausted and all of the codes have been output.
| Input String = /WED/WE/WEE/WEB/WET | |||
| Character Input | Code Output | New code value | New String |
| /W | / | 256 | /W |
| E | W | 257 | WE |
| D | E | 258 | ED |
| / | D | 259 | D/ |
| WE | 256 | 260 | /WE |
| / | E | 261 | E/ |
| WEE | 260 | 262 | /WEE |
| /W | 261 | 263 | E/W |
| EB | 257 | 264 | WEB |
| / | B | 265 | B/ |
| WET | 260 | 266 | /WET |
| EOF | T | ||
The Compression Process
Figure 2
The sample output for the string is shown in Figure 2 along with the resulting string table. As can be seen, the string table fills up rapidly, since a new string is added to the table each time a code is output. In this highly redundant input, 5 code substitutions were output, along with 7 characters. If we were using 9 bit codes for output, the 19 character input string would be reduced to a 13.5 byte output string. Of course, this example was carefully chosen to demonstrate code substitution. In real world examples, compression usually doesn't begin until a sizable table has been built, usually after at least one hundred or so bytes have been read in.
Decompression
The companion algorithm for compression is the decompression algorithm. It needs to be able to take the stream of codes output from the compression algorithm, and use them to exactly recreate the input stream. One reason for the efficiency of the LZW algorithm is that it does not need to pass the string table to the decompression code. The table can be built exactly as it was during compression, using the input stream as data. This is possible because the compression algorithm always outputs the STRING and CHARACTER components of a code before it uses it in the output stream. This means that the compressed data is not burdened with carrying a large string translation table.
Routine LZW_DECOMPRESS
-
Read OLD_CODE
-
output OLD_CODE
-
WHILE there are still input characters DO
-
Read NEW_CODE
-
STRING = get translation of NEW_CODE
-
output STRING
-
CHARACTER = first character in STRING
-
add OLD_CODE + CHARACTER to the translation table
-
OLD_CODE = NEW_CODE
-
END of WHILE
The Decompression Algorithm
Figure 3
The algorithm is shown in Figure 3. Just like the compression algorithm, it adds a new string to the string table each time it reads in a new code. All it needs to do in addition to that is translate each incoming code into a string and send it to the output.
Figure 4 shows the output of the algorithm given the input created by the compression earlier in the article. The important thing to note is that the string table ends up looking exactly like the table built up during compression. The output string is identical to the input string from the compression algorithm. Note that the first 256 codes are already defined to translate to single character strings, just like in the compression code.
| Input Codes: / W E D 256 E 260 261 257 B 260 T | ||||
| Input/ NEW_CODE |
OLD_CODE | STRING/ Output |
CHARACTER | New table entry |
| / | / | / | ||
| W | / | W | W | 256 = /W |
| E | W | E | E | 257 = WE |
| D | E | D | D | 258 = ED |
| 256 | D | /W | / | 259 = D/ |
| E | 256 | E | E | 260 = /WE |
| 260 | E | /WE | / | 261 = E/ |
| 261 | 260 | E/ | E | 262 = /WEE |
| 257 | 261 | WE | W | 263 = E/W |
| B | 257 | B | B | 264 = WEB |
| 260 | B | /WE | / | 265 = B/ |
| T | 260 | T | T | 266 = /WET |
The Decompression Process
Figure 4
The Catch
Unfortunately, the nice simple decompression algorithm shown in Figure 4 is just a little too simple. There is a single exception case in the LZW compression algorithm that causes some trouble to the decompression side. If there is a string consisting of a (STRING,CHARACTER) pair already defined in the table, and the input stream then sees a sequence of STRING, CHARACTER, STRING, CHARACTER, STRING, the compression algorithm will output a code before the decompressor gets a chance to define it.
A simple example will illustrate the point. Imagine the the string JOEYN is defined in the table as code 300. Later on, the sequence JOEYNJOEYNJOEY occurs in the table. The compression output will look like that shown in Figure 5.
| Input String: ...JOEYNJOEYNJOEY | ||
| Character Input | New Code/String | Code Output |
| JOEYN | 300 = JOEYN | 288 (JOEY) |
| A | 301 = NA | N |
| . | . | . |
| . | . | . |
| . | . | . |
| JOEYNJ | 400 = JOEYNJ | 300 (JOEYN) |
| JOEYNJO | 401 = JOEYNJO | 400 (???) |
A problem
Figure 5
When the decompression algorithm sees this input stream, it first decodes the code 300, and outputs the JOEYN string. After doing the output, it will add the definition for code 399 to the table, whatever that may be. It then reads the next input code, 400, and finds that it is not in the table. This is a problem, what do we do?
Fortunately, this is the only case where the decompression algorithm will encounter an undefined code. Since it is in fact the only case, we can add an exception handler to the algorithm. The modified algorithm just looks for the special case of an undefined code, and handles it. In the example in Figure 5, the decompression routine sees a code of 400, which is undefined. Since it is undefined, it translates the value of OLD_CODE, which is code 300. It then adds the CHARACTER value, which is 'J', to the string. This results in the correct translation of code 400 to string "JOEYNJ".
Routine LZW_DECOMPRESS
-
Read OLD_CODE
-
output OLD_CODE
-
CHARACTER = OLD_CODE
-
WHILE there are still input characters DO
-
Read NEW_CODE
-
IF NEW_CODE is not in the translation table THEN
-
STRING = get translation of OLD_CODE
-
STRING = STRING+CHARACTER
-
ELSE
-
STRING = get translation of NEW_CODE
-
END of IF
-
output STRING
-
CHARACTER = first character in STRING
-
add OLD_CODE + CHARACTER to the translation table
-
OLD_CODE = NEW_CODE
-
END of WHILE
The Modified Decompression Algorithm
Figure 6
The Implementation Blues
The concepts used in the compression algorithm are very simple -- so simple that the whole algorithm can be expressed in only a dozen lines. Implementation of this algorithm is somewhat more complicated, mainly due to management of the string table.
In the code accompanying this article, I have used code sizes of 12, 13, and 14 bits. In a 12 bit code program, there are potentially 4096 strings in the string table. Each and every time a new character is read in, the string table has to be searched for a match. If a match is not found, then a new string has to be added to the table. This causes two problems. First, the string table can get very large very fast. If string lengths average even as low as three or four characters each, the overhead of storing a variable length string and its code could easily reach seven or eight bytes per code.
In addition, the amount of storage needed is indeterminate, as it depends on the total length of all the strings.
The second problem involves searching for strings. Each time a new character is read in, the algorithm has to search for the new string formed by STRING+CHARACTER. This means keeping a sorted list of strings. Searching for each string will take on the order of log2 string comparisons. Using 12 bit words means potentially doing 12 string compares for each code. The computational overhead caused by this would be prohibitive.
The first problem can be solved by storing the strings as code/character combinations. Since every string is actually a combination of an existing code and an appended character, we can store each string as single code plus a character. For example, in the compression example shown, the string "/WEE" is actually stored as code 260 with appended character E. This takes only three bytes of storage instead of 5 (counting the string terminator). By backtracking, we find that code 260 is stored as code 256 plus an appended character E. Finally, code 256 is stored as a '/' character plus a 'W'.
Doing the string comparisons is a little more difficult. Our new method of storage reduces the amount of time for a string comparison, but it doesn't cut into the the number of comparisons needed to find a match. This problem is solved by storing strings using a hashing algorithm. What this means is that we don't store code 256 in location 256 of an array. We store it in a location in the array based on an address formed by the string itself. When we are trying to locate a given string, we can use the test string to generate a hashed address, and with luck can find the target string in one search.
Since the code for a given string is no longer known merely by its position in the array, we need to store the code for a given string along with the string data. In the attached program, there are three array elements for each string. They are code_value[i], prefix_code[i], and append_character[i].
When we want to add a new code to the table, we use the hashing function in routine find_match to generate the correct value of i. First find_match generates an address, then checks to see if the location is already in use by a different string. If it is, it performs a secondary probe until an open location is found.
The hashing function in use in this program is a straightforward XOR type hash function. The prefix code and appended character are combined to form an array address. If the contents of the prefix code and character in the array are a match, the correct address is returned. If that element in the array is in use, a fixed offset probe is used to search new locations. This continues until either an empty slot is found, or a match is found. By using a table about 25% larger than needed, the average number of searches in the table usually stays below 3. Performance can be improved by increasing the size of the table.
Note that in order for the secondary probe to always work, the size of the table needs to be a prime number. This is because the probe can be any integer between 0 and the table size. If the probe and the table size were not mutually prime, a search for an open slot could fail even if there were still open slots available.
Implementing the decompression algorithm has its own set of problems. One of the problems from the compression code goes away. When we are compressing, we need to search the table for a given string. During decompression, we are looking for a particular code. This means that we can store the prefix codes and appended characters in the table indexed by their string code. This eliminates the need for a hashing function, and frees up the array used to store code values.
Unfortunately, the method we are using of storing string values causes the strings to be decoded in reverse order. This means that all the characters for a given string have to be decoded into a stack buffer, then output in reverse order. In the program here this is done in the decode_string function. Once this code is written, the rest of the algorithm turns into code very easily.
One problem encountered when reading in data streams is determining when you have reached the end of the input data stream. In this particular implementation, I have reserved the last defined code, MAX_VALUE, as a special end of data indicator. While this may not be necessary when reading in data files, it is very helpful when reading compressed buffers out of memory. The expense of losing one defined code is minimal in comparison to the convenience.
Results
It is somewhat difficult to characterize the results of any data compression technique. The level of compression achieved varies quite a bit depending on several factors. LZW compression excels when confronted with data streams that have any type of repeated strings. Because of this, it does extremely well when compressing English text. Compression levels of 50% or better should be expected. Likewise, compressing saved screens and displays will generally show very good results.
Trying to compress binary data files is a little more risky. Depending on the data, compression may or may not yield good results. In some cases, data files will compress even more than text. A little bit of experimentation will usually give you a feel for whether your data will compress well or not.
Your Implementation
The code accompanying this article works. However, it was written with the goal of being illuminating, not efficient. Some parts of the code are relatively inefficient. For example, the variable length input and output routines are short and easy to understand, but require a lot of overhead. An LZW program using fixed length 12 bit codes could experience real improvements in speed just by recoding these two routines.
One problem with the code listed here is that it does not adapt well to compressing files of differing sizes. Using 14 or 15 bit codes gives better compression ratios on large files, (because they have a larger string table to work with), but actually has poorer performance on small files. Programs like ARC get around this problem by using variable length codes. For example, while the value of next_code is between 256 and 511, ARC inputs and outputs 9 bit codes. When the value of next_code increases to the point where 10 bit codes are needed, both the compression and decompression routines adjust the code size. This means that the 12 bit and 15 bit versions of the program will do equally well on small files.
Another problem on long files is that frequently the compression ratio begins to degrade as more of the file is read in. The reason for this is simple. Since the string table is of finite size, after a certain number of strings have been defined, no more can be added. But the string table is only good for the portion of the file that was read in while it was built. Later sections of the file may have different characteristics, and really need a different string table.
The conventional way to solve this problem is to monitor the compression ratio. After the string table is full, the compressor watches to see if the compression ratio degrades. After a certain amount of degradation, the entire table is flushed, and gets rebuilt from scratch. The expansion code is flagged when this happens by seeing a special code from the compression routine.
An alternative method would be to keep track of how frequently strings are used, and to periodically flush values that are rarely used. An adaptive technique like this may be too difficult to implement in a reasonably sized program.
One final technique for compressing the data is to take the LZW codes and run them through an adaptive Huffman coding filter. This will generally exploit a few more percentage points of compression, but at the cost of considerable more complexity in the code, as well as quite a bit more run time.
Portability
The code linked below was written and tested on MS-DOS machines, and has successfully compiled and executed with several C compilers. It should be portable to any machine that supports 16 integers and 32 bit longs in C. MS-DOS C compilers typically have trouble dealing with arrays larger than 64 Kbytes, preventing an easy implementation of 15 or 16 bit codes in this program. On machines using different processors, such as a VAX, these restrictions are lifted, and using larger code sizes becomes much simpler.
In addition, porting this code to assembly language should be fairly simple on any machine that supports 16 and 32 bit math. Significant performance improvements could be seen this way. Implementations in other high level languages should be straightforward.
Bibliography
-
Terry Welch, "A Technique for High-Performance Data Compression", Computer, June 1984
J. Ziv and A. Lempel, "A Universal Algorithm for Sequential Data Compression", IEEE Transactions on Information Theory, May 1977
>Rudy Rucker, "Mind Tools", Houghton Mifflin Company, 1987
Source Code
Update - September 9, 2007
Reader David McKellar offers up some source code with the following comments:
I modified your fine 1989 lzw.c code so it outputs the same as the GNU/Linux compress command. Of course, I also made the corresponding changes in the expand part of your code.
I made the smallest number of changes possible. You can diff to see what I did. I realize your program is an example program to show the concepts but I think it doesn't add to the complexity too much to be compress compatible.
I should mention that my changes don't totally support the GNU/Linux compress. The real compress flushes its hash table when it notices the ratio goes down. It changes the bits used for encoding when the table is full. And probably more tricky stuff. However that is only an issue when lzw.c is attempting to read a compressed file.
Update - November 8, 2007
Reader Bogdan Banica contributes source code which is an adapted version of the program that operates on memory buffers instead of files: stream_lzw.c
Recommended Reading
If you are interested in data compression, the texts below are all excellent reference material, and I've included a link to WinZip just in case you need a copy of that. If you make a purchase on Amazon.com after clicking through on the links below, you'll be expanding your knowledge and supporting this site at the same time. Thanks. (Note: all links will open in a new window.)
162 users commented in " LZW Data Compression "
Follow-up comment rss or Leave a TrackbackDear Mark,
I have one doubt on "If that element in the array is in use, a fixed offset probe is used to search new locations. This continues until either an empty slot is found, or a match is found, the average is usually below 3".
My question is:
Will it be possible that one search will never finish or take a huge number of search (infinite loop) to find a new location for some unpredictable strings? Any upper bounds on the number of search during find_match() given the size of the table? Thanks.
Steven.
It isn't possible for the search to never complete - the hash table has maybe 10% more slots than will be used - so for example if the code size is 14 bits, you only need 16K slots, and the table is sized at 18K.
Because the number of slots in the table is a prime number, you are guaranteed to eventually visit every slot, meaning sooner or later you will hit an empty slot.
As to whether this is an efficient method or nort, all I can say is that the original authors of the UNIX compress program who developed this strategy though that it was generally efficient, and their tests came up with the figure that said no more than three probes were usually needed.
I'm not an expert on hashing, so I just used the same algorithm that was in that code. I think in general to really know if your hash table is performing well you just have to run tests and gather statistics - it's difficult to predict in advance what the usage patterns will be.
Dear Mark,
what is the algorithm used by Winzip?
Someone said it is LZW, but someone said it is hybrid of LZ77 and Huffman coding known as 'Deflate'.
Thanks.
The second someone is right, the algorithm popularized by PKZip and still used in most "Zip-compatible" compressors is called deflate. It is an LZ77 compressor that uses a Huffman compressor on its token stream. It was very fast and efficient when it was first used 18 years ago 9http://www.c10n.info/archives/430) and is still the most popular compression algorithm for general purpose work today. LZ77 and LZ78, despite the similar names, are completely different from one another.
Mark,
How close is the output of this algo to the Unix compress (LZC?) algorithm? Will the output of your 12-bit version generate a file that can be restored by 'uncompress'? I'm looking for code that is compatible with the Unix version. Thanks!
My code is very close to UNIX compress, but not identical. I think the best approach to getting compatibility is, unfortunately, to hack the compress source code to get it to do your bidding.
If you search long enough on the net, you might be able to find the UNIX compress algorithm as implemented long ago by the ARC archiver - that source should still be compatible but might be a little easier to work with.
This might also be answered by one of the experts in comp.compression, you should consider posting the question there.
You know, I had been looking all over the net for it, but all I found were man pages and such. Thanks for pointing me in the right direction -- I think I found an original Unix distribution version!
Will this code work as is for 16 bit unsigned integers? If not, can it be easily modified to do so and how?
thanks
Martin, any time you change the integer type from signed to unsigned you run the risk of a few bugs. But I'm curious, why would you care? If you are on a compiler that uses unsigned 16 bit ints by default, you can just change the declaration of all ints in this this program to "signed int" and you are free of the problem.
I suspect the hashing probe code will break, but I'm not sure about any of the rest of it.
Give it a try!
Thanks a lot. I compiled your code on Microsoft Visual C and it looks like it's running just fine despite the file format I'm using. The expanded file and the original had no differences. What I'm wondering now is what is the integer size of the compressed file? If this sounds like a weird question here's why. I'm proficient in MatLab which as you may or may not know is a lot like C with the biggest single exception being the uncompiled execution of code. I know the basics of C but I'm still in the dark about a lot of things. In order to read the compressed file into MatLab, you have to specify the binary type in which the files were written (i.e. uint16 = unsigned 16 bit integer, or uint8 and so forth). I hope you know what I'm asking, but thanks for the help you've given me already. I wrote a LZW encoder in MatLab and the execution time is at least 100 times that of your code so you may have saved me months of execution time.
thanks again
Martin, the compressed file is a bit stream organized as bytes - so if you are trying to read it in Matlab you are not going to describe it as being organized as integers. Unsigned bytes is probably the way to go.
The program should run and write your output to a file called test.lzw. After it completes, check to see if test.lzw is present. It then decompresses the file to lzw.out. lzw.out should be identical to your original input file.
If you want to just compress or just decompress, and if you want the output file to have a different name, you will have to modify the program.
I have a couple of question ...
1) I am looking for LZW source code (C or C ) that operates totally in memory (no files) and on text strings. Do you know where I can find one? All the implementations I can find use files.
2) Can the LZW be modified to take advantage of a list of "words" that will most likely make up the text? For example, imagine compressing a C source code file, could you preload the keywords for better performance?
I don't have an LZW implementation handy that operates on memory, although you could certainly modify this code easily enough - no big deal. Since LZW kind of loses out to the much more efficient deflate algorithm, maybe you could try zlib, which will easily do what you want and do a lot better. If not, start with this source and hack it.
To test your question number 2, simply preload the compressor with a list of words, then chop those same words off when the output is read. If the list of words is chosen carefully, yes, you will see a big improvement.
You might also look at my article on star encoding. You could use that to preprocess the input, and you'd probably see a big improvement there as well.
For those of you that like VB, this article on The CodeProject web site has a VB.NET implementation of this code:
http://www.codeproject.com/useritems/VB_LZW.asp
Please note that I find that this site moves articles around from time to time. If you find that the URL has changed, please drop me a line so I can fix this comment.
Is there a way to use your code with 7 bits output? I mean, I need to be able to print out the output. So, transforming output bytes from 0-127 to 32-159 will solve my problem :-)
It's often useful to be able to see the numeric values of the output, and it's easy enough to do. Just rewrite output_code so that all it does is
fprintf( output, "%d\n", code )
Then do the same thing for the input routine.
Of course, you won't get actual compression this way, because you'll be using at least a 3:1 multiplier on your output stream, but you will be able to read it.
Mark,
Thanks for this quick answer.
But it doesn't solve my problem: when I said "print out", I meant "print out in a textarea" (to be copied). Of course, your solution works (I did it) but, as you said, in this case, I'd lose the compression ratio :-(
Well then, I don't understand exactly what you want. (For example, what is a textarea?) If what you are saying is that you want the output stream to only use characters 0x20 through 0x7f, I would suggest you just uuencode the stream. Still, you're going to lose most of your compression that way, because it's going to expand your data by at least 2:1.
I think you are stretching for something that is not quite possible.
Mark,
Thanks again.
A textarea is a zone on a web page where you can type text in (like the one I'm typing in).
I ported your code to javascript and wanted to store the output in a textarea as you can manipulate files with it.
Too bad, it's not possible... anyway, it was quite interesting!
Dear Mark,
Have you ever written Code in C to calculate the entropy and public available?
Thanks.
Aloha, remember that calculating the entropy always means "calculating the entropy with respect to a specific model." The idea that there is an absolute entropy that we can calculate is sort of a red herring.
So to answer your question, no. If I wanted to know the entropy with respect to an LZW compressor, I would just compress it with this code and see what the file size turned out to be!
Dear Mark,
Yes, the book written by you mentioned that the entropy depends on the model.
I would like to know how to calculate relative entropy by using LZW.
Thanks.
compress file.
entropy = -logbase2( ouput_file_size/input_file_size )
Hi Mark, your article say: "LZW is always trying to output codes for strings that are already known. And each time a new code is output, a new string is added to the string table."
My question is if It`s posible (or practical) create an external file, containing a large numbers of redundant strings in a arbitrary type of file (like JPG or other) and its code-equivalent (like a external dictionary), and make the code rebuild the original file looking at this "external dictionary"?
I´ve thinking in this concept for a while, and I cant figure out the code.
Thanks and greetings from Argentina!
What you're talking about is feasible, but what you end up with is not really LZW, it's something else.
To stick with the LZW paradigm and use an external dictionary is not too hard. When you are running the compressor, you simply load your dictionary file first, which adds the strings to the internal LZW dictionary. You then compress the file using the modified dictionary. On decompression, you will have to do something similar to load the dictionary before processing your compressed data.
Giving up on LZW and doing what you are suggesting is more like star encoding. See my article here for details on that:
http://marknelson.us/2002/08/01/star-encoding-in-cpp/
I want to know what the pros and cons are when lzw compressing tiff image files that will be e-mailed to a Lab. Do I risk loosing data? As I am not a programmer but a novice photographer please reply in simple english. Your help in understanding would very helpful.
you don't risk losing any data, lzw compression is lossless - it doesn't alter the data in any way. as long as your lab is able to accomodate the format you can save some upload time by using it.
Mark,
I'm using your compression algorithm in an other way:
for a particular reason I have to store binary data as a basic text-file without any bytes being interpreted as control-characters etc.
So I modified your algorithm (MS Access) in order to generate only byte values between 32 and 127 (basic ascii) and to store each LZW-code in 2 bytes (16 bits).
Although at first glance it looks like a poor compression, it's comparable with a 13 bits compression with the overhead of storing 13 bits in 2 bytes. The compression ratio degradation is less than 20%.
Maybe this idea is usefull to others.
Example of a text-based compression:
]~]E]R^w^i^n]4] ]F^i^l^e]V^e^r^s^i^o^n]=]"]4]2]...
Robert
Do anyone has an implementation with java?
Edge, I don't have any familiarity with any Java implementations, but a Google search on "lzw java" turns up 220,000 hits as of today so I don't think you'll have too much trouble.
this entry says:
I just compiled your test program and tried it out on an mpeg 2 formatted file and the compressed file was 30% larger than the source! It would appear that LZW is not optimum for the compression methods already inherent in the mpeg 2 file. The source file was 17M and a huffman compression reduced the size by about 500k. Win rar compresses the file by about 1M, clearly it is using some other compression algorithm.
Jabberwocky -
MPEG-2 is already pretty tightly compressed. In general, attempting to run a second pass on compressed data is not too productive, so your results aren't too suprising. WinRAR has a number of good algorithms, most of which are considerably more powerful than LZW, so squeezing an extra couple of percent out of the MPEG-2 file is possible.
If you really want to shring the MPEG-2 files, the best way to go about it is to transcode them to H.264, in which case you could easily shrink them by 50%.
Thanks very much. It'S a very good article.
i have a question , when i type "file.txt" ,i can get "file.lzw" and "file.out", "file.out" is the same than "file.txt". But i can't type "file.lzw" to get "file.out" .... why ? it's not a decompression program??
kiwi, this is just a demonstration program. It does both compression and decompression of a single file. If you enter file.txt as your input, it compresses it to file.lzw, then decompresses it to file.out. You can then compare to make sure the round trip was successful.
If you want a program that just does compression or decompression, it should be easy enough to take this program and give it some command line options to do just that. But for the purposes of making this program illustrate the topics in the article, that wasn't really necessary.
now i give some command line
char *a="test.lzw";
char *b=input_file_name;
if (strcmp(a,b)==0)
{
expand(input_file,output_file);
fclose(input_file);
fclose(output_file);
free(prefix_code);
free(append_character);
exit(-3);
};
but the result not correct , can you mark it where i got wrong?
No, sorry, I can't really tell from just that code fragment. I suggest you walk through the debugger - both in the original program and in your altered version and see if you can tell what you're doing wrong. It's pretty hard for me to debug your program in the comments section of the article. :-)
^^ that's ok , thank you again !!!!
Here is a VB implementation of LZW and Huffman compression:
http://www.danesfahani.com/Telecommunications%202/Tels2%20Simulink%20Simulations.htm
Dear Mark,
I would like to thank you for this article. I'm a 16 years old student from Holland and for school, we need to make a final work. I have chosen to do something with datacompression, I already wrote the Huffman compression program, know I'm almost finished with LZW. Your article is very helpful to me and I learnt much about LZW!
Peter.
Hello, Mark.
Is it possible to Gif encoding because using your lzw sorce?
It's possible that you could adapt this code to do GIF encoding and decoding. But it would be a bit of work.
I recently found a pretty easy to use C++ GIF codec in the VIGRA project:
http://kogs-www.informatik.uni-hamburg.de/~koethe/vigra/
The VIGRA license is pretty reasonable for non-commercial use.
Download vigra1.5.0.tar.gz and look in the source folder for gif.hxx and gif.cxx. They've added a framework around it for portability with other image codecs, you can strip it out without to much effort.
Hi, Mark,
I've already written an lzw implementation with Python for my university assignment. The biggest trouble was to make it read different length symbols, from 2 to 13 bits, as a symbol. Your article is great, and as you mentioned, my encoding program is very slow because of searching symbols in a 2k-8k dictionary. I dont understand why decoder doesnt need hash function.. as if the encoder stores strings in a dictionary at hash function indexes, the decoder should make the same dictionary, so the indexes are the same for same strings, to decode correctly (because compressed file is made from those indexes). Dictionaries must be identical, as i understand. then the question is, if the location in the dictionary at hash index is used, it jumps some other place for a fixed number. but how the decoder should know, if to take the string that is in that hash index place, or to jump further and take next string.. your program with xor hash and bit operations is to difficult for me to understand, i want to make something more simple, but cant understand the basics of hashing in this case. maybe you have wrote this in your article, but i read several times, and still cant understand why decoder doesnt need hashing (im not english speaker), how then should it store strings in the dictionary, that indexes match.. sorry for the big post, i just wanted to explain everything clearly.
greeting from Lithuania, Vilnius University
thanks, Aleksandras
You don't need to use a hash when decoding. This is because the strings are added to a table as sequential indices, and because of this are easy to look up directly. For example, if I see a code of 258 in the input stream of an encoded LZW file, I just look at location 258 in the table to get its definition.
The dictionary is structured different when encoding, which means you can't just look up an index. For example, if I want to see if 'A' followed by 'B' is in the encoder table, I can't just create an index by concatenating'A' and 'B' - this obviously won't work. unless I have a 16 bit table, which is usually not the case. And that won't work when I try to create an entry for "AB" 'C'. I have to combine the string and the following in character in some way to create an entry into a smaller table, and that's where the hash function comes in.
thanks a lot for the trouble Mark ! i laid in the bed, and just understood everything :) i just missed one tiny thing, maybe thats because im too exhausted doing those assignments.. thank you for your comment, it helped to put dots on "i" :) good luck to you !
Hi Mark,
Would it be possible to compress a text file with LZW, then search the compressed text without decompressing it? My idea was to compress the keywords with the same dictionary, then do something like a Boyer-Moore search, but I'm not sure if (A) it would even work, or (B) it would be accurate. I'm pretty green when it comes to compression, so any input you can give would be very helpful.
Thanks!
There are some (but not many) specialized algorithms used for searching through compressed text. I don't know of any that work with LZW. This is an area that people ask about fairly frequently, but I never really have a good answer or a good direction to point them.
As an example, here is somebody that has a commercial product that is designed to search through compressed text:
http://www.techworld.com/features/index.cfm?RSS&FeatureID=3137
The only person I know of that has any substantial work published on this is Paolo Ferragina, who has a few good papers and some software you might be interested in.
http://roquefort.di.unipi.it/~ferrax/index.html
hello, I have a question, how can I modify your LZW implementation for read different length symbols? I need to process symbols with 12 bits length.
DJ. the code posted with article reads 12 bit symbols by default, and you can change it by simply altering a #define value. Since it already does what you are asking, I would suggest that you don't need to do any modification.
I'm a little confused because in your implementation the dictionary length is 12,13,14 bits, but, i think you're processing 8 bit word. It's really right or am i mistake?
I have a file with 12-bit words and I need to process 12 bits really. Do I have to make some modification?
...thanks for your time.
Ah, you confused me by saying "symbols with 12 bit length."
The LZW.c encodes source text composed of an eight bit alphabet into (by default) 12 bit symbols.
So you are saying that your source text is composed of a 12-bit alphabet.
Yes, you can adapt this program to deal with those words without too much trouble, but it will require modifications of the key data structures and the hashing algorithm to fit the new size. In addition you will obviously need to adjust the input and output routines.
Hi Mark!
I m a student from India. At first thank u very much for giving a clear understanding of the LZW algorithm. I have also seen the source code. Though tough, but is understandable. But I m not getting anything about the function
"void output_code(FILE *output,unsigned int code)". How u have entered the characters into the file using a code buffer? Plz make it clear 4 me.
The output_code function is a pretty simple piece of code that supports the output of tokens in sizes different than the standard eight bits.
Most compression algorithms either use a bit stream of some kind or write words of varying size, so they are all going to have an output_code() routine of some kind. If it doesn't make since to you, I suggest you brush up on your C shift and mask operators.
Sir,
we are working on a algo of partial string matching.....but only for text files...can you please help us with any study material that would help us for the same.......
A good topic for a final year project might be to come up with a good implementation of the LZSS algorith, or some other variant of LZ77. The classic paper on compression by Ziv and Lempel from 1977 is available in illegal copies all over the internet, like here:
http://www.stanford.edu/class/ee398a/resources/ziv:77-SDC.pdf
Ziv and Lempel didn't concern themselves too much with implementation, so it was many years before programs like PKZip actually started using this algorithm.
The most interesting part of an LZ77 ipmlementation is the requirement that you look back through previously seen string matches. Replacing those matches with pointers is what allows you to achieve compression. You can implement a program that does a brute force search, but it will be hopelessly slow.
Your assignment, then, would be to create an algorithm to look through a bounded window of previously seen text for matches, and discuss the various parameters and compromises one must make to get reasonable performance.
Quick note, think there is a small bug in the second pseudocode for decompressing (modified decompression algorithm):
The line:
add OLD_CODE CHARACTER to the translation table
should be (I believe)
add (translation of OLD_CODE) character to the translation table
the reason is that OLD_CODE will always be a single character, and if there is enough repetition then the string to be added will be more than two characters.
Cheers, and thanks for the great description!!
Matt,
No, that's not the the way it works. Let's say old_code has a value of 555, and translates to "ABCD".
If you look at the implementation, we definitely store the value 555 in the translation table, not "ABCD". In LZW.c, old_code is stored in an integer array called prefix_table, and the character being appended to it is stored in append_character, a character array.
If you look at the decode_string() routine, you'll see how that provides enough information to decode any known code. Given a code n, I decode it in a loop like this:
string s = "";
while ( n > 255 )
{
s = append_character[ n ];
n = prefix_character[ n ];
}
s = n;
I work my backwards through older and older codes until I finally find the original character that started the sequence. String s then contains the decoded string, backwards.
Hi Mark,
Could this algorithm could be implemented for use in mobile phones using J2ME?
You could implement this in mobile phones, you'll find plenty of examples of LZW written in Java. But J2ME includes the deflate/zlib library, which performs quite a bit better and is very well tested. You'd be much better off using that.
Hi...... do you know about other implementations with best performance: compression ratio and time? What can i do for i to get better compression ratio and less time with this LZW implementation?
hiiii, are you there?
What can i do for i to get better compression ratio and less time with this LZW implementation? or
do you know about other implementations with best performance: compression ratio and time?
If you want a better algorithm, I suggest using zlib with the deflate algorithm. It's faster and performs much better.
Hi Mark,
After few days of sitting above the code, I decided to write here.
My english isn't even good, but I hope that you'll understand me (I'm 16, from Poland).
I'm working on a database project using FLTK and C .
After closing the main window, app is writing all the data into the file base.txt.
After it, it calls com() function: http://rafb.net/p/bg7wFK96.html (It compress the file and delete old base.txt file, leaving just base.lzw).
But my problem is, that not everytime compression is ok, sometimes, last few words (or lines), is crashed up. I'm wondering if it's your algorithm problem or mine, cause without compression, everything's ok. Maybe I missed something ? could you someday remake your LZW source code, to use new c technicues?
Sorry for my post length and a problem. Regards and have a nice day.
Hi marcin,
This code has been posted for a long time and I haven't had any reports of decompression errors. Do you have a specific file that demonstrates a failed compression/decompress cycle?
Theoretically everything in your example is ok... hmm, look at my 'algo':
- decompress file.lzw to file.txt
- open file.txt
- make changes etc in file.txt
- close file.txt
- compress file.txt to file.lzw
And sometimes compression is ok, sometimes last lines are messed up. But if there were no errors commited, I think that the problem is by my site. So sorry for problem.
hi there, it's me again.
I found out what's wrong. It's not cause of your algorithm, but error with my implementation of ARC4 coding.
Sorry for problem, and congratulate of best article refering to the LZW compression! Thank you!
Hi Mark,
Is this LZW also same as GIF89a?
Gif89a uses LZW as its compression method. You won't be able to just plug in the code from this example, there are minor variations in the implementation, but yes, same algorithm.
An interesting article that people might be interested in is located here, titled:
Multiple Pattern Matching in LZW Compressed Text
Abstract:
In this paper we address the problem of searching in LZW compressed text directly, and present a new algorithm for finding multiple patterns by simulating the move of the Aho-Corasick pattern matching machine. The new algorithm finds all occurrences of multiple patterns whereas the algorithm proposed by Amir, Benson, and Farach finds only the first occurrence of a single pattern.
Hello, very clear article. another question about entropy. Lempel-ziv entropy is been used in analysis of EEG signals, it is a relation between length of dictionary and length of signal, I think. I am doing my ph, and I need calculate that in matlab, so I will port your code and... how do I calculate the entropy? could you please give any advice?
thanks in advance
Hi Mark,
Can you please explain how you are printing the output into the file?
When we open the file test.lzw in a notepad (or changing it to test.txt and opening it in notepad),
we cannot see the symbols clearly.
Also i could not understand the shifting operations that are performed in both the find_match routine and output_code routine. Can you please explain these operations?
Extension of the above query.....
I actually want to incorporate a Hamming(7,4) code on the compressed file, introduce 1-bit error (randomly, but for every 7 bits), correct it , and then decompress the file to get the original file.
I have separated the encoder and decoder as two separate files and now I need to read the compressed file as binary data(blocks of 4 bits).
What modification in the output_code routine will make this possible?
I have tried reading the compressed file and creating a new file with 8bit ASCII values of the symbols, but the file terminated prematurely.
Please help me, I can compromise a bit on the compression ratio but it should not be bad.
Juan, LZW seems completely inappropriate for processing of EEG signals, unless they have been heavily preprocessed, or perhaps transfered to the frequency domain. Still, it doesn't really make sense. Analog signals such as the EEG are not particulary well served by processing with dictionary-based algorithms.
You don't really need to port my code to Matlab, you can get a couple of implementations from Mathworks.
My advice would be to pick another topic, this doesn't make sense.
Sainith, test.lzw contains compressed binary data, you are definitely not going to bea ble to view it in notepad. If you want to see nicely composed symbols you are going to have to modify the source code.
The shift operatins are all standard C code. What part of it isn't clear, maybe I could explain if I had a more specific question.
I want to know why you used variable shifting in " output_bit_buffer |= (unsigned long) code > 24,output); " both these lines are from the output_code routine.
I have tried to understand the code by giving a small input of 5 characters and writing down the steps and analysing them, but unable to understand it.
What is the modification required in the output_code routine to print the output in binary form so that any error correcting code(like Hamming(7,4) say) can take the input from this compressed file.When I tried to read directly it gave a lot of errors. Please suggest the modification to work for my program.
Variable shifting in " output_bit_buffer |= (unsigned long) code > 24,output); ".
These are the correct lines. Sorry for the mistake.
Anyway my main concern is to read the compressed file to encode error correction using hamming(7,4).
And to use the same for the decompression routine. Hence suggest some modifications for this.
Hi...What's the algorithm complexity? computational time?
Hi MG,
As to to algorithm's time and space complexity, I'll leave that as an exercise for the reader - sounds like you've been given that exercise ;-)
yes, my question is...for example the string: ABBBCCCC
so, AB is the first code, ABB is the second, ABBB is the third, ABBBC exist?, ABBBCC exist?, if ABBBC not exist then the maximum string length is 4 (ABBB), else, the maximum is 5 (ABBBC).
in other words, What's the maximum number of symbols in one comparison(to dictionary)
Hi great article by the way has been invaluable for a college project ive undertaken i have one small question, which is why the TABLE_SIZE constant is et to 5021 i understand the MAX_VALUE and MAX_CODE settings but am at a loss as to why 5021. I am probably being dense here but cannot figure it out. Thanks in advance
Gordon, the TABLE_SIZE is the size of the table that holds all the tokenized strings. When MAX_CODE is 12 bits, the maximum number of entries we will use in the table is 4096.
Because collisions in the hash table reduce its efficiency, we jack up the size to be a bit bigger than 4096, which reduces the likelihood of a collision. And to top it off, we need the size to be a prime number. The probing mechanism needed used in this algorithm is only guaranteed to find an empty slot if the table size is prime.
Hashing isn't too complicated, and you can find good explanations of it on the net without too much trouble.
Hi Mark Thanks for the reply, everything is clear now. Guess i was being dense and should've realised the MAX_CODE was a prime to ensure probing of the hashcodes was successful, and larger than table size to ensure a load factor, covered hashtables this year too so am feeling very sheepish, thanks again
please said to me if the lzw.c witch is in this web site can be compiled by unix or said to me witch langage i can compiled him whith it( visual c or.....)
Juliette, the code is written in very generic ANSI C, and should compile with any C compiler you are using. Visual C and gcc should both work.
Hi, just wanna say I'm performing a research on LZW vs. Huffman. Well, i know it's a long issue but anyway I'm still doing it.
You know LZW won't work well with random data. I actually mean data with little repetition. But if that data has a more frequent characters than others it will be well compressed by Huffman. Just try it with pi. you can get a 1 million number of pi from the calgary canterbury corpus.
The actual research i was actually doing if i can switch to huffman if the data is random and to lzw otherwise. But it seems i cannot do that without reading the file first. So now my problem is how to detect the randomness of a file while i am encoding.
If you want to comment please do so. Thanks.
You're tackling a difficult problem :-)
Characterizing "randomness" is not easy (not that I'm an expert in the area.) My personal bias is to define randomness using a pseudo-Kolmogorov Complexity definition, which basically says that you define how random a stream is by how well it compresses wrt. a specific model. I don't accept the notion that there is an absolute measure of randomness.
So, for example, a binary representation of Pi will not compress at all with an LZW compressor, and it will not compress at all with an order-0 Huffman coder. But does this make it random? Not at all - you can write a very simple program with perhaps a couple K characters that generates as many digits of PI as you please - resulting in a compression ratio of thousands to 1.
In the case you are describing, data with little repetition but a strong bias towards certain characters. I think you might be surprised if you try to come up with data sets and compare order-0 Huffman and LZW. I suspect that their performance tracks one another fairly closely. Let's say that a binary file is biased so that it has 50% character 'A' and the remaining characters are all randomly distributed. LZW is still going to pick up on that by generating codes for all the strings that start with 'A' more frequently than other codes.
I think it's a good topic to experiment with - but the first thing you have to do is come up with an adequate characterization of "randomness", and I think that's the hard part.
Keep us posted!
Well thanks. and you're right the randomness part is the hardest in the research. Even if we can use certain tests...it would not provide a significant drag in the performance of the compression...because I was actually considering an online approach.
Actually I'm now in the process of testing their performance according to generated random sources.
Don't worry I'll go back here.
Anyway I like your site because you don't have to do tiring registration and you're editor also has a built in spell checker. :)
hello, i am new at this algorithm and all. i need help...can i do an image compression using LZW algorithm in MatLab? Thanks..
yes, certainly, with something like this:
http://www.mathworks.co.uk/matlabcentral/fileexchange/loadFile.do?objectId=14741&objectType=FILE
hello mr mark.. maybe my english is very bad, i'm sorry. you just teach us how to compress using LZW algorithm but you don't teach us how much the rasio ( percentage this algorithm can save ). i hope that you want to teach us. I'm making a thesis with this algorithm, it's very difficult to find the literature in indonesia.. i don't understand how to calculate the rasio. If you have some literature, i hope you can send to jack_padang@yahoo.com , it's my email.. i will very gratefull to you. thank's before
You generally calculate the compression ratio as simply output-size/input-size. Thus, perfect compression would give a size of 0, no compression would give a size of 1. You can also multiply by 100 to make it a percentage.
Frequently when compressing images, people multiply the compression number by the size of a pixel to give a bpP number - bits per Pixel.
Hi, Mark,
THX a lot for this article and source codes. It is really helpful to my understanding of this algorithm. I am a software engineer working for a company, and a project requires compression. If I use your source codes in LZW.C in my project, does my company need to pay any fee? The company lawyer told us that since your article and source codes are published in the journal "Dr. Dobb's", it meant it was free for us to use for a commercial product. Is this true?
In general, the questions are: 1) Is it free for us to use for a commercial product? 2) What kind of license for your source codes?
Thanks a lot.
Scott Pan
Hi Scott,
My code use policy is here. Basically, the answer is yes, you are free to use this source.
However, I highly recommend you think about using zlib - it is fast, efficient, well-supported, and free. Plus, the deflate algorithm will outperform LZW in almost all cases.
Mark
There's a neat alternative implementation of the compression (data->code) and decompression (code->data) tables for in-memory compression (where all of the source data and compressed data is in-memory simultaneously). Rather than storing your code-mapped-data (your "strings") as either raw data, or a code character combination, you reference an offset/length in your uncompressed data (for the compression side) and similarly an offset/length in your decompressed data (for the decompression side). This works because all codes in the table map to contiguous data in the uncompressed data, and that when decompressing you always output data before adding new codes.
The advantage of doing this is that it makes transcription to/from the tables trivial, as it does the equality test for the hash-table lookup.
J
Yeah, I think I can see how this would be good. Of course, it's an unusual situation, and probably is only useful in a few situations... generally you won't have both tables in memory simultaneously.
Ah - I didn't mean that you have both tables in-memory simultaneously - the compressor would have the uncompressed text and the compression table in-memory simultaneously, the decompressor would have the decompressed data and the decompression table in memory simultaneously. For messaging-based applications this can be a pretty useful approach.
[...] My 1989 DDJ article on LZW data compression, including C source [...]
This might be a little off-topic, but I'm curious to know...
Do you know if RAR implements any part of the basic LZW code you have shown here, or is it more along the lines of variable deflate/zlib/arithmetic coding/range coding/bwt and other combinations as a hybrid algorithm?
@james:
RAR doesn't use LZW for compression. It chooses from several different algorithms depending on what it decides about the characteristics of the data.
I believe that the general purpose RAR algorithm is undisclosed.
This might sound a stupid question but i am unable to unsderstand and looking forward for your reply.
Question is "in your code while outputing any data you have
static unsigned long output_bit_buffer=0L;
and then
putc(output_bit_buffer >> 24,outfile);
does this mean that first you right shift output_bit_buffer 24 times and then write 32 bits in the output file? "
Because when I am trying to compress a file i get the output file size larger than the original file.
@jaidee:
putc outputs a single byte, so this line of code is used to first shift the bit buffer right by 24 positions, then output the result.
In effect, it outputs the top byte of that long value.
The C I/O library is oriented around bytes - that is all it knows how to read and write.
The LZW algorithm wants to read and write codes of variable length, generally greater than eight bits.
Because of this, we have to shift the bits into a temporary location (output_bit_buffer), then write them out eight bits at a time as the buffer fills.
I realize this code might be somewhat confusing, but most libraries and languages have no support for bit-oriented I/O, so this is what we are stuck with.
But why am I getting the size of compressed file (i.e. test.lzw) greater than the original file size?
I can't tell you that, I don't know what you are doing.
LZW won't compress every file - some files will actually get larger. But a standard test file made up of just readable text will definitely be compressed.
Mark - thanks for your demo and for the support you kindly do!
As others have already mentioned I was after a version of LZW that would work on memory streams of bytes, as opposed to files, so I've taken your algo as a basis and changed it here and there so it works as such.
How should I send you the modified file so you can include here for others to use?
@Bog:
If you mail me a copy of the program I'll be glad to add it to this article. Contact info on my About page.
thanks for this. i'm understanding the algorithm better.
but i'm a little confused about how the modified decompression algorithm would work. on the first try, isn't CHARACTER just a copy of OLD_CODE and essentially the same character? and because NEW_CODE wouldn't be in the dictionary at the point, would ENTRY just be (in this case) //?
@lehdi:
When you enter the loop OLD_CODE and CHARACTER have the same value, and OLD_CODE has already been sent to the output.
In the first pass through the loop, we then read in a second compressed character, which is stored in NEW_CODE. When we try to translate it, we will normally succeed, and get a single character translation. In the example in the article, we first read '/', and output it immediately before entering the loop.
We'll then read 'W', translate that to a string, which will also be "W", and output that. We then add '/' and 'W' to the table.
I recommend walking through the code for a simple example to see how it works.
Also, I am working on an updated version of the article that will include much simpler source code. By using modern C++ containers, I can implement this so that it's a lot easier to read.
How much efficient this compression algorithm are?. can it reduce 1GB "any" file to 1MB?. Mainly movie and data files.
So far i see all this compressions are useless. They do only 20%.
unless it is filled with same characters.
@Alex:
LZW compression is not really suitable for movie files - movies need a lossy compression algorith, such as H.264, in order to see substantial improvement.
It will work fine on text, but if you want higher performance compression, I suggest you look into the deflate algorithm, as implemented with zlib.
Can you teach me how to organize data structure to program lZW? I was read LZW.c, but i cant understand it. I want to know by step how to compress a string or row of integer. How to print result of compressed string or row of integer.(By pascal or nature language). Thanks a lots!!!
@TuanSon:
What do you need to know that isn't in this article? I don't know what else I can add to it.
Thanks your reply!
I have 2 questions to you:
At first, i want to process on a string. Do you have a simple source code for me?
Can you show me how to use 12->14 bits for string in table? Because i dont know how to use it. 1byte=8bits, 2bytes=16bits, 12 or 14 bits=?
Thanks
@TuanSon:
LZW.C is the only source code i have.
Reading and writing data of non-integral byte size is done with input_code() and output_code(). The source is pretty simple, if you take a look you should have an easy time seeing how it works.
Program lzw.c does not work with gcc. Segmentation fault at 289 line in function decode_string. I think function input_code(FILE *input) works not right.
Error occurs because of on my machine
sizeof(long)=8 (not 4). So, input_bit_buffer
@Neo: Okay, that makes sense, sorry for the problem. When the program was published in 1989 it was pretty unusual for a machine to have a 64 bit long. Nonetheless, the fact that it is hard-coded for 32 bits is a bit of a mistake.
You can make a quick fix by changing input_code and output_code to use an unsigned int instead of an unsigned long for the input bit buffer and output bit buffer.
Dear Mark,
Thanks for your code. I have a question about the LZW decoding. Am I right that any bit stream can be decoded by the LZW decoder, even though, the decoded symbol string is incorrect? For example, if I input 0000..0, namely, all zero sequences into the decoder. It seems that the decoder can still decode it into aa..a, where 'a' is the symbol corresponding to index 0 in the table. However, if we encode 'aaa..a', the obtained bit stream is not '000..0'. My question is: Is this statement right?
Many thanks
JT
@JT:
No, this definitely not the case. It would be very easy to give the LZW decoder an bit stream that it would recognize as being in error. For example, if the first symbol decoded by my program had a value of 260, it would generate an error.
If you want to find out more about this search the comp.compression archives for the phrase "bijective." David Scott has invested quite a bit of time making various algorithms obey the property you describe above. However, I don't know whether he has a modified version of LZW.
@Mark:
Thanks for your reply. I am thinking how we can find all the bit streams that can be detected as en error by the decoder. These bit streams can be treated as 'forbidden' bit stream that can be used for error detection. I am also thinking whether the initialization or the insertion algorithm will affect these 'forbidden' bit streams. If yes, we can design a LZW algorithm that has such kind of error detection capability. Any comments on that?
Thanks
JT
@JT: Two comments on that.
First, check into David Scott's work on bijective coding, as he has been working on similar problems.
Second, I am actually in the middle of an article on forbidden symbol usage for error detection in network streams. My particular implementation uses an arithmetic coder engine that simply reserves a fixed percentage of the symbol space for a special "error" symbol. In the case of a transmission error, you will sooner or later bumble into the forbidden symbol, with the amount of time varying depending on how much symbol space you have carved out for it.
With LZW, the easiest forbidden symbols are those for which we have not yet defined a code.
First off, thanks for the fantastic discussion and examples! Way back in the early '90s, I worked for a game company and your code from DDJ was used in our routines to compress graphic images. We used 2 or 3 different compression schemes. Recently, I came across some old files that I have not been able to uncompress. They don't appear to be RLE - looks more like LZW to me. I tried every piece of C code I could find - but alas, I guess that particular code is just gone. (it's been 16 years!) However I DO have 2 sample files that are both compressed and uncompressed. So because I have a 'before and after' state for those, I was hoping to be able to figure out how to uncompress the other dozen files. Could you help?
DL
@David:
The best way to get to the bottom of a 'what format is this' question is usually with a post to comp.compression. I've seen a few pretty obscure formats decoded there. If you have a nice hex dump of the first part of the message there's a good chance somebody will have seen it before.
I'm no format expert, so the chances of me recognizing it are pretty slim, sorry. :-(
Thanks! I'll put up a post and see what happens!
ei mark.
hi..i am currently doing a thesis on text compression algorithms and i ran to your code and convert it to java. I find it very useful but i was a bit confused with the bitwise part.
anyways, thanks man for the code..it was a great help.
while (1)
{
if (code_value[index] == -1)
return(index);
if (prefix_code[index] == hash_prefix &&
append_character[index] == hash_character)
return(index);
index -= offset;
if (index
[...] was designed to improve upon the GIF format. But if you want to read further about GIF’s LZW data compression, the GIF file format, the JPG file format, the PNG file format, or a comparison between image file [...]
Hello Mark!
Im trying t