Talk:Lempel-Ziv-Welch

From Wikipedia, the free encyclopedia

If there is going to be code included in the page, it should at the very least be commented to show what is happening in relation to the description.


The introduction to the article incorrectly states that LZW is derived from LZ77, when it's really derived from LZ78. I'll change this when I come up with a decent wording. --Kgaughan 22:12, 27 May 2005 (UTC)

  • There, I finally carried out my threat to fix that introductory paragraph. --Kgaughan 29 June 2005 21:11 (UTC)

The IBM patent expires March 21, 2006. Calculation is later of 20 years from the filing date or 17 years from the issue date for patents filed before June of 1995.

PS relevant consideration is that 20 would be added to filing date of parent application, June 1, 1983 rather than filing date of the '746 patent which is August 11,1986. Believe that is where people are going awry; they are forgetting this is a continuation.

Contents

[edit] redirects

I've just added redirects to here from both Ziv-Lempel-Welch and ZLW. --Heavyphotons 07:55, 27 July 2005 (UTC)

[edit] Lack of detail

It seems like this page is very, very lacking in detail about the workings of the LZW algorithm (Description of the algorithm). Any other thoughts? Stack 16:17, 28 July 2005 (UTC)

The algorithm is not written very clearly. I started to get confused when the author gave an example of 25 characters and told it had 27. i would appreciate it if the author explains it a bit more clearly.

[edit] Added info on the fact LZW is used in *.zip

I added a bit of info on the fact that LZW is used in the *.zip fomate, but its not a great addition as-is, could someone help me clean it up? It also needs sourcing. I know that the OpenOffice.org help file has info on the fact that it uses .zip in its save formates (I don't know if the new 2.0 formate uses LZW or .zip), and the .zip infomation can be taken from the wikipedia page for .zip, but I don't have any idea where to get a source for .pdf. I know it contians LZW becuse Acrobate Reader 6 contained a note in its boot windown that "This product contians an implementation of the LZW algorithem" or something along those lines. Can someone help me sort this out?

ZIP and PDF both use Deflate, which is based on LZ77, while LZW is based on LZ78. --Matthäus Wander 00:26, 19 November 2005 (UTC)


Did some research: PDF used LZW in version 2.1: [1]

[edit] IBM Patent

The article claimed that the IBM US patent expired on 11 Aug 2006, which is seven months in the future. I'm not sure when it will expire or if it has already (based on the patent I'd say it expired in 2005), so I removed the claim. If someone can verify that, I'd appreciate it. --DCrazy talk/contrib 18:39, 17 January 2006 (UTC)

Just caught the comment above. Seems there is a bit of dispute on this point, unless I'm misunderstanding it. --DCrazy talk/contrib 18:41, 17 January 2006 (UTC)

[edit] Encoding Example

The encoding example uses codes of 5 and 6 bits, but the very beginning of this article states that the algorithm uses fixed-length strings. It might be that I'm just not grasping the concept, as I'm not familiar with the algorithm, but is it really possible to change the length of the code words on the fly? If so, how can one deduce in the decoding phase if the following code word is 5 or 6 bits long? 09:41, 7 November 2006 (UTC)

I think the example is flawed as stated above. I've struggled with understanding the concept from this page, and it seems to me that the only practical implementation would indeed use fixed-length codes. There's no mention in the example of how the decoder would know that some codes are 5-bit and some are 6-bit. This example should use all 6-bit codes, and mention that the encoder specifies at the beginning of the data that they are 6-bits long. Therefore, "Total Length = 5*5 + 12*6 = 97 bits." just doesn't make any sense, it's 5*6 + 12*6, and probably an extra byte (8 bits) at the beginning to specify the code lengths. I haven't done any research, but I'm almost positive this has to be right. —The preceding unsigned comment was added by 4.152.213.104 (talk) 03:08, 6 April 2007 (UTC).
re: the above comment, as far as i'm aware, the number of bits used is the number of bits needed to encode the highest entry in the dictionary so far. As the decoder builds the same dictionary that the encoder used, the decoder is always aware of the number of bits needed to store an entry from the dictionary.
In for instance GIF, they start at 9 bits. The lower 8 bit is normal content, the upper half is dictionary. It expands uptil something like 12 or 16bit then resets. Carewolf 07:36, 24 April 2007 (UTC)
Notice in the first encoding example that when the bit values in the extended dictionary reach 31 (highest value representable by 5 binary digits), the algorithm immediately switches to 6-bit mode and outputs ALL bit codes as 6-bit codes, so the R character, which previously had a 5-bit value of 10010, from now on has the 6-bit value of 010010. The description accompanying the first encoding example in this article should explain that better. In decoding, the algorithm starts building the extended dictionary right after reading the first character, and the bit values in the extended dictionary will reach 31 at the same time that the code it is decoding switches to 6-bit mode. The algorithm then uses the size of its extended dictionary as a cue to switch to the next-higher bit mode. —The preceding unsigned comment was added by 65.26.235.222 (talk) 14:39, 9 May 2007 (UTC).

[edit] Decoding Example

Shouldn't the "TO" entry in the decoding example have a bit value of 27, as stated in the encoding example? I don't know why the author of this article made such a big deal out of zero indexing; he/she seems to have confused himself/herself.

[edit] algorithm details

I have added a code to describe what is the details of compressor and decompresser algorithms.

[edit] Python -> pseudocode

I think the Python example should be translated to pseudocode by someone who knows how. Not all readers understand Python.

Also, the given python code is not entirely clear.

Additionally, I believe the given pseudocode is overly simplistic; I did a direct translation (and I mean "my code looks exactly like the pseudocode" direct), and it failed on "xxx". This is a known issue in the simplistic representation of LZW which is indicated at http://marknelson.us/1989/10/01/lzw-data-compression/.

If I get some time, I'll rework my code or clean up the given python to look more pseudocodey, then post that.


I have cleaned up the Python code a little bit:

  • removed unnecessary class
  • put doctsrings at correct place (after the def line)
  • removed unnecessary and "unpythonic" type checks
  • moved utf-8 encoding to the end of compress()
  • removed that ugly __join() by using the unicode strings directly as keys
  • some minor clean ups

The result can be seen here: http://paste.pocoo.org/show/4497/

Any objections against replacing the existing code in the article? Marrin 10:24, 25 September 2007 (UTC)

The new code is much, much better than the old one, and I'm all for changing it. 72.67.158.141 (talk) 01:26, 14 January 2008 (UTC)

Even the above Python code misses the point somewhat. The idea is that the algorithm knows the word size as the dictionary grows, and the output is a bitstring. Using UTF-8 encoding of integer keys pads 8- to 14-bit keys to 2 bytes. The use of unicode strings is incredibly ugly anyway - unicode is synonymous with text, and shouldn't feature in an algorithm that compresses a sequence of bytes. DanPope (talk) 00:26, 15 February 2008 (UTC)

def compress(uncompressed):
    """Compresses a string to a list of output symbols"""
    chars = 256
    dictionary = {}
    for i in range(chars):
        dictionary[chr(i)] = chr(i)
    buffer = ''
    result = []
    for c in uncompressed:
        xstr = buffer + c
        if xstr in dictionary:
            buffer = xstr
        else:
            result.append(dictionary[buffer])
            dictionary[xstr] = chars
            chars += 1
            buffer = c
    if buffer:
        result += buffer.split()
    return result

That's my compressor... just needs a corresponding decompressor. DanPope (talk) 09:42, 15 February 2008 (UTC)

Your code is the best one so far IMO, I have added it in along with my corresponding decompressor. Also, I fixed the buffer.split() line at the end; there is no separator between the characters (i.e., 'OT' in the example), so split doesn't have the desired effect.

I did like the meaningful variable names, but for the sake of consistency I've changed them to match the pseudocode and examples. It's much easier to follow along if the notation agrees everywhere. Arguably we should change the rest of the article to your names... Kiv (talk) 22:15, 1 May 2008 (UTC)

[edit] failure on xxx

Which pseudo code failed on "xxx"? I tested the algorithm for compression and decompression on xxx and it worked!

I think the last line of the algorithm:

done output the code for w;

is not correct -- I implemented the algorithm and the resulting code had a duplicate of the penultimate code and no final code.

I suggest the last line should be:

done output the code for wk;

158.48.32.2 12:49, 4 August 2007 (UTC)

Well, I am almost sure that it should be 'output the code for w;'. As you can see if the first if-clause is satisfied it assigns the wk to w. However I'm going to implement a java version of the algorithm and see if it's wrong or not. Could you please review your implementation again? Thanks