Talk:Lempel-Ziv-Markov chain algorithm

From Wikipedia, the free encyclopedia

Contents

[edit] Dictionary Size

This page (and the page for 7z) claim variable dictionary size is up to 1GB. However, at 7zip page, claim is up to 4GB. Any reference for (it once being or practically being) 1GB? --Gojomo 19:53, 16 February 2007 (UTC)

[edit] Algorithm Details

I can't find anywhere in this article, or in the external links, any details on this algorithm (not the implementation or "7-zip"). Who invented it? When? What is the difference between it and the traditional LZ77, LZ78 and LZW algorithms, or is it a completely new algorithm? What's the acronym "LZMA"? Thanks. Nyh 08:33, 15 Dec 2004 (UTC)

  • Ditto that request. So far as I can tell, the author of 7-zip, Igor Pavlov, developed the algorithm, so the source itself seems to be the only documentation. From what I have been able to find, LZMA is an extension of LZ77 with a more flexible "dictionary" structure. I put dictionary in quotes, since LZ77's dictionary is a history window. Judging by the LZMA compression parameters described on this page, it appears LZMA's secret sauce is in its match-search method that allows it to scale its sliding window dramatically. mr_z 01:02:57, 21 Dec 2004 01:02:57 (UTC)
  • I came here looking for the same answers as well, esp. for this quote "is a data compression algorithm in development until 2001", in development by who and why did it stop in 2001? Does anyone know the history of this algorithm? --ShaunMacPherson 5 July 2005 12:59 (UTC)
  • LZMA's "secret sauce" is the fact it's a literal/match scheme coupled with range encoding, using an adaptive probability table capable of taking things into account like "match offsets tend to be multiples of 4" or "if we're currently on a 4-byte boundary then this byte is very likely to be a non-match". I hadn't actually thought of it as being related to Markov chains but I guess it is. The reason it beats the pants off Deflate is because Deflate uses Huffman coding, which is easier to decode but nearly always suboptimal.
Range encoding is incrementally more efficient than Huffman coding, but not nearly enough so to account for the dramatic performance increase over Deflate. If couple LZMA with a Huffman coder, and LZ77 with a range/arithmetic coder, the gap between them would close somewhat, but LZMA would still generally beat the pants off LZ77. The real "secret sauce" is the huge dictionary size, and the lookup algorithms that let the encoder search a dictionary of that size while staying usefully fast. --Piet Delport 16:23, 23 December 2005 (UTC)
Granted, Deflate's 32K limit doesn't help. So the "secret sauce" is the combination of the bigger window size plus the encoding that allows the bigger window size to be represented economically. See, I learned something! Posting wild speculation until someone corrects you is the true spirit of Wikipedia. --24.203.227.176 18:23, 2 January 2006 (UTC)
If by "encoding" you mean "data structure", then yes, that's it. :) --Piet Delport 23:48, 2 January 2006 (UTC)
  • What's the story with patents on the compression algorithm? The compression format has been gaining some popularity with open-source software and distributions. Maybe 7-zip owns patents but will only enforce them when the format is popular. Data compression is a minefield WRT patents - for example, that's why we have bzip2 instead of the arithmetic-coding variant bzip. OTOH, the Range encoding entry notes that range encoding might be patent-free. -- Richard, 19 Jan 2007

[edit] compression speed

On what benchmark are based the claim about a compression speed of 1MB/sec ? I've done some quick and dirty bench on a P4 2.8 Ghz and don't get something near the 1MB/sec with a 2Ghz processor as stated in the article. The test is done on fr page current (20050903_pages_current.xml.gz)

$ time lzma e pages_current.xml pages_current.xml.lzma
real    42m46.403s
user    41m47.801s
sys     0m2.220s
-rw-r--r--  1 ~~~ users 716524090 2005-09-23 23:39 pages_current.xml
-rw-r--r--  1 ~~~ users 141582811 2005-09-24 00:24 pages_current.xml.lzma
compression 280K/sec

$ time lzma d pages_current.xml.lzma pages_current.xml.1
real    0m42.048s
user    0m34.918s
sys     0m2.052s
decompression 17MB/sec

$ time bzip2 pages_current.xml
real    5m44.382s
user    5m31.425s
sys     0m1.560s
-rw-r--r--  1 ~~~ users 716524090 2005-09-24 00:27 pages_current.xml
-rw-r--r--  1 ~~~ users 164451016 2005-09-23 23:39 pages_current.xml.bz2
compression ~ 2 MB/sec

$ time bunzip2 pages_current.xml.bz2
real    1m50.196s
user    1m46.415s
sys     0m2.800s
decompression 6.5 MB/sec

$ time gzip pages_current.xml
real    1m13.921s
user    1m12.205s
sys     0m1.240s
-rw-r--r--  1 ~~~ users 716524090 2005-09-24 00:27 pages_current.xml
-rw-r--r--  1 ~~~ users 213544652 2005-09-23 23:39 pages_current.xml.gz
compression 9.8 MB/sec

$ time gunzip pages_current.xml.gz
real    0m14.064s
user    0m12.357s
sys     0m1.604s
decompression 51 MB/sec

Given the number above either the used data perform very differently from the benchmarked data or the compression speed is 1Mbits/sec rather than 1MBytes/sec. This test used the lzma sdk 4.27. Note also the memory usage seems pretty higher than gzip/bzip, using default option the compressor use 80MB of RAM. 80.15.144.16 23:58, 23 September 2005 (UTC)

  • try lzma sdk 4.32 or later with -a0 -D20 -fb64 -mfhc4 options. Should be faster than 1 MB/s, with only a slight lower compression ratio.

It's also unclear whether that 1MB/s refers to 1MB of input consumed per second or 1MB of output produced per second.

For compression, the rate should refer to input consumed, and for decompression, output produced. In other words, it's always in reference to the uncompressed data. (Otherwise, the rating would be largely determined by the compression ratio itself, which is not very useful.) --Piet Delport 00:50, 17 April 2006 (UTC)

[edit] Comparision of compression ratio

I did a quick comparision, default options for gzip,bzip2,lzma (from lzma utils] and rar:

 -rw-r--r--  1 root root 199M 2005-12-04 20:26 linux-2.6.11.tar
 -rw-r--r--  1 root root  36M 2005-12-04 20:28 linux-2.6.11.tar.bz2
 -rw-r--r--  1 root root  45M 2005-12-04 20:29 linux-2.6.11.tar.gz
 -rw-r--r--  1 root root  30M 2005-12-04 21:15 linux-2.6.11.tar.lzma
 -rw-r--r--  1 root root  34M 2005-12-04 21:24 linux-2.6.11.tar.rar

Quick decompression bench:

 # lzma d -so linux-2.6.11.tar.lzma |pv > /dev/null
 199MB 0:00:11 [17.2MB/s]
 # cat linux-2.6.11.tar.bz2| bunzip2 |pv > /dev/null
 199MB 0:00:20 [9.52MB/s]
 # cat linux-2.6.11.tar.gz| gunzip |pv > /dev/null
 199MB 0:00:03 [63.8MB/s]
Previous section talk about compression speed, not decompression speed, any number on this ? phe 12:56, 4 December 2005 (UTC)
  • Had it confirmed by Igor Pavlov that the acronym LZMA does indeed stand for Lempel Ziv Markov Chain, and not Lempel Ziv Modified Algorithm as previous asserted. My apologies.