Zip, Jpeg, Streaming: How Does Data Compression Work?

Whether or not you’ve discovered the wonderful world of data compression by following HBO’s Silicon Valley series, we’ve written a guide to understanding this important area of technology.

If you tried to pit the Silicon Valley series as “the trials and tribulations of a data compression startup,” it wouldn’t sound very sexy. But if you follow the adventures of Pied Piper as we do, you know that compression is actually a vital area of technology with ubiquitous applications.

But first of all, let’s inflict a first disillusionment: no, the “Weissman score” does not really exist. There is no universal and absolute method to quantify the power of a compression algorithm; only benchmarks, similar to microprocessors. There, it’s said. Time for real compression.

Entropy and Information

During the Second World War in the United States a young PhD student in mathematics at Bell Labs was tackling what was to become one of the pillars of computer science The American Claude Shannon built his information theory on a concept called entropy, which here simply means “quantity of information”. You toss a coin, you note the result which is either heads or tails: that’s one bit of entropy. You do the same with two pieces: two bits of entropy. You throw a coin and cheat by noting twice its result: it remains a bit.

The string “yyyyyyy” has a very low entropy rate, as it is “16 times to “. On the contrary, “1f08c6e13b954f80” has a high rate of entropy because each character is completely random. If you’ve done physics, you may remember that entropy refers to the “amount of disorder”, and you’ll notice that the random string, by definition completely disordered, is the one with the most entropy. When there is little entropy, there is order and logic in the form of predictable patterns.

Detail of the Mandelbrot fractal, whose Kolmogorov complexity is very low, as it is summarized in an equation of Terminal S level. CC Wikimedia

From this idea emerges the complexity of Kolmogorov, which represents the shortest length of text (or code) to obtain a given result. If the shortest way to write “aaaaaaaaaaaaaaaaaaaaaaaaaa” is “16 times to “, this string is, according to Kolmogorov, much less complex than “1f08c6e13b954f80”, whose shortest way to write it is … “1f08c6e13b954f80”.

You may see what we’re getting at: compressing a file means writing it in the shortest possible way, looking for logical patterns. It’s easy to see that there’s a limit to lossless compression. If one wants to keep all the information (entropy) without losing a crumb, the compressed result should not have less total entropy (information) than the original data had. And since Kolmogorov complexity is the shortest way to write a file, a compressed file cannot be smaller than its Kolmogorov complexity. It’s all silly, isn’t it?

Compressing a file means writing it in the shortest possible way

An indescribable feeling…

The problem with Kolmogorov’s complexity is thatit is impossible to systematically determine. In other words – and fortunately for our fictitious friends at Pied Piper – you can’t usually look at a compressed file and tell without making a mistake whether or not it can be compressed further. This is demonstrated in Berry Paradox, described by an Oxford bookseller at the turn of the 20th century.

Consider the sentence “the smallest positive integer that cannot be defined in less than thirteen words”, which happens to define a certain positive integer. Think about it. You see the problem?

CC Simon Q

CC Simon Q

” Le1 plus2 small3 integer4 positive5 no6 definable7 into8 minus9 from10 thirteen11 words12 has just been defined in twelve words. How many words would Mr. Kolmogorov compress it into? Twelve or at least thirteen? Admit it’s a bunch of crap. It is like the expression “indescribable feeling”: how can this feeling be indescribable when it has just been described with the adjective “indescribable”?

Even the best algorithm in the world won’t be able to all compress; there will always be files that will get bigger once they’ve gone through the mill. That makes sense: you could otherwise ask the algorithm to compress the file several times in a row, and make the file shrink until it disappears – all while keeping all the information inside, of course. The “magic compression algorithm” does not exist, and in general, compressing a file twice in a row gives a larger result than the original file.

There are generalist compression algorithms, such as ZIP, that work relatively well, regardless of the file they compress. But to achieve better performance, is best chosen with specializedalgorithms; an algorithm calibrated to compress audio might have difficulty doing the same with video. In the same way, the same algorithms will not necessarily be used from one video to another: a sports broadcast needs to faithfully retranscribe the movements, but less the colours, whereas a wildlife documentary needs to display beautiful colours rather than being precise in its movements.

Loss or no loss?

There are two main types of compression: lossless (lossless) or lossy (lossy). Lossless compression is the one we studied with Kolmogorov: when you zip and unzip a file, you want the file to be whole on arrival and not missing a piece. All the entropy must be there. But for image, sound or video, it is not so vital to preserve every little detail that will not be perceived by the human senses anyway. In still images, JPEG is alossy compression codec, but it allows a file to be reduced to one-tenth of its original size without any visible change in quality. PNG, on the other hand, is a lossless codec that makes larger files.

But what is a codec, you may ask? Codec is the abbreviation for “coder-decoder”: as its name suggests, it is a small piece of software whose function is to encode or decode a data stream. The role of audio codecs is to take an audio file (written in bits) and convert it to an analogue signal (i.e. “real sound”), and vice versa – usually by (de)compressing it in passing.

CC Brett Lessard

CC Brett Lessard

To continue in the audio examples, one of the main lossless codecs is FLAC (Free Lossless Audio Codec), which is capable of 50-60% compression and has the merit of being royalty-free. The ALAC (Apple Lossless Audio Codec) is identical in fact, except that it is generally not supported by Android. There are also Microsoft’s famous WMA codecs, three of which are lossy and one lossless.

But codec is not synonymous with format. A Microsoft AVI file contains both audio and video, which are handled by two different codecs. AVI is thus what is called a container format, just like WAV (also from Microsoft), FLV (from Adobe), MP4 (from the Moving Picture Experts Group) or the Ogg (from the librarians).

One of the most famous compressed formats, the ZIP most often uses the DEFLATE algorithm. We will study it as a compression algorithm without loss. DEFLATE works in two steps: first it performs dictionary compression, and then it does entropy coding.

The ZIP, part 1: Dictionary compression with Lempel-Ziv

This is the ABC of compression, so much so that, like Mr Jourdain, we do it every day without knowing it. The run-length encoding (RLE) is the most basic: if we encounter the sequence “aaaaaaaaaaaaaaaaaaaaaaaaa” from earlier, instead of writing each of the sixteen to, we write “16a” and we gain thirteen characters of place. Similarly, an old image compression algorithm could say “50 black pixels” instead of writing “black pixel” fifty times in a row – we say “old” because the images found on today’s Full HD screens almost always have variations in hue from one pixel to the next, which makes an ELA unattractive.

CC Jonathan Cohen

CC Jonathan Cohen

The same principle is found in a more elaborate way in the LZ77 and LZ78 algorithms deposited by researchers Lempel and Ziv in 1977 and 1978 respectively. These are the basis of all dictionary compression. When you say “UE” instead of “European Union”, “NASA” for “National Aeronautics and Space Administration ” or “MP3” for “MPEG-2 Audio Layer III”, you are doing more or less LZ78 compression: Instead of repeating these redundant names in their full length, you store them in a small dictionary (in this case in your head) and refer to them with unique and simple “hints” (in this case acronyms). In practice, LZ78 builds up its dictionary as it is compressed and notes everything and anything in it; but the principle is the same.

LZ77, on the other hand, does not have a separate dictionary, but works with a “sliding window” of data that allows it to remember when it last saw a particular pattern. He then says “it’s the same pattern as the one that is X bits ago and lasts Y bits”, and it compresses. This is the LZ77 that DEFLATE uses.

The ZIP, part 2: entropy coding with our friend Huffman

Dictionary compression being a bit easy, we’re going to get serious with entropy coding. This term refers to a whole bunch of lossless compression techniques that we would call “generalist”, which work well in all cases.

The first entropy coding technique is the Huffmancoding, developed by a MIT PhD student in 1952 and used in ZIP files. To simplify the explanation, we assume that we want to compress a text written in ASCII characters (i.e. where each character corresponds to one byte, or eight bits). It’s a bit silly to have to rewrite each of the same eight bits “01100101” each time you want to write a frequent letter like and. Isn’t there any way to make it shorter?

What Huffman did was take all the letters of the text and put them in drawers numbered from top to bottom, with the rarest letters at the bottom of the cabinet and the most frequent ones at the top. In real life, it’s supposed to be what we call a Huffman tree, but we’ll pass you the construction details. When the compression algorithm sees the letter and, it will simply write the drawer number: « 1 ». And when he sees the letter x, he will write for example “10110” (or “22” in binary).


In real life, a drawer cabinet is called a “tree”, is pyramidal, and the “1” and “0” indicate whether the drawer is on the left or right. Source: Wikimedia

More popular than Huffman coding, arithmetic coding is a bit longer to implement, but often compresses better. The general principle is the same, but instead of coding each symbol separately, it’s as if all possible and imaginable texts were theoretically indexed on a huge sundial like this one:


Source: Wikimedia

It is “enough” to construct the text on the dial, point the dial needle at it, and read the angle; this corresponds to the compressed text. This concept being already more complicated, just remember Huffman’s one, it’s more than enough to understand. Finally, interval coding is very similar to arithmetic coding, except that it does not necessarily work in binary base, but in the base that is most convenient for it (randomly, a base 28 to compress all the letters of the alphabet plus space and period).

Lossy compression: JPEG

The JPEG will serve as an example for compression with loss. First, the JPEG algorithm decreases the resolution of the chrominance, which the human eye does not pay too much attention to. It’s called subsampling.

The algorithm then slices the image into 8-pixel squares and applies a transformed into a discrete cosine. To speak poetically, it is as if the piece of file is transformed into sound waves or ripples on the water. The next step, quantization, is where a higher or lower degree of compression can be applied. The “waves” are mapped in a simplified way: instead of redrawing the curves exactly, only a few points are recorded. The fewer points, the less precise it is, the more compressed it is.

Once the quantization is complete, the JPEG takes its card and goes through two steps of entropy coding (which now holds no secrets for you): a RLE beast – but which reads the zigzagged end of the card, let’s not laugh – and a Huffman coding. There, it’s settled!

Most lossy compression algorithms, including those working in audio, use a form of direct cosine transform. Despite the artifacts that it generates and that we have all seen on a poor quality GIF or video, lossy compression can sometimes have desirable effects. In audio, it tends to soften volume peaks – so that, for example in a piece of classical music softly played piano passages can still be heard even in the background noise of a car.

In practice: streaming video

Video streaming, one of Pied Piper’s main fictitious businesses, made its debut on the real Internet on June 24, 1993. Garage rock band Severe Tire Damage, from Palo Alto, Silicon Valley, was performing at the famous Xerox PARC research facility. At that time, the researchers were discussing the retransmission technology they had just developed. The idea was obvious: why not showcase the results of Xerox PARC’s research by broadcasting the concert live on the Internet?

The result – a video of 152 by 67 pixels jerky at 8-12 frames per second, with sound quality worthy of “at best [of a] bad telephone link” – would have “sucked up half the bandwidth of the entire Internet”.

Audio and video streams are encoded with the appropriate codecs, passed through a so-called “bitstream” container such as MP4 or FLV, sent through the pipes of the Internet using a transport protocol such as Adobe’s RTP, and decoded at the user’s home with the same codecs. For example, if Apple’s keynotes refuse to be retransmitted on a browser other than Safari or Microsoft Edge, it is because of the transport protocol used : HLS (HTTP Live Streaming), although open source and implemented by a range of video players (including VLC), but that neither the desktop version of Google Chrome nor Mozilla Firefox natively support.

The file is split into several blocks, duplicated and compressed to different degrees

If your 4G connection fails while you’re watching a video on YouTube rather than freezing (or even stuttering) the video until the network returns, you can partially compensate by sending a more compressed video stream. This is called adaptive bitrate streaming: before sending, the file is split into several blocks, duplicated and compressed to different degrees, so that the protocol can choose blocks whose compression is adapted to the connection.

On video streams, and even on video files for that matter, you have to manage a time-to-memory trade-off. The video compression algorithm may be excellent, but it won’t be usable if decompression takes too long to be done on the fly, in which case the image won’t keep up as it would with a bad network connection.

Propose a new pivot strategy!

It was explained to you that the principle of lossless compression was to try to find a logic in the content of the file. An excellent compression algorithm would also have another particularly useful function, which has absolutely nothing to do with making things smaller, faster or more efficient. Which one do you think it is?

Alexa, Amazon

Alexa, the Amazon merchant AI

An excellent compression algorithm is also an excellent artificial intelligence, and vice versa. Although the algorithms we have seen in this article are not very smart, machine learning has the same purpose of pattern detection and prediction. A text compression algorithm, which would push its analysis to the point of understanding all the semantic variations and grammatical flexions of its data, could theoretically be reconverted into a chatbot. On the other hand, Google has been working on an AI for image compression, and Netflix has also developed an AI to compress its movies.

Now that you have discovered that you are (almost) a living compression algorithm, your express training is over. All you have to do is drop your CV in Pied Piper’s mailbox and propose them to switch to the AI domain in case their initiative of the moment ends in an industrial disaster.

For those who would be interested in reading a real scientific article written especially for the realization of Silicon Valley, the “Optimal Tip-to-Tip Efficiency” paper by mathematician Vinith Misra explains very seriously in 12 pages, with supporting equations and diagrams, the brilliant finale of season 1. You are warned first of all about the spoiler, and also that it is NSFW (not safe for work) content – not really suitable for all ages.

You May Also Like