  # Shannon Informational Entropy in PHP

What is Shannon Information Entropy?

Imagine that you have a text file that contains the following ten characters:

AAAAAAAABA

We could, if we wanted to, fairly reliably predict the next letter in the file (we’ll get it wrong just once). Each subsequent character doesn’t give us a lot of new information that we couldn’t guess. It has low informational entropy.

The relationship with compression

If each subsequent character doesn’t necessarily pass a whole lot of new information, then that tells us something about redundancy in the file. It tells us that we could represent the file in a different way to pass more information with each character. i.e. low informational entropy = highly compressible.

Let’s say we compress it with a compression program that uses a simple scheme where it writes the number of repeats of each character. So:

(8)A(1)B(1)A

Our ten character file is now 6 characters. We have increased the amount of information we are conveying with each character, so the file h as a higher informational entropy.

Applying it in practice

Each of our characters in our file are representable by 1s and 0s, known as bits. Shannon informational entropy gives us the possibility to calculate how many bits the file could be compressed down to.

Take the example. To start with we had 10 characters. Each character is represented in 8 bits, so 80 bits of information. We then compressed it to 6 characters, represented by 8 bits. So, 48 bits. How do we know if we’ve done a good job?

Using the shannon entropy formula, we can estimate that we should need 0.469 bits per character. With our original ten characters, that means 4.69 bits. We can’t end with a fraction of a bit, so in practice we would have to round that up to 5. We’re a long way off at 48 bits, so we should keep trying.

Applying against a pre-existing file

Not only can we use it as a target, but the calculation also enables us to see how well compressed a file is. For example, if we calculate the shannon entropy for a file (based on 8 bit characters) and the value comes out low, then we know this file is very compressible. If it comes out close to 8, then we know it is not very compressible.