Shannon Informational Entropy in PHP

Shannon information entropy attempts to predict the smallest possible average size of lossless encoding of data. This is useful when writing compression software as it provides a target estimate to which we can attempt to compress a file without losing any of the information contained inside it when it is subsequently decompressed.

Download from Github

Download

What is Shannon Information Entropy?

Imagine that you have a text file that contains the following ten characters:

AAAAAAAABA

We could, if we wanted to, fairly reliably predict the next letter in the file (we’ll get it wrong just once). Each subsequent character doesn’t give us a lot of new information that we couldn’t guess. It has low informational entropy.

The relationship with compression

If each subsequent character doesn’t necessarily pass a whole lot of new information, then that tells us something about redundancy in the file. It tells us that we could represent the file in a different way to pass more information with each character. i.e. low informational entropy = highly compressible.

Let’s say we compress it with a compression program that uses a simple scheme where it writes the number of repeats of each character. So:

(8)A(1)B(1)A

Our ten character file is now 6 characters. We have increased the amount of information we are conveying with each character, so the file h as a higher informational entropy.

Applying it in practice

Each of our characters in our file are representable by 1s and 0s, known as bits. Shannon informational entropy gives us the possibility to calculate how many bits the file could be compressed down to.

Take the example. To start with we had 10 characters. Each character is represented in 8 bits, so 80 bits of information. We then compressed it to 6 characters, represented by 8 bits. So, 48 bits. How do we know if we’ve done a good job?

Using the shannon entropy formula, we can estimate that we should need 0.469 bits per character. With our original ten characters, that means 4.69 bits. We can’t end with a fraction of a bit, so in practice we would have to round that up to 5. We’re a long way off at 48 bits, so we should keep trying.

Applying against a pre-existing file

Not only can we use it as a target, but the calculation also enables us to see how well compressed a file is. For example, if we calculate the shannon entropy for a file (based on 8 bit characters) and the value comes out low, then we know this file is very compressible. If it comes out close to 8, then we know it is not very compressible.

PHP and Download

I code a lot in PHP, therefore when I do compression work I tend to prototype algorithms in PHP first and convert later. Therefore, I wanted a calculator that worked in PHP. You can download it from Github here.

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on reddit
Reddit
Share on stumbleupon
StumbleUpon
Share on whatsapp
WhatsApp
Share on email
Email
Share on print
Print