This blog post is written by Kenny Chung, lawyer at Synch. Kenny is passionated about privacy issues beyond the ordinary. Read his thoughts about Personal Data and 33 bits of Entropy.
The concept of personal data revisited – the mathematical approach of identifying a person
As a lawyer, one of the fundamental and most profound questions when working with data protection and the GDPR is: what is personal data? If you are a company or an organisation that started implementing the necessary groundwork to be in compliance with the GDPR, surely, you have pondered over the same question.
Which information constitutes personal data and how much information is required to identify a person? As it turns out, there is an objective way to approach this question.
First, let us look at the basics. According to the GDPR, “personal data means any information relating to an identified or identifiable natural person […]”. While this provides some guidance to the criteria of determining whether specific identifiers, alone or jointly, constitutes information able to correctly deduce the identity of a unique individual, it is clear that the definition still leaves ample space for interpretation. Whether the identifiers presented are enough to identify a single person highly depends on the context they are used in. For example, one of the most common identifiers is the name of a person. If you use the name Oscar, which is a relatively common name, it may not always be enough to identify a specific individual. However, if you add additional identifiers such as: address, telephone number etc., the possibilities of it being anyone else than a single person quickly decreases. In addition, it is also possible to use other identifiers to identify an individual. If I say that a certain person is working at Amazon, this in itself is not sufficient information to identify anyone as Amazon has over 500 000 employees. If I tell you that a certain person is a CEO, this is also not sufficient as there are literally millions of companies with CEOs in the world. However, couple the fact that the person working at Amazon also is the CEO there, and it would be easy to deduce that I am talking about Jeff Bezos.
Usually, the situations presented are not as clear-cut and there are many complex situations subject to interpretation where bits and pieces of information are sporadically available. While each piece of given information might be partially revealing about a person, one might wonder whether it would be possible to measure exactly how much information one would need in order to identify someone? To determine such a thing, one could argue, would resemble the act of determining how many grains of sand you would need to build a sandcastle.
Well, it seems to be the case that there is a way to measure the exact amount of information you need, and the information hides behind 33 bits of entropy.
33 bits of entropy
There is a mathematical quantity called entropy which is measured in bits (if you are a lawyer, like me, you might be squirming uncomfortably in your seat right now). Entropy can be thought of as the number of possibilities a random variable can generate. If there are two possibilities, the entropy is one bit. If there are four possibilities the entropy is two bits, and the number of possibilities grows exponentially with each bit of entropy added. As there are around seven billion people on this planet, the entropy would be around 33 bits (e.g. 2 to the power of 33 which gives us around seven billion possibilities). In plain language, this means that you need 33 bits of entropy (footnote 1) to objectively and definitely identify a specific individual. In the same way, identifiers such as name, address and birthday etc., carry with them bits of entropy that may be partially revealing about a person’s identity. By using a mathematical formula (footnote 2), you are able to deduce how many bits of information you might gather from certain factors. Someone’s unique birthday is worth 8,51 bits of information while a certain ZIP code might be worth 10-20 bits of information depending on the area of the ZIP code. According to mathematical theory, if the bits of information are truly unique information bits, by adding the bits of entropy together it is possible to identify a specific person without fail.
With that said, information that does not provide new information e.g. if you know that someone lives in Stockholm, the information that they live in Sweden does not constitute new information, and hence cannot be counted towards the bits of entropy.
Is it really that simple?
In accordance with above, it seems to be possible to simply gather different bits of information, insert them into a mathematical formula and get an answer of whether the accumulated information is enough to identify an individual. Well, turns out it is not that simple. In theory there is no dispute that, this is how you can effectively identify someone. In practice, however, there are several concerns that might be addressed. Take the fact that it is difficult to understand how much information a certain identifier might present. The example above, that the city of Stockholm belongs to Sweden and hence does not bring forth new information, presumes that specific knowledge. Thus, it is not easy to distinguish already known information from new information which leads to an incorrect estimation of the information provided.
It is also necessary to understand that above-mentioned approach must be put in a legal context and therefore discern it from a purely mathematical approach. According to the GDPR, a criterion for the individual to be identifiable is that account should be taken to all the reasonable means at the controller’s or any other person’s disposal. This includes factors such as cost, amount of time and technical means amongst other things. Distinguishing between the objective way of being able to identify a person and a relative way of doing the same provides two different results. While blood, fingerprints and other types of unique biological samples might contain all the bits of entropy required to objectively identify a person, in most contexts there is simply no way of identifying the specific person behind the biological sample. Hence, although identifiers may contain all the necessary information on an objective level to identify a specific individual, in most cases it would not, judicially speaking, count as personal information.
With that said, it seems that privacy lawyers need to be around for a while longer in order to strike the correct balance of what constitutes personal data and not. If it is within a legal context, that is.
 The number is closer to 32,84 in reality as the population today is 7,7 billion, but for simplicity’s sake we will round it up to 33.
 ΔS = – log2 Pr(X=x), where ΔS is the reduction in entropy and Pr(X=x) the probability of a fact being true e.g. someone’s unique birthday would be 1/365.