When systems are designed to be differentially private, they allow companies to gather the data they need to train their algorithms, while helping to keep the data subjects anonymous. As privacy concerns grow, differential privacy could be a key concept in how our societies move past the current era of invasive surveillance.
If you’re worried about data collection from major tech companies, the good news is that concepts like differential privacy are starting to become more prominent. In certain situations, differential privacy can help to protect us by providing a compromise between the interests of people and those of data collectors.
The wider role of differential privacy
Differential privacy is actually a much broader idea that can be applied in a multitude of fields outside of training algorithms. It was developed as a response to problems of privacy in data analysis. Under normal circumstances, if your data is included in a database, it can lead to breaches of your privacy.
Even if your data has been anonymized and had your identifiers stripped away, it retains the possibility of being connected back to your identity through statistical analysis. The underlying idea behind differential privacy is that you cannot breach a person’s privacy if their data is not in the database.
With this in mind, differential privacy aims to grant individuals included in a database the same degree of privacy as if their data was completely left out. A system is differentially private when the data is structured in such a way that you cannot tell whether a particular subject participated or not.
If something fulfills this requirement, the data cannot be linked back to individuals, which protects their privacy. In this sense, differential privacy is really a definition rather than a technique.
Cynthia Dwork, one of the researchers who introduced the term, described differential privacy as a promise from the data holder to the data subject, that:
“You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available.”
One of the most common misconceptions is that differential privacy is a specific technique. It’s not – there are many. When companies talk about differential privacy, they aren’t using differential privacy, they are using various techniques to make sure that the data is differentially private.
For example, in Apple’s word and emoji suggestion algorithms, the company has established a system that adds noise to what users type to keep the information private. Assuming that there are no flaws in the system, the database is differentially private.
Differential privacy is achieved through a range of complicated techniques that involve a lot of statistics. In essence, they add a calculated amount of noise (random data) to the database. This obscures the relationship between the individual and the data points, but because it is done in a controlled manner, the data is still accurate enough to be useful in many situations.
The amount of noise needed will depend on the number of people in the database. To keep individual information private, the database cannot rely too much on a single person. The fewer people in a database, the more noise needs to be added to protect them.
We will spare you the math overload to make this article more digestible and easier to understand, but you can check out Dwork’s paper linked above if you want to take a look at the mathematical underpinnings of differential privacy.
If you’re already a little overwhelmed, don’t worry, because we’ll start by taking a few steps back. First, we’ll examine privacy and data in a more general sense. Then we’ll dive in and cover differential privacy at a deeper level, before focusing on how it can be used in machine learning for less-invasive data analysis.
We’ll look at the potential differential privacy has in machine learning, its current applications, and also its limitations. By the time you’re through, you should have a good understanding of its real-world ramifications without having to drown yourself in the math behind it.
Square one: Data & privacy
Data is good – at least in certain situations. It helps us understand what’s really happening and allows us to make better decisions for the future. Without its collection and analysis, we would have made none of our scientific progress, and the world would be much more chaotic.
Let’s say your country is planning how to spend its budget next year. What do you think would lead to better, more equitable outcomes:
- If it planned out its distribution according to a mix of guesswork and intuition; or
- If it planned out its distribution based on the detailed collection and analysis of information, including how many people there were, where they were, their ages, incomes, education levels and many other aspects.
If you chose the second option, congratulations, you’ve just invented censuses, which are just one example of how data analysis can help to make our lives better. Censuses involve collecting and analyzing data, which governments then use for a range of tasks, including allocation of resources. As you can probably guess, they’re able to do a much better job with this information than without it.
In practical terms, this means that every few years, most of us fill out a very detailed questionnaire, and send it off to the government. But doing so has the potential to breach the privacy of those that answer it, which can potentially lead to severe consequences.
It’s not unreasonable for individuals to be wary of censuses, particularly as the world slowly wakes up to the mass data collection and privacy invasions that are so prevalent. But censuses also offer us incredibly valuable insights, which are important for the future successes of our countries.
This leaves us in a conundrum. Participating in the census could lead to privacy violations, but if everyone rejects the census, we lose all of this valuable information.
A competent census bureau will allay these fears by introducing security and privacy mechanisms that help to protect the individual information, while still granting us insights about overall groups. When done properly, it’s a good compromise.
The only one
Let’s say a small logging and farming town is conducting its own little census. It wants to find out which industries are bringing in the most money to the town so that it knows which areas to expand on, and which need government support.
The town hall asks each business to respond to a survey that includes questions about revenue and many other details. Most companies are happy to share because they know that the information will help the town as well as their own businesses. They also trust the town to collate the information and strip away their identifiers, keeping the data anonymous.
If the data is collected and averaged out before being released to the public, then the farmers and the sawmills won’t have to worry about anyone else in town finding out how much money they are making. The individual figures will get lost in the averages because there are so many farmers and loggers.
But what if you owned the only hotel in town? Let’s say it made $500,000 in revenues. If the town collects and averages the figures for the hotel industry, then releases them as part of a graph alongside all of the other industries, the graph will say that the hotel industry made $500,000 in revenues.
Now, the people in town will realize that there is only one hotel, so they can deduce that the hotel made $500,000 in revenues. If there is only one of something, the individual data can’t get lost in the average.
This is a problem, because private companies aren’t normally required to publicly announce their financial records. As the owner, perhaps you don’t want the rest of the town to know just how much you made.
You’re left with the choice of either lying to the town and skewing the figures, perhaps eventually leading to poorer decisions, or breaching your own privacy.
Of course, this is not a good situation. What we’ve just demonstrated shows how even when data has been anonymized and had the identifiers stripped away, it may not really be so anonymous after all.
If the town statistician was savvy enough, there are a few things that she could do to protect your privacy as the hotel owner. She could simply leave the hotel industry out of the publication, or perhaps roll the hotel industry in with a bunch of other businesses and title the results miscellaneous industries.
As the hotel owner, you would want to know exactly how the statistics are going to be used before you fill out the survey so that your privacy would not be breached.
Comparing two sets of data
Let’s consider another example of how individual data can be exposed even when it has been anonymized. Let’s say a company does a yearly report that includes the total cost of wages for each department.
If the petting zoo department had a budget for wages of $1,000,000 in 2019, and it was shared between 20 employees, all you can really tell from that data is that the average wage was $50,000. You don’t know how much the manager was making, or how badly some employees were getting underpaid.
Now, let’s say that by the time the 2020 report came around, there had been no wage increases, but a company executive’s son had been attached to the department as an assistant manager, whose main role seemed to be taking long, boozy lunches.
If the new budget for wages was $1,200,000, and no raises had been granted, what does that tell us? That the possibly alcoholic son is pulling in a cool $200,000 for doing pretty much nothing.
As you can see, this demonstrates another way that we can uncover sensitive information out of supposedly anonymous data. Since the other members of the department would be outraged if they found out, it’s in the interests of company leadership to somehow keep this information from being pulled out of the data.
Machine learning
Most of the recent publicity surrounding differential privacy has been in the machine learning sphere, so that’s what we’re going to focus on. First, we need to cover some of the basics.
According to science fiction writer Arthur C. Clarke, any sufficiently advanced technology is indistinguishable from magic. He first published that law in the 1970s, and it’s easy to believe that if you transported someone from that time period to the present, they would scream witchcraft or trickery at some of our technological developments.
We have our constantly updated, perfectly curated news feeds that keep us entertained. Traffic rerouting apps like Waze that seem to magically know the fastest way to get through a city, and we can find any information we want with just a few simple taps on the keyboard.
All of these tasks are completed with algorithms, which are far more boring than magic. Algorithms are essentially sets of instructions or formulas that compute the desired result or figure out a problem.
Our lives are full of them – from Twitter to your email spam filter, to looking up flights. Unless you’re a Luddite or you specifically go out of your way to avoid algorithms, much of your information and many of your life decisions are probably made with their assistance. In a way, algorithms control our lives.
There is a range of benefits to this setup – it makes it easy to pick a restaurant, and finding an address is far more simple now than in the days of maps. Despite these benefits, algorithms also leave us open to manipulation and other negative effects, but those subjects are a little out of the scope of this article.
What we’re more concerned about is how these algorithms achieve such accurate results, and how they constantly improve themselves.
Why do we use machine learning to improve algorithms?
A significant portion of the process is done via machine learning, which is a field within the sphere of artificial intelligence. Under machine learning, data is collected and analyzed, with the algorithms taking what they learn and then altering their processes to accomplish their tasks more effectively.
The impressive thing about this type of artificial intelligence is that machine learning algorithms can improve themselves and their results without needing any external programming from a human developer.
To make up a simple example, let’s say a company making a chat app wants its emojis placed in the most convenient places for users. First, it would need an algorithm to count which are the most commonly used algorithms, so that it could put the frequently used ones in the easiest positions.
Emoji use can change over time, so what was once a conveniently placed emoji may end up barely being touched. If this happens, it’s just taking up space and making the user’s task take a fraction more time to accomplish.
If the company wants to make life as easy as possible for its users, it will use a machine-learning algorithm to collect data on these trends, analyze it, and then update the placement to make sure that the current popular emojis are easy to reach.
You may not care much about emojis, but what about your search results? When you Googled something 15 or 20 years ago, you would often have to go through pages and pages of results or try a number of different search terms to get what you really wanted. By comparison, it’s amazing how accurate the current results are.
How about the predictive typing engines on your phone? If you remember back to when platforms first began suggesting the next word, it was far less useful than Gboard or the iPhone keyboard are now. These days, the technology can pick up more of the context from what you are typing, which makes it pretty good at predicting the correct word.
If you appreciate the ease and simplicity that comes from these technologies, you owe some thanks to machine learning algorithms. If you appreciate the ease and simplicity that comes from these technologies, you owe some thanks to machine learning algorithms. But data collection isn’t always so benevolent, and sometimes it can harm the subjects by leading to cybercrime or invasive monitoring from the data collectors.
While the harm that can come from these practices may seem obvious, the dangers that come from anonymized data are more subtle.
Netflix “anonymizes” user data
Netflix New Icon by Netflix Inc. licensed under CC0.
Let’s look at a real-life example that shows just how serious the issue can be. In the late 2000s, the video streaming service Netflix thought it would be a good idea to outsource some of its development to the public, and opened up a competition to see if anyone could come up with a better algorithm for recommending movies to users.
To facilitate the competition, Netflix unveiled a $1,000,000 prize and released a bunch of its data. This included more than 100 million movie ratings compiled by almost half a million of the company’s subscribers.
In an FAQ, Netflix assured its users that there was no need to keep the data in these releases private because “…all customer identifying information has been removed; all that remains are ratings and dates.” This sounds like a good thing, as though the company was actually trying to protect its users, rather than moving forward with blatant disregard for their privacy.
Unfortunately for Netflix, it didn’t consider that removing identifying data does not necessarily make the data truly anonymous. Two researchers from the University of Austin began investigating the competition under the presumption that it would only take a small amount of information to deanonymize the data and identify users.
Using complex statistics, they discovered that they could deanonymize 99 percent of the records with just a few points of data. All that the researchers needed was eight separate movie ratings and the dates they were watched. This level of accuracy even accounted for a 14-day error in when the movies were viewed, as well as the possibility that two of the ratings were completely wrong.
They also found that with just two pairs of ratings and dates, they could deanonymize 68 percent of the records, although in this case, the time error could be a maximum of two days.
In essence, almost the whole database could be matched back up to the identities of those in the data release. All the researchers had to do is know when the data subjects had watched eight movies, and what those movies were.
This type of information isn’t that hard to find out – a colleague or a supposed friend could easily extract information about when you watched eight separate movies in casual conversation. You would never even consider that they were up to something nefarious. It’s not like they would be asking for your credit card details, it’s just normal, casual conversation.
Bad actors could also easily find out this information through IMDb, if the target used both services. It’s likely that an individual’s ratings on IMDb are similar to their ratings on Netflix, which would make it easy to deanonymize the data.
Now we get to the key question: Why should anyone care about their Netflix history being matched up to their identity – they’re just movies, right?
As the researchers noted in an example in their paper, when they investigated the ratings of a user, they were able to deduce his political and religious opinions, based on his scores of movies such as Power and Terror: Noam Chomsky in Our Times, and Fahrenheit 9/11, or Jesus of Nazareth and The Gospel of John, respectively.
It’s also likely that you could find strong correlations between a person’s viewership and their sexuality, or about a number of other aspects of our lives that many people like to keep private.
As the researchers so astutely pointed out:
The issue is not “Does the average Netflix subscriber care about the privacy of his movie viewing history?,” but “Are there any Netflix subscribers whose privacy can be compromised by analyzing the Netflix Prize dataset?”
The answer is clearly yes, as the researchers showed that they could deduce several different kinds of sensitive information just from the person’s Netflix history.
This isn’t just academics proving a point, it is a practical attack that threat actors can use to figure out private information about individuals, even if a database has supposedly been anonymized. Netflix was even sued and settled a case over the issue.
The underlying problem extends far beyond Netflix and films. Terrifying volumes of data are collected on us, and it is often anonymized either for more secure storage, or so that it can be publicly released for various purposes.
But what happens if something like your medical records had supposedly been anonymized, and then either released publicly or accessed by a hacker? If the data could be deanonymized as in the Netflix example, it would absolutely shatter your privacy and could lead to a host of crimes committed against you, such as identity theft or insurance fraud.
Does data collection have to be dangerous?
We can’t deny that algorithms are convenient and offer numerous benefits. However, it is still reasonable to be worried about their potential downsides. The good news is that in certain situations, we can get the benefits that come from data collection and machine learning algorithms, without the invasive breaches of our privacy.
To give credit where credit is due, there have been many promising steps in privacy reform from major companies over the past few years, although we still have a long way to go. One of these techniques is known as federated learning, and of course we also have our main focus, differential privacy.
Sidetracking through the social sciences: The randomized response technique & differential privacy
The easiest way to explain differential privacy is to look at something that is essentially a much simpler version of it. It’s known as the randomized response technique.
If scientists are investigating something sensitive, perhaps people’s criminal or sexual histories, how can they know that people will be honest in their surveys? For a multitude of reasons, many of us are not willing to be truthful about such private matters to a random person in a lab coat.
We don’t want permanent records of our intimate moments or indiscretions, nor are we comfortable telling someone we just met our darkest secrets. This makes it incredibly difficult to gather data in these sensitive areas.
In 1965, S. L. Warner came up with a solution. Let’s say he wanted to know whether people had ever stolen candy from a baby. Ashamed of their actions, Warner knew that he couldn’t rely on their answers.
If 99 out of 100 people denied it, was that the real truth? How could he figure out what percentage of people were lying?
He didn’t. Instead, Warner came up with a way to help people be more comfortable telling the truth. The randomized response technique was expanded into a number of different methods over the following years. One of the simplest involves coin flips.
A researcher will approach a person and explain what they are doing. They tell the participant that they will be asking them a sensitive question, but to protect their privacy, first they will ask the respondent to flip a coin and keep the result hidden from the researcher.
If the respondent flips a heads, they are to answer yes, no matter what the true answer is. If it lands on tails, they are to answer truthfully.
When the researcher asks the question, “Have you ever stolen candy from a baby?”” and the respondent says, “Yes”, the researcher has no way of knowing whether the respondent really did steal candy from a baby.
The respondent may be saying yes because the rules demanded it, or they may be admitting the truth. In theory, this protects the respondent, so they should be more willing to tell the truth when confronted with sensitive questions.
Let’s say that the researcher got 100 responses to the question, 75 of which were “Yes”. Knowing the 50/50 split of a coin toss, they can deduce that 50 of the “Yeses” were the result of a heads, while the remaining 25 came from people who were answering truthfully. Since 25 people truthfully said “No”, the researchers can therefore conclude that 50 percent of people steal candy from babies
There are a few assumptions involved in this method, and the results aren’t overly accurate in the social sciences, so other techniques are often employed instead. But that’s not the point.
The main takeaway is that the coin flip is a simple way to inject random data (the forced heads) into the database, and doing so protects the information that the respondents give.
Respondents don’t have to worry about their information being misused or being made public, because they have plausible deniability. Even if they answered truthfully that they are evil candy stealers, it doesn’t matter.
No one who reads the survey results will be able to tell whether they are truly one of the candy stealers, or were just forced to answer yes based on the coin flip.
In essence, this is how differential privacy techniques work. However, they are much more complex, and are able to give more accurate results than a simple coin flip.
If you’re not mathematically inclined, you can think of the differential privacy algorithms we actually use as extremely complicated versions of the above. If you are, you can feast on some of the equations in this paper by Carol Dwork.
Regardless, the basic theory still holds up – if we add randomness into the data, we can protect the private information of individuals, while still having a useful set of data that we can analyze.
Models of differential privacy
Differentially private algorithms have the potential to protect our data while still enabling reasonably accurate machine learning. Two of the most common models include global differential privacy and local differential privacy.
Global differential privacy
Under the global differential privacy model, the raw data of individuals is collected and analyzed by some central body, which would often be a tech company. The differential privacy algorithms are applied to the data in aggregate. While the private individual information may never be publicly released, it has been collected somewhere in its raw form.
This doesn’t have to be too much of a concern if the organization is trusted and has high levels of security in place. However, if either of these conditions isn’t met, differential privacy cannot keep individual information safe.
If the company publicly releases the differentially private database, your information won’t be able to be deanonymized from it. However, the global model does make it possible for the company to misuse your raw data. Hackers may also be able to access the raw data and use your private information to commit a range of crimes.
Local differential privacy
In contrast to global differential privacy, local differential privacy starts with the assumption that you cannot trust any party with your raw personal information. Instead of transferring your raw personal data to some central server for analysis, you want to keep your data to eliminate the possibility of it being exposed or misused by either companies or hackers.
Under the local differential privacy model, you never send your data anywhere. Instead, the algorithm comes to your device. When the algorithm wants to learn from your data, it essentially asks your device questions. Your device then adds random noise to obscure the real private data in the answers, before sending them to the central server.
The central server then aggregates the obscured data from all of its subjects. Together, the random noise cancels out, allowing the algorithm to learn from the private information without ever having had access to the raw data of any one individual.
This model offers a greater degree of privacy because it eliminates the possibility of the raw personal data being misused by the central body, and of it being stolen by cybercriminals.
The limitations of differential privacy
Differential privacy is an exciting concept that could help to move us away from a world where almost every moment of our lives seems to be tracked. However, it’s not a miracle cure, and it does have a number of limitations.
Accuracy vs privacy
At the heart of differential privacy is a trade-off between accuracy and privacy. We’ll use an analogy to explain how this can cause complications. Let’s say you are a researcher who wanted to determine how a person’s financial success impacts how attractive they are perceived to be by others.
To do this, you created an online app where participants can look at a person’s picture alongside statistics on their income, wealth and suburb of residence, and then rate how attractive they seem.
Of course, including all of this information besides their pictures could be seen as a huge privacy violation – the participants may recognize some of the true identities of the subjects, which would end up revealing private financial data.
To combat this, you could blur the photos to obscure the person’s identity. Blurring photos amounts to a similar process as adding random data noise does in differential privacy. If you only blurred the images slightly, survey participants would still be able to recognize them, so the same privacy problems would exist.
However, if you blurred them enough to hide their identities, participants wouldn’t be able to see how attractive they are. In instances like this where a high degree of accuracy is important, differential privacy may not be an effective approach. It may lead to either inadequate privacy protection or results that are so inaccurate that they’re useless.
Although differential privacy may not be suitable for protecting private information in small groups and in various other scenarios, it does still have a number of uses. As we’ve already seen from the examples covered above, there are a range of situations where data doesn’t have to be overly accurate, which enables us to obtain worthwhile insights without significant privacy breaches
Privacy budget
The more queries you ask of a database, the closer the privacy of the data subjects comes to being violated. Think of it as a game of 20 questions. Your first question might be something very general, such as “Am I human?” Even if the answer is “Yes”, it’s extremely unlikely that you would be able to guess who it is.
However, as you ask more and more questions, you get closer and closer to the answer. Once you get down to a question like “Am I the President?” it becomes much easier to guess the correct answer. Similarly, when a differentially private database is repeatedly queried, more and more information is revealed.
Over time, this can lead to the deanonymization of the data. This happens because the level of anonymization is diminished with each query. The more a database has been queried, the easier it is to use those query results to filter out the random noise and reconstruct the original private data.
To compensate for this, implementations of differential privacy include what’s called a privacy budget. This is essentially a control of how much data can be extracted through queries before it risks deanonymizing the data. Once this level has been reached, the data curator stops answering queries to protect the privacy of the data subjects.
The amount varies according to a number of other parameters, however, privacy budgets are generally quite conservative and calculated based on worst-case scenarios.
Real-world applications of differential privacy
Differential privacy isn’t just some theoretical idea that we are hoping we can use in the future. It has already been adopted in a range of different tasks.
The US Census
Seal of the United States Census by Mysid licensed under CC0.
Every 10 years, the US conducts a census to give it insight into the demographics and other happenings inside the country. This information is invaluable in planning for the future. The 2020 Census was the first time that it could be filled out online in a widespread manner.
Collecting this much personal data raises serious fears over security and just how the information will be kept private. To combat the risks, the US Census Bureau implements differential privacy into its process.
Census data is usually only published in an anonymized and aggregated form, but as we discussed earlier, it isn’t necessarily complicated to deanonymize this kind of data. Following the 2010 Census, the Census Bureau was able to reidentify data from 17 percent of the US population. This is worrying for anyone with privacy concerns, so the move toward differential privacy is a positive step.
For the 2020 Census, the Census Bureau carefully balanced the trade-off between accuracy and privacy. Completely eliminating privacy risks involves more noise in the data, which decreases its accuracy and usefulness. On the other hand, a high level of accuracy would require no data noise, which greatly increases the privacy risks.
As part of this trade-off, data from smaller communities will be more affected by inaccuracy than larger populations. This includes rural areas and smaller racial groups.
RAPPOR
In 2014, researchers from Google and the University of Southern California released a paper called RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. In it, they outlined a system for anonymously crowd-sourcing statistics.
As described in the paper, RAPPOR is differentially private, allowing “the forest of client data to be studied, without permitting the possibility of looking at individual trees.” RAPPOR uses a local differential privacy model, where the data stays on the device rather than being collected on a central server.
It is set up to give individuals strong plausible deniability, while still enabling organizations to gather useful statistics such as histograms, frequencies and category information.
Google has deployed RAPPOR as an opt-in mechanism for Chrome users. It gathers data on the sites people were setting as their home pages, so that Google can better understand the malware that tried to change them. About 14 million users participated in the study, and RAPPOR allowed them to do so without compromising their privacy.
RAPPOR is built into Chromium, which is the open-source component of the browser. This is a positive step from a privacy perspective, because anyone can take a look at the RAPPOR source code. If they have the right background knowledge, they can see what the code is actually doing.
This allows developers to analyze software for security weaknesses, and although RAPPOR’s implementaton in Chrome isn’t perfect, it’s certainly a move in the right direction for privacy.
RAPPOR has also been released under an open-source license “so that anybody can test its reporting and analysis mechanisms, and help develop the technology”. Firefox developers have expressed interest in using RAPPOR to safely collect telemetry data, but it has not been implemented at this stage.
Apple
Apple has deployed differentially private mechanisms in a number of its features, including:
- QuickType suggestions
- Lookup Hints
- Emoji suggestions
- Health Type Usage
- Safari Crashing Domains
- Safari Energy Draining Domains
- Safari Autoplay Intent Detection
Ostensibly, as with most other implementations of differential privacy, the company’s goal is to harvest data that helps make its products more effective, without breaching the privacy of its users.
Like Google’s RAPPOR, Apple’s features deploy local differential privacy and add noise to user data before it is shared with the central servers. The company does not store any identifying data alongside the data it uses to train its algorithms, which is a good sign that it takes the process seriously.
Apple also has measures in place that prevent an attacker from being able to discern information from correlated metrics. While Apple has done well in certain areas of its systems, differential privacy researchers have also criticized it for some of the parameters it uses, and the length of time it stores data for.
Apple disputed these assertions, arguing that its system has greater levels of protection than the researchers acknowledged. In the Wired article linked above, one of the study’s authors, University of Southern California professor Aleksandra Korolova responded to Apple’s defenses by highlighting that the point of differential privacy is to make sure that a system is secure, even if the company in control of the system engages in the worst behaviors.
Essentially, the system should be set up in a way where users don’t have to trust the company to do the right thing – so that it, its employees or hackers can’t deanonymize data even if they want to.
The other major issue with Apple’s approach is that it is withholding more information than Google’s open-source RAPPOR. As an example, it took months of research for researchers to figure out a key parameter that was critical for discerning the privacy of the system. The company could have simply published it for all to see.
While Apple’s approach isn’t perfect, it’s still a welcome move forward. Hopefully, other major tech companies will follow in their footsteps and develop similar privacy mechanisms.
Differential privacy & coronavirus
Coronavirus Disease 2019 by US State Department licensed under CC0.
Amid the coronavirus pandemic, many tech companies are also stepping up to do their part. One example is Google’s Covid-19 Community Mobility Reports, which are taking aggregate data from those who have turned on Location History and using Google Maps to determine how busy certain places are.
It’s hoped that the Covid-19 Community Mobility reports will “provide insights into what has changed in response to work from home, shelter in place, and other policies aimed at flattening the curve of this pandemic.”
This data could help officials make effective decisions for combating the pandemic. For example, if a city finds that certain bus stops are too crowded to effectively social distance, it could increase the number of services it offers to help reduce contact between individuals.
Under normal circumstances, this may seem like a worrying development, so there are a few things that we should clear up.
People who have Location History on are already having their location tracked. The only difference now is that this information will be part of the aggregate that is published in the reports.
While some may want to help officials in any way they can, others may be concerned about their data being used. The good news is that the Covid-19 Community Mobility Reports don’t involve the collection of raw individual data.
Instead, they use differential privacy to collect data that grants useful insights into the group, without compromising the privacy of individuals.
Although Google’s differential privacy isn’t perfect, the company does seem to be committed to protecting individuals while it hands over data to combat coronavirus. If you’re still concerned, Location History is an opt-in service, and unless it has been turned on, your data will not be collected as part of the reports.
If you want to help out in whichever small ways you can, having your Location History turned on will contribute to making the results a little more accurate. However, doing so means that Google will also use your location information for other purposes.
The promise of differential privacy
Some of the ideas behind differential privacy have been around since the sixties, but it wasn’t until the mid 2000s that its defining paper was released. Even then, it lurked mainly in the realms of academia until 2014 when Google released RAPPOR.
Although the concept still hasn’t seen much widespread implementation, there is some promise for its future, as well as the future of our collective privacy. A range of tech companies, both large and small, are already developing services that rely on the concept.
As we discussed earlier, differential privacy has also gotten more coverage during the coronavirus crisis because it offers us a way that we can collect valuable data that helps to control the spread, without causing significant privacy breaches.
Moreover, we are all starting to become aware of the large-scale data collection that is taking place, as well as how it can harm our privacy. In 2018, Europe began enforcing the GDPR, which is a landmark set of regulations for protecting people and their data.
Around the same time, companies began to pivot, and major data collectors such as Google and Facebook began emphasizing privacy in their products and marketing, as well as offering users more options in their privacy settings.
As Zuckerberg said at 2019’s F8 conference, “The future is private.” While his track record may make it hard to believe him, we can still hope that concepts like differential privacy can lead us to a more private future. If data collection and machine learning can be effective without having to invade people’s privacy, everyone wins.