What the GDPR Means for Data Science

Peering through a cracked doorway at a person sitting looking at a laptop screen with their head resting on their hand.

The General Data Protection Regulation (GDPR) is no longer “on the way.” It’s here, and it’s influencing data science in critical ways. Right now, data scientists are asking questions, such as:

What kind of data is subject to the GDPR?
What are new limits on data profiling and processing?
How does the GDPR impact the day-to-day of a practicing data scientist?
What’s the bottom line impact of the GDPR?

As the Director of Online Program in Data Science at Notre Dame, I had the opportunity to speak with our legal, business and data science experts to answer these crucial questions during our recent GDPR webinar.

Watch or listen to our recording to learn more about how the GDPR is impacting data science.

Webinar Highlights

Here are some highlights from the Notre Dame faculty featured in our GDPR webinar.

What kind of data is subject to the GDPR?

Mark McKenna, Professor of Law, The Law School: “What this applies to is personally identifiable data. So data that is and can be associated with a unique individual. Many companies use several kinds of data that are not personally identifiable and that really won’t be impacted in a significant way by this. But anything that can be connected to a particular person is potentially subject to the GDPR.

The biggest open question from a legal perspective is what the GDPR considers in terms of compliance and in terms of things that fall into legitimate interests of the business. The major question is how broadly will that be interpreted.

Some companies are going to argue that all of their functions are legitimate interests and I think that’s probably the route we’re going to see Facebook go. We’ll see how aggressive European regulators are going to be about pursuing that and how broadly it gets interpreted.”

What is considered “legitimate interest?”

McKenna: “We might see some diversion in the difference of this term and how it’s applied for academic institutions versus businesses.

The pre-existing law prior to GDPR made it pretty clear that businesses couldn’t claim there was legitimate interest just because it made them more money or made their product better. That wasn’t enough to say there was a legitimate interest.

But for an academic institution, they are going to claim legitimate interest for research purposes.

It wouldn’t surprise me to see interpretations of legitimate interest diverge depending on the who is involved.

But, again, legitimate interest doesn’t just mean more money for you. What the regulators are looking for is that this data makes the user experience better.”

What are new limits on data profiling and processing?

Fang Liu, Associate Professor, Applied and Computational Mathematics and Statistics: “Now some level of anonymization needs to be applied. To do that, you have to first define the data and information you are sharing and releasing under two categories: restricted and unrestricted data.

If the data has highly sensitive information, I think you are required to do something so your sensitive information is restricted and not disclosed to the public.”

McKenna: “The GDPR is fundamentally about getting consent for use of data so you are entitled to use the data that someone freely consents to give you. One important difference is that the threshold for what’s considered consent in Europe is higher than it is in the US.

The consent for the data has to be specific to the purpose for the data usage and if someone who is using data wants to reuse it for a different purpose than they originally gathered — the regulation would restrict your ability.

You need to anonymize away things that you don’t have a meaningful need to use because that just puts risk into the data that doesn’t need to be there for purposes of what you’re doing.”

Are we going to see more questions asking us, “Is it okay to use your data?”

McKenna: “That’s possible. I think that one thing companies are going to have to weigh is how much they want to “annoy” their customer with constantly asking if they can reuse these things and I think the other thing is we’re going to have to see is:

European regulators being more suspicious of American-style privacy policies that just say, “Click this box and we can do whatever we want with your data.” That’s not going to fly nearly as well in Europe.

The business strategy question is: Do you want to keep going back and asking for permission or do you just change what you’re doing as a business?“

How does the GDPR impact the day-to-day of practicing data scientists?

Scott Nestler, Associate Teaching Professor, Mendoza College of Business: “We tend to think and teach [that] more data is always better.

The importance of sample size determination and power calculations, things that which in this era of Big Data, people have said, “Oh, we don’t need to do that. Let’s just go get all the data.” Well, we might have to go back to using some of the more traditional methods from statistics up front in data science projects.

Pseudonymization is going to create challenges when we go to integrate different data sources.

I believe integrating data is where a lot of the insights out there exist. For example, in working with student-athletes, we collect data from wearable sensors and also do more laboratory measures.

It’s by combining and melding these sources that we truly unlock some insights.

It doesn’t always exist in one system and to do that you have to link records across different databases with some unique identifiers.”

Liu: “If everything is anonymized you really cannot identify a person. It’s really hard to merge different datasets together.

You might study and collect a bunch of attributes from a set of subject and do another study with a different set of attributes, but without unique identifiers, it’s going to be really hard to merge these two sets of data together.

This is really a challenge going forward.”

What are the more practical and technical implications of the GDPR?

Liu: “To protect the individual-level data, what I do is create a synthetic dataset by creating a statistical model with pseudo records. Researchers and the public can still use surrogate data sets to do research to answer questions, but there’s no real person in those pseudo records.

What we do as data scientists and what interests me is the population level, the aggregate statistics to find what’s the pattern in the whole data set rather than just focusing on the individual data set.

If we can find a balance, where we don’t disclose any individual-level data but maintain the utility of the dataset at the population level, I think there is a sweet spot to go after.”

What is the good news for data scientists under the GDPR?

Nestler: “The big one is it’s not enough to do good data science or to do good data analysis without being able to communicate that to [the] others who were not involved.

In particular, I think it’s going to increase the demand and need for those who can communicate how algorithms work, how a particular analysis was done, and why certain results were obtained to non-technical audiences.

Those who can communicate the results of technical analyses to those who don’t have a background or experience it, there’s going to be an increased demand for that. “

McKenna: “A slight take on that point. One of the things this might push a more open dialogue between the companies and the protection authorities to create best practice frameworks so that everybody is not in dark saying, “I don’t know which data I can use from before.”

Things like, here are some principles on good data practices that we can all agree to.”

What is the legal accountability and risk for a data scientist?

McKenna: “The regulation is not at the individual data scientist. It would be unusual for a data scientist to be held personally responsible.

My impression is that no one is anticipating this enforcement being directed at individual data scientists.”

What’s the bottom-line impact of the GDPR?

Liu: “For the general public, I think you should feel better with the GDPR because now your information is better protected and regulated. If you no longer want to share your data, you have a right to be forgotten.”

Nestler: “There’s four components data scientists should think about: 1) Is it technically possible? 2) Is it legal? 3) Is it something that’s in my or my organization’s best interest? 4) Is it ethical and something we should do? The question is how we do get from wherever we are with the GDPR situation and that [question] is hopefully at the intersection of all four of these.”

McKenna: “One of the things that lawyers get concerned about with big regulations is that this just gets turned into a box-checking exercise. It would be a fantastic outcome if people see that individual data scientists and organizations are thinking hard about what they are doing when they collect data.”

Conclusion

Sometimes, as data scientists, we forget that the vast majority of the population is not well informed on some of these issues. If the GDPR leads us to a better-informed population that knows not just what can happen with their data but also what their rights are with data and what it means to not accept a privacy policy, transparency might actually be a very good part of this.

We’ve focused so much on the technical side of data science, but the world is evolving. At Notre Dame, we believe in developing three-dimensional data scientists who can do the technical aspects, but also communicate effectively and act ethically. The acting ethically part is not easy, but now we’re moving this to the forefront and not just focusing on the technical side of data science.