*Differential Privacy Basics Series Conclusion and Important list of resources.*

*Differential Privacy Basics Series Conclusion and Important list of resources.*

*Summary: This the sixth and the FINAL blog post of “Differential Privacy Basics Series” — summarizing all of the previous blog posts. For more posts like these on Differential Privacy follow** Shaistha Fathima** on twitter.*

# Differential Privacy Basics Series

Before we head towards the conclusion — let’s have a look at some of the properties of Differential Privacy.

# Qualitative Properties of Differential Privacy (DP)

, moving beyond protection against re-identification.*Protection against arbitrary risks*, including all those attempted with all past, present,and future datasets and other forms and sources of auxiliary information.*Automatic neutralization of linkage attacks*

Linking attacks:Alinking attackinvolves combiningauxiliary datawithde-identified datatore-identifyindividuals. In the simplest case, a linking attack can be performed via ajoinof two tables containing these datasets.

Simple linking attacks are surprisingly effective:

1- Just a single data point is sufficient to narrow things down to a few records.

2- The narrowed-down set of records helps suggest additional auxiliary data which might be helpful.

3- Two data points are often good enough to re-identify a huge fraction of the population in a particular dataset.

4- Three data points (gender, ZIP code, date of birth) uniquely identify 87% of people in the US.

: Differential privacy is not a binary concept, and has a measure of privacy loss. This permits comparisons among different techniques:*Quantification of privacy loss*

(i) For a fixed bound on privacy loss, which technique provides better accuracy?

(ii) For a fixed accuracy, which technique provides better privacy?

: The quantification of loss also permits the analysis and control of cumulative privacy loss over multiple computations. Understanding the behavior of differentially private mechanisms under composition enables the design and analysis of complex differentially private algorithms from simpler differentially private building blocks.*Composition*: DP permits the analysis and control of privacy loss incurred by groups, such as families.*Group Privacy*: DP is immune to post-processing. A data analyst, without additional knowledge about the private database, cannot compute a function of the output of a differentially private algorithm M and make it less differentially private. That is, a data analyst cannot increase privacy loss, either under the formal definition or even in any intuitive sense, simply by sitting in a corner and thinking about the output of the algorithm, no matter what auxiliary information is available.*Closure Under Post-Processing*

# Granularity of Privacy- A final remark on DP definition.

*(Note: Granularity with respect to data means, the level of detail in a set of data.)*

Differential privacy promises that the behavior of an algorithm will be roughly unchanged even if a single entry in the database is modified. *But what constitutes a single entry in the database?*

For example a ** database that takes the form of a graph**. Such a database might encode a social network: each individual i∈[n] is represented by a vertex in the graph, and friendships between individuals are represented by edges.

This brings us to two situations:

(i) *DP at a level of granularity corresponding to individuals.*

This would require DP algorithms be insensitive to the addition or removal of any vertex from the graph. This gives a *strong privacy guarantee*, *but might in fact be stronger than we need.*

The addition or removal of a single vertex could after all add or remove up to n edges in the graph. Depending on what it is we hope to learn from the graph, insensitivity to n edge removals might be an impossible constraint to meet.

(ii) *DP at a level of granularity corresponding to edge.*

This would require DP algorithms to be insensitive *only* to the addition or removal of single, or small numbers of, edges from the graph. This is of course a weaker guarantee, but might still be sufficient for some purposes.

That is, if we promise ε-differential privacy at the level of a single edge, then no data analyst should be able to conclude anything about the existence of any subset of (1/ε) edges in the graph.

In some circumstances, ** large groups of social contacts might not be considered sensitive information.** For example, an individual might not feel the need to hide the fact that the majority of his contacts are with individuals in his city or workplace,because where he lives and where he works are public information.

Similarly, there might be a ** small number of social contacts whose existence is highly sensitive. **For example, a prospective new employer, or an intimate friend.

*In this case, edge privacy should be sufficient to protect sensitive information, while still allowing a fuller analysis of the data than vertex privacy.*

**Edge privacy will protect such an individual’s sensitive information provided that he has fewer than (1/ε) such friends.**

Another example, a ** differentially private movie recommendation system** can be designed to protect the data in the training set at the “event” level of single movies, hiding the viewing/rating of any single movie but not, say, hiding an individual’s enthusiasm for cowboy westerns or gore, or at the “user” level of an individual’s entire viewing and rating history.

# Summarizing — answer to what, why, when, where , how?

(Note: These are based on my current understanding, please post a comment if you would like to have a discussion on any of it)

# What is Differential Privacy?

Differential Privacy is a System or Framework proposed for better data privacy. It is **not a property of databases, but a property of queries**. The intuition behind it is that we bound how much the output can change if we change the data of a single individual in the database.

That is, if the effect of adding or removing an individual’s data is high on the output of the query then it means that the data has high sensitivity, and the chances of an adversary being able to analyze it and find some auxiliary information is high. In other words, the privacy is compromised!

In order to avoid data leak, we add a controlled amount of statistical noise to obscure the data contributions from individuals in the data set.

When training a AI model, noise is added while ensuring that the model still gains insight into the overall population, and thus provides predictions that are accurate enough to be useful. At the same time making it tough for the adversary to make any sense from the data queried!

# Why do we use Differential Privacy?

In the current world, privacy is one of the major concerns. With all the data science and AI models being implemented the chances of user privacy leak have increased.

Sometimes, AI models can memorize details about the data they’ve trained on and could ‘leak’ these details later on. Differential privacy is a framework (using math) for measuring this leakage and reducing the possibility of it happening.

# When and Where can we use Differential Privacy?

# How can we use Differential Privacy?

PATE analysis is one of the approaches for the implementation of DP.

PATE approach at providing differential privacy to machine learning is based on a simple intuition: if two different classifiers, trained on two different datasets with no training examples in common, agree on how to classify a new input example, then that decision does not reveal information about any single training example. The decision could have been made with or without any single training example, because both the model trained with that example and the model trained without that example reached the same conclusion.

For better theoretical understanding and explanation: Privacy and machine learning: two unexpected allies?

For practical code example: Detecting an Invisible Enemy With Invisible Data!

# Some Great Resources!!

- Exposed! A Survey of Attacks on Private Data
- Differential Privacy:A Primer for a Non-technical Audience∗
- Optimal Noise Adding Mechanisms for Approximate Differential Privacy
- Differential privacy and machine learning: Calculating sensitivity with generated data sets
- A Case Study on Differential Privacy
- Differential privacy: a comparison of libraries — IBM/differential-privacy-library (Python) vs google/differential-privacy (C++) vs brubinstein/diffpriv (R)
- Why differential privacy is awesome
- A Simpler Explanation of Differential Privacy

**Books**

- The Algorithmic Foundations of Differential Privacy
- Differential Privacy from Theory to Pratice
- Mathematical Foundations of DP

**Comic**

- A brief introduction to differential privacy: A data protection plan for the 2020 census
- PATE Framework

**Videos**

- 5 part Lecture on Differential Privacy by Cynthia Dwork
- Differential Privacy and the People’s Data
- Privacy preserving AI — Lecture by Andrew Trask
- Hacking Deep Learning: Differential Privacy and Collaborative Learning- Anand Sarwate
- A short tutorial on differential privacy: Dr Borja Balle, Amazon Research
- Differential Privacy for Growing Databases

Now this is not related but you might find it interesting — Podcast by The Changelog especially on Practical AI

# Overall References for this series:

- https://kth.diva-portal.org/smash/get/diva2:1112478/FULLTEXT01.pdf
- https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf
- http://www.cleverhans.io/privacy/2018/04/29/privacy-and-machine-learning.html
- https://www.seas.upenn.edu/~cis399/files/lecture/l21.pdf
- https://github.com/ZumrutMuftuoglu/OM-Study-Group/blob/master/privacybook.pdf

Thanks for following through till the end of this series. Feel free to post any comments or start a discussion about differential privacy concepts. You may also check the other series I have written before this :