Plan today
• An introduction to differential privacy

What to do?
“The future of privacy is lying” – (April 10 2013, , )

A Simple Example
• Negative data survey – ask people to lie, and then make inferences based on the aggregate answers

Warm up

Negative data surveys
• Participants select a choice that does not fit their situation
• Providing more choices provides more privacy
• May be challenging to design appropriate questions
• Reliance on honesty of the respondents
• This is an example of a local type of privacy, each person responsible for adding noise to their data

Differential privacy: Local and global

Global: We have a sensitive dataset, a trusted data owner Alice and a researcher Bob. Alice does analysis on the raw data, adds noise to the answers, and reports the (noisy) answers to : Each person is responsible for adding noise to their own data. Classic survey example each person has to answer question “Do you use drugs?”
• They flip a coin in secret and answer “Yes” if it comes up heads, but tell the truth otherwise.
• Plausible deniability about a “Yes” answer
We will next be looking further at the global case (global systems generally more accurate,
and less noise is needed)

Differential privacy: Where?
• Since its introduction in 2006:
– US Census Bureau in 2012: On The Map project
• Where people are employed and where they live
– Apple in 2016: iOS 10
• User data collection, e.g. for emoji suggestions
• https://images.apple.com/privacy/docs/Differential_Privac y_Overview.pdf
– NSW Department of Transport open release of 2016 Opal ticketing system data
• https://opendata.transport.nsw.gov.au/sites/default/files/r esources/Open%20Opal%20Data%20Documentation%2 0170728.pdf

Global differential privacy: Our focus
k-anonymity l-diversity
Differential privacy
Privatized Analysis
Original Data
Anonymous Data
Original Data

What is being protected?
• Imagine a survey is asking you: – Are you a smoker?
• Result: Number of smokers will be reported Would you take part in it?

What is being protected?
I would feel safe submitting the survey if:
I know the chance that the privatized result would be 𝑹 is nearly the same, whether or not I take part in the survey.
• Does this mean that an individual’s answer has no impact on the released result?

Overview of the process: Global differential privacy
Original Data
Privatized Analysis
• The privatized analysis comprises two steps:
– Querythedataandobtaintherealresult,e.g.,howmany female students are in the survey?
– Add random noise to hide the presence/absence of any individual. Release noisy result to the user.

The released results will be different each time (different amount of noised added)
• Query: How many females in the dataset? (true result = 32)
• Generate some random values, according to a distribution with
mean value 0: {1,2,-2,-1,0,-3,1,0}, add to true result and release
– 1st query: Released result=33 (32+1)
– 2nd query: Released result=34 (32+2)
– 3rd query:
– 4th query:
– 5th query:
– 6th query:
– 7th query:
– 8th query: –…
Released result=30 (32-2) Released result=31 (32-1) Released result=32 (32+0) Released result=29 (32-3) Released result=33 (32+1) Released result=32 (32,0)
• On average, the released result will be 32, but observing a single released result doesn’t give the adversary exact knowledge

Emoji scenario and use of differential privacy
• A developer wants to understand which emoji’s are popular, in ordertomakebetterrecommendations. Thereisadatabase like
• Query from developer: How many times was 🥺used today?
• System will release a noisy result to developer, to protect customer privacy
Emoji used today

The promise of differential privacy
• The chance that the noisy released result will be 𝑅is nearly the same, whether or not an individual participates in the dataset.
A=Probability that result is R
B=Probability that result is R
Possible world where I Possible world where I
participate do not participate
• If we can guarantee A≅B (A is very close to B), then no one can guess which possible world resulted in R.

The promise of differential privacy
• Does this mean that the attacker cannot learn anything sensitive about individuals from the released results?

Differential privacy: How?
• How much noise should we add to the result? This depends on – Privacy loss budget: How private we want the result to be
(how hard for the attacker to guess the true result)
– Globalsensitivity:Howmuchdifferencethepresenceor absence of an individual could make to the result.

Global sensitivity
• Global sensitivity of a query Q is the maximum difference in answers that adding or removing any individual from the dataset can cause (maximum effect of an individual)
• Intuitively, we want to consider the worst case scenario
• If asking multiple queries, global sensitivity is equal to the sum of the differences

Global sensitivity
• QUERY: How many people in the dataset are female? Global sensitivity = 1
X+1 people are female
X people are female
Possible world where I Possible world where I
participate do not participate

Global sensitivity
• QUERY: How many people in the dataset are smokers? Global sensitivity = 1
X+1 people are smokers
X people are smokers
Possible world where I Possible world where I
participate do not participate

Global sensitivity
• QUERY: How many people in the dataset are female? And how many people are smokers?
Global sensitivity = 1+1=2
X+1 people are smokers M+1 males and F females OR
M males and F+1 females
X people are smokers M males and F females
Possible world where I Possible world where I
participate do not participate

Privacy loss budget = k
• We want that the presence or absence of a user in the dataset does not have a considerable effect on the released result
A=Probability that result is R
Possible world where I
Privacy loss budget = k (k ≥ 0) ChoosektoguaranteethatA≤2k× B
B=Probability that result is R
Possible world where I
do not participate

Privacy loss budget = k
A=Probability that result is R
Possible world where I
B=Probability that result is R
Privacy loss budget=k (k ≥0) ChoosektoguaranteethatA≤2k× B
• k=0: No privacy loss (A=B), low utility
• k=high: Larger privacy loss, higher utility
• k=low: Low privacy loss, lower utility
Possible world where I
do not participate

Differential privacy: How?
• How much noise should we add to the result? This depends on
– Privacy loss budget (k): How private we want the result to be (how hard for the attacker to guess the true result)
– Globalsensitivity(G):Howmuchdifferencethepresence of absence of an individual could make to the result.
• Strategy: Add noise to the result according to
– Releasedresult=Trueresult+noise
• Where noise is a number randomly sampled from a distribution having
– averagevalue=0(μ)
– standarddeviation(spread)=G/k(b)
• Details about the distribution are beyond the scope of our study (it is called the Laplace distribution)

Example Code


• Differential privacy guarantees that the presence or absence of a user cannot be revealed after releasing the query result
– It does not prevent attackers from drawing conclusions about individuals from the aggregate results over the population
