Side Notes on Differential Privacy

In the following set of (side) notes, I shed light on aspects of differential privacy that I believe could benefit from clearer explanations and/or exploration.

Outline

0. A Brief Introduction
1. Understanding the Formal Definition
2. Global Model vs Local Model
3. Accuracy Bounds
4. Attacks to Local Differential Privacy
5. Frequency Estimation and Application to Voting

0. A Brief Introduction

While the history of differential privacy (in the form of randomized response which was first proposed by Warner) dates back to 1965, its formal definition was given in 2006 by Dwork, McSherry, Nissim, and Smith, and it is only within the past decade that practical applications of differential privacy have come into existence at scale — for examples, consider Google’s RAPPOR, Apple’s collection of emoji data, Microsoft for telemetry, and the US 2020 Census.

Informally, differential privacy allows one to conduct statistics on the entire population without necessarily knowing the individual data points.

A quick example (involving queries Q and a differentially private mechanism M)

  • Q: “How many people in the room are Korean?”
  • M: “42!”
  • Q: “Is Kevin Korean?”
  • M: “I don’t know (for sure)!”

At least, the above portrays the gist, and it should be accordingly obvious as to how such use of differential privacy would indeed enhance individuals’ privacy especially in today’s setting where our data reside in numerous databases around the world.

1. Understanding the Formal Definition

The cited definition of differential privacy (i.e. ε-differential privacy) is usually of the following form (where M denotes a mechanism satisfying ε-differential privacy, x and x’ denote neighboring databases, and S denotes any subset of outputs):

Formal definition of differential privacy

Nonetheless, this definition isn’t quite an intuitive one. For a better intuition, one can note three things.

  1. Recall that M(x) can be seen as a function that “injects random noise” to the real answer f(x) to a query such that the output is probabilistic. For example, f(x) could be 10 while M(x) could be 8, 9, 10, 11, or 12, each with probability 1/5.
  2. For the purpose of understanding, one can simply fix S to be a value (e.g. let S = 12) rather than a set of values.
  3. What is e doing there? Note that the following ways of viewing the above inequality could be helpful:

Effectively, the log ratio of two probabilities bounded by ε is exactly what ε-differential privacy signifies. As a result, it is mathematically impossible (as long as ε is not infinity) to retrieve the input x just from the output M(x) with 100% certainty, as there exists a possibility that a different input x’ could have produced the same output such that M(x’) = M(x) — which is exactly what differential privacy achieves.

In simpler terms, this allows one to plausibly deny the fact that an input x has been used to produce the value y = M(x) when only y is published. This benefit of differential privacy is sometimes called plausible deniability.

Example

Say a mechanism M satisfies ε-differential privacy where ε = ln(3). That means the ratio of the two probabilities are bounded by 3 from above. Intuitively, this can allow a scenario like the following.

  • Challenger (to privacy): “I see your output equals y. From this, I infer that your input must be value x.”
  • M: “Well, perhaps you are correct with probability 75%, but actually it could be something else like x’ with probability 25%. You will never know for sure!”

2. Global Model vs Local Model

Whether a consideration is centered around the global model of differential privacy or the local model of differential privacy is a crucial distinction to make (although at times unclear in the literature). The essential difference between the global and the local models of differential privacy is that x and x’ (from the aforementioned formal definition of differential privacy) represent neighboring databases in the global model, whereas they represent two individual data points in the local model.

Examples

  • Global model: x (database) → f(x) (query to database) → M(x) (adds noise to f(x) and outputs result)
  • Perhaps, f(x) is 40 when the query asks for the number of Koreans in the room while M(x) probabilistically outputs 42.
  • Local model: x (individual data point) → M(x) (perturbed data point)
  • Perhaps, everyone’s ethnicity is “perturbed” with some probability (e.g. Bob’s data which show he is American may be perturbed to indicate that Bob is in fact Korean when a locally differentially private M is applied) before the sum of those perturbed data points outputs 42 as the answer to the same query.

The practical difference is that the global model of differential privacy assumes a trusted data curator (who has access to everyone’s raw data in the database) while the local model removes trust from the equation such that even the curator (e.g. Google in the case of RAPPOR) is only able to see the noisy data points but not the raw data from the collection process.

In other words, noise is added to the output of a query in the global model whereas to the input in the local model.

3. Accuracy Bounds

Source: The Algorithmic Foundations of Differential Privacy

I. Knowing which accuracy bounds to use

Having made the above distinction, one should note that the global model is the one that has historically enjoyed richer theoretical results, as it has been formally studied first and is indeed the main focus of the seminal textbook The Algorithmic Foundations of Differential Privacy by Dwork and Roth. The above theorems, for instance, illustrate the accuracy bounds for the two most well-known mechanisms — the Laplace mechanism and the exponential mechanism — in the global model. On the other hand, those in the local model tend to depend a lot on the type of query being made and perturbation added to individual data points in the first place such that there is no overall “textbook” formula that one can consult from the outset.

Now, what should be avoided as a result is mixing up which model of differential privacy one is working with and consequently employing the wrong accuracy bounds. For example, if one applies the Laplace mechanism locally to data points (i.e. to preserve ε-local differential privacy), then one should not be using the accuracy bounds meant for the Laplace mechanism in the global model.

II. Clarifying the accuracy bound itself

When proposing a new differentially private mechanism, one should be able to show that the mechanism is accurate enough to be used in practice, not merely the fact that it satisfies ε-differential privacy. To be clear, merely claiming ε-differential privacy is not too helpful in the first place, as ε could be bigger than 10 or 100 in which case not much privacy can be guaranteed. Even if ε-differential privacy can be achieved for low values of ε, we desire mechanisms with high usability, i.e. those where accuracy seems reasonable.

In order to juggle this accuracy-privacy game well, a differentially private mechanism should always entail a closed-form accuracy bound for practical purposes. There already have been numerous protocols out there that claim ε-differential privacy without it.

4. Attacks to Local Differential Privacy

I. Taking the “mean”

Say you wish to report a value x = 10 which gets perturbed as per some locally differentially private mechanism M such that M(x) outputs a random value centered around 10 every time it is queried, e.g. it may output 12, 7, and 9 when queried 3 times. Now, the issue here is that perhaps a listening attacker can simply take the mean of (say) 20 outputs of M(x) to retrieve a pretty accurate estimate of the true value x.

II. Linkability

To defend the above retrieval, we can alter the process such that we memoize the first output of M(x) and output this memoized value (say 12) every time M(x) is called. Then the issue here is a more subtle one. What if a listening attacker sees the value 12 (which was the first memoized value) from 2 different IP addresses corresponding to your home and work IP addresses? Especially in a setting where only you output the value 12 (e.g. if the range of M is huge), it may be possible for an attacker to “link” your two IP addresses and perform such linkability analysis on every value he sees.

III. Proposed solution (ft. RAPPOR)

Hence, Google’s RAPPOR attempts to defend this linkability attack by introducing two different (and subsequent) notions of randomized responses: Permanent random and Instantaneous random.

Idea and example

  • The idea is that the first output of M(x) (in the setup phase) is memoized to be the Permanent random while future outputs of M(x) (in the report phase) are to be the Instantaneous random values, centered around the Permanent random (as opposed to x).
  • True value x = 10 → Permanent random M(x) = 12 → Instantaneous random_1 M(x) = 18 → Instantaneous random_2 M(x) = 8 → Instantaneous random_3 M(x) = 16 → …

IV. Trackability

Source: Can Two Walk Together: Privacy Enhancing Methods and Preventing Tracking of Users

Formalized only after RAPPOR has been deployed in practice for years, the notion of trackability is an interesting one that captures the intuition of linkability in formal terms resembling those that define differential privacy. Roughly speaking, it reflects that the log ratio of two probabilities — one that a given series of reports originated from one person and one that a given series of reports originated from two people — is bounded by ε.

It turns out that RAPPOR is actually not immune to being trackable under the above definition (after about 5 reports) although intuitively it should be. Whether or not we would be able to see optimized (under trackability constraints) versions of RAPPOR or otherwise in the future is still open.

5. Frequency Estimation and Application to Voting

In local differential privacy, the current research fields include the following:

  • Frequency estimation: estimating the frequency of data points (e.g. comparing frequencies of emojis, browser settings, etc.)
  • Heavy hitter identification (subtask of frequency estimation): outputting the most frequent data point (e.g. most popular emoji? bookmarked site?)
  • Itemset mining: association rule learning with noisy data
  • Spatial data collection: pertains to geolocation data

Amid numerous applications of above, we briefly examine one to voting (which essentially translates to heavy hitter identification) with actual numbers. The setting is as follows.

  • d: number of candidates (such that each vote is a d-bit vector)
  • x: a voter’s true vote (e.g. of the form [0, 0, 1, 0] to signify voting for the third candidate)
  • y: a voter’s noisy vote (where each bit of the true vote is flipped with probability 1-p)
  • p: probability that each bit of a true vote will not be flipped to yield a noisy vote
  • c[i]: the sum of 1’s on the i-th bit (i.e. i-th candidate) from all noisy votes
  • t[i]: estimated true count of votes won by the i-th candidate as a function of c[i]
  • f[i]: estimated true frequency of votes won by the i-th candidate, or simply t[i] / n
  • n: total number of voters
  • f_real[i]: real frequency of votes won by the i-th candidate had the election not employed local differential privacy

Then the key results are as follows.

  1. ε-local differential privacy is achieved for ε = 2 ln(p / (1-p)).
  2. The expected value E[f[i]] = f_real[i] such that f[i] is an unbiased estimator.
  3. Var[f[i]] = p (1-p) / (n (2p-1)²)
  4. With 95% confidence, we can claim the following:
Application of differential privacy to voting

Fixing n = 100,000 and p = 0.63, we reach two results. First, we achieve ε-local differential privacy where e^ε is about 2.9 — meaning that every voter is guaranteed plausible deniability of at least 25% (i.e. any accusation of a vote would be wrong with probability at least 25%).

Second, we bound our true election result within a “radius” of about 1.15%, e.g. if our estimate shows candidate A has won 51.15% of all the votes, then the true proportion of votes won by candidate A (had the election not employed differential privacy) would be between 50% and 52.3%. If an election involves more than 100,000 people, then this error radius would indeed decrease (which is desirable). Thus, the more voters we have for an election employing differential privacy, the better.

With regards to the non-flipping probability p, however, the challenge is to balance it as a parameter: higher p would achieve better accuracy while sacrificing privacy, whereas lower p would achieve better privacy while sacrificing accuracy.

Remarks

Computer Science PhD Student | NYU, Columbia | https://kevinchoi.io