From a506f32cd0a375f0ef2386dab6cf93a54542d2ac Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Tue, 1 Jul 2025 07:09:32 -0500 Subject: [PATCH 01/40] update(blog)!: Privacy-Enhancing Technologies Series: Differential Privacy --- blog/posts/differential-privacy.md | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) create mode 100644 blog/posts/differential-privacy.md diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md new file mode 100644 index 00000000..8e48a540 --- /dev/null +++ b/blog/posts/differential-privacy.md @@ -0,0 +1,26 @@ +--- +date: + created: 2025-07-01T17:30:00Z +categories: + - Explainers +authors: + - fria +tags: + - Privacy Enhancing Technologies + - Differential Privacy +license: BY-SA +schema_type: BackgroundNewsArticle +description: | + Privacy Pass is a new way to privately authenticate with a service. Let's look at how it could change the way we use services. +--- +# Privacy-Enhancing Technologies Series: Differential Privacy + +Is it possible to collect data from a large group of people but protect each individual's privacy? In this entry of my series on privacy-enhancing technologies, we'll discuss differential privacy and how it can do just that. + +## Problem + +It's useful to collect data from a large group of people. You can see trends in a population. But it requires a lot of individual people to give up personally identifiable information. Even things that seem inocuous like your gender can help identify you. 87% of Americans can be identified by three pieces of information: + +## History + +Most of the concepts I write about seem to come from the 70's and 80's, but differential privacy is a relatively new concept. It was first introduced in a paper from 2006 called [*Calibrating Noise to Sensitivity in Private Data Analysis*](https://desfontain.es/PDFs/PhD/CalibratingNoiseToSensitivityInPrivateDataAnalysis.pdf) \ No newline at end of file From 83be6545dae6d37c584e88c67680d4eef22c6df7 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Tue, 1 Jul 2025 07:16:28 -0500 Subject: [PATCH 02/40] add more info to the problem --- blog/posts/differential-privacy.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 8e48a540..12d7c972 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -19,7 +19,9 @@ Is it possible to collect data from a large group of people but protect each ind ## Problem -It's useful to collect data from a large group of people. You can see trends in a population. But it requires a lot of individual people to give up personally identifiable information. Even things that seem inocuous like your gender can help identify you. 87% of Americans can be identified by three pieces of information: +It's useful to collect data from a large group of people. You can see trends in a population. But it requires a lot of individual people to give up personally identifiable information. Even things that seem inocuous like your gender can help identify you. + +Latanya Sweeney in a [paper](https://dataprivacylab.org/projects/identifiability/paper1.pdf) from 2000 used U.S. Census data to try and re-identify people solely based on the metrics available to her. She found that 87% of Americans could be identified based on only 3 metrics: ZIP code, date of birth, and sex. ## History From f1a36ef966cecc434f2aa1b74f627e2faef390ec Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Tue, 1 Jul 2025 08:06:55 -0500 Subject: [PATCH 03/40] add info about noise --- blog/posts/differential-privacy.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 12d7c972..f554e33f 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -23,6 +23,11 @@ It's useful to collect data from a large group of people. You can see trends in Latanya Sweeney in a [paper](https://dataprivacylab.org/projects/identifiability/paper1.pdf) from 2000 used U.S. Census data to try and re-identify people solely based on the metrics available to her. She found that 87% of Americans could be identified based on only 3 metrics: ZIP code, date of birth, and sex. +Obviously, being able to identify individuals based on publicly available data is a huge privacy issue. + ## History -Most of the concepts I write about seem to come from the 70's and 80's, but differential privacy is a relatively new concept. It was first introduced in a paper from 2006 called [*Calibrating Noise to Sensitivity in Private Data Analysis*](https://desfontain.es/PDFs/PhD/CalibratingNoiseToSensitivityInPrivateDataAnalysis.pdf) \ No newline at end of file +Most of the concepts I write about seem to come from the 70's and 80's, but differential privacy is a relatively new concept. It was first introduced in a paper from 2006 called [*Calibrating Noise to Sensitivity in Private Data Analysis*](https://desfontain.es/PDFs/PhD/CalibratingNoiseToSensitivityInPrivateDataAnalysis.pdf). + +The paper introduces the idea of adding noise to data to achieve privacy. Of course, adding noise to the dataset reduces its accuracy. Ɛ defines the amount of noise added to the dataset, with a small Ɛ meaning more privacy but less accurate data and vice versa. + From 7a62a90097cba90a94b56e567850713c294673c2 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Tue, 1 Jul 2025 08:20:20 -0500 Subject: [PATCH 04/40] add more info on history --- blog/posts/differential-privacy.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index f554e33f..848d6eeb 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -29,5 +29,6 @@ Obviously, being able to identify individuals based on publicly available data i Most of the concepts I write about seem to come from the 70's and 80's, but differential privacy is a relatively new concept. It was first introduced in a paper from 2006 called [*Calibrating Noise to Sensitivity in Private Data Analysis*](https://desfontain.es/PDFs/PhD/CalibratingNoiseToSensitivityInPrivateDataAnalysis.pdf). -The paper introduces the idea of adding noise to data to achieve privacy. Of course, adding noise to the dataset reduces its accuracy. Ɛ defines the amount of noise added to the dataset, with a small Ɛ meaning more privacy but less accurate data and vice versa. +The paper introduces the idea of adding noise to data to achieve privacy. Of course, adding noise to the dataset reduces its accuracy. Ɛ defines the amount of noise added to the dataset, with a small Ɛ meaning more privacy but less accurate data and vice versa. It's also referred to as the "privacy loss parameter". +Importantly, differential privacy adds noise *before* it's analyzed. k-anonymity relies on trying to anonymize data *after* it's collected, so it leaves the possibility that not enough parameters are removed to ensure each indidual cannot be identified. \ No newline at end of file From 21ecc1a9c7ac5390d9a17b683bfac44f2ac5791d Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Tue, 1 Jul 2025 08:20:42 -0500 Subject: [PATCH 05/40] typo --- blog/posts/differential-privacy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 848d6eeb..962318ee 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -31,4 +31,4 @@ Most of the concepts I write about seem to come from the 70's and 80's, but diff The paper introduces the idea of adding noise to data to achieve privacy. Of course, adding noise to the dataset reduces its accuracy. Ɛ defines the amount of noise added to the dataset, with a small Ɛ meaning more privacy but less accurate data and vice versa. It's also referred to as the "privacy loss parameter". -Importantly, differential privacy adds noise *before* it's analyzed. k-anonymity relies on trying to anonymize data *after* it's collected, so it leaves the possibility that not enough parameters are removed to ensure each indidual cannot be identified. \ No newline at end of file +Importantly, differential privacy adds noise *before* it's analyzed. k-anonymity relies on trying to anonymize data *after* it's collected, so it leaves the possibility that not enough parameters are removed to ensure each individual cannot be identified. \ No newline at end of file From 91cc1bb0b783a748fca4c94e437f3b7e40b0bf7a Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Wed, 2 Jul 2025 10:13:16 -0500 Subject: [PATCH 06/40] add annotation --- blog/posts/differential-privacy.md | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 962318ee..2b41a4e8 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -31,4 +31,13 @@ Most of the concepts I write about seem to come from the 70's and 80's, but diff The paper introduces the idea of adding noise to data to achieve privacy. Of course, adding noise to the dataset reduces its accuracy. Ɛ defines the amount of noise added to the dataset, with a small Ɛ meaning more privacy but less accurate data and vice versa. It's also referred to as the "privacy loss parameter". -Importantly, differential privacy adds noise *before* it's analyzed. k-anonymity relies on trying to anonymize data *after* it's collected, so it leaves the possibility that not enough parameters are removed to ensure each individual cannot be identified. \ No newline at end of file +Importantly, differential privacy adds noise *before* it's analyzed. k-anonymity (1) relies on trying to anonymize data *after* it's collected, so it leaves the possibility that not enough parameters are removed to ensure each individual cannot be identified. +{ .annotate } + +1. k-anonymity means that for each row, at least k-1 other rows are identical. + + +### Google RAPPOR + +In 2014, Google introduced [Randomized Aggregatable Privacy-Preserving Ordinal Response](https://arxiv.org/pdf/1407.6981) (RAPPOR), their [open source](https://github.com/google/rappor) implementation of differential privacy, with a few improvements. + From 8bcf668ddae8e3ceeaa46350c08902fa3e36bfee Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Thu, 3 Jul 2025 10:55:32 -0500 Subject: [PATCH 07/40] add strava heatmap --- blog/posts/differential-privacy.md | 27 ++++++++++++++++++++++++--- 1 file changed, 24 insertions(+), 3 deletions(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 2b41a4e8..7cb4b0e6 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -27,17 +27,38 @@ Obviously, being able to identify individuals based on publicly available data i ## History +### Before Differential Privacy + +Previous attempts at anonymizing data have relied on been highly vulnerable to reidentification attacks. + +#### AOL Search Log Release + +A famous example is the AOL search log release. AOL had been logging its users searches for research purposes. When they released the data, they only replaced the users' real names with an identifier. Researchers were able to identify [user 4417749](https://archive.nytimes.com/www.nytimes.com/learning/teachers/featured_articles/20060810thursday.html) as Thelma Arnold based on the identifying details of her searches. + +#### Strava Heatmap Incident + +In 2018, the fitness app Strava announced a major update to its heatmap, showing the the workout patterns of users of fitness trackers like Fitbit. + +Analyst [Nathan Ruser](https://x.com/Nrg8000/status/957318498102865920) indicated that these patterns can reveal military bases and troop movement patterns. This is obviously a huge op-sec problem and can endanger the lives of troops. + +Since movement patterns are fairly unique, + +### Dawn of Differential Privacy + Most of the concepts I write about seem to come from the 70's and 80's, but differential privacy is a relatively new concept. It was first introduced in a paper from 2006 called [*Calibrating Noise to Sensitivity in Private Data Analysis*](https://desfontain.es/PDFs/PhD/CalibratingNoiseToSensitivityInPrivateDataAnalysis.pdf). The paper introduces the idea of adding noise to data to achieve privacy. Of course, adding noise to the dataset reduces its accuracy. Ɛ defines the amount of noise added to the dataset, with a small Ɛ meaning more privacy but less accurate data and vice versa. It's also referred to as the "privacy loss parameter". -Importantly, differential privacy adds noise *before* it's analyzed. k-anonymity (1) relies on trying to anonymize data *after* it's collected, so it leaves the possibility that not enough parameters are removed to ensure each individual cannot be identified. -{ .annotate } +Importantly, differential privacy adds noise *before* it's analyzed. k-anonymity relies on trying to anonymize data *after* it's collected, so it leaves the possibility that not enough parameters are removed to ensure each individual cannot be identified. -1. k-anonymity means that for each row, at least k-1 other rows are identical. +### Problems with k-anonymity +k-anonymity means that for each row, at least k-1 other rows are identical. +| Age | ### Google RAPPOR In 2014, Google introduced [Randomized Aggregatable Privacy-Preserving Ordinal Response](https://arxiv.org/pdf/1407.6981) (RAPPOR), their [open source](https://github.com/google/rappor) implementation of differential privacy, with a few improvements. + + From 4a6a15d213a019d084971494936f86a1f99715e4 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Thu, 3 Jul 2025 11:09:07 -0500 Subject: [PATCH 08/40] add link to blog about strava --- blog/posts/differential-privacy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 7cb4b0e6..811a9b3f 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -41,7 +41,7 @@ In 2018, the fitness app Strava announced a major update to its heatmap, showing Analyst [Nathan Ruser](https://x.com/Nrg8000/status/957318498102865920) indicated that these patterns can reveal military bases and troop movement patterns. This is obviously a huge op-sec problem and can endanger the lives of troops. -Since movement patterns are fairly unique, +It was also possible to [deanonymize](https://steveloughran.blogspot.com/2018/01/advanced-denanonymization-through-strava.html) individual users in some circumstances. ### Dawn of Differential Privacy From 9a4f49a2b3f66cb8f64aed906b3d80d5bef516dc Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Mon, 7 Jul 2025 10:39:55 -0500 Subject: [PATCH 09/40] add intro before differential privacy --- blog/posts/differential-privacy.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 811a9b3f..99683778 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -29,6 +29,12 @@ Obviously, being able to identify individuals based on publicly available data i ### Before Differential Privacy +Being able to collect aggregate data is essential for research. It's what the U.S. Census does every 10 years. + +Usually we're more interested in the data as a whole and not data of individual people as it can show trends and overall patterns in groups of people. However, in order to get that data we must collect it from individuals. + +It was thought at first that simply removing names and other obviously identifying details from the data was enough to prevent re-identification, but [Latanya Sweeney](https://latanyasweeney.org/JLME.pdf) (a name that will pop up a few more times) proved in 1997 that even without names, a significant portion of individuals can be re-identified from a dataset by cross-referencing external data. + Previous attempts at anonymizing data have relied on been highly vulnerable to reidentification attacks. #### AOL Search Log Release @@ -43,6 +49,8 @@ Analyst [Nathan Ruser](https://x.com/Nrg8000/status/957318498102865920) indicate It was also possible to [deanonymize](https://steveloughran.blogspot.com/2018/01/advanced-denanonymization-through-strava.html) individual users in some circumstances. +#### + ### Dawn of Differential Privacy Most of the concepts I write about seem to come from the 70's and 80's, but differential privacy is a relatively new concept. It was first introduced in a paper from 2006 called [*Calibrating Noise to Sensitivity in Private Data Analysis*](https://desfontain.es/PDFs/PhD/CalibratingNoiseToSensitivityInPrivateDataAnalysis.pdf). From f6c344393fffcf3006a2b763847f341451a527d9 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Mon, 7 Jul 2025 11:29:05 -0500 Subject: [PATCH 10/40] add link --- blog/posts/differential-privacy.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 99683778..04b883d9 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -33,7 +33,7 @@ Being able to collect aggregate data is essential for research. It's what the U. Usually we're more interested in the data as a whole and not data of individual people as it can show trends and overall patterns in groups of people. However, in order to get that data we must collect it from individuals. -It was thought at first that simply removing names and other obviously identifying details from the data was enough to prevent re-identification, but [Latanya Sweeney](https://latanyasweeney.org/JLME.pdf) (a name that will pop up a few more times) proved in 1997 that even without names, a significant portion of individuals can be re-identified from a dataset by cross-referencing external data. +It was thought at first that simply r[emoving names and other obviously identifying details](https://simons.berkeley.edu/news/differential-privacy-issues-policymakers#:~:text=Prior%20to%20the%20line%20of%20research%20that%20led%20to%20differential%20privacy%2C%20it%20was%20widely%20believed%20that%20anonymizing%20data%20was%20a%20relatively%20straightforward%20and%20sufficient%20solution%20to%20the%20privacy%20challenge.%20Statistical%20aggregates%20could%20be%20released%2C%20many%20people%20thought%2C%20without%20revealing%20underlying%20personally%20identifiable%20data.%20Data%20sets%20could%20be%20released%20to%20researchers%20scrubbed%20of%20names%2C%20but%20otherwise%20with%20rich%20individual%20information%2C%20and%20were%20thought%20to%20have%20been%20anonymized.) from the data was enough to prevent re-identification, but [Latanya Sweeney](https://latanyasweeney.org/JLME.pdf) (a name that will pop up a few more times) proved in 1997 that even without names, a significant portion of individuals can be re-identified from a dataset by cross-referencing external data. Previous attempts at anonymizing data have relied on been highly vulnerable to reidentification attacks. @@ -51,6 +51,11 @@ It was also possible to [deanonymize](https://steveloughran.blogspot.com/2018/01 #### +#### Problems with k-anonymity + +k-anonymity means that for each row, at least k-1 other rows are identical. +| Age | + ### Dawn of Differential Privacy Most of the concepts I write about seem to come from the 70's and 80's, but differential privacy is a relatively new concept. It was first introduced in a paper from 2006 called [*Calibrating Noise to Sensitivity in Private Data Analysis*](https://desfontain.es/PDFs/PhD/CalibratingNoiseToSensitivityInPrivateDataAnalysis.pdf). @@ -59,11 +64,6 @@ The paper introduces the idea of adding noise to data to achieve privacy. Of cou Importantly, differential privacy adds noise *before* it's analyzed. k-anonymity relies on trying to anonymize data *after* it's collected, so it leaves the possibility that not enough parameters are removed to ensure each individual cannot be identified. -### Problems with k-anonymity - -k-anonymity means that for each row, at least k-1 other rows are identical. -| Age | - ### Google RAPPOR In 2014, Google introduced [Randomized Aggregatable Privacy-Preserving Ordinal Response](https://arxiv.org/pdf/1407.6981) (RAPPOR), their [open source](https://github.com/google/rappor) implementation of differential privacy, with a few improvements. From fc2c321beec35be850bfff178db81b3b1f5ae9a9 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Tue, 8 Jul 2025 09:11:30 -0500 Subject: [PATCH 11/40] add example Table --- blog/posts/differential-privacy.md | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 04b883d9..0ad3c366 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -49,7 +49,20 @@ Analyst [Nathan Ruser](https://x.com/Nrg8000/status/957318498102865920) indicate It was also possible to [deanonymize](https://steveloughran.blogspot.com/2018/01/advanced-denanonymization-through-strava.html) individual users in some circumstances. -#### +#### Randomized Response + +One of the earliest ideas for anonymizing data was [randomized response](https://uvammm.github.io/docs/randomizedresponse.pdf), first introduced all the way back in 1965 in a paper by Stanley L. Warner. The idea behind it is quite clever. + +For certain questions like "have you committed tax fraud?", respondents will likely be hesitant to answer truthfully. The solution? Have the respondent flip a coin. If the coin is tails, answer yes. If the coin lands on heads, answer truthfully. + +| Respondent | Answer | Coin Flip (not included in the actual dataset just here for illustration) | +| --- | --- | +| 1 | Yes | Tails (Answer Yes) | +| 2 | No | Heads (Answer Truthfully) | +| 3 | Yes | Heads (Answer Truthfully) | +| 4 | Yes | Tails (Answer Yes) | +| 5 | No | Heads (Answer Truthfully) | + #### Problems with k-anonymity From e3db04c23ad3d35f247b87704f9aab5950b0f675 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Tue, 8 Jul 2025 09:13:11 -0500 Subject: [PATCH 12/40] add title to table --- blog/posts/differential-privacy.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 0ad3c366..77877646 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -55,6 +55,8 @@ One of the earliest ideas for anonymizing data was [randomized response](https:/ For certain questions like "have you committed tax fraud?", respondents will likely be hesitant to answer truthfully. The solution? Have the respondent flip a coin. If the coin is tails, answer yes. If the coin lands on heads, answer truthfully. +Have you committed tax fraud? + | Respondent | Answer | Coin Flip (not included in the actual dataset just here for illustration) | | --- | --- | | 1 | Yes | Tails (Answer Yes) | From b9c64a75e0a6dbe89204ff06c1b6935b2691cfd2 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Sat, 12 Jul 2025 06:11:59 -0500 Subject: [PATCH 13/40] fix table --- blog/posts/differential-privacy.md | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 77877646..ca945808 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -58,13 +58,24 @@ For certain questions like "have you committed tax fraud?", respondents will lik Have you committed tax fraud? | Respondent | Answer | Coin Flip (not included in the actual dataset just here for illustration) | -| --- | --- | +| ---- | ---- | | 1 | Yes | Tails (Answer Yes) | | 2 | No | Heads (Answer Truthfully) | | 3 | Yes | Heads (Answer Truthfully) | | 4 | Yes | Tails (Answer Yes) | | 5 | No | Heads (Answer Truthfully) | +Because we know the exact probability that a "Yes" answer is fake, 50%, we can remove it and give a rough estimate of how many respondents answered "Yes" truthfully. + +Randomized Response would lay the groundwork for differential privacy, but it wouldn't truly be realized for many decades. + +#### Unrelated Question Randomized Response + +A variation used later in a [paper](https://www.jstor.org/stable/2283636) by Greenberg et al. called **unrelated question randomized response** would present each respondent with either a sensitive question or a banal question like "is your birthday in January?" to increase the likelihood of people answering honestly, since the researcher doesn't know which question was asked. + + + + #### Problems with k-anonymity From 5c03af3753c6c62c3ed7ef82531f28d1bb6f0489 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Sat, 12 Jul 2025 06:28:18 -0500 Subject: [PATCH 14/40] add table --- blog/posts/differential-privacy.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index ca945808..c9fddeb8 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -73,9 +73,13 @@ Randomized Response would lay the groundwork for differential privacy, but it wo A variation used later in a [paper](https://www.jstor.org/stable/2283636) by Greenberg et al. called **unrelated question randomized response** would present each respondent with either a sensitive question or a banal question like "is your birthday in January?" to increase the likelihood of people answering honestly, since the researcher doesn't know which question was asked. - - - +| Respondent | Question (not visible to researcher) | Answer | +| --- | --- | --- | +| 1 | Have you ever committed tax evasion? | No | +| 2 | Is your birthday in January? | Yes | +| 3 | Is your birthday in January? | No | +| 4 | Have you ever committed tax evasion? | Yes | +| 5 | Have you ever committed tax evasion? | No | #### Problems with k-anonymity From b59993f2b478a0d9a292581c66897f9ae5072792 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Sat, 12 Jul 2025 06:28:56 -0500 Subject: [PATCH 15/40] remove unfinished table --- blog/posts/differential-privacy.md | 1 - 1 file changed, 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index c9fddeb8..4d8d7ab1 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -84,7 +84,6 @@ A variation used later in a [paper](https://www.jstor.org/stable/2283636) by Gre #### Problems with k-anonymity k-anonymity means that for each row, at least k-1 other rows are identical. -| Age | ### Dawn of Differential Privacy From 48b5fcab7eb8553a5d63c6d939592d7b5b90eff8 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Sat, 12 Jul 2025 06:45:27 -0500 Subject: [PATCH 16/40] add k-anonymity --- blog/posts/differential-privacy.md | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 4d8d7ab1..a0692da0 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -58,7 +58,7 @@ For certain questions like "have you committed tax fraud?", respondents will lik Have you committed tax fraud? | Respondent | Answer | Coin Flip (not included in the actual dataset just here for illustration) | -| ---- | ---- | +| --- | --- | | 1 | Yes | Tails (Answer Yes) | | 2 | No | Heads (Answer Truthfully) | | 3 | Yes | Heads (Answer Truthfully) | @@ -81,9 +81,23 @@ A variation used later in a [paper](https://www.jstor.org/stable/2283636) by Gre | 4 | Have you ever committed tax evasion? | Yes | | 5 | Have you ever committed tax evasion? | No | +#### k-Anonymity + +Latanya Sweeney and + +It's interesting that even all the way back in 1998 concerns constant data collection were already relevant. + +>Most actions in daily life are recorded on some computer somewhere That information in turn is often shared exc hanged and sold. Many people may not care that the local grocer keeps track of which items they purchase but shared information can be quite damaging to individuals or organizations. Improp er disclosure of medical information financial information or matters of national security can have alarming ramications and many abuses have been cited. + +k-anonymity means that for each row, at least k-1 other rows are identical. In practical terms, no individual row is unique, so no one can be uniquely identified by the data. + +##### Generalization + +This is achieved through a few techniques, one of which is generalization. Generalization is reducin + #### Problems with k-anonymity -k-anonymity means that for each row, at least k-1 other rows are identical. + ### Dawn of Differential Privacy From 1bb13346f663bbcde3ae59cc0f1bb8f3d2800712 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Sat, 12 Jul 2025 07:07:19 -0500 Subject: [PATCH 17/40] add generalization and suppression --- blog/posts/differential-privacy.md | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index a0692da0..b0b96a25 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -83,17 +83,27 @@ A variation used later in a [paper](https://www.jstor.org/stable/2283636) by Gre #### k-Anonymity -Latanya Sweeney and +Latanya Sweeney and Pierangela Samarati introduced [k-anonymity](https://dataprivacylab.org/dataprivacy/projects/kanonymity/paper3.pdf) to the world back in 1998. It's interesting that even all the way back in 1998 concerns constant data collection were already relevant. >Most actions in daily life are recorded on some computer somewhere That information in turn is often shared exc hanged and sold. Many people may not care that the local grocer keeps track of which items they purchase but shared information can be quite damaging to individuals or organizations. Improp er disclosure of medical information financial information or matters of national security can have alarming ramications and many abuses have been cited. -k-anonymity means that for each row, at least k-1 other rows are identical. In practical terms, no individual row is unique, so no one can be uniquely identified by the data. +In a dataset, you might have removed names and other obviously identifying information, but there might be other data such as birthday, ZIP code, etc that might be unique to one person in the dataset. If someone were to crossreference this data with outside data, it could be possible to deanonymize individuals. + +k-anonymity means that for each row, at least k-1 other rows are identical. So for a k of 2, at least one other row is identical to each row. ##### Generalization -This is achieved through a few techniques, one of which is generalization. Generalization is reducin +This is achieved through a few techniques, one of which is generalization. Generalization is reducing the precision of data so that it's not as unique. + +For example, instead of recording an exact age, you might give a range like 20-30. You've probably noticed this on surveys you've taken before. Data like this that's not directly identifiable but could be used to re-identify someone is refered to as *quasi-identifiers*. + +##### Suppression + +Sometimes even with generalization, you might have outliers that don't satisfy the k-anonymity requirements. + +In these cases, you can simply remove the row entirely. #### Problems with k-anonymity From cedf8fe53a8058810a5d1e16076e6ef41904b9e4 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Sat, 12 Jul 2025 07:17:26 -0500 Subject: [PATCH 18/40] remove problems with k-anonymity --- blog/posts/differential-privacy.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index b0b96a25..2e3e49b6 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -105,10 +105,6 @@ Sometimes even with generalization, you might have outliers that don't satisfy t In these cases, you can simply remove the row entirely. -#### Problems with k-anonymity - - - ### Dawn of Differential Privacy Most of the concepts I write about seem to come from the 70's and 80's, but differential privacy is a relatively new concept. It was first introduced in a paper from 2006 called [*Calibrating Noise to Sensitivity in Private Data Analysis*](https://desfontain.es/PDFs/PhD/CalibratingNoiseToSensitivityInPrivateDataAnalysis.pdf). From 3c8957f11d3e568de6ceb14bdae2ff255247832a Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Sat, 12 Jul 2025 07:28:05 -0500 Subject: [PATCH 19/40] fix typo --- blog/posts/differential-privacy.md | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 2e3e49b6..40770313 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -87,7 +87,7 @@ Latanya Sweeney and Pierangela Samarati introduced [k-anonymity](https://datapri It's interesting that even all the way back in 1998 concerns constant data collection were already relevant. ->Most actions in daily life are recorded on some computer somewhere That information in turn is often shared exc hanged and sold. Many people may not care that the local grocer keeps track of which items they purchase but shared information can be quite damaging to individuals or organizations. Improp er disclosure of medical information financial information or matters of national security can have alarming ramications and many abuses have been cited. +>Most actions in daily life are recorded on some computer somewhere. That information in turn is often shared, exchanged, and sold. Many people may not care that the local grocer keeps track of which items they purchase but shared information can be quite damaging to individuals or organizations. Improper disclosure of medical information, financial information, or matters of national security can have alarming ramications and many abuses have been cited. In a dataset, you might have removed names and other obviously identifying information, but there might be other data such as birthday, ZIP code, etc that might be unique to one person in the dataset. If someone were to crossreference this data with outside data, it could be possible to deanonymize individuals. @@ -111,7 +111,13 @@ Most of the concepts I write about seem to come from the 70's and 80's, but diff The paper introduces the idea of adding noise to data to achieve privacy. Of course, adding noise to the dataset reduces its accuracy. Ɛ defines the amount of noise added to the dataset, with a small Ɛ meaning more privacy but less accurate data and vice versa. It's also referred to as the "privacy loss parameter". -Importantly, differential privacy adds noise *before* it's analyzed. k-anonymity relies on trying to anonymize data *after* it's collected, so it leaves the possibility that not enough parameters are removed to ensure each individual cannot be identified. +#### Central Differential Privacy + +This early form of differential privacy relied on adding noise to the data *after* it was already collected, meaning you still have to trust a central authority with the raw data. + +#### Local Differential Privacy + +In many later implementations of differential privacy, noise is added to data on-device before it's sent off to any server. This removes the need to trust the central authority to handle your raw data. ### Google RAPPOR From 52daa0c5a3f1b3d9e6c04d73695ff93de7bdc139 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Sat, 12 Jul 2025 07:28:25 -0500 Subject: [PATCH 20/40] remove table title --- blog/posts/differential-privacy.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 40770313..b3ddd1e7 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -55,8 +55,6 @@ One of the earliest ideas for anonymizing data was [randomized response](https:/ For certain questions like "have you committed tax fraud?", respondents will likely be hesitant to answer truthfully. The solution? Have the respondent flip a coin. If the coin is tails, answer yes. If the coin lands on heads, answer truthfully. -Have you committed tax fraud? - | Respondent | Answer | Coin Flip (not included in the actual dataset just here for illustration) | | --- | --- | | 1 | Yes | Tails (Answer Yes) | From 30ff6cdc52b32d0bc6d608cda64aa37fb0f47816 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Sat, 12 Jul 2025 07:29:01 -0500 Subject: [PATCH 21/40] fix typo --- blog/posts/differential-privacy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index b3ddd1e7..d1ad3c35 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -95,7 +95,7 @@ k-anonymity means that for each row, at least k-1 other rows are identical. So f This is achieved through a few techniques, one of which is generalization. Generalization is reducing the precision of data so that it's not as unique. -For example, instead of recording an exact age, you might give a range like 20-30. You've probably noticed this on surveys you've taken before. Data like this that's not directly identifiable but could be used to re-identify someone is refered to as *quasi-identifiers*. +For example, instead of recording an exact age, you might give a range like 20-30. You've probably noticed this on surveys you've taken before. Data like this that's not directly identifiable but could be used to re-identify someone is referred to as *quasi-identifiers*. ##### Suppression From e6603cde4b5628f9d35e9eb3d8d10822f90d3a20 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Sat, 12 Jul 2025 07:30:06 -0500 Subject: [PATCH 22/40] add more detail --- blog/posts/differential-privacy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index d1ad3c35..96839902 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -107,7 +107,7 @@ In these cases, you can simply remove the row entirely. Most of the concepts I write about seem to come from the 70's and 80's, but differential privacy is a relatively new concept. It was first introduced in a paper from 2006 called [*Calibrating Noise to Sensitivity in Private Data Analysis*](https://desfontain.es/PDFs/PhD/CalibratingNoiseToSensitivityInPrivateDataAnalysis.pdf). -The paper introduces the idea of adding noise to data to achieve privacy. Of course, adding noise to the dataset reduces its accuracy. Ɛ defines the amount of noise added to the dataset, with a small Ɛ meaning more privacy but less accurate data and vice versa. It's also referred to as the "privacy loss parameter". +The paper introduces the idea of adding noise to data to achieve privacy. Of course, adding noise to the dataset reduces its accuracy. Ɛ defines the amount of noise added to the dataset, with a small Ɛ meaning more privacy but less accurate data and vice versa. It's also referred to as the "privacy loss parameter" or "privacy budget". #### Central Differential Privacy From e79526612fde0d42bd9ac9602a2d3a954b14068a Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Sat, 12 Jul 2025 07:37:18 -0500 Subject: [PATCH 23/40] move local differential privacy --- blog/posts/differential-privacy.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 96839902..6b312a90 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -113,13 +113,12 @@ The paper introduces the idea of adding noise to data to achieve privacy. Of cou This early form of differential privacy relied on adding noise to the data *after* it was already collected, meaning you still have to trust a central authority with the raw data. -#### Local Differential Privacy - -In many later implementations of differential privacy, noise is added to data on-device before it's sent off to any server. This removes the need to trust the central authority to handle your raw data. - ### Google RAPPOR In 2014, Google introduced [Randomized Aggregatable Privacy-Preserving Ordinal Response](https://arxiv.org/pdf/1407.6981) (RAPPOR), their [open source](https://github.com/google/rappor) implementation of differential privacy, with a few improvements. +#### Local Differential Privacy + +In Google's implementation, noise is added to data on-device before it's sent off to any server. This removes the need to trust the central authority to handle your raw data, an important step in achieving truly anonymous data collection. From 77e30648edeb10638ed14107e604c65153c2378f Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Sat, 12 Jul 2025 07:40:01 -0500 Subject: [PATCH 24/40] fix typo --- blog/posts/differential-privacy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 6b312a90..40ab29e2 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -33,7 +33,7 @@ Being able to collect aggregate data is essential for research. It's what the U. Usually we're more interested in the data as a whole and not data of individual people as it can show trends and overall patterns in groups of people. However, in order to get that data we must collect it from individuals. -It was thought at first that simply r[emoving names and other obviously identifying details](https://simons.berkeley.edu/news/differential-privacy-issues-policymakers#:~:text=Prior%20to%20the%20line%20of%20research%20that%20led%20to%20differential%20privacy%2C%20it%20was%20widely%20believed%20that%20anonymizing%20data%20was%20a%20relatively%20straightforward%20and%20sufficient%20solution%20to%20the%20privacy%20challenge.%20Statistical%20aggregates%20could%20be%20released%2C%20many%20people%20thought%2C%20without%20revealing%20underlying%20personally%20identifiable%20data.%20Data%20sets%20could%20be%20released%20to%20researchers%20scrubbed%20of%20names%2C%20but%20otherwise%20with%20rich%20individual%20information%2C%20and%20were%20thought%20to%20have%20been%20anonymized.) from the data was enough to prevent re-identification, but [Latanya Sweeney](https://latanyasweeney.org/JLME.pdf) (a name that will pop up a few more times) proved in 1997 that even without names, a significant portion of individuals can be re-identified from a dataset by cross-referencing external data. +It was thought at first that simply [removing names and other obviously identifying details](https://simons.berkeley.edu/news/differential-privacy-issues-policymakers#:~:text=Prior%20to%20the%20line%20of%20research%20that%20led%20to%20differential%20privacy%2C%20it%20was%20widely%20believed%20that%20anonymizing%20data%20was%20a%20relatively%20straightforward%20and%20sufficient%20solution%20to%20the%20privacy%20challenge.%20Statistical%20aggregates%20could%20be%20released%2C%20many%20people%20thought%2C%20without%20revealing%20underlying%20personally%20identifiable%20data.%20Data%20sets%20could%20be%20released%20to%20researchers%20scrubbed%20of%20names%2C%20but%20otherwise%20with%20rich%20individual%20information%2C%20and%20were%20thought%20to%20have%20been%20anonymized.) from the data was enough to prevent re-identification, but [Latanya Sweeney](https://latanyasweeney.org/JLME.pdf) (a name that will pop up a few more times) proved in 1997 that even without names, a significant portion of individuals can be re-identified from a dataset by cross-referencing external data. Previous attempts at anonymizing data have relied on been highly vulnerable to reidentification attacks. From a49677e36a6570deefb378e899e5c6ec211dac20 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Mon, 14 Jul 2025 06:11:53 -0500 Subject: [PATCH 25/40] add more info about apple --- blog/posts/differential-privacy.md | 65 ++++++++++++++++++++++++++++-- 1 file changed, 62 insertions(+), 3 deletions(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 40ab29e2..260d7294 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -113,12 +113,71 @@ The paper introduces the idea of adding noise to data to achieve privacy. Of cou This early form of differential privacy relied on adding noise to the data *after* it was already collected, meaning you still have to trust a central authority with the raw data. -### Google RAPPOR +## Google RAPPOR -In 2014, Google introduced [Randomized Aggregatable Privacy-Preserving Ordinal Response](https://arxiv.org/pdf/1407.6981) (RAPPOR), their [open source](https://github.com/google/rappor) implementation of differential privacy, with a few improvements. +In 2014, Google introduced [Randomized Aggregatable Privacy-Preserving Ordinal Response](https://arxiv.org/pdf/1407.6981) (RAPPOR), their [open source](https://github.com/google/rappor) implementation of differential privacy. -#### Local Differential Privacy +Google RAPPOR implements and builds on previous techniques such as randomized response and adds significant improvements on top. + +### Local Differential Privacy In Google's implementation, noise is added to data on-device before it's sent off to any server. This removes the need to trust the central authority to handle your raw data, an important step in achieving truly anonymous data collection. +### Bloom Filters + +Google RAPPOR makes use of a clever technique caled bloom filters that saves space and improves privacy. + +Bloom filters work by starting out with an array of all 0's + +`[0, 0, 0, 0, 0, 0, 0, 0, 0]` + +Then, you run data such as the word "apple" through a hashing algorithm, which will give 1's in specific positions, say position 1, 3, and 5. + +`[0, 1, 0, 1, 0, 1, 0, 0, 0]` + +When you want to check if data is present, you run the data through the hashing algorithm and check if the corresponding postions are 1's. If they are, the data *might* be present (other data might have flipped those same bits at some point). If any of the 1's are 0's, then you know for sure that the data is not in the set. + +### Permanent Randomized Response + +A randomization step is performed flipping some of the bits randomly. This response is then "memoized" so that the same random values are used for future reporting. This protects against an "averaging" attack where an attacker sees multiple responses from the same user and can eventually recover the real value by averaging them out over time. + +### Instantaneous Randomized Response + +On top of the permanent randomized data, another randomization step is performed. This time, different randomness is added on top of the permanent randomness so that every response sent is unique. This prevents an attacker from determining a user from seeing the same randomized pattern over and over again. + +Both the permanent and instantaneous randomized response steps can be fine-tuned to for the desired privacy. + +### Chrome + +Google first used differential privacy in their Chrome browser for detection of [malware](https://blog.chromium.org/2014/10/learning-statistics-with-privacy-aided.html). + +Differential privacy is also used in Google's [Privacy Sandbox](https://privacysandbox.google.com/private-advertising/aggregation-service/privacy-protection-report-strategy). + +### Maps + +Google Maps uses DP for its [place busyness](https://safety.google/privacy/data/#:~:text=To%20offer%20features%20like%20place%20busyness%20in%20Maps%2C%20we%20apply%20an%20advanced%20anonymization%20technology%20called%20differential%20privacy%20that%20adds%20noise%20to%20your%20information%20so%20it%20can%E2%80%99t%20be%20used%20to%20personally%20identify%20you.) feature, allowing Maps to show you have busy an area is without revealing the movements of individual people. + +### Google Fi + +[Google Fi](https://opensource.googleblog.com/2019/09/enabling-developers-and-organizations.html) uses differential privacy as well to improve the service. + +## OpenDP + +[OpenDP](https://opendp.org) is a community effort to build open source and trustworthy tools for differential privacy. Their members consist of academics from prestigious universities like Harvard and employees at companies like Microsoft. + +There's been an effort from everyone to make differential privacy implementations open source, which is a breath of fresh air from companies that typically stick to closed source for their products. + +## Apple + +[Apple](https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf) uses local differential privacy for much of its services, similar to what Google does. They add noise before sending any data off device, enabling them to collect aggregate data without harming the privacy of any individual user. + +They limit the number of contributions any one user can make via a *privacy budget*, confusingly also represented by epsilon, so you won't have to worry about your own contributions being averaged out over time and revealing your own trends. Some of the things they use differential privacy for include + +- QuickType suggestions +- Emoji suggestions +- Lookup Hints +- Safari Energy Draining Domains +- Safari Autoplay Intent Detection +- Safari Crashing Domains +- Health Type Usage From e4250e0e589b51a7a2dd2b2434393205788476fd Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Mon, 14 Jul 2025 06:36:47 -0500 Subject: [PATCH 26/40] add sketch matrix --- blog/posts/differential-privacy.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 260d7294..8d18d356 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -181,3 +181,10 @@ They limit the number of contributions any one user can make via a *privacy budg - Safari Crashing Domains - Health Type Usage +That's just based on their initial whitepaper, they've likely increased their use of DP since then. + +### Sketch Matrix + +Apple uses a similar method to Google, with a matrix initialized with all zeros. The input for the matrix is encoded with the SHA-256 hashing algorithm, and then bits are flipped randomly at a probablility dependent on the epsilon value. + +Apple only sends a random row from this matrix instead of the entire thing in order to stay within their privacy budget. \ No newline at end of file From 6555000d37c4c808929aff3fdce9b85fc9719ea4 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Mon, 14 Jul 2025 07:01:52 -0500 Subject: [PATCH 27/40] add us census --- blog/posts/differential-privacy.md | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 8d18d356..beeb2167 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -187,4 +187,18 @@ That's just based on their initial whitepaper, they've likely increased their us Apple uses a similar method to Google, with a matrix initialized with all zeros. The input for the matrix is encoded with the SHA-256 hashing algorithm, and then bits are flipped randomly at a probablility dependent on the epsilon value. -Apple only sends a random row from this matrix instead of the entire thing in order to stay within their privacy budget. \ No newline at end of file +Apple only sends a random row from this matrix instead of the entire thing in order to stay within their privacy budget. + +### See What's Sent + +You can see data sent with differential privacy in iOS under Settings > Privacy > Analytics > Analytics Data, it will begin with DifferentialPrivacy. On macOS, you can see these logs in the Console. + +## U.S. Census + +Differential privacy isn't just used by big corporations, in 2020 famously the U.S. Census used DP to protect the data of U.S. citizens for the first time. + +As a massive collection of data from a large number of U.S. citizens, it's important for the census bureau to protect the privacy of census participants while still preserving the overall aggregate data. + +### Impetus + +Since the 1990's, the U.S. Census used a less formal injection of statistical noise into their data \ No newline at end of file From fae748bdb1e1b9311945d42fd123ffd8b09bd9b9 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Mon, 14 Jul 2025 07:07:08 -0500 Subject: [PATCH 28/40] add impetus --- blog/posts/differential-privacy.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index beeb2167..4a2782db 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -201,4 +201,9 @@ As a massive collection of data from a large number of U.S. citizens, it's impor ### Impetus -Since the 1990's, the U.S. Census used a less formal injection of statistical noise into their data \ No newline at end of file +Since the 1990's, the U.S. Census used a less formal injection of statistical noise into their data, which they did all the way through 2010. + +After the 2010 census, the bureau tried to reidentify individuals in the census data. + +>The experiment resulted in reconstruction of a dataset of more than 300 million individuals. The Census Bureau then used that dataset to match the reconstructed records to four commercially available data sources, to attempt to identify the age, sex, race, and Hispanic origin of people in more than six million blocks in the 2010 Census. + From d67e8fe1d5b21762f53c8ced4ec9a5943c4a772d Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Mon, 14 Jul 2025 07:12:31 -0500 Subject: [PATCH 29/40] add info about reidentificatino on the US census --- blog/posts/differential-privacy.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 4a2782db..7e57a196 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -203,7 +203,11 @@ As a massive collection of data from a large number of U.S. citizens, it's impor Since the 1990's, the U.S. Census used a less formal injection of statistical noise into their data, which they did all the way through 2010. -After the 2010 census, the bureau tried to reidentify individuals in the census data. +After the 2010 census, the bureau tried to [reidentify individuals](https://www2.census.gov/library/publications/decennial/2020/census-briefs/c2020br-03.pdf) in the census data. >The experiment resulted in reconstruction of a dataset of more than 300 million individuals. The Census Bureau then used that dataset to match the reconstructed records to four commercially available data sources, to attempt to identify the age, sex, race, and Hispanic origin of people in more than six million blocks in the 2010 Census. +Considering 309 million people lived in the U.S. in 2010, that's a devastating breach of personal privacy. Clearly more formal frameworks for protecting the privacy of individuals were needed. + +>Nationwide, roughly 150 million individuals—almost one-half of the population, have a unique combination of sex and single year of age at the block level. + From 8873db8e5acd629501cb252b1a148e8fee701027 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Mon, 14 Jul 2025 07:29:53 -0500 Subject: [PATCH 30/40] add last part of census --- blog/posts/differential-privacy.md | 1 + 1 file changed, 1 insertion(+) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 7e57a196..0e96da53 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -211,3 +211,4 @@ Considering 309 million people lived in the U.S. in 2010, that's a devastating b >Nationwide, roughly 150 million individuals—almost one-half of the population, have a unique combination of sex and single year of age at the block level. +They could keep adding noise until these attacks are impossible, but that would make the data nigh unusable. Instead, differential privacy offers a mathematically regiorous method to protect the data from future reidentification attacks without ruining the data by adding too much noise. They can be sure thanks to the mathematical guarantees of DP. \ No newline at end of file From a1c690dd236d4d61db7e6b427c52615f87296417 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Mon, 14 Jul 2025 07:30:19 -0500 Subject: [PATCH 31/40] add codeblock --- blog/posts/differential-privacy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 0e96da53..9e4c2597 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -191,7 +191,7 @@ Apple only sends a random row from this matrix instead of the entire thing in or ### See What's Sent -You can see data sent with differential privacy in iOS under Settings > Privacy > Analytics > Analytics Data, it will begin with DifferentialPrivacy. On macOS, you can see these logs in the Console. +You can see data sent with differential privacy in iOS under Settings > Privacy > Analytics > Analytics Data, it will begin with `DifferentialPrivacy`. On macOS, you can see these logs in the Console. ## U.S. Census From c15c837d2c8a5ed3de33cb2750cf13db5ddbab39 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Mon, 14 Jul 2025 07:39:29 -0500 Subject: [PATCH 32/40] fix typo --- blog/posts/differential-privacy.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 9e4c2597..dbbe94e1 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -171,7 +171,11 @@ There's been an effort from everyone to make differential privacy implementation [Apple](https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf) uses local differential privacy for much of its services, similar to what Google does. They add noise before sending any data off device, enabling them to collect aggregate data without harming the privacy of any individual user. -They limit the number of contributions any one user can make via a *privacy budget*, confusingly also represented by epsilon, so you won't have to worry about your own contributions being averaged out over time and revealing your own trends. Some of the things they use differential privacy for include +They limit the number of contributions any one user can make via a *privacy budget*, confusingly also represented by epsilon, so you won't have to worry about your own contributions being averaged out over time and revealing your own trends. + +This allows them to find new words that people use that aren't included by default in the dictionary, or find which emojis are the most popular. + +Some of the things they use differential privacy for include - QuickType suggestions - Emoji suggestions @@ -211,4 +215,4 @@ Considering 309 million people lived in the U.S. in 2010, that's a devastating b >Nationwide, roughly 150 million individuals—almost one-half of the population, have a unique combination of sex and single year of age at the block level. -They could keep adding noise until these attacks are impossible, but that would make the data nigh unusable. Instead, differential privacy offers a mathematically regiorous method to protect the data from future reidentification attacks without ruining the data by adding too much noise. They can be sure thanks to the mathematical guarantees of DP. \ No newline at end of file +They could keep adding noise until these attacks are impossible, but that would make the data nigh unusable. Instead, differential privacy offers a mathematically rigorous method to protect the data from future reidentification attacks without ruining the data by adding too much noise. They can be sure thanks to the mathematical guarantees of DP. \ No newline at end of file From 4ef46c72eae37fdfd34a8e0eadcbc637e6f5e69d Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Mon, 14 Jul 2025 08:03:24 -0500 Subject: [PATCH 33/40] add future conclusion --- blog/posts/differential-privacy.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index dbbe94e1..259df17c 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -215,4 +215,10 @@ Considering 309 million people lived in the U.S. in 2010, that's a devastating b >Nationwide, roughly 150 million individuals—almost one-half of the population, have a unique combination of sex and single year of age at the block level. -They could keep adding noise until these attacks are impossible, but that would make the data nigh unusable. Instead, differential privacy offers a mathematically rigorous method to protect the data from future reidentification attacks without ruining the data by adding too much noise. They can be sure thanks to the mathematical guarantees of DP. \ No newline at end of file +They could keep adding noise until these attacks are impossible, but that would make the data nigh unusable. Instead, differential privacy offers a mathematically rigorous method to protect the data from future reidentification attacks without ruining the data by adding too much noise. They can be sure thanks to the mathematical guarantees of DP. + +## Future of Differential Privacy + +Differential privacy unlocks the potential for data collection with minimal risk of data exposure for any individual. Already, DP has allowed for software developers to improve their software, for new possibilities in research in the health sector and in government organizations. + +Adoption of scientifically and mathematically rigorous methods of data collection allows for \ No newline at end of file From 73329f0f46ff061d413f5f6597347f6e65c3000e Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Mon, 14 Jul 2025 08:13:27 -0500 Subject: [PATCH 34/40] add conclusion --- blog/posts/differential-privacy.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 259df17c..fb9c5be4 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -221,4 +221,6 @@ They could keep adding noise until these attacks are impossible, but that would Differential privacy unlocks the potential for data collection with minimal risk of data exposure for any individual. Already, DP has allowed for software developers to improve their software, for new possibilities in research in the health sector and in government organizations. -Adoption of scientifically and mathematically rigorous methods of data collection allows for \ No newline at end of file +Adoption of scientifically and mathematically rigorous methods of data collection allows for organizations to collect aggregate data will allow for increased public trust in organizations and subsequently greater potential for research that will result in improvements to our everyday lives. + +I think for there to be more public trust there needs to be a bigger public outreach. That's my goal with this series, I'm hoping to at least increase awareness of some of the technology being deployed to protect your data, especially since so much of the news we hear is negative. Armed with the knowledge of what's available, we can also demand companies and organizations use these tools if they aren't already. \ No newline at end of file From ea37b0f759f151fdfa0224dbc305d7600f2580b7 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Mon, 14 Jul 2025 08:17:08 -0500 Subject: [PATCH 35/40] add example table --- blog/posts/differential-privacy.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index fb9c5be4..d8fe6b63 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -63,6 +63,12 @@ For certain questions like "have you committed tax fraud?", respondents will lik | 4 | Yes | Tails (Answer Yes) | | 5 | No | Heads (Answer Truthfully) | +| Method | Description | +| ----------- | ------------------------------------ | +| `GET` | :material-check: Fetch resource | +| `PUT` | :material-check-all: Update resource | +| `DELETE` | :material-close: Delete resource | + Because we know the exact probability that a "Yes" answer is fake, 50%, we can remove it and give a rough estimate of how many respondents answered "Yes" truthfully. Randomized Response would lay the groundwork for differential privacy, but it wouldn't truly be realized for many decades. From d43be006cc1fc36cdf07c6db94571bfa2ed3394f Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Mon, 14 Jul 2025 08:25:08 -0500 Subject: [PATCH 36/40] try to fix table --- blog/posts/differential-privacy.md | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index d8fe6b63..4800a40a 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -56,18 +56,12 @@ One of the earliest ideas for anonymizing data was [randomized response](https:/ For certain questions like "have you committed tax fraud?", respondents will likely be hesitant to answer truthfully. The solution? Have the respondent flip a coin. If the coin is tails, answer yes. If the coin lands on heads, answer truthfully. | Respondent | Answer | Coin Flip (not included in the actual dataset just here for illustration) | -| --- | --- | +| --- | --- | --- | | 1 | Yes | Tails (Answer Yes) | | 2 | No | Heads (Answer Truthfully) | -| 3 | Yes | Heads (Answer Truthfully) | +| 3 | Yes | Tails (Answer Yes) | | 4 | Yes | Tails (Answer Yes) | -| 5 | No | Heads (Answer Truthfully) | - -| Method | Description | -| ----------- | ------------------------------------ | -| `GET` | :material-check: Fetch resource | -| `PUT` | :material-check-all: Update resource | -| `DELETE` | :material-close: Delete resource | +| 5| No | Heads (Answer Truthfully) | Because we know the exact probability that a "Yes" answer is fake, 50%, we can remove it and give a rough estimate of how many respondents answered "Yes" truthfully. From 5c40201840ffe2179ff0005ccfc12803d2a640cb Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Mon, 14 Jul 2025 09:26:02 -0500 Subject: [PATCH 37/40] add description --- blog/posts/differential-privacy.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 4800a40a..543e7496 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -10,8 +10,7 @@ tags: - Differential Privacy license: BY-SA schema_type: BackgroundNewsArticle -description: | - Privacy Pass is a new way to privately authenticate with a service. Let's look at how it could change the way we use services. +description: Differential privacy is a mathematically rigorous framework for adding a controlled amount of noise to a dataset so that no individual can be reidentified. Learn how this technology is being implemented to protect you. --- # Privacy-Enhancing Technologies Series: Differential Privacy From bc04e9e4632146e5a063f9ca32a9383570700184 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Mon, 14 Jul 2025 09:28:03 -0500 Subject: [PATCH 38/40] fix typo --- blog/posts/differential-privacy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 543e7496..5667ddaa 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -124,7 +124,7 @@ In Google's implementation, noise is added to data on-device before it's sent of ### Bloom Filters -Google RAPPOR makes use of a clever technique caled bloom filters that saves space and improves privacy. +Google RAPPOR makes use of a clever technique called bloom filters that saves space and improves privacy. Bloom filters work by starting out with an array of all 0's From a427b6723d82fa5a2bb157428d27e42756158706 Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Mon, 14 Jul 2025 09:33:17 -0500 Subject: [PATCH 39/40] add conclusion --- blog/posts/differential-privacy.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 5667ddaa..2b38e8ec 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -222,4 +222,6 @@ Differential privacy unlocks the potential for data collection with minimal risk Adoption of scientifically and mathematically rigorous methods of data collection allows for organizations to collect aggregate data will allow for increased public trust in organizations and subsequently greater potential for research that will result in improvements to our everyday lives. -I think for there to be more public trust there needs to be a bigger public outreach. That's my goal with this series, I'm hoping to at least increase awareness of some of the technology being deployed to protect your data, especially since so much of the news we hear is negative. Armed with the knowledge of what's available, we can also demand companies and organizations use these tools if they aren't already. \ No newline at end of file +I think for there to be more public trust there needs to be a bigger public outreach. That's my goal with this series, I'm hoping to at least increase awareness of some of the technology being deployed to protect your data, especially since so much of the news we hear is negative. Armed with the knowledge of what's available, we can also demand companies and organizations use these tools if they aren't already. + +It's heartening to see the level of openness and collaboration in the research. You can see a clear improvement over time as each paper takes the previous research and builds on it. I wish we saw the same attitude with all software. \ No newline at end of file From 7e1a09d18f6adbcc3d8c8c54f0444ca54aa3a87b Mon Sep 17 00:00:00 2001 From: fria <138676274+friadev@users.noreply.github.com> Date: Mon, 14 Jul 2025 09:44:16 -0500 Subject: [PATCH 40/40] grammar --- blog/posts/differential-privacy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blog/posts/differential-privacy.md b/blog/posts/differential-privacy.md index 2b38e8ec..3006664f 100644 --- a/blog/posts/differential-privacy.md +++ b/blog/posts/differential-privacy.md @@ -52,7 +52,7 @@ It was also possible to [deanonymize](https://steveloughran.blogspot.com/2018/01 One of the earliest ideas for anonymizing data was [randomized response](https://uvammm.github.io/docs/randomizedresponse.pdf), first introduced all the way back in 1965 in a paper by Stanley L. Warner. The idea behind it is quite clever. -For certain questions like "have you committed tax fraud?", respondents will likely be hesitant to answer truthfully. The solution? Have the respondent flip a coin. If the coin is tails, answer yes. If the coin lands on heads, answer truthfully. +For certain questions like "have you committed tax fraud?" respondents will likely be hesitant to answer truthfully. The solution? Have the respondent flip a coin. If the coin is tails, answer yes. If the coin lands on heads, answer truthfully. | Respondent | Answer | Coin Flip (not included in the actual dataset just here for illustration) | | --- | --- | --- |