Data Mining - Privacy (Anonymization)

1 - Purpose

The purpose of data mining is to discriminate …

  • who gets the loan
  • who gets the special offer

Certain kinds of discrimination are unethical, and illegal

  • racial, sexual, religious, …

But it depends on the context

  • sexual discrimination is usually illegal (except for doctors, who are expected to take gender into account)
  • and information that appears innocuous may not be (ZIP code correlates with race, membership of certain organizations correlates with gender)

3 - Information privacy laws

3.1 - Europe

  • A purpose must be stated for any personal information collected
  • Such information must not be disclosed to others without consent
  • Records kept on individuals must be accurate and up to date
  • To ensure accuracy, individuals should be able to review data about themselves
  • Data must be deleted when it is no longer needed for the stated purpose
  • Personal information must not be transmitted to locations where equivalent data protection cannot be assured
  • Some data is too sensitive to be collected, except in extreme circumstances (e.g., sexual orientation, religion)

4 - Anonymization

Anonymization is harder than you think

4.1 - Medical records

When Massachusetts released medical records summarizing every state employee’s hospital record in the mid‐1990s, the governor gave a public assurance that it had been anonymized by removing all identifying information such as name, address, and social security number.

He was surprised to receive his own health records (which included diagnoses and prescriptions) in the mail.

4.2 - Re-identification

Using publicly available records:

  • 50% of Americans can be identified from city, birth date, and sex
  • 85% can be identified if you include the 5‐digit zip code as well

4.3 - Netflix

Netflix movie database: 100 million records of movie ratings (1–5)

  • Can identify 99% of people in the database if you know their ratings for 6 movies and approximately when they saw the movies (+- one week)
  • Can identify 70% if you know their ratings for 2 movies and roughly when they saw them

4.4 - AOL engine queries

In 2006, a text file was released on the web containing 20,000,000 search engine queries made by 650,000 users over a 3-month period, intended for research purposes. The file had been anonymized by replacing user names with random numbers, one per user. However, some of the queries contained clues to the user's identity. The New York Times was able to locate an individual from these supposedly anonymized search records by cross referencing them with phonebook listings. Look up this renowned example of reidentification and read about it. What is the name of the user identified by the New York Times?

4.5 - NSA - Metadata Match (Stanford)

5 - Open Person Directory

  • Yelp,
  • Google Places,
  • Facebook directories.

6 - Tools

7 - Documentation / Reference

data_mining/privacy.txt · Last modified: 2017/05/24 19:57 by gerardnico