程序代写代做代考 scheme database Data Linkage and Privacy – cscodehelp代写

Data Linkage and Privacy

Data Linkage and Privacy
• If data matching is being conducted within a single organisation and is using databases within the organisation, privacy/confidentiality is generally not a concern.
– Can assume individuals doing the matching are authorised, aware of policies and don’t have malicious intent
– E.g. University of Melbourne: administrator who is matching
student academic results database against database of applicants for PhD study
• On the other hand, problems can arise if
– Matched data is being passed to another organisation or being made public
– Data matching is being conducted across databases from different organisations

Example 1: Need for privacy in public health
• Research team investigating effects of car accidents on the public health system. Research questions
– Most common injuries for what types of car accident?
– When and where accidents occurred, the road and weather conditions at time of accident and health of people involved in accident, as well as two years later?
• Data needed
– Hospital data on patients
– Private health insurance data – Police
– Road traffic authorities
• These organisations can’t share all their data with the research team.

Example 2: Need for privacy – national security
• National crime investigation unit analysing crimes of national significance (significance to all of Australia)
• Wants to link its own database about suspicious individuals to different databases around Australia
– Tax
– Law enforcement
– Financial institutions
• Only linked records should be available to the unit
– It should not get access from the bank to financial data about
non-suspicious individuals
– It should not get access to tax records about non-suspicious individuals

Privacy Preserving Data Linkage: Problem Statement
• How can we perform data linkage for two databases, each from a different organisation
– Without revealing any information about individuals who do not get linked across the databases (i.e. individuals who occur in one database and not in the other)
• We will need
– Methods for computing similarity of records, without revealing the record values
• Hashing: an important tool

Hashing
A hash function H maps a data item of arbitrary size to a data item of fixed size
• A hash function 𝐻 𝑚𝑜𝑑3
–𝐻 32=2 𝑚𝑜𝑑3
–𝐻 20=2 𝑚𝑜𝑑3
–𝐻 6=0 𝑚𝑜𝑑3
–𝐻 7=1 𝑚𝑜𝑑3
• A hash function 𝐻 𝑐1
–𝐻 𝐽𝑎𝑚𝑒𝑠=10 𝑐1
–𝐻 𝐾𝑎𝑡𝑒=11 𝑐1
–𝐻 𝑇𝑖𝑚=20 𝑐1
– 𝐻 (𝑇h𝑒 𝑞𝑢𝑖𝑐𝑘 𝑏𝑟𝑜𝑤𝑛 𝑓𝑜𝑥 𝑗𝑢𝑚𝑝𝑒𝑑 𝑜𝑣𝑒𝑟 𝑡h𝑒 𝑙𝑎𝑧𝑦 𝑑𝑜𝑔) = 20 𝑐1

Non invertible (one way) hash function
• Non invertible hash function. Given the output H(X), extremely hard to reconstruct X. Examples
– MD5 hash function (produces a 32 digit hexadecimal number, equal to a string of 128 bits)
• H(James)= d52e32f3a96a64786814ae9b5279fbe5
• H(I love data wrangling)= 614416fa9d994aa8225ebd7c50f22060
• H(12345678)= 25d55ad283aa400af464c76d713c07ad
– SHA-3-512 hash function (produces a 128 digit hexadecimal number, equal to a string of 512 bits)
• H(James)=02c56351888fa73ff825ffd65526b264ebefe7916fa5d8d5c58 e766bfdd1de8e85b68bf12599b9d21eca6683d4abfa8616acfa6834e7c4 78e394374a7b015898
• H(12345678)=8a56bac869374c669443a1626ff0967af258123f83faf6b5 5e31dd541e6bbd90308a3385713294bf2e8861bc8cf8f8feda41f9c4db1 9d5811a6b5de85eac9870

Hash encoding for exact matching: 2 party protocol
• Each organisation
– Applies a (one way) hash function to the attribute used to join the databases
– Shares its hashed values with the other organisation. Each checks which ones match. These are the linked records.
Org. A
Name
H(Name)
Jill
8347
Jane
6992
Name
H(Name)
Bob
2332
Jane
6992
Org. B

Small changes in input, large change in output
• Disadvantage 1: What about single character differences in the original value? E.g. MD5 hash function
– H(James)= d52e32f3a96a64786814ae9b5279fbe5
– H(Jamex)= c3bfa7fa6ad2b987619bb4c932e65b4a
– Single character difference results in a completely different output. This is generally true for one way hash functions such as MD5, SHA ….
• Advantage?

Dictionary attack
• Disadvantage 2: An organisation could mount a dictionary attack to “invert” the hash function. E.g. Organisation A generates a hash dictionary by computing hashes for all words of length 4
– H(aaaa)=… – H(aaab)=… – H(aaac)=… – H(aaad)=… – …..
– H(zzzz)=…
• Organisation A then scans the hashed values received from Organisation B. Checks if any match its hash
dictionary. If yes, privacy is lost for those items.
• Could also generate dictionary for all known words, pairs of words, …. [up to some limit of feasibility]
• d077f244def8a70e5ea758bd8352fcd8 example

Hash encoding for exact matching: 3 party protocol
• Possible solution
– Involve a trusted 3rd party (Organisation C)
– Organisations A and B send their hashed values to Organisation C, who then checks for matches.
– What if Organisation C is malicious?
• Organisation C could mount a dictionary attack and guess the hashed values
• Solution: A and B perform “dictionary attack resistant” hashing

3rd Party Protocol using salt
Name
H(Name)
Jill+SECRET_KEY
1112
Jane+SECRET Key
9341
A
B
Name
H(Name)
Bob+SECRET_KEY
2996
Jane+SECRET_KEY
9341
• Organisations A and B concatenate a secret word to every name field in their data before hashing (known as a salt). Organisation C does not know what this word is and thus can’t perform a dictionary attack to “reverse”
the hashed values it receives.

Hashing and salting
• In June 2012 dating site eHarmony was hacked – 1.5 million password hashes publicly released
• In June 2012 social networking site LinkedIn was hacked
– 117 million hashed password stolen and publicly released
• In April 2018 textbook rental service Chegg was hacked – 40 million password hashes publicly released
• None of these companies used a salt when hashing the passwords
– Many passwords were thus susceptible to a brute force dictionary attack on the hashed values

Frequency attack
• This third party scheme prevents a dictionary attack, but may still be susceptible to a frequency attack.
– 3rd party compares the distribution of hashed values to some known distribution
• E.g. distribution of surname frequencies in a public database versus distribution of hash values
• May be able to guess some of the hashed values!
• Organisations A and B could prevent this by adding random records to manipulate the
frequency distribution

Frequency attack [slide from ]

Privacy preserving exact-match scheme
• Organisations A and B can determine which records in the two databases are an exact match in a privacy preserving manner
– using a trusted third party C, and
– using one way hash functions with a salt, and – adding random records
• A reasonably private scheme (depending on how much the third party is trusted)
A
Linked record pair IDs
Exact Matching of Private IDs
Private IDs
C
Private IDs
AB
B
Linked record pair IDs
C

Leave a Reply

Your email address will not be published. Required fields are marked *