程序代写代做代考 data structure Data Linkage and Privacy – Approximate matching – cscodehelp代写

Data Linkage and Privacy – Approximate matching

Privacy preserving approximate-match
• The hash based technique using the 3rd party, can only compute exact match between strings in a privacy preserving manner.
• What if we wish to compute approximate match between two strings in a privacy preserving manner?
– Representing a record field (string) in a privacy preserving manner (so that it is difficult to reverse engineer its value)
– Computing approximate similarity of two record fields (strings) that have been represented in privacy preserving manner.
• One method:
– Set-based string similarity + bloom filters.

Bloom Filters
• Data structure: a ‘bit array’ (or a ‘bit string’) – An array of binary bits (0s or 1s).
• A bloom filter is a ‘bit array’ with I bits
– Initialise bits to 0s
– Turn bits on using k number of hash functions 𝐻 ⋯ 𝐻 1𝑘
– Each hash function maps input to an index in the bit array. – Indexrange 0⋯𝑰−1
1
0
1
1
0
0
1
0
0
0
0
0
0
0
0
0

Bloom Filters for storing strings
• Storing a string x in a bloom filter
– choose 𝐻 ⋯ 𝐻 that takes a string as input
1𝑘
– each 𝐻 (𝑥) returns a index to the bloom filter (bit array) 𝑖
– set the value of that index to 1
• Example: a bloom filter 𝐼 = 25, 𝑘 = 2, string s = ‘Yana White’
– 𝐻 𝑥 : the alphabet order of the first letter mod 25 1
– 𝐻 𝑥 : the alphabet order of the last letter mod 25 2
– 𝐻 𝑌𝑎𝑛𝑛𝑎𝑊h𝑖𝑡𝑒 =25𝑚𝑜𝑑25=0 1
– 𝐻 𝑌𝑎𝑛𝑛𝑎𝑊h𝑖𝑡𝑒 =5𝑚𝑜𝑑25=5 2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
• For a set of strings, repeat for each string.

Checking if a string is present in a bloom filter
• For any string X, hashing X using the k hash functions 𝐻 ⋯ 𝐻 . 1𝑘
– If at least one bit at the index Hi(X) is 0 X is not in the bloom filter
– If all of the bits at the indexes Hi(X) are set to 1, then X could be a member of the bloom filter.
However, it might be a false positive.
– Question: Why isn’t X a member of the bloom filter with certainty, if all bits at index Hi(X) are set to 1?

Bloom filter: comparing two strings for (approximate-) similarity
• Given two strings
– Create a set of smaller strings into a set of smaller strings, e.g. 2-grams
_smith_→𝑆 : _𝑠,𝑠𝑚,𝑚𝑖,𝑖𝑡,𝑡h,h_ 1
_smyth_→𝑆2:{_𝑠,𝑠𝑚,𝑚𝑦,𝑦𝑡,𝑡h,h_} – Store 𝑆 in one bloom filter 𝐵
11
– Store 𝑆2 in another bloom filter 𝐵2
– Both 𝐵 and 𝐵 are the same length 𝑰 and use the same 𝒌 12
hash functions.
• If two strings are similar→ many common 2-grams.
– 𝐵 and 𝐵 have many identical bits (bit positions) set to 1. 12

Bloom filter: comparing two strings for (approximate-) similarity cont.
• Use Dice similarity
–𝑠𝑖𝑚 𝐵,𝐵 =2×𝑏1∩𝑏2 where
𝑑𝑖𝑐𝑒 1 2 𝑏1 + 𝑏2
𝑏 ∩ 𝑏 is the number of bits set to 1 in both 𝐵 and 𝐵 12 12
𝑏 is the number of bits set to 1 in bloom filter B1 1
𝑏2 is the number of bits set to 1 in bloom filter B2 • Or Jaccard similarity formula

Similarity between _SMITH_ and _SMYTH_ with Bloom filters
𝐵 1
𝐵2
𝑆 1
𝑆2
has 11 bits set to 1; 𝐵
• 𝐵 and𝐵 have8commonbitssetto1.
has 10 bits set to 1 • 𝑠𝑖𝑚 𝐵,𝐵 =2×𝑏1∩𝑏2 =2×8 =0.762
• 𝐵 12
12
𝑑𝑖𝑐𝑒 1 2 𝑏1 + 𝑏2 11+10
• Approximating𝑠𝑖𝑚 𝑆,𝑆 .Why? 𝑑𝑖𝑐𝑒 1 2

Privacy preserving record linkage with bloom filters
AB
Bloom filters,
one for each record
C
Bloom filters,
one for each record
Approximate similarity comparisons of bloom filters

Usage considerations
• A and B agree on a bit array length l and k hash functions – A and B also agree on a salt and use it in their hash
functions
• A and B may also add dummy records (and their corresponding bloom filters) to lessen the chance of C mounting a frequency attack
• May be problems with using bloom filters to represent short strings (not very secure)

Summary: Record Linkage
• Summary
– Challenges involved with record linkage
– Blocking: How & why to do it
– How to assess similarities between records
– How to evaluate effectiveness of blocking & record linkage – How to ensure privacy for exact matching
– How to ensure privacy for inexact matching
– Possible attacks: Dictionary attacks, frequency attacks

Leave a Reply

Your email address will not be published. Required fields are marked *