计算机代考程序代写 database Example: Bing’s “Movies Vertical” – cscodehelp代写
Example: Bing’s “Movies Vertical”
adding
entity actions to entity cards
Need to link movie records from Bing and Netflix
Easy problem only if there’s an attribute with unique value per record
• If there’s a “key” then use “database join”
Not so fast! The linkage challenge
No keys
Noisy values Missing attributes
Usually remove diacritics (why?) déjà → deja
“Unrelated works with same titles” http://www.imdb.com/list/ls000397020/
Concatenate “Title, Year”: surely a key?
” ,2006″
” ,1997″
“Unrelated works with same titles” http://www.imdb.com/list/ls000397020/
Concatenate “Title, Year”: surely a key?
http://www.imdb.com
Pre-processing
Prep Block Score Match Merge
Simple pre-processing to clean records
Title
Year
Directors
Cast
Runtime
…
come fly with me
2004
peter kagan
michael buble
63
…
michael jordan come fly with me
1989
michael jordan, jay thomas
42
…
Pairwise comparison
• Need to compare two records to determine if they are to be linked
• Calculate similarity between a pair of records If similar:
link pair
• High complexity – we will need to compare/”score” many pairs of records for similarity.
Prep Block Score Match Merge
Complexity – pairwise comparison
• Two databases A (m records) and B (n records) • How many comparisons? 𝑚 × 𝑛
• One MSFT problem had 1b records
• Complexity 𝑂(𝑛2)
• If𝑚==𝑛==1𝑏 109 records
• 109 × 109 number of comparisons;
• It will take a long time to compute 1018 pairs
• Naïve whole-database all-pairs comparison doesn’t scale. Can we do better?
Blocking – scalable record linkage
• Maps complex records to simple values (blocks).
• The simple values are blocking-keys
• Same blocking methods applied to both A and B.
Each record is allocated to
one or more of the blocks
b1
b2 B b3
a1 a2 a3
A
b1 b4
a1 a3
b1 b2 b3
a2 a5
b3
a2 a5
b4 a4
b5 a5
Prep Block
Score
Match
Merge
b1 b5
a4
Blocking – cont.
• Record-pairs assigned to the same blocks are potential matches.
• Withinblockpairwise comparisons only
• For example, b5 is only compared with a4, and no other records in A
2×1
2×2
b1 b5
a4
b1 – a4 b5 – a4
b1 – a1 b1 – a3 b4 – a1 b4 – a3
b1 – a2 b1 – a5 b2 – a2 b2 – a5 b3 – a2 b3 – a5
b1 b4
a1 a3
b1 b2 b3
a2 a5
3×2
b3
a2 a5
1×2
b3 – a2 b3 – a5
Scalable record linkage with blocking
Blocking example
come fly jordan michael
Prep Block Score Match Merge
Represent complex records as simple values (blocks); Only score records with simple value (block) in common.
Title
come fly with me
michael jordan come fly with me
…
…
…
Blocking example
A movie record: <“ghost busters”, 2016>
• On one function: release year, result in one block
• 2016
• On one function: concatenated values of release year and one title-
word: results in two blocks • 2016 ghost
• 2016 busters
• Movies with a difference in release year by one; three functions: • release year,
• release year + 1,
• release year – 1; results in three blocks:
• 2015, 2016, and 2017
• Redundant blocks, why?
Common blocking methods for strings
• Token/word (‘come fly with me’) {come, fly, with, me}
• Phrases/n-words (‘come fly with me’, n=2)
{’_ come’, ’come fly’, ’fly with’, ‘with me’, ‘me _’}
• N-grams (‘Jamie’, n=2)
• {_j, ja, am, mi, ie, e_}
• blocking-keys are the 2-grams; ‘jamie’ are in 6 blocks
• Prefix, suffix
• soundex
• Pauline: P450, Paulina: P450, • Chris: C620
• Kris: K620
Blocking methods– design
• Blocking methods should be efficient, e.g. hashing • What would be a bad blocking method?
• Trade-off between block sizes: true matches being missed vs computational efficiency
• Can filter out large blocks
• Blocks can overlap but avoid redundant blocks • Need to ensure that recall does not suffer
Measuring blocking method
Given the ground truth (matching record-pairs)
False positives (fp): non-matching pairs in the same block. False negatives(fn): matching pairs never in the same block. True positives (tp): matching pairs in the same block.
True negative (tn): non-matching pairs never in the same block. Total number of pairs with no blocking: 𝑚 × 𝑛 ( == 𝑓𝑝 + 𝑓𝑛 + 𝑡𝑝 + 𝑡𝑛)
Blocking measures:
• Pair-completeness (PC): 𝑡𝑝Τ(𝑡𝑝 + 𝑓𝑛)
• Reduction-ratio (RR): 1 − (𝑓𝑝 + 𝑡𝑝)Τ(𝑓𝑝 + 𝑓𝑛 + 𝑡𝑝 + 𝑡𝑛)