程序代写代做代考 Assess pairwise similarity – cscodehelp代写

Assess pairwise similarity
Multiple scoring functions on different parts of the records
Prep Block Score Match Merge
Comparing two records : Assess their similarity
Title
come fly with me
0.82
michael jordan come fly with me
Year
2004
15
1989
Directors
peter kagan
0
Cast
michael buble
0
michael jordan, jay thomas
Runtime
63
21
42


Scoring similarities – text
What is the similarity score between a text attribute between a pair of records?
Revision on text processing!

Exercise: set-based string similarity
• Compute dice coefficient similarity between 𝑥 and 𝑦 using 2-grams. • 𝑥=james
• 𝑦=jamie
• 𝑆 = _j,ja,am,me,es,s_
• 𝑆 = _j,ja,am,mi,ie,e_ 𝑦
•𝑆∩𝑆 =3,𝑆 =6𝑆 =6 𝑥𝑦𝑥𝑦
•𝑠𝑖𝑚 𝑆,𝑆 =2×𝑆𝑥∩𝑆𝑦 =2∗3=0.5
𝑥
𝑑𝑖𝑐𝑒 𝑥 𝑦
𝑆𝑥 + 𝑆𝑦 6+6
• Why 2-grams? What about 1-gram or 10-grams?

Exercise: direct string similarity
• Edit-distance similarity
Minimum number of character insertions, deletions, substitutions to go from s1 to s2 • “valuation” –> “revolution”?
v
a
l
u
a
t
i
o
n
r
e
v
o
l
u

t
i
o
n
•𝑠𝑖𝑚𝑥,𝑦 =1− 𝑑𝑥,𝑦 =1− 4 =0.6 max 𝑥,𝑦 10
• Jaro-Winkler similarity
• Based on edit-distance
• Favours matches at the start of the strings.

Scoring similarity
Combine the set of similarity scores→final score
( 0.82 , 15 , 0 , 0 , 21 )
Title

come fly with me

michael jordan come fly with me

Prep Block
Score
f: Rd→[0,1]
0.1
Match
Need a good f
Merge
Score record pairs for similarity

More on score combination
Idea 1: sum up feature scores
Idea 2: +similarities, -dissimilarities Idea 3: weighted sum
Idea 4: label examples of non/matching record pairs, train a classifier using machine learning
Will learn the weight
(,,,,) f: Rd→[0,1]
0.1
Prep Block Score Match Merge
0.82
15
0
0
21

Match ‘sufficiently’ similar records
pairs compared sufficiently similar pairs
Prep Block Score Match Merge
Threshold θ
Match when final score > θ e.g.
threshold θ = 0.5
final score > 0.5

Merge matched records
matched pairs
Prep Block Score Match Merge

Merge matched records
• Also needs to resolve conflicting attributes • False positives and false negatives still exist
Prep Block Score Match Merge

Evaluation of record linkage results
False positives (fp): # non-matched pairs that are incorrectly classified as matches. False negatives(fn): # true matching pairs that are incorrectly classified as non-matches True positives (tp): # true matching pairs that are classified as matches
True negative (tn): #non-matched pairs that are classified as non-matches
• Precision: 𝑡𝑝Τ(𝑡𝑝 + 𝑓𝑝)
Proportion of pairs classified as matches that are true matches.
• Recall: 𝑡𝑝Τ(𝑡𝑝 + 𝑓𝑛)
Proportion of true matching pairs that are classified as matches.
10

Acknowledgements
• Based on presentation materials created by , , and others

Leave a Reply

Your email address will not be published. Required fields are marked *