计算机代考程序代写 algorithm Semester 2 2021 – cscodehelp代写

Semester 2 2021
Lecture 4, Part 3: Unstructured Data – Text Preprocessing

String match and comparison –
cont.
N-gram Distance

N-grams
Another method for finding approximate string match
What is an (character) n-gram? A (character) substring of length n • bi-grams of crat: cr, ra, at; or
• bi-grams of crat: #c, cr, ra, at, t# (sometimes)
• tri-grams of crat: ##c, #cr, cra, rat, at#, t##
‘#’ is a padding character
A sequence of characters is converted into a set of n-grams.

N-gram distance
N-gram distance between x and y: |Gn(x)| + |Gn(y)| − 2 × |Gn(x) ∩ Gn(y)|
crat and cart:
|G2(crat)| + |G2(cart)| − 2 × |G2(crat) ∩ G2(cart)| = 5+5−2×2= 6
crat and arts:
|G2(crat)| + |G2(art)| − 2 × |G2(crat) ∩ G2(art)| = 5+5−2×0= 10
• bi-grams of crat: G2(crat) = #c, cr, ra, at, t# • bi-grams of cart: G2(cart) = #c, ca, ar, rt, t# • bi-grams of arts: G2(arts) = #a, ar, rt, ts, s#

N-gram distance
• Less sensitive to relative ordering of strings (matches can be anywhere!)
• More sensitive to long substring matches,
• Quite useless for very long strings and/or very small alphabets (Why?)
• Potentially useful for (approximate) prefixes / suffixes, e.g., Street → St; or smog → smoke

Jaccard similarity (set-based)
𝑆 ∩𝑆 12
𝑆 ∪𝑆 12
S1: G2(crat) = {#c, cr, ra, at, t#} S2: G2(cart) = {#c, ca, ar, rt, t#}
𝑠𝑖𝑚 𝑆,𝑆= 𝑗𝑎𝑐𝑐 1 2
𝑆 ∩ 𝑆 =2 12
𝑆 ∪ 𝑆 =8 12
= 2 = 0.25 8
𝑠𝑖𝑚 𝑐𝑟𝑎𝑡, 𝑐𝑎𝑟𝑡 𝑗𝑎𝑐𝑐

Sørensen-Dice similarity (set-based)
𝑠𝑖𝑚 𝑆,𝑆= 𝑑𝑖𝑐𝑒 1 2
2× 𝑆 ∩𝑆 12
𝑆+𝑆 12
S1: G2(crat) = {#c, cr, ra, at, t#} S2: G2(cart) = {#c, ca, ar, rt, t#}
𝑆 =5 1
𝑆 =5 2
𝑆 ∩ 𝑆 =2 12
𝑠𝑖𝑚𝑑𝑖𝑐𝑒 𝑐𝑟𝑎𝑡,𝑐𝑎𝑟𝑡 =
2×𝑆∩𝑆 4
1 2 = =0.4
𝑆+𝑆 10 12

Cosine similarity (vector-based)
bi-gram(crat) = (#c, cr, ra, at, t#) bi-gram(cart) = (#c, ca, ar, rt, t#) vector representation
#c
ar
at
ca
cr
ra
rt
t#
𝑋
1
0
1
0
1
1
0
1
𝑌
1
1
0
1
0
0
1
1
𝑐𝑜𝑠𝑋,𝑌 =
12
𝑋⋅𝑌
X⋅𝑌=σ𝑛 𝑋𝑌 = 1+0+0+0+0+0+0+1=2 𝑖=1 𝑖 𝑖
𝑋 = σ𝑛 𝑋2 = 12+12+12+12+12 = 5, 𝑖=1 𝑖
𝑌 = σ𝑛 𝑌2 = 12+12+12+12+12 = 5 𝑖=1 𝑖
𝑋𝑌
𝑐𝑜𝑠𝑋,𝑌 =
2 =0.4 5×5

Jaro-Winkler similarity
• Based on edit-distance (also character-based similarity)
• Give more weight to strings with matching prefixes
• Useful for matching (approximate) prefixes / suffixes.
• The details of the algorithm are not covered in the subject

Leave a Reply

Your email address will not be published. Required fields are marked *