计算机代考程序代写 algorithm Semester 2 2021 – cscodehelp代写

Semester 2 2021
Lecture 4, Part 2: Unstructured Data – Text Preprocessing

String match and comparison –
cont.
Minimum Edit Distance

Minimum edit distance
Transform a source string into the target string through three basic character operations.
• Insert, Delete, Replace
• Each operation incurs a cost or (a mismatch score);
“… proceed if you can see no ther option … “
other . .. otter … their … there

Levenshtein distance
A minimum edit distance algorithm.
Operations
• insert
• delete • replace
t h e r
|| | | o t h e r
t h e r
| | | | o tt e r

Levenshtein distance
A minimum edit distance algorithm.
Cost
• Insert →1
• Delete →1
• Replace →1
Alternative
• Insert →1
• Delete →1
• Replace →2
distance = 1
th e r
| | | | o t h e r
distance = 2
t h e r
||| | o tt e r

Levenshtein distance
• Efficient algorithm based on prefix comparison • Demonstrated with a matrix.
#
o
t
t
e
r
#
0
1
2
3
4
5
t
1
1
1
2
3
4
h
2
2
2
2
3
4
e
3
3
3
3
2
3
r
4
4
4
4
3
2

Levenshtein distance
j
(m = 4 n=5)
#
o
t
t
e
r
#
0
1
2
3
4
5
t
1
2, 2, 1
h
2
e
3
r
4
i

Levenshtein distance
(m = 4 n=5)
#
o
t
j
t
e
r
#
0
1
2
3
4
5
t
1
2, 2, 1
3, 2, 1
4, 2, 2
5, 3, 4
6, 4, 5
h
2
2, 3, 2
2, 3, 2
3, 3, 2
4, 3, 3
5, 4, 4
e
3
3, 4, 3
3, 4, 3
3, 4, 3
4, 4, 2
5, 3, 4
r
4
4, 5, 4
4, 5, 4
4, 5, 4
3, 5, 4
4, 4, 2
i

Levenshtein distance
A minimum edit distance algorithm.
Cost
• Insert →1
• Delete →1
• Replace →1
Alternative
• Insert →1
• Delete →1
• Replace →2
distance = 1
th e r
| | | | o t h e r
distance = 2
t h e r
||| | o tt e r

Try it!
#
l
e
c
t
u
r
e
#
0
1
2
3
4
5
6
7
r
1
2, 2, 1
e
2
2, 3, 2
c
3
o
4
r
5
d
6

Edit-distance to similarity
Normalised distance: The distance is divided by the length of the
longer string
𝑑 𝑆1,𝑆2 max 𝑆1 , 𝑆2
𝑠𝑖𝑚𝑒𝑑𝑖𝑡 𝑠1, 𝑠2
• ther → otter?
= 1.0 −
𝑑 𝑆1,𝑆2 max 𝑆1 , 𝑆2
t
h
e
r
o
t
t
e
r
𝑠𝑖𝑚𝑒𝑑𝑖𝑡 𝑡h𝑒𝑟, 𝑜𝑡𝑡𝑒𝑟
= 1.0 − 2 = 0.6 5

Leave a Reply

Your email address will not be published. Required fields are marked *