程序代写CS代考 Semester 2, 2021 – cscodehelp代写

Semester 2, 2021
Lecture 4, Part 7: Unstructured Data – Text Representations

Raw Frequencies
• What are the problems?
• What are the alternatives?

Raw Frequencies
• What are the problems?
• What are the alternatives?
SPORTS
play grace crowd
ARTS
play grace audience

TF-IDF
Discourse on Floating Bodies
– Galileo Galilei
Treatise on Light
– Christiaan Huygens
Experiments with Alternate Currents of High Potential and High Frequency
– Nikola Tesla
Relativity: The Special and General Theory

TF-IDF
• TF-IDF stands for Term Frequency-Inverse Document Frequency • Each text document as a numeric vector
• each dimension is a specific word from the corpus
• A combination of two metrics to weight a term (word)
• term frequency (tf): how often a given word appears within a document
• inverse document frequency (idf): down-weights words that appear in many documents.
• Main idea: reduce the weight of frequent terms and increase the weight of rare and indicative ones.

TF-IDF
Term frequency (TF):
• 𝑡𝑓 𝑡, 𝑑 = the raw count of a term in the document. Inverse Document Frequency (IDF):
•𝑖𝑑𝑓𝑡 =ln 1+𝑁 +1or𝑖𝑑𝑓𝑡 =ln 𝑁 +1 1+𝑑𝑓 𝑑𝑓
𝑡𝑡
• N is the number of document in the collection,
• 𝑑𝑓 is the document frequency, the number of document containing the term t.
𝑡
TF-IDF (L2 normalised):
• 𝑡𝑓_𝑖𝑑𝑓 𝑡,𝑑 = 𝑣𝑡 σ𝑡′∈𝑑 𝑣𝑡′2
where𝑣𝑡 =𝑡𝑓 𝑡,𝑑 ×𝑖𝑑𝑓 𝑡

Example TF-IDF
Two documents: A – ‘the car is driven on the road’
B – ‘the truck is driven on the highway’
word​
𝑡𝑓 𝑡,𝑑
𝒊𝒅𝒇(𝒕) =
𝒍𝒏 𝟏+𝑵 +1 𝟏+𝒅𝒇𝒕
𝒗𝒕 = 𝑡𝑓 𝑡,𝑑 ×𝑖𝑑𝑓 𝑡
𝒕𝒇_𝒊𝒅𝒇 𝒕,𝒅
A
B
A
B
A
σ𝑡′∈𝑑𝑣𝑡′2 = 2.225
B
σ𝑡′∈𝑑𝑣𝑡′2 = 2.225
car
1
0
ln 3 +1=1.405 2
1.405
0
0.632
0
driven
1
1
ln3 +1=1 3
1
1
0.449
0.449
road
1
0
ln 3 +1=1.405 2
1.405
0​
0.632
0
truck​
0​
1
ln 3 +1=1.405 2
0
1.405
0
0.632
highway
0​
1
ln 3 +1=1.405 2
0​
1.405
0
0.632
* stop words removed

Example TF-IDF

Example TF-IDF – cont.
• Two documents, A and B.
A. ‘the car is driven on the road’
B. ‘the truck is driven on the highway’
* stop words removed
• Text features for machine learning
car
driven
road
truck
highway
0.632
0.449
0.632
0
0
0
0.449
0
0.632
0.632

TRY THIS!
• 3 documents:
A: ‘the car is driven on roads’
B: ‘the truck is driven on a highway’
C: ‘a bike can not be ridden on a highway’
word​
𝑡𝑓 𝑡,𝑑
𝒊𝒅𝒇 𝒕 =
𝒍𝒏 𝟏+𝑵 +1 𝟏+𝒅𝒇𝒕
𝒗𝒕 = 𝑡𝑓 𝑡,𝑑 ×𝑖𝑑𝑓 𝑡
𝒕𝒇_𝒊𝒅𝒇 𝒕,𝒅
A​
B
C
A​
B
c
A
σ𝑡′∈𝑑 𝑣𝑡′2
B
σ𝑡′∈𝑑 𝑣𝑡′2
C
σ𝑡′∈𝑑 𝑣𝑡′2
car
1
0
0
ln 4/2 +1=
driven
1
1
0
road
1
0
0
truck​
0​
1
0
highway
0​
1
1
bike
0
0
1
ridden
0
0
1
* stop words removed

Features from unstructured text
Features for structured data
Features for unstructured text
car
driven
road
truck
highway
0.632
0.449
0.632
0
0
0
0.449
0
0.632
0.632

Rank document similarity to a query
• Query q = ‘I saw a car and a truck on the highway’
• Query terms = [‘car’, ‘truck’, ‘highway’]
• Query vector 1, 0, 0, 1, 1 , unit vector 𝑣 = [0.577, 0, 0, 0.577, 0.577] 𝑞
• Cosine similarity to rank documents
cos(𝑣 ,𝑑 ), cos(𝑣 ,𝑑 ) : 0.36, 0.73 𝑞1 𝑞2
d1 d2
car
driven
road
truck
highway
0.632
0.449
0.632
0
0
0
0.449
0
0.632
0.632

car
driven
road
truck
highway
0.632
0.449
0.632
0
0
0
0.449
0
0.632
0.632

Leave a Reply

Your email address will not be published. Required fields are marked *