程序代做CS代考 Semester 2, 2021 – cscodehelp代写

Semester 2, 2021
Lecture 4, Part 6: Unstructured Data – Text Representations

Features
Sepal length
Sepal width
Petal length
Petal width
Species (label)
4.9
3.0
1.4
0.2
Iris setosa
7.0
3.2
4.7
1.4
Iris versicolor
5.4
3.7
1.5
0.2
Iris setosa
6.3
3.3
6.0
2.5
Iris virginica

How To Represent Text?

Text Features
• Part-of-speech tagging • She saw a bear.
• Your efforts will bear fruit. • bear_NN; bear_VB
bear: NOUN bear: VERB
We
value
curiosity
,
passion
and
a
desire
to
learn
PRON
VERB
ADJ
PUNC
ADJ
CONJ
DET
NOUN
TO
VERB
• ngrams
we
value
curiosity
,
passion
and
a
desire
to
learn
we_value
value_curiosity
curiosity_,
,_passion
passion_and
and_a
a_desire
desire_to
to_learn

Text Representation – BoW
• Bag-of-words: simplest vector space representational model for text
• Disregards word order and grammatical features such as POS
• Each text document as a numeric vector
• each dimension is a specific word from the corpus
• the value could be its frequency in the document or occurrence (denoted by 1 or 0).

Prepare for BoW
• Word tokenisation
• Case-folding
• Abstraction of number (#num#, #year#)

Prepare for BoW
• Stop word removal

Prepare for BoW
• Stop word removal

Prepare for BoW
• Stemming
• How would this look different if we lemmatised instead?
• Removed punctuation
• Word counts

Bag of Words
tranform
school
world
year
futur
foster
mind
cheese
transform

doc001
2
4
1
4
3
0
0
0
0

doc002
1
3
3
0
2
2
2
0
1

doc003
0
3
4
0
3
3
0
4
2

Term Frequency
• What if a rare word occurs in a document?
• e.g. ‘catarrh’ is less common than ‘mucus’ or ‘stuffiness’
• What if a word occurs in many documents?
• Maybe we want to avoid raw counts?
• Raw frequencies varies with document length
• Don’t capture important (rare) features that would be telling of a type of document

Leave a Reply

Your email address will not be published. Required fields are marked *