程序代写代做代考 Hidden Markov Mode python computational biology deep learning chain lecture09.pptx

lecture09.pptx

LECTURE 9

Sequence Classifcaaon and Part-Of-Speech Tagging

Arkaitz Zubiaga, 5
th

February, 2018

 Sequence Classifcaaon

 Sequence Classifers:

 Hidden Markov Models (HMM).

 Maximum Entropy Markov Models (MEMM).

 Condiaonal Random Fields (CRF).

 Using Sequence Classifers for Part-of-Speech (POS) Tagging.

LECTURE 9: CONTENTS

 Someames, classifcaatin tif items in a sequence is dependent
tin tither sequence items:

 Part-tif-Speech (POS) tagging:
Assigning categtiries tti wtirds, e.g. adjecave, noun or verb

Example:
The man saw a cat
DET NN VB DET NN

SEQUENCE CLASSIFICATION

DET: determiner
NN: noun
VB: verb

 Why is classifcaaon dependent on other sequence items?

In our example, “The man saw a cat”:

‘saw’ can be:
a ntiun (if it refers to the tool for cutng)
a verb (if it’s simple past of ‘see’)

we can’t classify whether ‘saw’ is verb or noun by looking at the
word alone → need to look at context, the sequence

SEQUENCE CLASSIFICATION

 The man saw a cat

 ??? ??? ??? ??? ???

 The frst item, ‘the’, is easy tti classify, it’s always a determiner:

The man saw a cat

 DET ??? ??? ??? ???

SEQUENCE CLASSIFICATION

 The man saw a cat

 DET ??? ??? ??? ???

 Now ‘man’ is ambigutius: (1) ntiun, i.e. male person, or (2) verb,
i.e. ‘take charge of’? We can look at:

 P(‘man
NN

’) vs P(‘man
VB

’) → probability of the word as noun or verb

 P(NN | DET) vs P(VB | DET) → probability of a noun or a verb to be
preceded by a determiner

SEQUENCE CLASSIFICATION

 Twti prtibabiliaes to determine POS of a word:
[1] P(‘manNN’) vs P(‘manVB’)
& [2] P(‘manNN’ | DET) vs P(‘manVB’ | DET)

 Classifers we have seen so far (NB, MaxEnt, SVM) can handle [1].

 But they can’t handle [2].

SEQUENCE CLASSIFICATION

 Using training data, we can learn of word category POS
i
being

preceded by POS
j
.

SEQUENCE CLASSIFICATION

Prepositions can be followed by
nouns (.7) or determiners (.3),
never by verbs.

SEQUENCE CLASSIFICATION

Nouns can be followed by verbs,
prepositions, or another noun.

SEQUENCE CLASSIFICATION

Determiners are ALWAYS
followed by a noun.

 [1] P(‘manNN’) vs P(‘manVB’)

 [2] P(NN | DET) vs P(VB | DET)

The man saw a cat

DET ??? ??? ??? ???

SEQUENCE CLASSIFICATION

No matter the outcome of [1], we know it’s NN
thanks to [2].

 OK, and how do we achieve this?

 Hidden Markov Models (HMM)

 Maximum Entropy Markov Models (MEMM)

 Condiaonal Random Fields (CRF)

 Deep Learning:

 Recurrent Neural Networks (RNN).

 Long/Short-Term Memory Networks (LSTM).

 Convoluaonal Neural Networks (CNN).

SEQUENCE CLASSIFICATION

 OK, and how do we achieve this?

 Hidden Marktiv Mtidels (HMM)

 Maximum Entrtipy Marktiv Mtidels (MEMM)

 Ctindiatinal Randtim Fields (CRF)

 Deep Learning:

 Recurrent Neural Networks (RNN).

 Long/Short-Term Memory Networks (LSTM).

 Convoluaonal Neural Networks (CNN).

SEQUENCE CLASSIFICATION

HIDDEN MARKOV MODELS

 A Markov chain:

 Set of ntides ctinnected with prtibabiliaes , where weights on
all edges leaving a node sum to 1.

WHAT IS A MARKOV CHAIN?

MARKOV CHAINS

 In a First order Markov Chain, the probability of a paracular state
depends ONLY on the previous state.

 Remember: Lecture 3 on language models, Markov assumpaon:
P(library | I found two pounds in the) ≈ P(library | the)

MARKOV CHAINS

WHAT IS A HIDDEN MARKOV MODEL (HMM)?

HMM: TWO KINDS OF PROBABILITIES

 Transiatin prtibabiliaes: pairwise probabiliaes of tags being
preceded/followed by another tag.

e.g. determiner likely followed by noun or adjecave.

 Emissitin prtibabiliaes: probability for a paracular tag to be a
paracular word, e.g. verb is very likely to be “is”.

 Again, the Markov assumpaon, i.e. dependence only on the
previous state.

 There must be a dependency between sequence items, e.g.:

 Weather (dependency): this morning’s weather may
determine the probability of this afernoon’s weather.

 Independent events: my laptop having been broken doesn’t
determine the probability of my next laptop breaking.

TWO IMPORTANT ASSUMPTIONS IN HMM

 Limitaaons of HMM:

 Models dependencies between each state and only its
corresponding observaaon.

 Learns joint distribuaon of states and observaaons P(Y, X),
but not the condiaonal probability P(Y|X).

LIMITATIONS OF HMM

 NLTK HMM:
htp://www.nltk.org/_modules/nltk/tag/hmm.html

 hmmlearn:
htps://github.com/hmmlearn/hmmlearn

 Scikit HMM:
htp://scikit-learn.sourceforge.net/stable/modules/hmm.html

HMM IMPLEMENTATIONS

 Maximum Entropy Markov Models (MEMM): Models
dependencies between each state and the full tibservaatin
sequence.

 Learning objecave funcaon consistent with predicave funcaon:
P(Y|X).

MEMM: ALTERNATIVE TO HMM

MEMM: THE LABEL BIAS PROBLEM

 In principle, locally, state 1 prefers state 2 in 2 nd place.

MEMM: THE LABEL BIAS PROBLEM

 But if we look at the enare path (which MEMM does):

 P(1→1→1→1) = 0.4*0.45*0.5 = 0.09

 P(1→2→2→2) = 0.6*0.3*0.3 = 0.054

MEMM: THE LABEL BIAS PROBLEM

Despite 1→2 being more
likely locally,

the entire path probability
will favour 1→1

 Soluaon: don’t normalise probabiliaes locally, use local
potenaals.

SOLUTION TO LABEL BIAS

 CRF:

 Global ntirmaliser Z(x) tiverctimes label bias issue of MEMM.

 Models the dependency between each state and the enare
tibservaatin sequence (like MEMM).

CONDITIONAL RANDOM FIELDS

 Widely used for a range of sequence classifcaaon tasks:

Laferty, J., McCallum, A., & Pereira, F. C. (2001). Condiaonal
random felds: Probabilisac models for segmenang and labeling
sequence data.
htps://repository.upenn.edu/cgi/viewcontent.cgi?aracle=1162&context=cis_papers

CONDITIONAL RANDOM FIELDS

 Python-crfsuite:
htp://python-crfsuite.readthedocs.io/en/latest/

 PyStruct:
htps://pystruct.github.io/

IMPLEMENTATIONS OF CRF

 We’ve talked about classifer linear sequences.

 But someames we can have more complex sequences:

 Graph or tree-structured sequences.

BEYOND CLASSIFIER LINEAR SEQUENCES

 Post classifcaaon in forums, e.g. binary classifcaaon: does a post
answer the quesaon of the 1 st post?

 Post 1

 Post 1.1

 Post 1.1.1

 Post 1.1.2

 Post 1.2

 Post 1.2.1

 Post 1.2.1.1

 Post 1.2.2

WHERE DO WE FIND TREE SEQUENCES IN NLP?

HMM VS CRF

USING SEQUENCE CLASSIFIERS
FOR PART-OF-SPEECH (POS)
TAGGING

PART-OF-SPEECH TAGGING

 As we were saying, sequence can play an important role in
determining POS tags in a sentence:

The man saw a cat

DET NN VB DET NN

“man” can’t be a verb if it’s preceded by a determiner.

PART-OF-SPEECH TAGGING

 As we were saying, sequence can play an important role in
determining POS tags in a sentence:

The man saw a cat

DET NN VB DET NN

 With HMM, only looking at previous label, we can end up predicting
sequence of 5 nouns (NN, NN, NN, NN, NN).

 Looking at the entire sequence (MEMM, CRF), we avoid this, we
never have a sentence only made of 5 nouns.

PART-OF-SPEECH TAGGING

 The list of POS tag types is actually quite large!
Open class (lexical) words

Closed class (functional)

Nouns Verbs

Proper Common

Modals

Main

Adjectives

Adverbs

Prepositions

Particles

Determiners

Conjunctions

Pronouns

… more

IBM

Italy

cat / cats

snow
see

registered

can

had

old older oldest

slowly

to with

off up

the some

and or

he its

Numbers

122,312

one

Interjections Ow Eh

OPEN VS CLOSED CLASSES

 Open vs. Closed classes:
 Closed:

 determiners: a, an, the,…
 pronouns: she, he, I,…
 prepositions: on, under, over, near, by,…

 Open:
 Nouns, Verbs, Adjectives, Adverbs.

CHALLENGES IN POS TAGGING: AMBIGUITY

 Words often have more than one possible POS, e.g. back:
 The back door = JJ (adjective)
 On my back = NN (noun)
 Win the voters back = RB (adverb)
 Promised to back the bill = VB (verb)

 See list of POS tags:

http://rednoise.org/rita/reference/PennTags.html

PART-OF-SPEECH TAGGING

 POS tagging example:
 Input: Plays well with others

NNS UH IN NNS
VBZ JJ

NN
RB

 Output: Plays/VBZ well/RB with/IN others/NNS

1. List candidate labels for each
word.
2. Based on probabilities learnt from
training data, the classifier predicts
the most likely POS sequence.

NB: the fact that there are 2
unambiguous words (with & others)
is useful to first label them, and then
predict the other 2.

PART-OF-SPEECH TAGGING

 Uses of POS tagging in NLP:
 Text-to-speech.
 Phrase search through regular expressions, e.g. (Det) Adj* N+
 As input to or to speed up a full linguistic parser (later lectures)
 If you know the tag, you can back off to it, e.g. in lemmatisation,

saw → see, or saw → saw?
 In other fields:

 Computational biology.
 Prediction of series of data, e.g. weather forecasting.

POS TAGGING PERFORMANCE

 Current POS tagging system are very accurate:
 Baseline system that predicts the most common POS for a word

already gets 90% accuracy → owing to many words not being
ambiguous.

 Standard classifiers (no sequence) can achieve 93% accuracy.
 Sequence classifiers can achieve 97% accuracy.

REFERENCES

 List of software for POS tagging:
https://aclweb.org/aclwiki/Part-of-speech_tagging

ASSOCIATED READING

 Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapters 9-
10.

 Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapter 6
Section 1.6.

Published by admin

Leave a Reply Cancel reply