程序代写代做代考 decision tree Bayesian deep learning ECE 657A: Classification – Lecture 5: Classification, Training, Validation and Estimation
ECE 657A: Classification – Lecture 5: Classification, Training, Validation and Estimation
ECE 657A: Classification
Lecture 5: Classification, Training, Validation and Estimation
Mark Crowley
January 24, 2016
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 1 / 34
Class Admin Announcements
Today’s Class
Announcements
Supervised Learning / Classification
Project Pitch Session
Break
Naive Bayes and MAP estimation
Training, validation, error estimation
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 2 / 34
Class Admin Announcements
Announcements
Today (Feb 1): Project Pitch Session.
Assignment 1 due : Due Feb 10 at 11:59pm (do in groups)
Next week (Feb 8):
I’ll be travelling Feb 3-9
11:30am TA help session
Guest lecture on Data Analytics tools for R.
Oxt week
(Feb 15): TA help session for the assignment or project
(11:30-12:50) and Guest lecture on Data Analytics tools for R by one
of Prof. Fischmeister’s students (1-2:2).
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 3 / 34
http://www.urbandictionary.com/define.php?term=oxt
Class Admin Pitch Session
Pitch Session : Structure
Your group will be matched with two other groups – see table.
Your group will have 5-10 minutes to pitch your idea to two other
groups.
You can use any materials you want, I’ll have flip chart paper, some
can use chalboard or whiteboard. You could use laptop. You could just
talk.
Then 5-10 minutes of discussion and feedback.
Next group
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 4 / 34
Class Admin Pitch Session
Pitch Session : Grading
You will not be graded on your pitch quality or your topic.
You will be graded on your feedback for others. (1.5% out of 4% peer
review)
Peer review forms:
Explain the project idea: you need to provide a concise, fairly accurate
explanation of their idea (shows you listened)
Do you think the project is too easy, too hard or just right?
Give some substantial, constructive suggestion for their project.
Explain how your suggestion would improve the project’s ability to
teach the class about something new.
Submit form online : https://goo.gl/forms/YpMlLC2qAsC5YXJW2
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 5 / 34
Supervised vs. Unsupervised Learning
Outline of
Class Admin
Announcements
Pitch Session
Supervised vs. Unsupervised Learning
Classification
Training, Validation and Error Estimation
Dividing the Data
Confusion Matrix
ROC and AUC Curves
Other Performance Measures
Hypothesis Testing
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 6 / 34
Supervised vs. Unsupervised Learning
Descriptive versus Inferential Analysis
We have data (samples). This data is a sample of a population (more
than just the measured or observed sample).
Descriptive Analysis is the type of analysis and measures that seek
to describe and summarize the data, the available samples. We can
not in general use it for interpretation of unobserved data.
Inferential Analysis (predictive) is the type of analysis that can
describe measures over the population of data. That is observed and
unobserved.
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 7 / 34
Supervised vs. Unsupervised Learning
Another way to look at it: Machine Learning [Murphy,
2012]
Supervised Learning Also called: predictive, inferential
Given inputs and outputs
Model based: model or distribution is known
Parameter Estimation: distribution known, fitting
parameters
Linear Regresion, least squares estimation
output is linear: regression, prediction
output is categorical (label) : classification
Unsupervised Learning Also called: descriptive, knowledge discovery
Given inputs only
output is categorical : clustering
Reinforcement Learning Input and output given only when action
(prediction, choice, activity) provided.
Given feedback about utility of choice made.
No given model or labels ahead of time.
Learning is interactive.Mark Crowley ECE 657A : Lecture 5 January 24, 2016 8 / 34
Supervised vs. Unsupervised Learning
Clustering vs. Classification
Clustering
Unsupervised
Uses unlabeled data
Organize patterns w.r.t. an
optimization criteria
Notion of similarity
Hard to evaluate
Example: K- means, Fuzzy C-
means, Hierarchical
Classification
Supervised
Uses labeled data
Requires training phase
Domain sensitive
Easy to evaluate
Examples: Bayesian, KNN,SVM,
Decision Trees
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 9 / 34
Classification
Classification Definition
Definition:
Classification is a learning method that uses training samples (with
known class labels) to learn how to assign the samples into their
proper classes. The task of the clasifier is to use the feature vector
provide by the feature extractor to assign the object (datapoint) to a
category.
This learning can then be used to label test samples (with unknown
labels).
Classification can be facilitated if the features (or a subset of them)
of samples in the same class share characteristics that discriminate
them from other classes.
It can be seen as the discrete form of Prediction, can you map the
input values to a discrete set of output values rather than to a
continuous number.
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 10 / 34
Classification
Classification
How do we design a classifier that can deal with the variability in
feature values?
Designing or selecting a classifier for the data or application is not an
obvious task.
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 11 / 34
Classification
Definition
Given a dataset X = {x1, x2, . . . , xn} where each datapoint xi ∈ X
contains F features denoted xi ,f ,
and given a set of classes C = {C1, . . . ,CK},
the classification problem is to define a mapping δ(xi ) : X → C
where each xi is assigned to one class.
A class, Cj , contains precisely those points mapped to it, no more
and no less.
Note: the classes are predefined, nonoverlapping and partition the
entire set of datapoints.
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 12 / 34
Classification
Considerations for Designing a Classification Approach
1. Data, labelled, one class, multiclass, overlap, missing values
2. Training and Error Estimation procedures
3. Classification methods
4. Performance measures
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 13 / 34
Classification
Classifier Methods
Similarity Based Approach
Template matching
Nearest Mean classifier
Neural Network classifiers
Deep Learning
Other
Decision Trees
Random Forests
Probabilistic Approach
Bayesian Classification
Logistic regression
Decision Boundary Approach
Geometrical
Neural Networks
SVM
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 14 / 34
Training, Validation and Error Estimation Dividing the Data
Training and Error Estimation
Resubstitution
Holdout Method
Leave-one-out
K-fold Cross Validation
Bootstrap
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 15 / 34
Training, Validation and Error Estimation Dividing the Data
Error Estimation and training Methods
Resubstitution Method: Uses all the available data for training and then
the same data is used for testing
Optimistically biased estimate, especially when the ratio
of sample size to dimensionality is small
Holdout Method: Uses half the data for training and the other half is used
for testing; training and test sets are independent
Pessimistically biased estimate
Different partitionings may give different results
Leave-one-out Method: A classifier is designed using (n- 1) samples and
evaluated on the remaining one sample; this is repeated n
times with different training sets of size (n-1)
Estimate is unbiased but it has a large variance
Large computational requirement because n different
classifiers have to be designed
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 16 / 34
Training, Validation and Error Estimation Dividing the Data
Error Estimation Methods
K-fold cross validation (Rotation): divide the available samples into K
disjoint subsets for training and the remaining subset for
test. Can repeat the process with different partitioning.
Estimate has lower bias than the holdout method and is
cheaper to implement than leave-one-out method.
Bootstrap Method: Generate many bootstrap sets of size n by sampling
with replacement.
Bootstrap estimates can have lower variance than the
leave-one-out method.
Computationally more demanding.
Useful in small sample size situations.
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 17 / 34
Training, Validation and Error Estimation Dividing the Data
Validation
Sometimes we need to tune certain parameters of the classifier such
as k in the K Nearest Neighbours classifier or parameters that control
the architecture of a neural network.
This can be done using a validation set, leaving the test set to test
the optimal choice
So, the data is divided into 3 subsets: Training, Validation and
Testing
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 18 / 34
Training, Validation and Error Estimation Confusion Matrix
Classification Confidence
For a binary classification problem (ie. C = {C0,C1}) our decision rule
could be
δ(x) = I(f (x) > τ)
where:
f (x) is a measure of confidence that x ∈ C1. This could be a
probability, a distance or other function
τ (“tau”) is a threshold parameter that is used to make a decision
on which class to assign to each x .
For different values of τ we will get a different number of xi ∈ C1.
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 19 / 34
ECE 750 T17
Performance Measures
9
TP
True Positive
20
FN
False Negative
10
FP
False Positive
45
TN
True Negative
25
P
30
N
70
Retrieved
Relevant
Not Retrieved
Relevant
Retrieved
Not Relevant
Not Retrieved
Not Relevant
Example:
100 Students 30 Tall, 70 Not Tall
Classifier 65 Tall (20 true tall, 45 false positive)
35 not tall (25 true not tall)
From Retrieval Classifier
R NR
Rel
Not
Rel
65
Yes
35
No
P+N=100
Training, Validation and Error Estimation Confusion Matrix
Using a Confusion Matrix
For a particular value of τ we can build a confusion matrix:
True Value
1 0
Estimated
1 TP FP
0 FN TN
Sum N+ = TP + FN N− = FP + TN
TP: True Positive
FP: False Positive (False alarm)
TN: True Negative
FN: False Negative (Missed detection
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 20 / 34
Training, Validation and Error Estimation Confusion Matrix
Using a Confusion Matrix
For a particular value of τ we can build a confusion matrix:
True Value
1 0
Estimated
1 TP FP
0 FN TN
Sum N+ = TP + FN N− = FP + TN
N = TP + FP + FN + TN
N̂+ = TP + FP
N̂− = FN + TN
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 20 / 34
Training, Validation and Error Estimation Confusion Matrix
Using a Confusion Matrix
For a particular value of τ we can build a confusion matrix:
True Value
1 0
Estimated
1 TP FP
0 FN TN
Sum N+ = TP + FN N− = FP + TN
Notes:
For more than one class you could build this for each class, whether
the point is in the class or not.
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 20 / 34
Training, Validation and Error Estimation Confusion Matrix
The False Positive vs False Negative Tradeoff
From this table we can also compute various success and error rates:
True Value
1 0
Estimated
1 TPR= TP/N+ FPR= FP/N−
0 FNR= FN/N+ TNR= TN/N−
TPR: True Positive Rate (Sensitivity, Recall, Hit rate)
FPR: False Positive Rate (False alarm rate, Type I Error)
TNR: True Negative Rate (Miss Rate, Type II Error)
FNR: True Negative Rate (Specificity)
Remember, this depends on τ , so how do we find the best value of τ?
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 21 / 34
ECE 750 T17
Performance Measures
10
rate)fication (misclassi 55%
NP
Error
rate)n recognitio (overall %45
100
2520
Accuracy
�
�
�
�
�
FPFN
NP
TNTP
%6.66
30
20
ySensitivit
or Recall
�
P
TP
FNTP
TP
tprate
How many of the
true
+ve are classified
+ve, completeness
raterate
raterate
rate
tp
P
FN
fn
fp
N
TN
tn
N
FP
fp
valuepredictivepositive
FPTP
TP
�
�
�
1
1%36
70
25
Specifity
rate) alarm (false %64
70
45
%31
65
20
Precision
How many of the
classified +ve are true
positive, exactness
ECE 750 T17
Performance Measures
11
1
RecallPrecision
Recall Precision 2
Recall
1
Precision
1
2
d
�
�
Fmeasure
Close to 1 is Better
Harmonic mean of Precision & Recall
ECE 750 T17
Performance Measures
For multiclass, C classes, there are 2 ways to
calculate overall average F-measure:
(a) Micro-averaged F-measure
Calculate the global F-measure over all classes
12
� �
� �
US
SU
U
S
�
�
�
¦
¦
¦
¦
2
averaged_micro_
Recall
Precision
1
1
1
1
F
FNTP
TP
FPTP
TP
C
i
ii
C
i
i
C
i
ii
C
i
i
ECE 750 T17
Performance Measures
b) Macro-averaged F-measure
Compute F-measure locally over each class
and then average over all classes
13
C
F
F
F
FNTP
TP
FPTP
TP
C
i
i
ii
ii
i
ii
i
ii
i
i
¦
�
�
�
1
i
averaged_macro_
2
US
US
US
Number of classes
Training, Validation and Error Estimation ROC and AUC Curves
Receiver Operating Characteristic (ROC) Curve
ROC Originally conceived during WW II to assess the performance of
radar systems
If we apply our decision rule δ(x) for a range of τ then we can draw a
curve of any of the success/error rates.
If we plot TPR vs FPR we get the ROC Curve
Say we have two classifiers or distributions. One for C0 and another
for C1
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 22 / 34
Training, Validation and Error Estimation ROC and AUC Curves
ROC vs PR Curves
Line A is better than line B in both curves.
0 1
0
1
fpr
tp
r
A
B
0 1
0
1
recall
p
re
ci
si
o
n
AB
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 23 / 34
ECE 750 T17
ROC Curve (Receiver Operating Characteristic)
T. Faweett, An Introduction to ROC Analysis, Pattern Recognition
Letters, 27, 2006, 861-874.
ROC Originally conceived during WW II to assess the performance
of radar systems
14
y 1.0 D
0.8 B
A C
tp rate 0.6
0.4
E
0.2
0 0.2 0.4 0.6 0.8 1.0
fp rate x
conservative
Underperforming
Worse than random
Liberal classifiers
(high tp but high fp)
• Discrete classifier (output only a class label) produces a point
• A is more conservative than B
• C is random guessing (tp=fp) the whole line
TP/P
FP/N
ECE 750 T17
ROC Curve (Receiver Operating Characteristic)
15
Can create a curve from a scoring classifier using a
threshold from extreme conservative to liberal
1.0
0.8
0.6
0.4
0.2
0 0.2 0.4 0.6 0.8 1.0
x
70% Accuracy
60% Accuracy
Instance True Class Score
1 p 0.9
2 n 0.7
3 p 0.55
4 p 0.54
5 n 0.52
6 p 0.51
7 p 0.38
8 n 0.35
9 n 0.33
10 n 0.1
Example: 10 Instance and if Classifier Output is np otherwise consider then Wt
5
1
,
5
1
1 , 1 7.0
0,
5
1
0 , 1 9.0
fptpnp
fptpnp
W
W
ECE 750 T17
m
False Acceptance Rate vs False
Rejection Rate
16
Rate (FN) FRR
Rate
(FP)
FAR
High accuracy
might mean
reject more
Training, Validation and Error Estimation ROC and AUC Curves
Another Confusion Matrix
We can also create a confusion matrix using our estimated count of
positives and negatives:
True Value
1 0
Estimated
1 TP/N̂+=PPV=precision FP/N̂+=FDP
1 = FN/N̂− TNR= TN/N̂−
Precision measures what fraction of our detections are actually positive.
Recall measures what fraction of all the positives that we actually
detected.
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 24 / 34
Training, Validation and Error Estimation ROC and AUC Curves
Precision-Recall Curve
If we plot precision P vs recall R as we vary τ then we get a
precision-recall curve which can often show us different performance
behaviour from the ROC curve.
The P-R only uses statistics based on TP, so the curve is useful when
there is a very small number of positive cases in your classifier or
when the number of negatives could scale based on some parameter.
Curve bends the other way.
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 25 / 34
Training, Validation and Error Estimation ROC and AUC Curves
ROC vs PR Curves
Line A is better than line B in both curves.
0 1
0
1
fpr
tp
r
A
B
0 1
0
1
recall
p
re
ci
si
o
n
AB
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 26 / 34
Training, Validation and Error Estimation Other Performance Measures
Summary
Naive Bayes Classification
Cross Validation
Evaluation Metrics
Confusion Matricies and ROC curves
Mark Crowley ECE 657A : Lecture 5 January 24, 2016 27 / 34
Class Admin
Announcements
Pitch Session
Supervised vs. Unsupervised Learning
Classification
Training, Validation and Error Estimation
Dividing the Data
Confusion Matrix
ROC and AUC Curves
Other Performance Measures
Hypothesis Testing