程序代写代做代考 scheme data mining data science database decision tree Bayesian IT enabled Business Intelligence, CRM, Database Applications
IT enabled Business Intelligence, CRM, Database Applications
Sep-18
Testing
Prof. Vibs Abhishek
The Paul Merage School of Business
University of California, Irvine
BANA 273 Session 5
1
Agenda
Construction of test data set
Measuring accuracy
Assignments posted to Canvas
Review Assignment 1
2
What is Testing?
It is important to know how the decision support system is performing in real-world situations
“Real” testing is difficult
How do we test the negative decisions?
Was it right to turn down the loan application?
Was it correct that we did not invest in the other project?
Even for positive decisions, the eventual outcome may not be known
The loan that was approved has not defaulted yet, but we do not know if it would do so in the next 28 years
Testing
Use a small number of old cases to see how the system performs
3
Training versus Testing
It is not advisable to use the same set of cases to train the model and then test it
The performance would be optimistic
Training data would perfectly capture all the stochastic relationships across the features and the goal
As mentioned before, we partition the dataset into two subsets
Training set
Used to build the model
Testing set
Used to validate the performance of the model
4
Training and testing
Natural performance measure for classification problems: error rate
Success: instance’s class is predicted correctly
Error: instance’s class is predicted incorrectly
Error rate: proportion of errors made over the whole set of instances
Resubstitution error: error rate obtained from training data
Resubstitution error is (hopelessly) optimistic
Making the most of the data
Once evaluation is complete, all the data can be used to build the final classifier
Generally, the larger the training data the better the classifier (but returns diminish)
The larger the test data the more accurate the error estimate
Holdout procedure: method of splitting original data into training and test set
Dilemma: ideally both training set and test set should be large!
7
Holdout estimation
What to do if the amount of data is limited?
The holdout method reserves a certain amount for testing and uses the remainder for training
Usually: one third for testing, the rest for training
Problem: the samples might not be representative
Example: class might be missing in the test data
Advanced version uses stratification
Ensures that each class is represented with approximately equal proportions in both subsets
8
Repeated holdout method
Holdout estimate can be made more reliable by repeating the process with different subsamples
In each iteration, a certain proportion is randomly selected for training (possibly with stratification)
The error rates on the different iterations are averaged to yield an overall error rate
This is called the repeated holdout method
Still not optimum: the different test sets overlap
Can we prevent overlapping?
9
Cross-validation
Cross-validation avoids overlapping test sets
First step: split data into k subsets of equal size
Second step: use each subset in turn for testing, the remainder for training
Called k-fold cross-validation
Often the subsets are stratified before the cross-validation is performed
The error estimates are averaged to yield an overall error estimate
10
More on cross-validation
Standard method for evaluation: stratified ten-fold cross-validation
Why ten?
Extensive experiments have shown that this is the best choice to get an accurate estimate
Even better: repeated stratified cross-validation
E.g. ten-fold cross-validation is repeated ten times and results are averaged
11
Leave-One-Out cross-validation
Leave-One-Out:
a particular form of cross-validation:
Set number of folds to number of training instances
I.e., for n training instances, build classifier n times
Makes best use of the data
Very computationally expensive
Accuracy Measure
Accuracy is the percentage of test cases where the predicted and actual goals are the same
The test set on the right shows 70% accuracy
Problem
Does it account for a bias towards a class?
Stratified accuracy
Accuracy for each class
Accuracy for Approve=no
4 out of 6 (66.7%)
Accuracy for Approve = yes
3 out of 4 (75%)
© Prof. V Choudhary, September 18
12
Confusion Matrix
A confusion matrix summarizes the result of running a classification model on a test dataset
© Prof. V Choudhary, September 18
Actual class
True negative
False positive (Type 1)
No
False negative (Type 2)
True positive
Yes
No
Yes
Predicted class
13
Confusion Matrix
Total number of test cases
905 + 23 + 12 + 323 = 1263
Number of correct classification
905 + 323 = 1228
Number of incorrect classification
23 + 12 = 35
Accuracy = 1228/1263 = 97.2%
Stratified accuracy
Accuracy for “a” = 905/(905+23) = 97 5%
Accuracy for “b” = 323/(12+323) = 96.4%
© Prof. V Choudhary, September 18
14
15
The bootstrap
CV uses sampling without replacement
The same instance, once selected, can not be selected again for a particular training/test set
The bootstrap uses sampling with replacement to form the training set
Sample a dataset of n instances n times with replacement to form a new dataset of n instances
Use this data as the training set
Use the instances from the original dataset that don’t occur in the new training set for testing
16
The 0.632 bootstrap
The 0.632 bootstrap
A particular instance has a probability of 1–1/n of not being picked
Thus its probability of not ending up in the test data is:
This means the training data will contain approximately 63.2% of the instances
17
Estimating error with the bootstrap
The error estimate on the test data will be pessimistic
Trained on ~63% of the instances
Therefore, combine it with the resubstitution error:
The resubstitution error gets less weight than the error on the test data
Repeat process several times with different replacement samples; average the results
Training, testing and validation data
The standard for computing accuracy of a model
Split data into 3 parts:
Training data to be used for model generation
Validation data to be used for model selection
Testing data to be used for determining the accuracy of the final model
Counting the cost
In practice, different types of classification errors often incur different costs
Examples:
Loan decisions
Promotional mailing
Fault diagnosis
19
Good boundary?
Better
boundary?
Blue dots = good loans
Red dots = bad loans
20
Classification with costs
Default cost matrices:
Success rate is replaced by average cost per prediction
Cost is given by appropriate entry in the cost matrix
21
22
Cost-sensitive classification
Change classifier model to take account of cost of errors
Can take costs into account when making predictions
Basic idea: only predict high-cost class when very confident about prediction
Given: predicted class probabilities
Normally we just predict the most likely class
Here, we should make the prediction that minimizes the expected cost
Expected cost: dot product of vector of class probabilities and appropriate column in cost matrix
Changing the cutoff probability in Naïve Bayes
Example – Work out the cost of errors:
Consider a classifier problem where the class variable is {Accept, Analyze, Reject}
Suppose Naïve Bayes examines a test instance (row) and assigns the following probabilities:
Accept 50%, Analyze 30%, Reject 20%
Suppose the cost matrix is
Actual↓ Predicted→ Accept Analyze Reject
Accept 0 1 2
Analyze 1 0 1
Reject 3 1 0
23
25
Cost-sensitive learning
So far we haven’t taken costs into account at training time
Most learning schemes do not perform cost-sensitive learning
They generate the same classifier no matter what costs are assigned to the different classes
Simple methods for cost-sensitive learning:
Thresholding: Adjust probability threshold for setting class labels
Rebalancing: Resampling of instances according to costs
Terminology
TP
True positive FP
False positive
FN
False negative
TN
True negative
Model’s
Predictions
True Labels
Positive
Negative
Positive
Negative
A hypothetical lift chart
40% of responses
for 10% of cost
80% of responses
for 40% of cost
Generating a lift chart
Sort instances according to predicted probability of being positive:
Lift Chart
x axis is sample size
y axis is number of true positives
…
…
…
Yes
0.88
4
No
0.93
3
Yes
0.93
2
Yes
0.95
1
Actual class
Predicted probability
28
Binary Classification: Lift Curves
Sort test examples by their predicted score
For a particular threshold compute
(1) NTP = number of true positive examples detected by the model
(2) NTPR = number of true positive examples that would be
detected by random ordering
Lift = NTP/NTPR
Lift curve = Lift as a function of number of examples above the threshold, as the threshold is varied
Expect that good models will start with high lift (and will eventually decay to 1)
From Chapter 8: Visualizing Model Performance, in Data Science for Business (O Reilly, 2013),
31
Computing Profits using Lift charts
Example: promotional mailing to 1,000,000 households @ $0.50 each. Company earns on average, $600 from each response
Mail to all; 0.1% respond (1000).
Total Profit = 600,000 – 500,000 = $100,000
Data mining tool identifies subset of 100,000 most promising, 0.4% of these respond (400)
Lift Ratio = 0.4 / 0.1 = 4
Total profit =
Identify subset of 400,000 most promising, 0.2% respond (800)
Lift Ratio = 0.2 / 0.1 = 2
Total profit =
A lift chart allows a visual comparison
32
Computing Profits using Lift charts
Example: promotional mailing to 1,000,000 households @ $0.50 each. Company earns on average, $600 from each response
Mail to all; 0.1% respond (1000).
Total Profit = 600,000 – 500,000 = $100,000
Data mining tool identifies subset of 100,000 most promising, 0.4% of these respond (400)
Lift Ratio = 0.4 / 0.1 = 4
Total profit = (600*400) – 50,000 = $190,000
Identify subset of 400,000 most promising, 0.2% respond (800)
Lift Ratio = 0.2 / 0.1 = 2
Total profit = (600*800) – 200,000 = $280,000
A lift chart allows a visual comparison
Example of an Empirical “Profit Curve”
12:1 benefit/cost ratio
(more lucrative)
From Chapter 8: Visualizing Model Performance, in Data Science for Business (O Reilly, 2013),
with permission from the authors, F. Provost and T. Fawcett
33
ROC curves
ROC curves are similar to lift charts
Stands for “receiver operating characteristic”
Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel
Differences to lift chart:
y axis shows percentage of true positives in sample rather than absolute number
x axis shows percentage of false positives in sample rather than sample size
34
ROC Plots
TP
True positive FP
False positive
FN
False negative
TN
True negative
Model’s
Predictions
True Labels
Positive
Negative
Positive
Negative
TPR = True Positive Rate = TP / (TP + FN)
= ratio of correct positives predicted to actual number of positives
(same as recall, sensitivity, hit rate)
FPR = False Positive Rate
= FP / (FP + TN) = ratio of incorrect negatives predicted to actual number of negatives
(same as false alarm rate)
Receiver Operating Characteristic: plots TPR versus FPR as threshold varies
As we decrease our threshold, both the TPR and FPR will increase, both ending at [1, 1]
From Chapter 8: Visualizing Model Performance, in Data Science for Business (O Reilly, 2013),
with permission from the authors, F. Provost and T. Fawcett
Example of an Actual ROC
36
In the following confusion matrix, the number of errors is
A: 123
B: 374
C: 99
D: 911
E: None of the above
37
A lift chart is useful for
A: Calculating Bayesian lift
B: Calculating the difference function
C: Calculating the optimal number of promotional mailings
D: Calculating the accuracy of Naïve Bayes
E: None of the above
38
Review Assignment 1
Weka Example – Classification using Naïve Bayes
Download file from EEE (session 9):
4bank-data-8.arff
Switch tab to “classify”
Select method: NaiveBayes
Verify class variable set to “pep”
Use 10 fold cross validation
Run classifier
Examine confusion matrix
Next Session
Decision Tree based classification
41
368
.
0
1
1
1
»
»
÷
ø
ö
ç
è
æ
–
–
e
n
n
set
data
training
set
data
test
e
e
err
_
_
_
_
*
368
.
0
*
632
.
0
+
=
0102030405060708090
0
2000
4000
6000
8000
10000
12000
14000
AGE
MONTHLY INCOME
/docProps/thumbnail.jpeg