程序代写代做代考 flex data mining decision tree algorithm STAT318 — Data Mining – cscodehelp代写

STAT318 — Data Mining
Dr
University of Canterbury, Christchurch,
Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
, University of Canterbury 2021
STAT318 — Data Mining ,1 / 31

Cross-Validation and the Bootstrap
In this section we discuss two important resampling methods: cross-validation and the bootstrap.
These methods use samples formed from the training data to obtain additional information about a fitted model or an estimator.
They can be used for estimating prediction error, determining appropriate model flexibility, estimating standard errors, …
, University of Canterbury 2021
STAT318 — Data Mining ,2 / 31

Training error vs. test error
The training error is the average error that results from applying a statistical learning method to the observations used for training — a simple calculation.
The test error is the average error that results from applying a statistical learning technique to test observations that were not used for training — a simple calculation if test data exists, but we usually only have training data.
The training error tends to dramatically under-estimate the test error.
, University of Canterbury 2021
STAT318 — Data Mining ,3 / 31

Performance
High Bias Low Variance
Low Bias High Variance
Training Sample
Testing Sample
Low Flexibility High
, University of Canterbury 2021
STAT318 — Data Mining ,4 / 31
Prediction Error

Validation Set Approach
A very simple strategy is to randomly divide the training data into two sets:
1 Training Set: Fit the model using the training set.
2 Validation Set: Predict the response values for the observations in the validation set.
The validation set error provides an estimate of the test error (MSE for regression and error rate for classification).
, University of Canterbury 2021
STAT318 — Data Mining ,5 / 31

Validation Set Approach
􏰭􏰯􏰰 􏰱
􏰲 􏰯􏰯 􏰭􏰰 􏰳􏰭
In this example, the training data are randomly split into two sets of approximately the same size. The blue set is used for training and the orange set for validation.
, University of Canterbury 2021
STAT318 — Data Mining ,6 / 31

Example: auto data
2 4 6 8 10 Degree of Polynomial
Find the best level of flexibility in polynomial regression using the validation set approach (50% used for training).
, University of Canterbury 2021
STAT318 — Data Mining ,7 / 31
Mean Squared Error
16 18 20 22 24 26 28

Example: auto data
2 4 6 8 10 Degree of Polynomial
The validation set approach using different training and validation sets (50% used for training).
, University of Canterbury 2021
STAT318 — Data Mining ,8 / 31
Mean Squared Error
16 18 20 22 24 26 28

Drawbacks of the validation set approach
The validation set estimate of the test error can be highly variable.
Only a subset of observations are used to train the model. Hence, the validation set test error will tend to over-estimate the test error.
, University of Canterbury 2021
STAT318 — Data Mining ,9 / 31

K-fold cross-validation
1 The training data are divided into K groups (folds) of approximately equal size.
2 The first fold is used as a validation set and the remaining (K − 1) folds are used
for training. The test error is then computed on the validation set.
3 This procedure is repeated K times, using a different fold as the validation set each time.
4 The estimate of the test error is the average of the K test errors from each validation set.
, University of Canterbury 2021
STAT318 — Data Mining ,10 / 31

5-fold cross-validation
􏰯􏰰􏰲 􏰱
􏰯􏰯 􏰺􏰻 􏰼 􏰳􏰺 􏰯􏰯 􏰺􏰻 􏰼 􏰳􏰺 􏰯􏰯􏰯􏰯 􏰺􏰺􏰻􏰻 􏰼􏰼 􏰳􏰳􏰺􏰺 􏰯􏰯 􏰺􏰻 􏰼 􏰳􏰺 􏰯􏰯 􏰺􏰻 􏰼 􏰳􏰺
Each row corresponds to one iteration of the algorithm, where the orange set is used for validation and the blue set is used for training.
If K = n, we have leave one out cross-validation (LOOCV).
, University of Canterbury 2021
STAT318 — Data Mining ,11 / 31

Cross-validation test error
We randomly divide the training data of n observations into K groups of approximately equal size, C1, C2, . . . , CK .
Regression Problems (average MSE):
Classification Problems (average error rate):
1K
CV = 􏰐 􏰐 (y −yˆ)2.
Kn ii k=1 i:xi ∈Ck
1K
CV = 􏰐 􏰐 I(y ̸=yˆ).
Knii k=1 i:xi ∈Ck
, University of Canterbury 2021
STAT318 — Data Mining ,12 / 31

Cross-validation: auto data
LOOCV
10−fold CV
2 4 6 8 10
Degree of Polynomial
2 4 6 8 10
Degree of Polynomial
The right plot shows nine different 10-fold cross-validations.
, University of Canterbury 2021
STAT318 — Data Mining ,13 / 31
Mean Squared Error
16 18 20 22 24 26 28
Mean Squared Error
16 18 20 22 24 26 28

K-fold cross-validation: classification
X1
The Bayes error rate is 0.133 in this example.
oo oooo
o
o oo o o
oo
oo o oooo
oo o
o o o ooo
ooooooo oo
oooo ooo o oooooo o
o oooo
oo oo oo o o ooooooo
o
oooo
oo o o o o
ooo o ooo oo o o oo
oooooooo o o o o ooooo
oo o o oooooo
oo
o
oo oo o
ooo ooo
o ooooooo
oo ooo
oo oo oo
ooo oo ooo o o o oo o
o oooo oo o
o
o
o
o
oo
ooo
, University of Canterbury 2021
STAT318 — Data Mining ,14 / 31
X2

K-fold cross-validation: logistic regression
Degree=1 Degree=2
oo oo oo o o oo o o
oo oo
oo
o oo o o o oo o o
o oooo oo o
oooo oo
o o o o o o o o o o o o oooooo
ooo oooo o ooo oooo o ooo ooo o o ooo ooo o oooo o o o oooo o
o
o oo
o
o oo o o oo oooooooo ooooooo ooooooo
o o
oo oooo oo oo oo oooo oo oooo o oooo o
ooo o ooo oo o ooo o ooo oo o o o o o
oo oo o ooo o o o o oooo
oo oo o ooo o o o o oooo
o
o o o oo o o o o oo o ooo ooo
o oo o o o
o
o oo o o o
o
o o o oo o o oo
o o o oo o o oo
oooooooooo o o o
oooooooooo o o o
ooo ooo oo o ooo oo o ooo
o o o o oo o o o o o oo o
o o o o o o o o
o ooo o o ooo o
o ooo o ooo ooo ooo
The test error rates are 0.201 and 0.197, respectively.
, University of Canterbury 2021
STAT318 — Data Mining ,15 / 31

K-fold cross-validation: logistic regression
Degree=3 Degree=4
oo oo oo o o oo o o
oo oo
oo
o oo o o o oo o o
o oooo oo o
oooo oo
o o o o o o o o o o o o oooooo
ooo oooo o ooo oooo o ooo ooo o o ooo ooo o oooo o o o oooo o
o
o oo
o
o oo o o oo oooooooo ooooooo ooooooo
o o
oo oooo oo oo oo oooo oo oooo o oooo o
ooo o ooo oo o ooo o ooo oo o o o o o
oo oo o ooo o o o o oooo
oo oo o ooo o o o o oooo
o
o o o oo o o o o oo o ooo ooo
o oo o o o
o
o oo o o o
o
o o o oo o o oo
o o o oo o o oo
oooooooooo o o o
oooooooooo o o o
ooo ooo oo o ooo oo o ooo
o o o o oo o o o o o oo o
o o o o o o o o
o ooo o o ooo o
o ooo o ooo ooo ooo
The test error rates are 0.160 and 0.162, respectively.
, University of Canterbury 2021
STAT318 — Data Mining ,16 / 31

K-fold cross-validation: logistic regression and KNN
2 4 6 8 10 0.01 0.02 0.05 0.10 0.20 0.50 1.00
Order of Polynomials Used 1/K
The test error (orange), training error (blue) and the 10-fold cross-validation error (black). The left plot shows logistic regression and the right plot shows KNN.
, University of Canterbury 2021
STAT318 — Data Mining ,17 / 31
Error Rate
0.12 0.14 0.16 0.18 0.20
Error Rate
0.12 0.14 0.16 0.18 0.20

Comments
Since each training set has approximately (1 − 1/K )n observations, the cross-validation test error will tend to over-estimate the prediction error.
LOOCV minimizes this upward bias, but this estimate has high variance. K = 5 or 10 provides a good compromise for this bias-variance trade-off.
, University of Canterbury 2021
STAT318 — Data Mining ,18 / 31

Cross Validation: right and wrong
Consider the following classifier for a two-class problem:
1 Starting with 1000 predictors and 50 observations, find the 20 predictors having the largest correlation with the response.
2 Apply a classifier using only these 20 predictors.
If we use cross-validation to estimate test error, can we simply apply it at step (2)?
, University of Canterbury 2021
STAT318 — Data Mining ,19 / 31

Bootstrap
The use of the term bootstrap derives from the phrase:
“to pull oneself up by one’s bootstraps”.
The bootstrap is a powerful statistical tool that can be used to quantify uncertainty associated with a statistical learning technique or a given estimator.
, University of Canterbury 2021
STAT318 — Data Mining ,20 / 31

Example
Suppose we wish to invest a fixed sum of money in two financial assets that yield returns X and Y , respectively (X and Y are random variables).
We want to minimize the risk (variance) of our investment:
where 0 ≤ α ≤ 1.
minV(αX +(1−α)Y), α
, University of Canterbury 2021
STAT318 — Data Mining ,21 / 31

Example
The values of σX2 , σY2 and σXY are unknown and hence, need to be estimated from sample data.
We can then estimate the α value that minimizes the variance of our investment using
αˆ= σˆY2−σˆXY . σˆX2 +σˆY2 −2σˆXY
αˆ is an estimator, but we don’t know its sampling distribution or its standard error.
, University of Canterbury 2021
STAT318 — Data Mining ,22 / 31

Example: simulated returns for investments X and Y
−2 −1 0 1 2 −2 −1 0 1 2
XX
−3 −2 −1 0 1 2 −2 −1 0 1 2 3
XX
, University of Canterbury 2021
STAT318 — Data Mining ,23 / 31
YY
−3 −2 −1 0 1 2 −2 −1 0 1 2
YY
−3 −2 −1 0 1 2 −2 −1 0 1 2

Example: simulated sampling distribution for αˆ
0.4 0.5 0.6
0.7 0.8 0.9
α
, University of Canterbury 2021
STAT318 — Data Mining ,24 / 31
0 50 100 150 200

Example: statistics from 1000 observations
The sample mean is
α ̄=
which is very close to the true value, α = 0.6.
The standard deviation is
􏰹􏰸 1 1000
1 1000
sd(αˆ) = 􏰸􏰷
which gives an approximate standard error of SE(αˆ) = 0.083.
1000
􏰐αˆi =0.5996, i=1
􏰐(αˆi − α ̄)2 = 0.083, i=1
1000−1
, University of Canterbury 2021
STAT318 — Data Mining ,25 / 31

Real world
The procedure we have discussed cannot be applied in the real world because we cannot sample the original population many times.
The bootstrap comes to the rescue.
Rather than sampling the population many times directly, we repeatedly sample
the observed sample data using random sampling with replacement.
These bootstrap samples are the same size as the original sample (n observations) and will likely contain repeated observations.
, University of Canterbury 2021
STAT318 — Data Mining ,26 / 31

Bootstrap
Obs
X
Y
3
5.3
2.8
1
4.3
2.4
3
5.3
2.8
Obs
X
Y
2
2.1
1.1
3
5.3
2.8
1
4.3
2.4
Obs
X
Y
1
4.3
2.4
2
2.1
1.1
3
5.3
2.8
Obs
X
Y
2
2.1
1.1
2
2.1
1.1
1
4.3
2.4
, University of Canterbury 2021
STAT318 — Data Mining ,27 / 31
αˆ * 1
Original Data (Z)
Z*1
Z*2
􏰮
􏰮 􏰮􏰮
􏰮􏰮
Z*B
􏰮
αˆ * 2
􏰮 􏰮 􏰮 􏰮 􏰮
αˆ * B

Example: bootstrap sampling distribution for αˆ
0.4 0.5 0.6 0.7 0.8 0.9 0.3 0.4 0.5 0.6 0.7 0.8 0.9
αα
, University of Canterbury 2021
STAT318 — Data Mining ,28 / 31
0 50 100 150 200
0 50 100 150 200

Example: statistics from B bootstrap samples
Let Z∗i and αˆ∗i denote the ith bootstrap sample and the ith bootstrap estimate
of α, respectively.
We estimate the standard error of αˆ using
􏰹􏰸􏰸 1 􏰐B
SEB =􏰷B−1 where B is some large value (say 1000) and
For our example SEB = 0.087
α ̄ ∗ = B
i=1
1 􏰐B
αˆ ∗ i .
(αˆ∗i −α ̄∗)2,
i=1
, University of Canterbury 2021
STAT318 — Data Mining ,29 / 31

The general picture
● ● ● ●
● ●
● ●●
● ● ●

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
x
An empirical distribution function (EDF) for a sample of n = 20 standard normal random variables.
, University of Canterbury 2021
STAT318 — Data Mining ,30 / 31
Fn(x)
0.0 0.2 0.4 0.6 0.8 1.0

Comments
The bootstrap can also be used to approximate confidence intervals, the simplest method being the bootstrap percentile confidence interval.
For example, an approximate 90% confidence interval is 5th and 95th percentiles of B bootstrap estimates.
It is possible to use the bootstrap for estimating prediction error, but cross-validation is easier and gives similar results.
We will use the bootstrap when building decision trees.
, University of Canterbury 2021
STAT318 — Data Mining ,31 / 31

Leave a Reply

Your email address will not be published. Required fields are marked *