CS代考 CSC 311: Introduction to Machine Learning – cscodehelp代写
CSC 311: Introduction to Machine Learning
Lecture 3 – Linear Classification
Based on slides by Amir-massoud Farahmand & Emad A.M. Andrews
Intro ML (UofT) CSC311-Lec3 1 / 39
Last class, we discussed linear regression, and used a modular approach to machine learning algorithm design:
chooseamodel: y=f(x)=w⊤x+b
choose a loss: L(y, t) = 12 (y − t)2
formulate an optimization problem: minimize Ni=1 L(y(i), t(i)) (w,b)
solve the minimization problem using one of two strategies
direct solution (set derivatives to zero)
gradient descent
vectorize the algorithm, i.e. represent in terms of linear algebra make a linear model more powerful using feature expansion improve the generalization by adding a regularizer
Intro ML (UofT) CSC311-Lec3
Classification Setup
Classification: predicting a discrete-valued target
Binary classification: predicting a binary-valued target
Recall the notation:
Training data: (x(1) , t(1) ), (x(2) , t(2) ), … (x(N ) , t(N ) )
x(i) are the inputs
t(i) are the (discrete value) targets
predict whether a patient has a disease, given the presence or
absence of various symptoms
classify e-mails as spam or non-spam
predict whether a financial transaction is fraudulent
Intro ML (UofT) CSC311-Lec3
The Binary Linear Classification Model
classification: predict a discrete-valued target binary: predict a binary target t ∈ {0, 1}
Training examples with t = 1 are called positive examples, and training examples with t = 0 are called negative examples.
t ∈ {0, 1} or t ∈ {−1, +1} is for computational convenience.
Intro ML (UofT) CSC311-Lec3
The Binary Linear Classification Model
classification: predict a discrete-valued target binary: predict a binary target t ∈ {0, 1}
Training examples with t = 1 are called positive examples, and training examples with t = 0 are called negative examples.
t ∈ {0, 1} or t ∈ {−1, +1} is for computational convenience. linear: model is a linear function of x, followed by a threshold r:
z = wT x + b
y= 0 ifz
Whenx1=1,need:z=w0x0+w1x1<0⇐⇒w0+w1<0
This is our “training set”
Intro ML (UofT) CSC311-Lec3
x0 x1 t 101 110
What conditions are needed on w0,w1 to classify all examples? Whenx1=0,need:z=w0x0+w1x1>0⇐⇒w0>0
Whenx1=1,need:z=w0x0+w1x1<0⇐⇒w0+w1<0
Example solution: w0 = 1, w1 = −2 Is this the only solution?
This is our “training set”
Intro ML (UofT) CSC311-Lec3
x0 x1 x2 t 1000 1010 1100 1111
z = w0x0 + w1x1 + w2x2
CSC311-Lec3
x0 x1 x2 t 1000 1010 1100 1111
z = w0x0 + w1x1 + w2x2 need: w0 < 0
CSC311-Lec3
x0 x1 x2 t 1000 1010 1100 1111
z = w0x0 + w1x1 + w2x2 need: w0 < 0
need: w0 +w2 <0
CSC311-Lec3
x0 x1 x2 t 1000 1010 1100 1111
z = w0x0 + w1x1 + w2x2 need: w0 < 0
need: w0 +w2 <0 need: w0 +w1 <0
CSC311-Lec3
x0 x1 x2 t 1000 1010 1100 1111
z = w0x0 + w1x1 + w2x2 need: w0 need: w0 +w2 need: w0 +w1 need: w0 +w1 +w2
< 0 <0 <0 >0
CSC311-Lec3
x0 x1 x2 t 1000 1010 1100 1111
z = w0x0 + w1x1 + w2x2 need: w0 < 0
need: w0 +w2 <0
need: w0 +w1 <0 need: w0 +w1 +w2 >0
Example solution: w0 = −1.5, w1 = 1, w2 = 1
(UofT) CSC311-Lec3
The Geometric Picture
Input Space, or Data Space for NOT example
x0 x1 t 101 110
Training examples are points
Weights (hypotheses) w can be represented by half-spaces H+ ={x:wTx≥0},H− ={x:wTx<0}
The boundaries of these half-spaces pass through the origin (why?)
Intro ML (UofT) CSC311-Lec3 9 / 39
The Geometric Picture
Input Space, or Data Space for NOT example
x0 x1 t 101 110
Training examples are points
Weights (hypotheses) w can be represented by half-spaces H+ ={x:wTx≥0},H− ={x:wTx<0}
The boundaries of these half-spaces pass through the origin (why?) The boundary is the decision boundary: {x : wT x = 0}
In 2-D, it is a line, but think of it as a hyperplane
If the training examples can be perfectly separated by a linear
decision rule, we say data is linearly separable.
Intro ML (UofT) CSC311-Lec3 9 / 39
The Geometric Picture
Weight Space
Weights (hypotheses) w are points
Each training example x specifies a half-space w must lie in to be
correctly classified: wT x > 0 if t = 1.
w0 > 0 w0 + w1 < 0
Intro ML (UofT) CSC311-Lec3 10 / 39
The Geometric Picture
Weight Space
Weights (hypotheses) w are points
w0 > 0 w0 + w1 < 0
Each training example x specifies a half-space w must lie in to be correctly classified: wT x > 0 if t = 1.
For NOT example:
x0 =1,x1 =0,t=1 =⇒ (w0,w1)∈{w:w0 >0}
x0 =1,x1 =1,t=0 =⇒ (w0,w1)∈{w:w0 +w1 <0}
Intro ML (UofT) CSC311-Lec3 10 / 39
The Geometric Picture
Weight Space
Weights (hypotheses) w are points
w0 > 0 w0 + w1 < 0
Each training example x specifies a half-space w must lie in to be correctly classified: wT x > 0 if t = 1.
For NOT example:
x0 =1,x1 =0,t=1 =⇒ (w0,w1)∈{w:w0 >0}
x0 =1,x1 =1,t=0 =⇒ (w0,w1)∈{w:w0 +w1 <0}
The region satisfying all the constraints is the feasible region; if this region is nonempty, the problem is feasible, otw it is infeasible.
Intro ML (UofT) CSC311-Lec3 10 / 39
The Geometric Picture
The AND example requires three dimensions, including the dummy one. To visualize data space and weight space for a 3-D example, we can look
at a 2-D slice.
The visualizations are similar.
Feasible set will always have a corner at the origin.
Intro ML (UofT) CSC311-Lec3 11 / 39
The Geometric Picture
Visualizations of the AND example Data Space
-Sliceforx0 =1and
- example sol: w0 =−1.5, w1 =1, w2 =1 - decision boundary: w0x0+w1x1+w2x2=0
=⇒ −1.5+x1+x2=0
Intro ML (UofT) CSC311-Lec3 12 / 39
The Geometric Picture
Visualizations of the AND example Data Space
-Sliceforx0 =1and
- example sol: w0 = −1.5, w1 = 1, w2 = 1 - decision boundary: w0x0+w1x1+w2x2=0
=⇒ − 1 . 5 + x 1 + x 2 = 0
Weight Space
-Sliceforw0 =−1.5forthe constraints
- w 0 + w 1 < 0
- w0 + w1 + w2 > 0
Intro ML (UofT)
CSC311-Lec3
The Geometric Picture
Some datasets are not linearly separable, e.g. XOR
Intro ML (UofT) CSC311-Lec3 13 / 39
Summary: binary linear classifiers
Binary Linear Classifiers. Targets t ∈ {0, 1} z = wT x + b
1 ifz≥0 y= 0 ifz<0
If training set is separable, we can solve for w, b using linear programming
Intro ML (UofT) CSC311-Lec3 14 / 39
Summary: binary linear classifiers
Binary Linear Classifiers. Targets t ∈ {0, 1} z = wT x + b
1 ifz≥0 y= 0 ifz<0
If training set is separable, we can solve for w, b using linear programming
If it’s not separable, the problem is harder
data is almost never separable in real life.
Intro ML (UofT) CSC311-Lec3
What can we do instead?
(Spoiler Answer: Logistic Regression)
Let’s derive Logistic Regression by considering a few alternatives first.
Intro ML (UofT) CSC311-Lec3 15 / 39
Loss Functions
Instead: define loss function then try to minimize the resulting cost function
Recall: cost is loss averaged (or summed) over the training set Seemingly obvious loss function: 0-1 loss
0 ify=t L0−1(y,t) = 1 if y ̸= t
= I[y ̸= t]
Intro ML (UofT) CSC311-Lec3 16 / 39
Attempt 1: 0-1 Loss
Usually, the cost J is the averaged loss over training examples; for 0-1 loss, this is the misclassification rate/error:
What is the issue with using this loss?
I[y(i) ̸= t(i)]
Intro ML (UofT) CSC311-Lec3 17 / 39
Attempt 1: 0-1 Loss
Usually, the cost J is the averaged loss over training examples; for 0-1 loss, this is the misclassification rate/error:
I[y(i) ̸= t(i)]
The step function (0-1 loss) has zero derivative almost everywhere
What is the issue with using this loss?
Intro ML (UofT) CSC311-Lec3 17 / 39
Attempt 2: Linear Regression
Sometimes we can replace the loss function we care about with one that is easier to optimize. This is known as relaxation with a smooth surrogate loss function.
One problem with L0−1 is that it is defined in terms of final prediction, which inherently involves a discontinuity Instead, define loss in terms of wT x + b directly
Redo notation for convenience: z = wT x + b
Intro ML (UofT) CSC311-Lec3
Attempt 2: Linear Regression
We already know how to fit a linear regression model using the squared error loss. Can we use the same squared error loss instead?
z = w⊤x + b LSE(z,t)= 21(z−t)2
Intro ML (UofT) CSC311-Lec3 19 / 39
Attempt 2: Linear Regression
We already know how to fit a linear regression model using the squared error loss. Can we use the same squared error loss instead?
z = w⊤x + b LSE(z,t)= 21(z−t)2
Doesn’t matter that the targets are actually binary. Treat them as continuous values.
Intro ML (UofT) CSC311-Lec3 19 / 39
Attempt 2: Linear Regression
We already know how to fit a linear regression model using the squared error loss. Can we use the same squared error loss instead?
z = w⊤x + b LSE(z,t)= 21(z−t)2
Doesn’t matter that the targets are actually binary. Treat them as continuous values.
For this loss function, it makes sense to make final predictions by thresholding z at 12 (why?)
Intro ML (UofT) CSC311-Lec3 19 / 39
Attempt 2: Linear Regression
The problem:
The loss function penalizes you when you make correct predictions with high confidence!
If t = 1, the loss is larger when z = 10 than when z = 0.
Intro ML (UofT) CSC311-Lec3 20 / 39
Attempt 3: Logistic Activation Function
There’s obviously no reason to predict values outside [0, 1]. Let’s squash y into this interval.
The logistic function is a kind of sigmoid, or S-shaped function:
σ(z) = 1 1+e−z
σ−1(y) = log(y/(1 − y)) is called the logit.
Intro ML (UofT) CSC311-Lec3 21 / 39
Attempt 3: Logistic Activation Function
There’s obviously no reason to predict values outside [0, 1]. Let’s squash y into this interval.
The logistic function is a kind of sigmoid, or S-shaped function:
σ(z) = 1 1+e−z
σ−1(y) = log(y/(1 − y)) is called the logit.
A linear model with a logistic nonlinearity is known as log-linear:
z = w⊤x + b y = σ(z)
LSE(y,t)= 12(y−t)2.
Used in this way, σ is called an activation function.
Intro ML (UofT) CSC311-Lec3 21 / 39
Attempt 3: Logistic Activation Function
The problem:
(plot of LSE as a function of z, assuming t = 1)
∂L = ∂L ∂z ∂wj ∂z ∂wj
Intro ML (UofT) CSC311-Lec3
Attempt 3: Logistic Activation Function
The problem:
(plot of LSE as a function of z, assuming t = 1)
∂L = ∂L ∂z ∂wj ∂z ∂wj
For z ≪ 0, we have σ(z) ≈ 0.
∂L ≈ 0 (check!) =⇒ ∂L ≈ 0 =⇒ derivative w.r.t. wj is small
=⇒ wj is like a critical point
If the prediction is really wrong, you should be far from a critical point (which is your candidate solution).
Intro ML (UofT) CSC311-Lec3 22 / 39
Logistic Regression
Because y ∈ [0, 1], we can interpret it as the estimated probability that t = 1.
The pundits who were 99% confident Clinton would win were much more wrong than the ones who were only 90% confident.
Intro ML (UofT) CSC311-Lec3 23 / 39
Logistic Regression
Because y ∈ [0, 1], we can interpret it as the estimated probability that t = 1.
The pundits who were 99% confident Clinton would win were much more wrong than the ones who were only 90% confident.
Cross-entropy loss (aka log loss) captures this intuition:
−logy ift=1 LCE(y,t)= −log(1−y) ift=0
= −t log y − (1 − t) log(1 − y)
Intro ML (UofT) CSC311-Lec3
Logistic Regression
Logistic Regression:
z = w⊤x + b y = σ(z)
LCE =−tlogy−(1−t)log(1−y) Plot is for target t = 1.
Intro ML (UofT) CSC311-Lec3 24 / 39
Logistic Regression
Problem: what if t = 1 but you’re really confident it’s a negative example (z ≪ 0)?
If y is small enough, it may be numerically zero. This can cause very subtle and hard-to-find bugs.
LCE =−tlogy−(1−t)log(1−y)
⇒ y ≈ 0 ⇒ computes log0
Intro ML (UofT) CSC311-Lec3
Logistic Regression
Problem: what if t = 1 but you’re really confident it’s a negative example (z ≪ 0)?
If y is small enough, it may be numerically zero. This can cause very subtle and hard-to-find bugs.
y = σ(z) ⇒ y ≈ 0 LCE =−tlogy−(1−t)log(1−y) ⇒ computes log0
Instead, we combine the activation function and the loss into a single logistic-cross-entropy function.
LLCE(z, t) = LCE(σ(z), t) = t log(1 + e−z) + (1 − t) log(1 + ez)
Numerically stable computation:
E = t * np.logaddexp(0, -z) + (1-t) * np.logaddexp(0, z)
Intro ML (UofT) CSC311-Lec3 25 / 39
Logistic Regression
Comparison of loss functions: (for t = 1)
Intro ML (UofT) CSC311-Lec3 26 / 39
Gradient Descent
How do we minimize the cost J in this case? No direct solution.
Taking derivatives of J w.r.t. w and setting them to 0 doesn’t have
an explicit solution.
Intro ML (UofT) CSC311-Lec3 27 / 39
Gradient Descent
How do we minimize the cost J in this case? No direct solution.
Taking derivatives of J w.r.t. w and setting them to 0 doesn’t have
an explicit solution.
Now let’s see a second way to minimize the cost function which is
more broadly applicable: gradient descent.
Intro ML (UofT) CSC311-Lec3 27 / 39
Gradient Descent
How do we minimize the cost J in this case? No direct solution.
Taking derivatives of J w.r.t. w and setting them to 0 doesn’t have
an explicit solution.
Now let’s see a second way to minimize the cost function which is
more broadly applicable: gradient descent.
Gradient descent is an iterative algorithm, which means we apply
an update repeatedly until some criterion is met.
Intro ML (UofT) CSC311-Lec3 27 / 39
Gradient Descent
How do we minimize the cost J in this case? No direct solution.
Taking derivatives of J w.r.t. w and setting them to 0 doesn’t have
an explicit solution.
Now let’s see a second way to minimize the cost function which is
more broadly applicable: gradient descent.
Gradient descent is an iterative algorithm, which means we apply
an update repeatedly until some criterion is met.
We initialize the weights to something reasonable (e.g. all zeros)
and repeatedly adjust them in the direction of steepest descent.
Intro ML (UofT) CSC311-Lec3 27 / 39
Gradient for Logistic Regression
Back to logistic regression:
LCE(y, t) = − t log(y) − (1 − t) log(1 − y)
y=1/(1+e−z) and z=wTx+b
∂LCE ∂LCE ∂y ∂z t 1−t
∂w = ∂y ·∂z·∂w = −y+1−y ·y(1−y)·xj
=(y − t)xj
Exercise: Verify this!
Gradient descent (coordinatewise) update to find the weights of logistic regression:
wj ← wj − α ∂J ∂wj
(y(i) − t(i)) x(i)
Intro ML (UofT)
CSC311-Lec3 28 / 39
Logistic Regression
Comparison of gradient descent updates:
Linear regression (verify!):
w ← w − N Logistic regression:
(y(i) − t(i)) x(i)
(y(i) − t(i)) x(i)
Intro ML (UofT) CSC311-Lec3 29 / 39
Logistic Regression
Comparison of gradient descent updates:
Linear regression (verify!):
w ← w − N Logistic regression:
(y(i) − t(i)) x(i)
models. But we won’t go in further detail.
(y(i) − t(i)) x(i)
Not a coincidence! These are both examples of generalized linear
Notice N1 in front of sums due to averaged losses. This is why you need smaller learning rate when we optimize the sum of losses
(α′ = α/N).
Intro ML (UofT) CSC311-Lec3 29 / 39
Stochastic Gradient Descent
So far, the cost function J has been the average loss over the training examples:
L(y(x(i), θ), t(i)).
Intro ML (UofT)
CSC311-Lec3
Stochastic Gradient Descent
So far, the cost function J has been the average loss over the training examples:
J (θ) = N By linearity,
L(i) = N L(y(x(i), θ), t(i)).
∂ J 1 N ∂ L ( i ) ∂θ=Ni=1 ∂θ.
Intro ML (UofT)
CSC311-Lec3
Stochastic Gradient Descent
So far, the cost function J has been the average loss over the training examples:
By linearity,
L(i) = N L(y(x(i), θ), t(i)).
∂ J 1 N ∂ L ( i ) ∂θ=Ni=1 ∂θ.
Computing the gradient requires summing over all of the training examples. This is known as batch training.
Batch training is impractical if you have a large dataset N ≫ 1 (e.g. millions of training examples)!
Intro ML (UofT) CSC311-Lec3 30 / 39
Stochastic Gradient Descent
Stochastic gradient descent (SGD): update the parameters based on the gradient for a single training example,
1. Choose i uniformly at random
2. θ←θ−α∂L(i) ∂θ
Intro ML (UofT)
CSC311-Lec3 31 / 39
Stochastic Gradient Descent
Stochastic gradient descent (SGD): update the parameters based on the gradient for a single training example,
1. Choose i uniformly at random 2. θ←θ−α∂L(i)
Cost of each SGD update is independent of N.
SGD can make significant progress before even seeing all the data!
Intro ML (UofT) CSC311-Lec3
Stochastic Gradient Descent
Stochastic gradient descent (SGD): update the parameters based on the gradient for a single training example,
1. Choose i uniformly at random 2. θ←θ−α∂L(i)
Cost of each SGD update is independent of N.
SGD can make significant progress before even seeing all the data!
Mathematical justification: if you sample a training example uniformly at random, the stochastic gradient is an unbiased estimate of the batch gradient:
∂ L ( i ) 1 N ∂ L ( i ) ∂ J E ∂θ =Ni=1 ∂θ =∂θ.
Intro ML (UofT) CSC311-Lec3 31 / 39
Stochastic Gradient Descent
Stochastic gradient descent (SGD): update the parameters based on the gradient for a single training example,
1. Choose i uniformly at random 2. θ←θ−α∂L(i)
Cost of each SGD update is independent of N.
SGD can make significant progress before even seeing all the data!
Mathematical justification: if you sample a training example uniformly at random, the stochastic gradient is an unbiased estimate of the batch gradient:
∂ L ( i ) 1 N ∂ L ( i ) ∂ J E ∂θ =Ni=1 ∂θ =∂θ.
Variance in this estimate may be high
If we only look at one training example at a time, we can’t exploit
efficient vectorized operations.
Intro ML (UofT) CSC311-Lec3 31 / 39
Stochastic Gradient Descent
Compromise approach: compute the gradients on a randomly chosen medium-sized set of training examples M ⊂ {1, . . . , N }, called a mini-batch.
Stochastic gradients computed on larger mini-batches have smaller variance. This is similar to bagging.
The mini-batch size |M| is a hyper