CS代考程序代写 UNIVERSITY COLLEGE LONDON Faculty of Engineering Sciences
UNIVERSITY COLLEGE LONDON Faculty of Engineering Sciences
Department of Computer Science
Problem Set: Classification
Dr. Dariush Hosseini (dariush.hosseini@ucl.ac.uk)
1
Notation
Inputs: x=[1,×1,x2,…,xm]T ∈Rm+1
Outputs:
y ∈ R for regression problems
y ∈ {0, 1} for binary classification problems
Training Data:
S = {(x(i), y(i))}ni=1
Input Training Data:
The design matrix, X, is defined as:
(1)T (1) (1) x 1×1··xm
(2) (2) x 1 x · · xm
(2)T 1
X=·=· · ·· ·
· · · · · · x(n)T 1 x(n) · · x(n)
1m
Output Training Data:
y(1)
y(2) y= · ·
y(n)
Data-Generating Distribution:
The outcomes of S are drawn i.i.d. from a data-generating distribution, D
Page 2
1. This problem focuses on generative approaches to classification. It begins by asking for ba- sic statements and derivations pertaining to probabilistic classification, before asking you to consider a particular generative model. The model is not one which we discussed in lectures, but is very similar to Naive Bayes. It is known as ‘Linear Discriminant Analysis’ (LDA). You are asked to investigate the discriminant boundaries that emerge from this model. Following this you are asked to consider a slight generalisation of the model with fewer restrictions placed upon the class conditional covariances. This more general model is known as ‘Quadratic Discriminant Analysis’ (QDA). Finally you are asked to consider how these models differ from the Naive Bayes model which we discussed in the lectures. Note throughout how different model assumptions imply different discriminant boundaries and hence different classifiers.
(a) [2 marks]
Describe the generative approach to classification. How does it differ from the dis- criminative approach?
(b) [3 marks]
Derive the Bayes Optimal Classifier for binary classification, assuming misclassifica- tion loss.
(c) [10 marks]
In a binary classification setting, assume that classes are distributed according to a Bernoulli random variable, Y, whose outcomes are y, i.e. y ∼ Bern(θ), where θ = pY (y = 1). Furthermore we model the class conditional probability distributions for the random variable, X , whose outcomes are given by instances of particular input attribute vectors, x = [x1, x2, …, xm]T ∈ Rm, as (note that here we will take care of the bias parameter explicitly, hence the absence of a leading ‘1’ in the attribute vector):
x|(y = 0) ∼ N(μ0,Σ0) where: μ0 ∈ Rm, Σ0 ∈ Rm×m,ΣT0 = Σ0,Σ0 ≻ 0 x|(y = 1) ∼ N(μ1,Σ1) where: μ1 ∈ Rm, Σ1 ∈ Rm×m,ΣT1 = Σ1,Σ1 ≻ 0
The off-diagonal elements of Σ0,Σ1 are not necessarily zero.
Assume that Σ = Σ0 = Σ1 and show that the discriminant boundaries between the
classes can be described by the following expression (you should clearly express w and b):
w·x+b=0 where: w∈Rm andb∈R
(d) [2 marks]
What does this expression describe? Explain.
(e) [4 marks]
Now assume that Σ0 ̸= Σ1. What happens to the discriminant boundaries? Explain.
(f) [4 marks]
Explain how this approach differs from that of Na ̈ıve Bayes.
Page 3
2. This problem focuses on discriminative classification. You begin by considering the Lo- gistic Noise Latent Variable model and use it to motivate the Logistic Regression model, as we do in lectures. Following this you are asked to consider whether changing the pa- rameterisation of the underlying logistic noise will imply a different classification model (it won’t!). Next you are asked to repeat this analysis but for a Gaussian Latent Variable model. The resulting classification model is know as ‘probit regression’. While it is sim- ilar in form to logistic regression, the probit function is less easy to manipulate than the logistic sigmoid, and furthermore has more sensitivity to outliers. Finally you are asked to consider a multinomial extension of the logistic regression model, and in particular to examine the form of the boundaries which exist between classes for this model.
(a) [2 marks]
Describe the discriminative approach to classification. How does it differ from the generative approach?
(b) [3 marks]
Recall that in binary logistic regression, we seek to learn a mapping characterised by the weight vector, w and drawn from the function class, F:
1 m+1 F = fw(x)=I[pY(y=1|x)≥0.5]pY(y=1|x)= ,w∈R
Here pY(y|x) is the posterior output class probability associated with a data gener- ating distribution, D, which is characterised by the joint distribution pX ,Y (x, y). Provide a motivation for this form of the posterior output class probability pY(y|x) by considering a Logistic Noise Latent Variable Model. Remember that the noise in such a model characterises a random variable ε, with outcomes, ε, which are drawn i.i.d. as follows:
ε ∼ Logistic(a,b) where: a = 0, b = 1
The characteristic probability distribution function for such a variable is:
exp − (ε−a) p (ε|a, b) = b
1 + e−w·x
ε (ε−a)2 b 1+exp − b
(c) [3 marks]
If we allow the Logistic parameters to take general values a ∈ R, b > 0 explain the effect which this has on the final logistic regression model.
(d) [4 marks]
Let us assume instead a Gaussian Noise Latent Variable Model. Now ε is drawn i.i.d. as follows:
ε ∼ N(0,1)
Derive an expression for the posterior output class probability pY(y|x) in this case.
(e) [3 marks]
How will the treatment of outliers in the data differ for these two models? Explain.
Page 4
(f) [2 marks]
For K-class multinomial regression, assuming misclassification loss, we can express a discriminative model for the posterior output class probability as:
exp(wj · x)
pY (y = j|x) = Kk=1 exp(wk · x)
Where now y ∈ {1, …, K}
Demonstrate that this model reduces to logistic regression when K = 2.
(g) [3 marks]
For K > 2 derive an expression for the discriminant boundaries between classes. What does this expression describe?
Page 5