程序代写 Machine Learning and Data Mining in Business – cscodehelp代写
Machine Learning and Data Mining in Business
Week 6 Tutorial
When studying these exercises, please keep in mind that they are about problem-solving techniques for machine learning. In general, they’re not about particular distributions or learning algorithms.
Question 1
Copyright By cscodehelp代写 加微信 cscodehelp
Let Y1, Y2, . . . , Yn mass function
∼ Poisson(λ).
Recall that the Poisson distribution has probability
e−λ λy p(y;λ)= y! .
(a) Write down the likelihood for a sample y1, . . . , yn.
Solution: Likelihood function:
n n exp(−λ)λyi
∝ exp(−λ)λyi
The symbol ∝ means “proportional to”. In practice, we automatically drop any terms that do not depend on the parameter since they are irrelevant for optimisation.
L(λ) = p(yi; λ) =
i=1 i=1 yi!
(b) Derive a simple expression for the log-likelihood.
Solution: Log-likelihood function:
l(λ) = log L(λ)
n exp(−λ)λyi
n exp(−λ)λyi
= −λ+yilog(λ)−log(yi!)
log yi! ny
log exp(−λ) + log(λ i ) − log(yi!) n
= −nλ + yi log(λ) − log(yi!)
nn i=1 i=1
Essentially, all that we did was to repeatedly apply the laws of exponents and logarithms from the basic facts training and review notes provided at the start of the semester.
(c) Let the objective function for optimisation be the negative log-likelihood. Find the critical point of the cost function.
Solution: Objective function (dropping constant terms from the negative log- likelihood):
J(λ) = nλ − yi log(λ)
i=1 First derivative with respect to the parameter:
dλ = n − λ First-order necessary condition:
Solving the equation:
1 n n − λ
1 n λ=n yi
(d) Show that the critical point is the MLE.
The second derivative of the cost function is d2J 1n
which is positive as long as at least one of the data points yi is not zero (recall
that y ∈ 0, 1, 2, . . . for the Poisson distribution).
Therefore, J(β) is strictly convex and we can conclude that λ is the MLE.
(e) You can create as many additional exercises of this type as you like by picking any simple statistical distribution and answering the same questions.
Question 2
In addition to being good practice, this exercise derives results that will be very useful later.
Consider the model Y1, Y2, . . . , Yn ∼ Bernoulli σ(β) , where β ∈ R is a parameter and σ is the sigmoid function
σ(β) = 1 . 1 + exp(−β)
You can think of this model as a logistic regression that only has the intercept. Following the lecture, the optimisation problem for estimating this model is
n minimise −yilog σ(β) −(1−yi)log 1−σ(β) ,
(a) Differentiate σ(β).
Solution: Starting from
we have that
σ(β) = 1 , 1 + exp(−β)
(1 + exp(−β))2
using the chain rule, the reciprocal rule, and the derivative of the exponential function.
(b) Show that σ′(β) = σ(β)(1 − σ(β)).
σ′(β) = exp(−β)
(1 + exp(−β))2
noting that
1−σ(β)=1− 1
1 + exp(−β)
= 1 exp(−β) = σ(β)(1 − σ(β)), (1 + exp(−β)) 1 + exp(−β)
=1+exp(−β)− 1 = exp(−β)
1 + exp(−β) 1 + exp(−β) 1 + exp(−β)
(c) Find the derivative of J(β) using the chain rule and the previous result.
Solution: The derivative is
σ ′ ( β ) σ ′ ( β ) −yi σ(β) +(1−yi)1−σ(β)
σ(β)(1 − σ(β)) σ(β)(1 − σ(β)) −yi σ(β) +(1−yi) 1−σ(β)
−yi + yiσ(β) + σ(β) − yiσ(β)
σ(β) − yi i=1
n nσ(β) − yi
(d) Find the critical point of J(β).
Solution: The first-order necessary condition is
Rearranging,
σ(β) = y. We conclude that the critical point is
β = σ−1(y),
n nσ(β)−yi =0.
where σ−1 denotes the inverse of the sigmoid function, i.e. the logit function.
(e) What is the second derivative of the cost function? Show that the objective function is convex.
Solution: Using the first derivative above, dJ2 ′
d2β = nσ (β) = nσ(β)(1 − σ(β))
This is expression is strictly positive since σ(β) ∈ (0,1). Therefore, the cost function is strictly convex.
Question 3
Suppport vector machines (SVMs) were a major development in machine learning in the mid-1990s due to their state-of-art performance and novelty at the time. Since then, researchers have discovered that support vector machines can be reformulated as regularised estimation, establishing a deep connection to classical methods such as logistic regression.
In suppport vector classification (SVC), we consider a binary classification problem and encode the response as y ∈ {−1, 1}. The method is based on the linear decision function
and classification rule
f(x)=β0 +β1×1 +…+βpxp
y=sign f(x) ,
which means that y = 1 if f(x) > 0 and y = −1 if f(x) < 0.
The set {x : f (x) = 0} is the decision boundary. Thus, we can view |f (x)| as a measure of the learning algorithm’s confidence that the observation is correctly classified.
The support vector classifier learns the coefficients β0, β1, . . . , βp by regularised empirical risk minimisation based on the hinge loss
L y,f(x) =max 0,1−yf(x) .
This figure from the ISL textbook plots the hinge loss and the cross-entropy loss (neg- ative log-likelihood loss) for y = 1. The figure calls the latter the logistic regression loss because in this formulation, the prediction f(x) in the loss function L(y,f(x)) is a prediction for the logit of the probability.
Logistic Regression Loss
−6 −4 −2 0 2
yi(β0 +β1xi1 +...+βpxip)
(a) Write down the learning rule for a support vector classifier based on l2 regularisa-
(b) Consider the term yf(x) from the hinge loss. What is the classification when yif(xi) > 0 compared to yif(xi) < 0?
(c) Intepret the hinge loss function by considering the following cases: 1. yf(x)>1
n p
minimise max 0,1−yi (β0 +β1xi1 +…+βpxip) +λβj2
β i=1 j=1
Solution: We have that yf(x) > 0 if y and f(x) both have the same sign. Therefore, yf(x) > 0 occurs when the observation is correctly classified. Like- wise, yf(x) < 0 occurs when y and f(x) have opposite signs, which means that the instance is incorrectly classified.
2. 0
Overall, the hinge and cross-entropy loss functions are quite similar. The logistic regression loss is smooth, while the hinge loss has a “kink” where 1 − yf (x) = 0. Informally, we can say that the logistic regression loss is almost like a smooth (differentiable) version of the hinge loss.
程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com