程序代做CS代考 chain flex GMM algorithm Density Estimation with Gaussian Mixture Models – cscodehelp代写

Density Estimation with Gaussian Mixture Models
Liang National University

Motivation
• Inpractice,theGaussiandistributionhaslimitedmodelingcapabilities.
• Below is a two-dimensional dataset that cannot be meaningfully represented by a single Gaussian
• Wecanusemixturemodelsfordensityestimation.
• Mixture models can be used to describe a distribution 𝑝(𝒙) by a convex combination
of 𝐾 simple (base) distributions
+
𝑝𝒙 =’𝜋(𝑝( 𝒙
()*
0≤𝜋( ≤1, ‘𝜋( =1 ()*
where the components 𝑝( are members of a family of basic distributions, e.g., Gaussians, Bernoullis, or Gammas, and the 𝜋( are mixture weights.
+

11.1 Gaussian Mixture Model
The Gaussian mixture distribution (black) is composed of a convex combination of Gaussian distributions and is more expressive than any individual component. Dashed lines represent the weighted Gaussian components.
𝑝 𝑥|𝜽 =0.5𝒩 𝑥7−2,12 +0.2𝒩 𝑥|1,2 +0.3𝒩 𝑥|4 ,1

11.1 Gaussian Mixture Model
• A Gaussian mixture model (GMM) is a density model where we combine a
finite number of 𝐾 Gaussian distributions 𝑁(𝒙|𝝁(, 𝚺() so that +
𝑝𝒙|𝜽
()*
0≤𝜋( ≤1,’𝜋( =1 ()*
where we defined 𝜽 ∶= 𝝁(,𝜮(,𝜋(: 𝑘 = 1,⋯,𝐾 parameters of the GMM.
+
as the collection of all
• GMM gives us significantly more flexibility for modeling complex densities than a simple Gaussian distribution.
• Parameter Learning via Maximum Likelihood
• Assume we are given a dataset 𝓧 = {𝒙*,𝒙I,…,𝒙K}, where 𝒙M,𝑛 = 1,…,𝑁, are drawn i.i.d. from an unknown distribution 𝑝 𝒙 . Our objective is to find a good approximation/representation of this unknown distribution 𝑝 𝒙 by means of a GMM with 𝐾 components.

11.2 Parameter Learning via Maximum Likelihood • Assume we are given a dataset 𝓧 = {𝒙*,𝒙I,…,𝒙K}, where 𝒙M,𝑛 = 1,…,𝑁,
are drawn i.i.d. from an unknown distribution 𝑝 𝒙 .
• Our objective is to find a good approximation/representation of this unknown distribution 𝑝 𝒙 by means of a GMM with 𝐾 components.

Example
• We consider a one-dimensional dataset 𝓧 = {−3, −2.5, −1, 0, 2, 4, 5} consisting of 7 data points and wish to find a GMM with 𝐾 = 3 components that models the density of the data.
• We initialize the mixture components as
𝑝* 𝑥 =𝒩 𝑥|−4,1 𝑝I 𝑥 =𝒩 𝑥|0,0.2 𝑝O 𝑥 =𝒩 𝑥|8,3
and assign them equal weights 𝜋* = 𝜋I = 𝜋O = 𝟏𝟑 .
• We can view the corresponding model and the data points below.

• How to obtain a maximum likelihood estimate 𝜽𝑀𝐿 of model parameters 𝜽?
• We start by writing down the likelihood, i.e., the predictive distribution of the
training data given the parameters. We exploit our i.i.d. assumption, which
M)*
Observed data
whereeveryindividuallikelihoodterm𝑝𝒙M|𝜽 isaGaussianmixturedensity. • Then we obtain the log-likelihood (loss function) as
leads to the factorized likelihood
K+
𝑝 𝓧|𝜽 =U𝑝 𝒙M|𝜽 , 𝑝 𝒙M|𝜽 =’𝜋(𝒩
()*
Mixture component Mixture proportion
KK+
L 𝝁(,𝚺(,𝜋( =log𝑝 𝓧|𝜽 =’log𝑝 𝒙M|𝜽 =’log’𝜋(𝒩
M)* • We aim to find parameters 𝜽∗
M)* ()*
(including 𝝁∗,𝚺∗,𝜋∗) that maximize log- 𝒌 𝒌 𝒌
likelihood L defined above. Z[

• We obtain the following necessary conditions when we optimize the log- likelihood with respect to the GMM parameters 𝝁(, 𝚺A, 𝜋(:
𝜕L K 𝜕log𝑝 𝒙M|𝜽 𝜕𝝁( = 𝟎` ⟺ ‘ 𝜕𝝁(
= 𝟎` = 𝟎 = 0
M)*
𝜕L K 𝜕log𝑝 𝒙M|𝜽
𝜕𝚺( = 0 ⟺ ‘ 𝜕𝚺( M)*
𝜕L K 𝜕log𝑝 𝒙M|𝜽
𝜕𝜋( = 0 ⟺ ‘ 𝜕𝜋( M)*
• For all three necessary conditions, by applying the chain rule, we require partial derivatives of the form
𝜕log𝑝 𝒙M|𝜽 = 1 𝜕𝑝 𝒙M|𝜽 𝜕𝜽 𝑝 𝒙M|𝜽 𝜕𝜽
• where 𝜽 = 𝝁(,𝚺A,𝜋(: 𝑘 = 1,⋯,𝐾 are the model parameters and 1=1
𝑝𝒙M|𝜽 ∑+ 𝜋c𝒩 𝒙Md𝝁c,𝚺𝒋 c)*

11.2.1 Responsibilities
• We define the quantity
𝑟M( ≔ 𝜋(𝒩
∑+ 𝜋c𝒩 𝒙Md𝝁c,𝚺𝒋 c)*
as the responsibility of the 𝑘th mixture component for the 𝑛th data point. • We can see 𝑟M( is proportional to the likelihood
𝑝 𝝁(, 𝚺( = 𝜋(𝒩 𝜮( of the 𝑘th mixture component given the data point.
• The responsibility 𝑟M( represents the posterior probability that 𝒙M has been generated by the kth mixture component
• Note that 𝒓M: = 𝑟M*, ⋯ , 𝑟M+ ` ∈ R+ is a (normalized) probability vector, i.e., ∑(𝑟M( =1with𝑟M( ≥0.
• This probability vector distributes probability mass among the K mixture components, and we can think of rn as a “soft assignment” of 𝒙M to the 𝐾 mixture components.

Example – Responsibilities
• From the figure below, we compute the responsibilities 𝑟M(
1.0 0.0 0.0
1.0 0.0 0.0 0.057 0.943 0.0
• 0.001 0.999 0.0 0.0 0.066 0.934
0.0 0.0 1.0 0.0 0.0 1.0
K×+ ∈ R
• The nth row tells us the responsibilities of all mixture components for xn.
• The sum of all K responsibilities for a data point (sum of every row) is 1.
• The kth column gives us an overview of the responsibility of the kth mixture component.
• The third mixture component (third column) is not responsible for any of the first four data points, but takes much responsibility of the remaining data points.
• The sum of all entries of a column gives us the values Nk, i.e., the total responsibility of the kth mixture component. In our example, we get N1 = 2.058, N2 =2.008, N3 = 2.934.
• We will determine the updates of the model parameters 𝝁(, 𝚺(, and 𝜋( for given responsibilities

11.3 EM Algorithm
• In GMM, we first initialize the parameters 𝝁(, 𝚺(, and 𝜋( and alternate until convergence between the following two steps
• E-step: Evaluate the responsibilities 𝑟M( (probability of data point 𝑛 belonging to mixture component 𝑘)
• M-step: Use the updated responsibilities to re-estimate the parameters 𝝁( , 𝚺( , and 𝜋(

Check your understanding
• Given a dataset generated by a mixture of 3 Gaussians, when we randomly sample a data point, it has the probability of 1/3 belonging to each Gaussian.
• A GMM is a linear combination of several Gaussian distributions.
• In GMM, K (number of Gaussians) is a hyperparameter.
• If a dataset is not generated by Gaussian distributions, it cannot be modeled by GMM.

11.3 EM Algorithm
• Initialize 𝝁(, 𝚺(, 𝜋(. (below is an example)
• 𝜋( =1/𝐾 forall𝑘
• 𝝁(:centroidsfrom𝑘-meansalgorithmorusingrandomlychosendatapoints • =𝚺thesamplevariance,forall𝑘
• E-step: Evaluate responsibilities 𝑟M( for every data point 𝒙M using current parameters𝜋(,𝝁(,𝚺(: 𝜋(𝒩
𝑟M(=∑c𝜋c𝒩 𝒙Md𝝁c,𝚺c
• M-step: Re-estimate parameters 𝜋(, 𝝁(, 𝚺( using the current responsibilities
𝑟M( (from E-step): K
𝝁 ( = 𝑁1 ‘ 𝑟 M ( 𝒙 M
1K (M)*
𝚺(=𝑁(‘𝑟M(𝒙M−𝝁( 𝒙M−𝝁(` M)* 𝑁(
𝜋( = 𝑁

11.3 EM Algorithm

(d)
(e) (f)
(c)

• The dataset is colored according to the responsibilities of the mixture components when EM converges.
• A single mixture component is highly responsible for the data on the left.
• The overlap of the two data clusters on the right could have been generated
by two mixture components.
• It becomes clear that there are data points that cannot be uniquely assigned to a single component (either blue or yellow), such that the responsibilities of these two clusters for those points are around 0.5.

11.3 EM Algorithm • The final GMM is given as
𝑝 𝑥 = 0.29𝒩 𝑥|−2.75 , 0.06 + 0.28𝒩 𝑥|−0.50 , 0.25 + 0.43𝒩 𝑥|3.64 , 1.63
Final GMM fit. After five iterations, the EM Negative log-likelihood as a function of the algorithm converges and returns this GMM EM iterations.

11.2.2 Updating the Means
• Theupdateofthemeanparameters𝝁(,k=1,…,K,oftheGMMisgivenby
∑K 𝑟 M)* M(
• Proof: Calculate the gradient of the log-likelihood with respect to 𝝁(
∑K 𝑟 𝒙 𝝁 M( q r = M ) * M ( M
• Considering
K L𝝁(,𝚺(,𝜋( =log𝑝𝓧|𝜽 =’log𝑝𝒙M|𝜽
M)*
𝑝𝒙M|𝜽 ()*
• We have
• RecallourknowledgeinmultivariateGaussiandistributionandvectorcalculus
+
𝜕𝑝 𝒙M|𝜽 =’+ 𝜋 𝜕𝒩 𝒙Md𝝁c,𝚺𝒋 =𝜋 𝜕𝒩 𝜕𝝁( c)* c 𝜕𝝁( ( 𝜕𝝁(
• Wehave
𝜕𝑝 𝒙M|𝜽 𝜕𝝁(
𝑝 𝒙 𝝁 , 𝜮 = 2 𝜋 s tI 𝜮 s *I 𝑒 𝑥 𝑝 ( − 12 𝒙 − 𝝁 v 𝜮 s * 𝒙 − 𝝁 )
𝜕𝒙`𝑩𝒙 = 𝒙`(𝑩 + 𝑩`) 𝜕𝒙
= 𝜋 ( 𝒙 M − 𝝁 ( ` 𝜮 s( 𝟏 𝒩 𝒙 M @ 𝝁 ( , 𝚺 (

11.2.2 Updating the Means
• The desired partial derivative of L with respect to 𝝁( is given as
𝜕L K𝜕log𝑝𝒙M|𝜽 K 1 𝜕𝑝𝒙M|𝜽
,
𝜕𝝁( = ‘ 𝜕𝝁( M)*
K
= ‘ 𝒙 M − 𝝁 ( ` 𝚺 (s 𝟏 M)*
K
= ‘ 𝑝 𝒙M|𝜽 M)*
𝜕𝝁(
𝜋(𝒩
∑+ 𝜋𝒩𝒙d𝝁,𝚺 c)*c Mc𝒋
= ‘ 𝑟 M ( 𝒙 M − 𝝁 ( ` 𝚺 (s 𝟏 M)*
= 𝑟M(
•Wenowsolvetheabovegradientfor𝝁M(qrsothatxL(𝝁zy{|) =𝟎`andobtain KK Kx𝝁yK
where we define
K
𝑁(: = ‘ 𝑟M( M)*
‘ 𝑟M(𝒙M = ‘ 𝑟M(𝝁M(qr ⟺ 𝝁M(qr = ∑M)* 𝑟M(𝒙M M)* M)*
= 𝑁1 ‘ 𝑟M(𝒙M ( M)*
as the total responsibility of the kth mixture component for the entire dataset. • This concludes the proof.
∑K 𝑟 M)* M(

11.2.2 Updating the Means
∑K 𝑟 𝒙 𝝁 M( q r = M ) * M ( M
∑K 𝑟 M)* M(
• This is an importance-weighted Monte Carlo estimate of the mean. • The importance weights of data point 𝒙M is 𝑟M(
• Mean update
Initialization:
𝓧 = {−3,−2.5,−1,0,2,4,5}
1.0 0.0 0.0
1.0 0.0 0.0 0.057 0.943 0.0
0.001 0.999 0.0 0.0 0.066 0.934
0.0 0.0 1.0 0.0 0.0 1.0
𝜇* ∶ −4 ⟶ −2.7 𝜇I ∶ 0 ⟶ −0.4 𝜇O ∶ 8⟶3.7
−2.7 = −3×1 − 2.5×1 − 1×0.057 − 0×0.001 1+1+0.057+0.001
𝜋* =𝜋I =𝜋O =1 3
𝑝*𝑥=𝒩𝑥|−4,1 𝑝I 𝑥 = 𝒩 𝑥|0, 0.2 𝑝O 𝑥 =𝒩 𝑥|8,3

11.2.3 Updating the Covariances
• The update of the covariance parameters 𝚺(, 𝑘 = 1, . . . , 𝐾 is given by
1K 𝚺(Mqr=𝑁(‘𝑟M(𝒙M−𝝁( 𝒙M−𝝁(`
M)*
• Proof We compute the partial derivatives of the log-likelihood L with respect
to the covariances 𝚺(, set them to 𝟎, and solve for 𝚺(. We start by 𝜕L K𝜕log𝑝𝒙M|𝜽 K 1 𝜕𝑝𝒙M|𝜽
𝜕𝚺( = ‘ 𝜕𝚺( = ‘ 𝑝 𝒙M|𝜽 𝜕𝚺( M)* M)*
• We already know 1/𝑝 𝒙M|𝜽 . To obtain 𝜕𝑝 𝒙M|𝜽 /𝜕𝚺(, we have,
𝜕𝑝 𝒙M|𝜽 = 𝜕 𝜋( 2𝜋 stIdet 𝚺( s*Iexp −1 𝒙M −𝝁( `𝚺(s𝟏 𝒙M −𝝁(
𝜕𝚺( 𝜕𝚺( 2
= 𝜋 ( 2 𝜋 s tI Ñ 𝜕 d e t 𝚺 ( s *I e x p − 1 𝒙 M − 𝝁 ( ` 𝚺 (s 𝟏 𝒙 M − 𝝁 ( 𝜕𝚺( * 𝜕 1 2
+ d e t 𝚺 ( s I 𝜕 𝚺 ( e x p − 2 𝒙 M − 𝝁 ( ` 𝚺 (s 𝟏 𝒙 M − 𝝁 ( Ö

11.2.3 Updating the Covariances • From Vector Calculus, we have the following identities
𝜕 det𝚺( s*I=−1det𝚺( s*I𝚺(s𝟏 𝜕𝚺( 2
𝒙M −𝝁( `𝚺(s𝟏 𝒙M −𝝁( =−𝚺(s𝟏 𝒙M −𝝁( 𝒙M −𝝁( `𝚺(s𝟏
𝜕𝑝 𝒙M|𝜽 =𝜋(𝒩 𝒙M|𝝁( ,𝚺( Ü −1 𝚺(s𝟏 −𝚺(s𝟏 𝒙M −𝝁( 𝒙M −𝝁( `𝚺(s𝟏 𝜕𝚺( 2
𝜕 𝜕𝚺(
• We obtain the desired partial derivative
• Thus, the partial derivative of the log-likelihood with respect to 𝚺( is given by
𝜕L K 𝜕𝚺( = ‘
K M)* =’
K 1 𝜕𝑝 𝒙M|𝜽 = ‘ 𝑝 𝒙M|𝜽 𝜕𝚺(
M)*
𝜕log𝑝 𝒙M|𝜽 𝜕𝚺(
Ü−1𝚺(s𝟏−𝚺(s𝟏𝒙M−𝝁( 𝒙M−𝝁(`𝚺(s𝟏 ∑+𝜋𝒩𝒙d𝝁,𝚺 2
M)*c)*c Mc𝒋
= 𝑟M(
1K
=−2’𝑟M( 𝚺(s𝟏 −𝚺(s𝟏 𝒙M −𝝁( 𝒙M −𝝁( `𝚺(s𝟏
M)*
1K1K
=−2𝚺(s𝟏 ‘𝑟M( +2𝚺(s𝟏 ‘𝑟M( 𝒙M −𝝁( 𝒙M −𝝁( ` 𝚺(s𝟏
M)* M)* Ky

11.2.3 Updating the Covariances
• Settingthispartialderivativeto𝟎,weobtainthenecessaryoptimalitycondition
K
𝑁 ( 𝚺 (s 𝟏 = 𝚺 (s 𝟏 ‘ 𝑟 M ( 𝒙 M − 𝝁 ( 𝒙 M − 𝝁 ( ` 𝚺 (s 𝟏
M)* K
⟺ 𝑁 ( 𝑰 = ‘ 𝑟 M ( 𝒙 M − 𝝁 ( 𝒙 M − 𝝁 ( ` 𝚺 (s 𝟏 M)*
• By solving for 𝚺( , we obtain
K
• This gives us a simple update rule for 𝚺( for 𝑘 = 1, … , 𝐾 and proves our theorem.
• This update method is the weighted covariance of data points 𝒙M associated with
the 𝑘th component.
• The weights are the responsibilities 𝑟M(
𝚺(àâä=𝑁1’𝑟M(𝒙M−𝝁( 𝒙M−𝝁(` ( M)*

11.2.3 Updating the Covariances
𝜎 *I ∶ 1 ⟶ 0 . 1 4 𝜎I ∶ 0.2 ⟶ 0.44 𝜎 OI ∶ 3 ⟶ 1 . 5 3

11.2.4 Updating the Mixture Weights • The mixture weights of the GMM are updated as
where 𝑁 is the number of data points
• Proof We calculate the partial derivative of the log-likelihood with respect to
theweightparameters𝜋(,𝑘 = 1,…,𝐾. • We have the constraint
∑( 𝜋( = 1
𝜋(Mqr = 𝑁( ,𝑘 = 1,⋯,𝐾 𝑁
• Using Lagrange multipliers (will not be covered in this course), we have
+ 𝔏=L+λ ‘𝜋(−𝟏
K + ()* +
+λ ‘𝜋(−𝟏 M)* ()* ()*

K++
+λ ‘𝜋(−𝟏 M)* ()* ()*
• We obtain the partial derivative with respect to 𝜋( as 𝜕𝔏 K 𝒩
𝜕𝜋(=’∑+ 𝜋𝒩𝒙d𝝁,𝚺 +λ M)*c)*c Mcc
1K 𝑁(
=𝜋(‘∑+ 𝜋𝒩 𝒙 d𝝁,𝚺 +λ=𝜋(+λ M)* c)* c M c c =𝑁(
• The partial derivative with respect to the Lagrange multiplier 𝜆 is 𝜕𝔏 +
𝜕λ = ‘ 𝜋( − 𝟏 ()*
• Setting both partial derivatives to 0 yields the system of equations 𝜋( =−𝑁(
1 = ‘ 𝜋( ()*

• Usingthetwoequations,weobtain
+ +𝑁( 𝑁 ‘𝜋(=1⟺−’ λ =1⟺−λ=1⟺λ=−𝑁
()* ()*
• This allows us to substitute −𝑁 for 𝜆 in 𝜋( = − Ky to obtain 𝜋 (M q r = 𝑁 ( è
𝑁
which gives us the update for the weight parameters 𝜋( and proves the Theorem.
1.0 0.0 0.0
1.0 0.0 0.0 0.057 0.943 0.0
0.001 0.999 0.0
𝜋 *
∶1⟶0.29 11.50 3
0.29 =
1+1+0.057+0.001 7
∶ 1 ⟶ 0.29 11.51
0.0 0.0 0.0
0.066 0.934 0.0 1.0 0.0 1.0
𝜋
I 3
𝜋 ∶ 1 ⟶ 0.42 11.52 O 3
• We see that the third component gets more weight/importance, while the other components become slightly less important.

Generating a new dataset with GMM
• ForagivenGMMwithparameters𝝁(,𝚺(,𝜋(,𝑘=1,…,𝐾,wewanttogenerate
a dataset with 𝑁 data points.
• Wesampleanindex𝑘from{1,2,…,𝐾}withprobabilities𝜋*,…,𝜋(
• We generate a number of 𝑁𝜋( data points for the 𝑘th component
• In the 𝑘th component, every data point is sampled as 𝒙~𝒩 𝝁(, 𝚺(

Comparising GMM with K-Means
Algorithms.
1. k-Means
a. Given hard labels, compute centroids
b. Given centroids, compute hard labels
2. GMM
a. Given soft labels, compute Gaussians
b. Given Gaussians, compute soft labels
• Like k-means, GMM may get stuck in local minima.
• Unlike k-means, the local minima are more favorable because soft labels allow points to move between clusters slowly.

Check your understanding
• If 𝐾 takes a greater value, the likelihood becomes greater after convergence.
• Assume we have 𝑁 data points. The maximum likelihood will be achieved if we set 𝐾 = 𝑁.
• In GMM, the EM algorithm gives us global minimum, because we can update 𝜋(, 𝝁( and 𝚺( through closed-form solutions.
• GMM has a higher computational complexity than kmeans.
• When the 𝑁 data points are close to each other in the feature space, we should set 𝐾 to a small value.

Leave a Reply

Your email address will not be published. Required fields are marked *