LECTURE 6

(1)

LECTURE 6

GAUSSIAN MIXTURE MODELS

ELEC801 Pattern Recognition, Fall 2018, KNU Instructor: Gil-Jin Jang

Slide credits:

Hak-Yong Han; Andrew W. Moore, CMU; Tae-Kyun Kim, Imperial College

London

(2)

GAUSSIAN MIXTURE MODEL (GMM)

(Also called mixture of Gaussians (MOG)) MoG modeling (GMM)

Learning GMM parameters by maximum likelihood estimation

EM algorithm for GMM

(3)

Mixture Modeling

Class weight, class prior probability, multinomial

Multivariate Normal Number of hidden

components Normal parameters

Observations Class weights

Normal = Gaussian

A formalism for modeling a

probability density function as a sum of parameterized functions.

Gaussian is most general parametric

distribution.

(4)

 There are 𝐾 components

– The 𝑘 ^𝑡ℎ component, 𝐶 _𝑘 , is associated with mean

vector 𝜇 _𝑘 and covariance matrix Σ _𝑘

– Each component generates data from a Gaussian with mean 𝜇 _𝑘 and covariance Σ _𝑘

 Assume that each sample is generated as follows:

– Pick a component at random.

– Choose component 𝑘 with probability 𝜋 _𝑘

– Sample 𝐱~𝑁(𝜇 _𝑘 , Σ _𝑘 )

(5)

GMM Explanation by Graphical Models

Denote z as 1-of-K representation: z _{k ∈} {0, 1} and Σ _k z _k = 1.

We define the joint distribution p(x, z) by a multa marginal distribution p(z) and a conditional distribution p(x|z).

p(z) p(x|z) = p(x, z)

Hidden variable

Observable variable: data

p(z)

p(x|z)

(6)

The marginal distribution over z is written by the mixing coefficients π _k

where

The marginal distribution is in the form of

Similarly,

(7)

The marginal distribution of x is

, which is as a linear superposition of Gaussians.

(8)

The conditional probability p(z _k = 1|x) denoted by γ(z _k ) is obtained by Bayes' theorem,

We view π _k as the prior probability of z _k = 1, and γ(z _k ) as the posterior probability.

γ(z _k ) is the responsibility that k-component takes for

explaining the observation x.

(9)

Maximum Likelihood Estimation

s.t.

Given a data set of X = {x _1,…, x _N }, the log of the likelihood

function is

(10)

Setting the derivatives of ln p(X|π, μ, Σ) with respect to μ _k

to zero, we obtain

(11)

objective function f(x)

constraints g(x)

max f(x) s.t. g(x)=0 max f(x) + 𝜆g(x)

Refer to Optimisation course or http://en.wikipedia.org/wiki/Lagrange_multiplier

Setting the derivatives of ln p(X|π, μ, Σ) with respect to Σ _k to zero, we obtain

Finally, we maximise ln p(X|π, μ, Σ) with respect to the

mixing coefficients π _k . We use a Lagrange multiplier

(12)

which gives

we find λ = -N and

(13)

1. Initialise the means μ _k , covariances Σ _k and mixing coefficients π _k .

2. Ε step: Evaluate the membership probabilities using the current parameter values

3. M step: RE-estimate the parameters using the current responsibilities

EM (Expectation Maximisation) for Gaussian Mixtures

(14)

4. Evaluate the log likelihood

and check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied, return to step 2.

Mixtures

(15)

Gaussian Mixture Example:

Start

Advance apologies: in Black and White this example will be

incomprehensible

(16)

After first

iteration

(17)

After 2nd

iteration

(18)

After 3rd

iteration

(19)

After 4th

iteration

(20)

After 5th

iteration

(21)

After 6th

iteration

(22)

After 20th

iteration

(23)

EM Example: Old Faithful Geyser

(24)

Hard clustering: a data point is

assigned a cluster.

Soft clustering: a data point is

explained by a mix of multiple

Gaussians probabilistically.

Two standard methods are k-means and Gaussian Mixture Model (GMM).

K-means assigns data points to the nearest clusters, while GMM represents data

by multiple Gaussian densities.

(25)

Example: Density Estimation

(26)

(27)

Some Bio

Assay data

(28)

clustering of the assay

data

(29)

Resulting

Density

Estimator

(30)

Three

classes of assay

(each learned with it’s own mixture model)

(Sorry, this will again be semi-

useless in black and white)

(31)

Resulting Bayes

Classifier

(32)

LECTURE 6

LECTURE 6

GAUSSIAN MIXTURE MODELS

ELEC801 Pattern Recognition, Fall 2018, KNU Instructor: Gil-Jin Jang

Slide credits:

Hak-Yong Han; Andrew W. Moore, CMU; Tae-Kyun Kim, Imperial College

London

GAUSSIAN MIXTURE MODEL (GMM)

(Also called mixture of Gaussians (MOG)) MoG modeling (GMM)

Learning GMM parameters by maximum likelihood estimation

EM algorithm for GMM

Mixture Modeling

Class weight, class prior probability, multinomial

Multivariate Normal Number of hidden

components Normal parameters

Observations Class weights

Normal = Gaussian

A formalism for modeling a

probability density function as a sum of parameterized functions.

Gaussian is most general parametric

distribution.

 There are 𝐾 components

– The 𝑘 𝑡ℎ component, 𝐶 𝑘 , is associated with mean

vector 𝜇 𝑘 and covariance matrix Σ 𝑘

– Each component generates data from a Gaussian with mean 𝜇 𝑘 and covariance Σ 𝑘

 Assume that each sample is generated as follows:

– Pick a component at random.

– Choose component 𝑘 with probability 𝜋 𝑘

– Sample 𝐱~𝑁(𝜇 𝑘 , Σ 𝑘 )

GMM Explanation by Graphical Models

Denote z as 1-of-K representation: z k ∈ {0, 1} and Σ k z k = 1.

We define the joint distribution p(x, z) by a multa marginal distribution p(z) and a conditional distribution p(x|z).

p(z) p(x|z) = p(x, z)

Hidden variable

Observable variable: data

p(z)

p(x|z)

The marginal distribution over z is written by the mixing coefficients π k

where

The marginal distribution is in the form of

Similarly,

The marginal distribution of x is

, which is as a linear superposition of Gaussians.

The conditional probability p(z k = 1|x) denoted by γ(z k ) is obtained by Bayes' theorem,

We view π k as the prior probability of z k = 1, and γ(z k ) as the posterior probability.

γ(z k ) is the responsibility that k-component takes for

explaining the observation x.

Maximum Likelihood Estimation

s.t.

Given a data set of X = {x 1,…, x N }, the log of the likelihood

function is

Setting the derivatives of ln p(X|π, μ, Σ) with respect to μ k

to zero, we obtain

objective function f(x)

constraints g(x)

max f(x) s.t. g(x)=0 max f(x) + 𝜆g(x)

Refer to Optimisation course or http://en.wikipedia.org/wiki/Lagrange_multiplier

Setting the derivatives of ln p(X|π, μ, Σ) with respect to Σ k to zero, we obtain

Finally, we maximise ln p(X|π, μ, Σ) with respect to the

mixing coefficients π k . We use a Lagrange multiplier

which gives

we find λ = -N and

1. Initialise the means μ k , covariances Σ k and mixing coefficients π k .

2. Ε step: Evaluate the membership probabilities using the current parameter values

3. M step: RE-estimate the parameters using the current responsibilities

EM (Expectation Maximisation) for Gaussian Mixtures

4. Evaluate the log likelihood

and check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied, return to step 2.

Mixtures

Gaussian Mixture Example:

Start

Advance apologies: in Black and White this example will be

incomprehensible

After first

iteration

After 2nd

iteration

After 3rd

iteration

After 4th

– The 𝑘 ^𝑡ℎ component, 𝐶 _𝑘 , is associated with mean

vector 𝜇 _𝑘 and covariance matrix Σ _𝑘

– Each component generates data from a Gaussian with mean 𝜇 _𝑘 and covariance Σ _𝑘

– Choose component 𝑘 with probability 𝜋 _𝑘

– Sample 𝐱~𝑁(𝜇 _𝑘 , Σ _𝑘 )

Denote z as 1-of-K representation: z _{k ∈} {0, 1} and Σ _k z _k = 1.

The marginal distribution over z is written by the mixing coefficients π _k

The conditional probability p(z _k = 1|x) denoted by γ(z _k ) is obtained by Bayes' theorem,

We view π _k as the prior probability of z _k = 1, and γ(z _k ) as the posterior probability.

γ(z _k ) is the responsibility that k-component takes for

Given a data set of X = {x _1,…, x _N }, the log of the likelihood

Setting the derivatives of ln p(X|π, μ, Σ) with respect to μ _k

Setting the derivatives of ln p(X|π, μ, Σ) with respect to Σ _k to zero, we obtain

mixing coefficients π _k . We use a Lagrange multiplier

1. Initialise the means μ _k , covariances Σ _k and mixing coefficients π _k .