• 검색 결과가 없습니다.

LECTURE 6

N/A
N/A
Protected

Academic year: 2022

Share "LECTURE 6"

Copied!
35
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

LECTURE 6

GAUSSIAN MIXTURE MODELS

ELEC801 Pattern Recognition, Fall 2018, KNU Instructor: Gil-Jin Jang

Slide credits:

Hak-Yong Han; Andrew W. Moore, CMU; Tae-Kyun Kim, Imperial College

London

(2)

GAUSSIAN MIXTURE MODEL (GMM)

(Also called mixture of Gaussians (MOG)) MoG modeling (GMM)

Learning GMM parameters by maximum likelihood estimation

EM algorithm for GMM

(3)

Mixture Modeling

Class weight, class prior probability, multinomial

Multivariate Normal Number of hidden

components Normal parameters

Observations Class weights

Normal = Gaussian

A formalism for modeling a

probability density function as a sum of parameterized functions.

Gaussian is most general parametric

distribution.

(4)

 There are 𝐾 components

– The 𝑘 𝑡ℎ component, 𝐶 𝑘 , is associated with mean

vector 𝜇 𝑘 and covariance matrix Σ 𝑘

– Each component generates data from a Gaussian with mean 𝜇 𝑘 and covariance Σ 𝑘

 Assume that each sample is generated as follows:

– Pick a component at random.

– Choose component 𝑘 with probability 𝜋 𝑘

– Sample 𝐱~𝑁(𝜇 𝑘 , Σ 𝑘 )

(5)

GMM Explanation by Graphical Models

Denote z as 1-of-K representation: z k ∈ {0, 1} and Σ k z k = 1.

We define the joint distribution p(x, z) by a multa marginal distribution p(z) and a conditional distribution p(x|z).

p(z) p(x|z) = p(x, z)

Hidden variable

Observable variable: data

p(z)

p(x|z)

(6)

The marginal distribution over z is written by the mixing coefficients π k

where

The marginal distribution is in the form of

Similarly,

(7)

The marginal distribution of x is

, which is as a linear superposition of Gaussians.

(8)

The conditional probability p(z k = 1|x) denoted by γ(z k ) is obtained by Bayes' theorem,

We view π k as the prior probability of z k = 1, and γ(z k ) as the posterior probability.

γ(z k ) is the responsibility that k-component takes for

explaining the observation x.

(9)

Maximum Likelihood Estimation

s.t.

Given a data set of X = {x 1,…, x N }, the log of the likelihood

function is

(10)

Setting the derivatives of ln p(X|π, μ, Σ) with respect to μ k

to zero, we obtain

(11)

objective function f(x)

constraints g(x)

max f(x) s.t. g(x)=0 max f(x) + 𝜆g(x)

Refer to Optimisation course or http://en.wikipedia.org/wiki/Lagrange_multiplier

Setting the derivatives of ln p(X|π, μ, Σ) with respect to Σ k to zero, we obtain

Finally, we maximise ln p(X|π, μ, Σ) with respect to the

mixing coefficients π k . We use a Lagrange multiplier

(12)

which gives

we find λ = -N and

(13)

1. Initialise the means μ k , covariances Σ k and mixing coefficients π k .

2. Ε step: Evaluate the membership probabilities using the current parameter values

3. M step: RE-estimate the parameters using the current responsibilities

EM (Expectation Maximisation) for Gaussian Mixtures

(14)

4. Evaluate the log likelihood

and check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied, return to step 2.

Mixtures

(15)

Gaussian Mixture Example:

Start

Advance apologies: in Black and White this example will be

incomprehensible

(16)

After first

iteration

(17)

After 2nd

iteration

(18)

After 3rd

iteration

(19)

After 4th

iteration

(20)

After 5th

iteration

(21)

After 6th

iteration

(22)

After 20th

iteration

(23)

EM Example: Old Faithful Geyser

(24)

Hard clustering: a data point is

assigned a cluster.

Soft clustering: a data point is

explained by a mix of multiple

Gaussians probabilistically.

Two standard methods are k-means and Gaussian Mixture Model (GMM).

K-means assigns data points to the nearest clusters, while GMM represents data

by multiple Gaussian densities.

(25)

Example: Density Estimation

(26)
(27)

Some Bio

Assay data

(28)

clustering of the assay

data

(29)

Resulting

Density

Estimator

(30)

Three

classes of assay

(each learned with it’s own mixture model)

(Sorry, this will again be semi-

useless in black and white)

(31)

Resulting Bayes

Classifier

(32)

 GMM optimality

– Can we find optimal GMM in polynomial time?

– Finding optimal GMM is NP-hard

 Good initialization

– Use LBG k-means results for initial estimates – Use splitting criteria like LBG

– Do PCA, and then find initial conditions

 In GMM we need to estimate the parameters, which all are real numbers

– Number of parameters: M+M(D) + M(D(D-1)/2)

(33)

 K-means

 Hierarchical clustering

 PCA (principal component analysis)

 Non-linear PCA

– Neural Auto-Associators – Locally weighted PCA – Others…

Other unsupervised learning methods

(34)

Matrix and vector derivatives are obtained first by element-wise derivatives

and then reforming them into matrices and vectors.

(35)

Appendix: Matrix and Vector Derivatives

참조

관련 문서

 Communication cost = input file size + 2 × (sum of the sizes of all files passed from Map processes to Reduce processes) + the sum of the output sizes of the Reduce

 Because output is spread across R files (each reduce task creates one file in DFS).. Task

 Step 2: label each node by the # of shortest paths from the root E..

 Data stream consists of a universe of elements chosen from a set of size N.  Maintain a count of the number of distinct elements

 In fact, the minimum solution is given by y = 1 vector (the smallest eigenvector w/ eigenvalue 0); however, this does not say anything about the partition.  Thus, we find

 Main idea: Recommend items to customer x similar to previous items rated highly by x.  Andy enjoyed

 New Problem: Given a stream, how can we find recent frequent items (= which appear more. than s times in

 The Google solution for spider traps: At each time step, the random surfer has two options!.