• 검색 결과가 없습니다.

Introduction to Data Mining

N/A
N/A
Protected

Academic year: 2022

Share "Introduction to Data Mining"

Copied!
44
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

Introduction to Data Mining

Lecture #21: Dimensionality Reduction U Kang

Seoul National University

(2)

In This Lecture

Understand the motivation and applications of dimensionality reduction

Learn the definition and properties of SVD, one of the most important tools in data mining

Learn how to interpret the results of SVD, and how to use it for dimensionality reduction

(3)

Outline

Overview

Dim. Reduction with SVD

(4)

Dimensionality Reduction

Assumption: Data lies on or near a low d-dimensional subspace

Axes of this subspace are effective representation of the data

(5)

Dimensionality Reduction

Compress / reduce dimensionality:

106 rows; 103 columns; no updates

Random access to any cell(s); small error: OK

The above matrix is really “2-dimensional.” All rows can be reconstructed by scaling [1 1 1 0 0] or [0 0 0 1 1]

(6)

Rank of a Matrix

Q: What is rank of a matrix A?

A: Number of linearly independent columns of A

For example:

Matrix A = has rank r=2

Why? The first two rows are linearly independent, so the rank is at least 2, but all three rows are linearly dependent (the first is equal to the sum of the second and third) so the rank must be less than 3.

Why do we care about low rank?

We can write A as two “basis” vectors: [1 2 1] [-2 -3 1]

And new coordinates of : [1 0] [0 1] [1 -1]

(7)

Rank is “Dimensionality”

Cloud of points 3D space:

Think of point positions as a matrix:

We can rewrite coordinates more efficiently!

Old basis vectors: [1 0 0] [0 1 0] [0 0 1]

New basis vectors: [1 2 1] [-2 -3 1]

Then A has new coordinates: [1 0]. B: [0 1], C: [1 -1]

Notice: We reduced the number of coordinates!

1 row per point:

A B C

A

(8)

Dimensionality Reduction

Goal of dimensionality reduction is to discover the axis of data!

Rather than representing

every point with 2 coordinates we represent each point with 1 coordinate (corresponding to the position of the point on the red line).

By doing this we incur a bit of error as the points do not exactly lie on the line

(9)

Why Reduce Dimensions?

Why reduce dimensions?

Discover hidden correlations/topics

Words that occur commonly together

Remove redundant and noisy features

Not all words are useful

Interpretation and visualization

Easier storage and processing of the data

(10)

SVD (Singular Value Decomposition)

A

[m x n]

= U

[m x r]

Σ

[ r x r]

(V

[n x r]

)

T

A: Input data matrix

m x n matrix (e.g., m documents, n terms)

U: Left singular vectors

m x r matrix (m documents, r concepts)

Σ: Singular values

r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix A)

V: Right singular vectors

n x r matrix (n terms, r concepts)

(11)

SVD (Singular Value Decomposition)

m A

n

m Σ

n

U

VT

T

(12)

SVD (Singular Value Decomposition)

A

m

n

+

σ1u1v1 σ2u2v2

σi … scalar ui … vector v … vector

T

Also called “spectral decomposition”

(13)

SVD - Properties

It is always possible to decompose a real matrix A into A = U Σ VT , where

U, Σ, V: unique

U, V: column orthonormal

UT U = I; VT V = I (I: identity matrix)

(Columns are orthogonal unit vectors)

Σ: diagonal

Entries (singular values) are positive,

and sorted in decreasing order (σ1 σ2 ... ≥ 0)

Nice proof of uniqueness: http://www.mpi-inf.mpg.de/~bast/ir-seminar-ws04/lecture2.pdf

(14)

SVD - Example

A = U Λ VT - example:

1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1

datainfretrieval.

brain lung

0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27

= CS

MD

9.64 0 0 5.29

x

0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

x

(15)

SVD - Example

A = U Λ VT - example:

1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1

datainfretrieval.

brain lung

0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27

= CS

MD

9.64 0 0 5.29

x

0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

x CS-concept

MD-concept

(16)

SVD - Example

A = U Λ VT - example:

1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1

datainfretrieval.

brain lung

0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27

= CS

MD

9.64 0 0 5.29

x

0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

x CS-concept

MD-concept

doc-to-concept similarity matrix

(17)

SVD - Example

A = U Λ VT - example:

1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1

datainfretrieval.

brain lung

0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27

= CS

MD

9.64 0 0 5.29

x

0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

x

‘strength’ of CS-concept

(18)

SVD - Example

A = U Λ VT - example:

1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1

datainfretrieval.

brain lung

0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27

= CS

MD

9.64 0 0 5.29

x

0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

x

term-to-concept similarity matrix CS-concept

(19)

SVD - Example

A = U Λ VT - example:

1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1

datainfretrieval.

brain lung

0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27

= CS

MD

9.64 0 0 5.29

x

0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

x

term-to-concept similarity matrix CS-concept

(20)

Outline

Overview

Dim. Reduction with SVD

(21)

SVD - Interpretation #1

‘documents’, ‘terms’ and ‘concepts’:

U: document-to-concept similarity matrix

V: term-to-concept sim. matrix

Λ: its diagonal elements: ‘strength’ of each concept

(22)

SVD – Interpretation #1

‘documents’, ‘terms’ and ‘concepts’:

Q: if A is the document-to-term matrix, what is AT A?

A:

Q: A AT ? A:

(23)

SVD – Interpretation #1

‘documents’, ‘terms’ and ‘concepts’:

Q: if A is the document-to-term matrix, what is AT A?

A: term-to-term ([m x m]) similarity matrix Q: A AT ?

A: document-to-document ([n x n]) similarity matrix

(24)

SVD - Interpretation #2

best axis to project on: (‘best’ = min sum of squares of projection errors)

(25)

SVD - Motivation

(26)

SVD - interpretation #2

minimum RMS error SVD: gives

best axis to project

v1

first singular vector

(27)

SVD - Interpretation #2

(28)

SVD - Interpretation #2

A = U Λ VT - example:

1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1

0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27

= x 9.64 00 5.29

0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

x

v1

(29)

SVD - Interpretation #2

A = U Λ VT - example:

1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1

0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27

= x 9.64 00 5.29

0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

x

variance (‘spread’) on the v1 axis

(30)

SVD - Interpretation #2

A = U Λ VT - example:

U Λ gives the coordinates of the points in the projection axis

1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1

0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27

= x 9.64 00 5.29

0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

x

(31)

SVD - Interpretation #2

More details

Q: how exactly is dim. reduction done?

1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1

0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27

= x 9.64 00 5.29

0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

x

(32)

SVD - Interpretation #2

More details

Q: how exactly is dim. reduction done?

A: set the smallest singular values to zero:

1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1

0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27

= x 9.64 00 5.29

0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

x

(33)

SVD - Interpretation #2

1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1

0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27

~ x 9.64 00 0

0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

x

(34)

SVD - Interpretation #2

1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1

0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27

~ x 9.64 00 0

0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

x

(35)

SVD - Interpretation #2

1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1

0.18 0.36 0.18 0.90 0 0 0

~ x 9.64

0.58 0.58 0.58 0 0

x

(36)

SVD - Interpretation #2

1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1

~

1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(37)

SVD – Best Low Rank Approx.

A U

Sigma

VT

=

B U

Sigma

VT

=

B is best approximation of A

(38)

SVD – Best Low Rank Approx.

Theorem:

Let A = U Σ VT and B = U S VT where

S = diagonal rxr matrix with sii (i=1…k) else si=0 then B is a best rank(B)=k approx. to A

What do we mean by “best”:

B is a solution to minB ǁA-BǁF where rank(B)=k

𝜎𝜎11 Σ

𝜎𝜎𝑟𝑟𝑟𝑟

𝐴𝐴 − 𝐵𝐵 𝐹𝐹 = � 𝐴𝐴𝑖𝑖𝑖𝑖 − 𝐵𝐵𝑖𝑖𝑖𝑖 2

(39)

SVD - Interpretation #2

Equivalent:

‘spectral decomposition’ of the matrix:

= x x

u1 u2

σ1

σ2

v1 v2

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

(40)

SVD - Interpretation #2

Equivalent:

‘spectral decomposition’ of the matrix

= σ1 u1 vT1 + σ2 u2 vT2 +...

n

m

n x 1 1 x m

k terms

Assume: σ1 σ2 σ3 ≥ ... ≥ 0

Why is setting small σi to 0 the right thing to do?

Vectors ui and vi are unit length, so σi scales them.

So, zeroing small σ introduces less error.

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

(41)

SVD - Interpretation #2

Q: How many σs to keep?

A: Rule-of-a thumb:

keep 80-90% of ‘energy’ = ∑𝒊𝒊 𝝈𝝈𝒊𝒊𝟐𝟐

= σ1 u1 vT1 + σ2 u2 vT2 +...

n

m

Assume: σ1 σ2 σ3 ≥ ...

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

(42)

SVD - Complexity

To compute SVD (for n x m matrix):

O(nm2) or O(n2m) (whichever is less)

But:

Less work, if we just want singular values

or if we want first k singular vectors

or if the matrix is sparse

Implemented in linear algebra packages like

LINPACK, Matlab, SPlus, Mathematica ...

(43)

Conclusion

SVD: A= U Σ VT: unique

U: user-to-concept similarities

V: movie-to-concept similarities

Σ : strength of each concept

Dimensionality reduction:

keep the few largest singular values (80-90% of ‘energy’)

SVD: picks up linear correlations

(44)

Questions?

참조

관련 문서

 Given a minimum support s, then sets of items that appear in at least s baskets are called frequent itemsets.

 Drineas et al., Fast Monte Carlo Algorithms for Matrices III: Com puting a Compressed Approximate Matrix Decomposition, SIAM Journal on Computing,

 Communication cost = input file size + 2 × (sum of the sizes of all files passed from Map processes to Reduce processes) + the sum of the output sizes of the Reduce

 Because output is spread across R files (each reduce task creates one file in DFS).. Task

 Step 2: label each node by the # of shortest paths from the root E..

 Data stream consists of a universe of elements chosen from a set of size N.  Maintain a count of the number of distinct elements

74) Francesca Bignami, “European Versus American Liberty: A Comparative Privacy Analysis of Antiterrorism Data Mining”, B.C. Harris, “Personal Data Privacy Tradeoffs and How

I.e., if competitive ratio is 0.4, we are assured that the greedy algorithm gives an answer which is >= 40% good compared to optimal alg, for ANY input... Analyzing