• 검색 결과가 없습니다.

Large Scale Data Analysis Using Deep Learning

N/A
N/A
Protected

Academic year: 2024

Share "Large Scale Data Analysis Using Deep Learning"

Copied!
24
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

U Kang 1

Large Scale Data Analysis Using Deep Learning

Numerical Computation U Kang

Seoul National University

(2)

In This Lecture

Condition number and its impact on optimization

Gradient descent

(3)

U Kang 3

Overflow and Underflow

𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 (𝒙𝒙)

𝑖𝑖

=

exp(𝑥𝑥𝑖𝑖)

𝑗𝑗=1𝑛𝑛 exp(𝑥𝑥𝑗𝑗)

The exponentiation can underflow when the

argument is very negative, or overflow when it is very positive

A solution: softmax( 𝐳𝐳 ) where

𝐳𝐳 = 𝒙𝒙 − 𝒎𝒎𝒎𝒎𝒙𝒙𝒊𝒊𝒙𝒙𝒊𝒊

Why?

(4)

Condition Number

Conditioning: how rapidly a function changes with respect to small changes in its inputs

Function that changes rapidly when their inputs are perturbed slightly can be problematic for scientific computation because rounding errors in the inputs can result in large changes in the output

Consider 𝑠𝑠 𝑠𝑠 = 𝐴𝐴

−1

𝑠𝑠 . When A has an

eigendecomposition, its condition number is max

𝑖𝑖,𝑗𝑗

|

𝜆𝜆𝜆𝜆𝑖𝑖

𝑗𝑗

|

(5)

U Kang 5

Condition Number

Condition number for the equation Ax=b: the

maximum ratio of the relative error in x divided by the relative error in b

Let e be the error in b, then the condition number is

𝐴𝐴−1𝑒𝑒 𝐴𝐴−1𝑏𝑏

𝑒𝑒 𝑏𝑏

=

𝐴𝐴−1𝑒𝑒 𝑒𝑒 𝐴𝐴−1𝑏𝑏 𝑏𝑏

whose maximum value is 𝑠𝑠𝑠𝑠𝑠𝑠

𝑒𝑒,𝑏𝑏≠0

𝐴𝐴

−1

𝑒𝑒

𝑒𝑒

𝑏𝑏 𝐴𝐴

−1

𝑏𝑏

= 𝑠𝑠𝑠𝑠𝑠𝑠

𝑒𝑒≠0

𝐴𝐴

−1

𝑒𝑒

𝑒𝑒 𝑠𝑠𝑠𝑠𝑠𝑠

𝑏𝑏≠0

𝑏𝑏

𝐴𝐴

−1

𝑏𝑏 = | 𝜆𝜆

𝑚𝑚𝑚𝑚𝑥𝑥

𝜆𝜆

𝑚𝑚𝑖𝑖𝑖𝑖

|

(6)

Gradient Based Optimization

Goal: minimize a function y = f(x)

Derivative f’(x) (=

𝑑𝑑𝑑𝑑𝑑𝑑𝑥𝑥

)of f(x) gives the slope of f(x) at point x

𝑠𝑠(𝑠𝑠 + 𝜖𝜖) ≈ 𝑠𝑠(𝑠𝑠) + 𝜖𝜖𝑠𝑠𝑓(𝑠𝑠)

Tells us how to change x in order to make a small improvement in y

E.g., 𝑠𝑠(𝑠𝑠 − 𝜖𝜖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑠𝑠 𝑠𝑠 )) is less than f(x) for small enough 𝜖𝜖

Gradient descent: reduce f(x) by moving x in small

steps with opposite sign of the derivative

(7)

U Kang 7

Gradient Descent

(8)

Critical Points

Critical points or stationary points: f’(x) = 0

(9)

U Kang 9

Approximate Optimization

Global minimum vs local minimum

(10)

Gradient Descent for Multiple Input

Functions with multiple inputs: f : 𝑅𝑅

𝑖𝑖

→ 𝑅𝑅

Partial derivative

𝜕𝜕𝑥𝑥𝜕𝜕

𝑖𝑖

𝑠𝑠 (𝒙𝒙) measures how f changes as only the variable 𝑠𝑠

𝑖𝑖

increases at point 𝒙𝒙 .

Gradient 𝛻𝛻

𝒙𝒙

𝑠𝑠(𝒙𝒙) of f is the vector containing all the partial derivatives

Critical points: every element of the gradient is equal

to 0

(11)

U Kang 11

Gradient Descent for Multiple Input

Directional derivative in direction u (a unit vector):

the slope of the function f in direction u

Derivative of the function 𝑠𝑠(𝒙𝒙 + 𝛼𝛼𝒖𝒖) with respect to 𝛼𝛼, evaluated at 𝛼𝛼 = 0: 𝜕𝜕𝛼𝛼𝜕𝜕 𝑠𝑠 𝒙𝒙 + 𝛼𝛼𝒖𝒖 = 𝑢𝑢𝑇𝑇𝛻𝛻𝒙𝒙𝑠𝑠 𝒙𝒙

Find the direction in which f decreases the fastest

𝑠𝑠𝑠𝑠𝑠𝑠𝑢𝑢,𝑢𝑢𝑇𝑇𝑢𝑢=1𝑢𝑢𝑇𝑇𝛻𝛻𝒙𝒙𝑠𝑠 𝒙𝒙 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑢𝑢,𝑢𝑢𝑇𝑇𝑢𝑢=1| 𝑢𝑢 |2| 𝛻𝛻𝒙𝒙𝑠𝑠 𝒙𝒙 |2𝑐𝑐𝑠𝑠𝑠𝑠𝜃𝜃

This is minimized when u points in the opposite direction as the gradient: 𝒙𝒙 = 𝒙𝒙 − 𝜖𝜖𝛻𝛻𝒙𝒙𝑠𝑠 𝒙𝒙

This method is called the steepest descent or gradient descent

𝜖𝜖 (called step size) can be set to a small constant, to a value that minimizes f(x), or to a value that results in the smallest f(x) among several values (line search)

(12)

Beyond the Gradient: Jacobian and Hessian

Jacobian : a matrix containing partial derivatives of a function whose input and output are both vectors

For a function f: 𝑅𝑅𝑚𝑚 → 𝑅𝑅𝑖𝑖, Jacobian 𝐽𝐽 ∈ 𝑅𝑅𝑖𝑖×𝑚𝑚 is defined such that 𝐽𝐽𝑖𝑖,𝑗𝑗 = 𝜕𝜕𝑥𝑥𝜕𝜕

𝑗𝑗 𝑠𝑠(𝒙𝒙)𝑖𝑖

Second derivative: derivative of a derivative

For a function f: 𝑅𝑅𝑚𝑚 → 𝑅𝑅, the derivative with respect to 𝑠𝑠𝑖𝑖 of the derivative of f with respect to 𝑠𝑠𝑗𝑗 is denote by 𝜕𝜕𝑥𝑥𝜕𝜕2

𝑖𝑖𝜕𝜕𝑥𝑥𝑗𝑗 𝑠𝑠

In a single dimension, second derivative is denote by f’’(x)

Tells us whether a gradient step will cause as much of an improvement as we would expect based on the gradient alone

(13)

U Kang 13

Curvature

(14)

Hessian

Second derivatives of a function f: 𝑅𝑅

𝑖𝑖

→ 𝑅𝑅 with multiple input dimensions

Hessian matrix 𝐻𝐻(𝑠𝑠)(𝒙𝒙) is defined such that

𝐻𝐻(𝑠𝑠)(𝒙𝒙)𝑖𝑖,𝑗𝑗 = 𝜕𝜕𝑥𝑥𝜕𝜕2

𝑖𝑖𝜕𝜕𝑥𝑥𝑗𝑗 𝑠𝑠(𝒙𝒙)

The differential operators are commutative: 𝜕𝜕𝑥𝑥𝜕𝜕2

𝑖𝑖𝜕𝜕𝑥𝑥𝑗𝑗 𝑠𝑠 𝒙𝒙 =

𝜕𝜕2

𝜕𝜕𝑥𝑥𝑗𝑗𝜕𝜕𝑥𝑥𝑖𝑖 𝑠𝑠(𝒙𝒙). Thus, H is symmetric.

The second derivative in a specific direction represented by a unit vector 𝑑𝑑 is given by 𝑑𝑑𝑇𝑇𝐻𝐻𝑑𝑑.

(pf) Note that the first derivative is given by 𝑑𝑑𝑇𝑇𝛻𝛻𝒙𝒙𝑠𝑠 𝒙𝒙 . The second derivative is given by 𝑑𝑑𝑇𝑇𝛻𝛻𝒙𝒙 𝑑𝑑𝑇𝑇𝛻𝛻𝒙𝒙𝑠𝑠 𝒙𝒙 = 𝑑𝑑𝑇𝑇 𝐻𝐻𝑇𝑇𝑑𝑑 = 𝑑𝑑𝑇𝑇𝐻𝐻𝑑𝑑

(15)

U Kang 15

Aside: Chain Rule of Calculus

Let x be a real number, and f and g be functions from R to R. Suppose y = g(x), and z = f(g(x)) = f(y). Then the

chain rule states that

𝑑𝑑𝑥𝑥𝑑𝑑𝑑𝑑

=

𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑𝑑𝑑𝑥𝑥

Suppose 𝑠𝑠 ∈ 𝑅𝑅

𝑚𝑚

, 𝑦𝑦 ∈ 𝑅𝑅

𝑖𝑖

, 𝑠𝑠: 𝑅𝑅

𝑚𝑚

→ 𝑅𝑅

𝑖𝑖

, 𝑠𝑠: 𝑅𝑅

𝑖𝑖

→ 𝑅𝑅 . If 𝒚𝒚 = 𝑠𝑠(𝒙𝒙) and 𝑧𝑧 = 𝑠𝑠(𝒚𝒚) , then

𝜕𝜕𝑥𝑥𝜕𝜕𝑑𝑑

𝑖𝑖

= ∑

𝑗𝑗 𝜕𝜕𝑑𝑑𝜕𝜕𝑑𝑑

𝑗𝑗

𝜕𝜕𝑑𝑑𝑗𝑗

𝜕𝜕𝑥𝑥𝑖𝑖

. In

vector notation: 𝛻𝛻

𝒙𝒙

𝑧𝑧 = (

𝜕𝜕𝒚𝒚𝜕𝜕𝒙𝒙

)

𝑇𝑇

𝛻𝛻

𝒚𝒚

𝑧𝑧 , where

𝜕𝜕𝒚𝒚𝜕𝜕𝒙𝒙

is the n x m Jacobian matrix of g

E.g., suppose 𝑧𝑧 = 𝑑𝑑𝑇𝑇𝒚𝒚, and 𝐲𝐲 = 𝛻𝛻𝒙𝒙𝑠𝑠 𝒙𝒙 . Then 𝛻𝛻𝒙𝒙𝑧𝑧 = (𝜕𝜕𝒚𝒚𝜕𝜕𝒙𝒙)𝑇𝑇𝛻𝛻𝒚𝒚𝑧𝑧 = 𝐻𝐻𝑇𝑇𝑑𝑑

(16)

Hessian

If d is an eigenvector of H, the directional second derivative

𝑑𝑑𝑇𝑇𝐻𝐻𝑑𝑑 is the corresponding eigenvalue. For other direction, the directional second derivative is a weighted average of all of the eigenvalues

The second derivative tells us how well we can expect a gradient descent step to perform

𝑠𝑠 𝑠𝑠 ≈ 𝑠𝑠 𝑠𝑠 0 + 𝑠𝑠 − 𝑠𝑠 0 𝑇𝑇𝑠𝑠 + 12 𝑠𝑠 − 𝑠𝑠 0 𝑇𝑇𝐻𝐻 𝑠𝑠 − 𝑠𝑠 0 where g is the gradient and H is the Hessian at 𝑠𝑠 0 .

Using a learning rate of 𝜖𝜖, then the new point x will be given by 𝑠𝑠 0 − 𝜖𝜖g.

Then, 𝑠𝑠 𝑠𝑠 0 − 𝜖𝜖g ≈ 𝑠𝑠 𝑠𝑠 0 − 𝜖𝜖𝑠𝑠𝑇𝑇𝑠𝑠 + 12𝜖𝜖2𝑠𝑠𝑇𝑇𝐻𝐻𝑠𝑠

The original improvement due correction to account

(17)

U Kang 17

Hessian

The term 12

𝜖𝜖

2

𝑠𝑠

𝑇𝑇

𝐻𝐻𝑠𝑠

If 𝑠𝑠𝑇𝑇𝐻𝐻𝑠𝑠 is too large, the gradient descent step can actually move uphill

If 𝑠𝑠𝑇𝑇𝐻𝐻𝑠𝑠 is 0 or negative, increasing 𝜖𝜖 would decrease f

When 𝑠𝑠𝑇𝑇𝐻𝐻𝑠𝑠 is positive, the optimal 𝜖𝜖 is given by 𝑔𝑔𝑔𝑔𝑇𝑇𝑇𝑇𝐻𝐻𝑔𝑔𝑔𝑔

If f is approximated well by a quadratic function, the eigenvalues of the Hessian determine the scale of the learning rate

(18)

Hessian

Second derivative determines whether a critical point is a local maximum, a local minimum, or a saddle point

f’(x) = 0 and f’’(x) > 0: x is a local minimum

f’(x) = 0 and f’’(x) < 0: x is a local maximum

f‘(x) = 0 and f’’(x) = 0: x may be a saddle point or a part of a flat region

In multiple dimension: need to examine all of the second derivatives

At a critical point where 𝛻𝛻𝒙𝒙𝑠𝑠 𝒙𝒙 = 𝟎𝟎, we can examine eigenvalues of the Hessian to determine whether the critical point is a local maximum, local minimum, or a saddle point

If H is positive definite, then x is a local minimum

If H is negative definite, then x is a local maximum

If H has both positive and negative eigenvalues, then x is a saddle point

(19)

U Kang 19

Saddle Points

(20)

Gradient Descent and Poor Conditioning

The condition number of the Hessian measures how much the second derivative vary

When the Hessian has a poor condition number, gradient descent performs poorly

In one direction, the derivative increases rapidly, while in another direction, it increases slowly

Gradient descent is unaware of this change in the derivative so it does not know that it needs to explore preferentially in the direction where the derivative remains negative for longer

This makes it difficult to choose a good step size: the step size must be small enough to avoid overshooting, but this means the step size is too small to make significant progress in other direction with less curvature

(21)

U Kang 21

Gradient Descent and Poor

Conditioning

(22)

Gradient Descent and Poor Conditioning

The problem of the Hessian with a poor condition number can be solved by using the second-order method

First-order method: use only the gradient to find the optimum

Gradient Descent

Second-order method: use both the gradient and the Hessian

Newton’s method

(23)

U Kang 23

What you need to know

Gradient descent

Useful method for optimization

Multiple inputs: Jacobian and Hessian

Second derivative gives useful information for optimization

For a function with poorly conditioned Hessian,

gradient descent performs badly

(24)

Questions?

참조

관련 문서

Applying existing statistical methods to a distributed research network (DRN) creates the opportunity to conduct research using data from multiple

Internal environment measurement data collected from IoT sensors, device control data collected from device control actuators, and user judgment data are learned

(2004), “A Proposal of the Evaluation Method for Rock Slope Stability Using Logistic Regression Analysis”, Journal of Korean Society of Rock Mechanics, Vol.12, No.2, pp8.

Unlike them, by combining set covering and clustering techniques, we developed a new method that could deal with total number of variables and consider the combinatorial effects

In this paper, we introduced a positioning technology using deep learning based on LTE Channel State Information-Reference Signal (CSI-RS) data, and confirmed the possibility

In this study, we propose a controller to control bicycle using DDPG (Deep Deterministic Policy Gradient) algorithm which is the latest deep

We designed, fabricated the Knudsen pump and analyzed pressure gradient efficiency of membrane according to Knudsen number under vacuum condition...

본 논문에서는 영어로 된 리뷰 상품 데이터를 기반으로 3 개의 딥러닝 알고리즘 모델(Word Count, TFIDF(Term Frequency Inverse Document Frequency),