• ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

Advanced Deep Learning

N/A
N/A
Protected

Academic year: 2024

Share "Advanced Deep Learning"

Copied!
20
0
0

์ „์ฒด ๊ธ€

(1)

Advanced Deep Learning

Deep Feedforward Networks - 2

U Kang

(2)

U Kang 2

In This Lecture

๏ฎ

Back propagation

๏ฑ

Motivation

๏ฑ

Main idea

๏ฑ

Procedure

(3)

Computational Graphs

๏ฎ

Formalizes computation

๏ฎ

Each node indicates a variable

๏ฎ

Each edge indicates an operation

(4)

U Kang 4

Chain Rule of Calculus

๏ฎ

Let x be a real number, and f and g be functions from R to R. Suppose y = g(x), and z = f(g(x)) = f(y). Then the chain rule states that

๐‘‘๐‘ง

๐‘‘๐‘ฅ

=

๐‘‘๐‘ง

๐‘‘๐‘ฆ ๐‘‘๐‘ฆ ๐‘‘๐‘ฅ

๏ฎ

Suppose ๐‘ฅ โˆˆ ๐‘…

๐‘š

, ๐‘ฆ โˆˆ ๐‘…

๐‘›

, ๐‘”: ๐‘…

๐‘š

โ†’ ๐‘…

๐‘›

, ๐‘“: ๐‘…

๐‘›

โ†’ ๐‘… . If ๐’š = ๐‘”(๐’™) and ๐‘ง = ๐‘“(๐’š) , then

๐œ•๐‘ง

๐œ•๐‘ฅ๐‘–

= ฯƒ

๐‘— ๐œ•๐‘ง

๐œ•๐‘ฆ๐‘—

๐œ•๐‘ฆ๐‘—

๐œ•๐‘ฅ๐‘–

. In vector notation: ๐›ป

๐’™

๐‘ง = (

๐œ•๐’š

๐œ•๐’™

)

๐‘‡

๐›ป

๐’š

๐‘ง, where

๐œ•๐’š

๐œ•๐’™

is the n x m Jacobian matrix of g

๏ฑ E.g., suppose ๐‘ง = ๐‘‘๐‘‡๐’š, and ๐ฒ = ๐›ป๐’™๐‘” ๐’™ . Then ๐›ป๐’™๐‘ง = (๐œ•๐’š

๐œ•๐’™)๐‘‡๐›ป๐’š๐‘ง = ๐ป๐‘‡๐‘‘

(5)

Repeated Subexpressions

๏ฎ Computing the same subexpression many times would be wasteful

(6)

U Kang 6

Backpropagation - Overview

๏ฎ

Backpropagation is an algorithm to compute partial derivatives efficiently in neural networks

๏ฎ

The main idea is dynamic programming

๏ฑ Many computations share common computations

๏ฑ Store the common computations in memory, and read them from memory when needed, without re-computing them from scratch

(7)

Backpropagation - Overview

๏ฎ Assume a computational graph to compute a single scalar ๐‘ข(๐‘›)

๏ฑ E.g., this can be the loss

๏ฎ Our goal is to compute a partial derivatives ๐œ•๐‘ข(๐‘›)

๐œ•๐‘ข(๐‘–) for all ๐‘– โˆˆ {1,2, โ€ฆ , ๐‘›๐‘–} where ๐‘ข(1) to ๐‘ข(๐‘›๐‘–) are the parameters of model

๏ฎ We assume that nodes are ordered such that we compute their outputs one after the other, starting at ๐‘ข(๐‘›๐‘–+1) and going up to ๐‘ข(๐‘›)

๏ฎ Each node ๐‘ข(๐‘–) is associated with an operation ๐‘“(๐‘–) and is

computed by ๐‘ข(๐‘–) = ๐‘“(๐ด(๐‘–)) where ๐ด(๐‘–) is the set of all parents of ๐‘ข(๐‘–)

(8)

U Kang 8

Forward Propagation

๏ฎ This procedure performs the computations mapping ๐‘›๐‘– inputs ๐‘ข(1) to ๐‘ข(๐‘›๐‘–) to output ๐‘ข(๐‘›)

(9)

Back-Propagation

๏ฎ This procedure computes the partial derivatives of ๐‘ข(๐‘›) with respect to the variables ๐‘ข(1), โ€ฆ , ๐‘ข(๐‘›๐‘–)

(10)

U Kang 10

Back-Propagation Example

๏ฎ

Back-propagation procedure (re-use blue-colored computation)

๏ฑ Compute ๐œ•๐‘ง

๐œ•๐‘ฆ and ๐œ•๐‘ง

๐œ•๐‘ฅ

๏ฑ ๐œ•๐‘ง

๐œ•๐‘ค โ† ๐œ•๐‘ฆ

๐œ•๐‘ค

๐œ•๐‘ง

๐œ•๐‘ฆ + ๐œ•๐‘ฅ

๐œ•๐‘ค

๐œ•๐‘ง

๐œ•๐‘ฅ

๏ฑ ๐œ•๐‘ง

๐œ•๐‘ฃ โ† ๐œ•๐‘ค

๐œ•๐‘ฃ

๐œ•๐‘ง

๐œ•๐‘ค

v w

y x

z

(11)

Cost of Back-Propagation

๏ฎ The amount of computation scales linearly with the number of edges in the computational graph

๏ฑ Computation for each edge: computing a partial derivative, one multiplication, and one addition

w

y x

z

Compute ๐œ•๐‘ง

๐œ•๐‘ฆ and ๐œ•๐‘ง

๐œ•๐‘ฅ

๐œ•๐‘ง

๐œ•๐‘ค โ† ๐œ•๐‘ฆ

๐œ•๐‘ค

๐œ•๐‘ง

๐œ•๐‘ฆ + ๐œ•๐‘ฅ

๐œ•๐‘ค

๐œ•๐‘ง

๐œ•๐‘ฅ

๐œ•๐‘ง โ† ๐œ•๐‘ค ๐œ•๐‘ง

(12)

U Kang 12

Back-Propagation in Fully Connected

๏ฎ Forward propagation

MLP

(13)

Back-Propagation in Fully Connected

๏ฎ Backward computation

MLP

โจ€: element-wise product

(14)

U Kang 14

Stochastic Gradient Descent (SGD)

๏ฎ A recurring problem in ML: large training sets are necessary for good generalization, but large training sets are more

computationally expensive

๏ฎ Cost function using cross-entropy:

๏ฑ ๐ฝ ๐œƒ = ๐ธ๐‘ฅ,๐‘ฆ~ เทœ๐‘๐‘‘๐‘Ž๐‘ก๐‘Ž๐ฟ ๐‘ฅ, ๐‘ฆ, ๐œƒ = 1

๐‘šฯƒ๐‘–=1๐‘š ๐ฟ(๐‘ฅ ๐‘– , ๐‘ฆ ๐‘– , ๐œƒ)

where L is the per-example loss ๐ฟ ๐‘ฅ, ๐‘ฆ, ๐œƒ = โˆ’ log ๐‘(๐‘ฆ|๐‘ฅ; ๐œƒ)

๏ฑ Gradient descent requires computing ๐›ป๐œƒ๐ฝ(๐œƒ)=1

๐‘šฯƒ๐‘–=1๐‘š ๐›ป๐œƒ๐ฟ(๐‘ฅ ๐‘– , ๐‘ฆ ๐‘– , ๐œƒ)

๏ฑ The computational cost of the gradient descent is O(m) which can take long for large m

๏ฎ Insight of SGD: gradient is an expectation which can be approximately estimated using a small set of samples

(15)

Stochastic Gradient Descent (SGD)

๏ฎ SGD

๏ฑ We sample a minibatch of examples ๐ต = {๐‘ฅ 1 , โ€ฆ , ๐‘ฅ ๐‘šโ€ฒ } drawn uniformly from the training set

๏ฑ The minibatch size mโ€™ is typically a small number (1 to few hundred)

๏ฑ mโ€˜ is usually fixed as the training set size m grows

๏ฑ The estimate of the gradient is ๐‘” = 1

๐‘šโ€ฒ ฯƒ๐‘–=1๐‘šโ€ฒ ๐›ป๐œƒ๐ฟ(๐‘ฅ ๐‘– , ๐‘ฆ ๐‘– , ๐œƒ)

๏ฑ Then the gradient descent is given by ๐œƒ โ† ๐œƒ โˆ’ ๐œ–๐‘”

๏ฎ SGD works well in practice

๏ฑ I.e., it often finds a very low value of the cost function quickly enough to be useful

(16)

U Kang 16

Minibatch Processing

๏ฎ SGD (recap)

๏ฑ We sample a minibatch of examples ๐ต = {๐‘ฅ 1 , โ€ฆ , ๐‘ฅ ๐‘šโ€ฒ } drawn uniformly from the training set

๏ฑ The estimate of the gradient is ๐‘” = 1

๐‘šโ€ฒ ฯƒ๐‘–=1๐‘šโ€ฒ ๐›ป๐œƒ๐ฟ(๐‘ฅ ๐‘– , ๐‘ฆ ๐‘– , ๐œƒ)

๏ฑ Then the gradient descent is given by ๐œƒ โ† ๐œƒ โˆ’ ๐œ–๐‘”

๏ฎ SGD using back propagation

๏ฑ For each instance (๐‘ฅ(๐‘–), ๐‘ฆ(๐‘–)), where i=1~mโ€™, we compute the gradient

๐›ป๐œƒ๐ฝ(๐‘–) using back propagation where ๐ฝ(๐‘–) = ๐ฟ(๐‘ฅ ๐‘– , ๐‘ฆ ๐‘– , ๐œƒ) is the i-th loss

๏ฑ The final gradient is ๐‘” = 1

๐‘šโ€ฒฯƒ๐‘–=1๐‘šโ€ฒ ๐›ป๐œƒ๐ฝ(๐‘–)

๏ฑ Update ๐œƒ โ† ๐œƒ โˆ’ ๐œ–๐‘”

(17)

Example

๏ฎ A feedforward neural network with one hidden layer

๏ฎ Forward propagation

๏ฑ ๐’‚ โ† ๐‘พ๐’™

๏ฑ ๐’‰ โ† ๐ˆ(๐’‚) (elementwise)

๏ฑ ๐’š โ† ๐’—เท ๐‘‡๐’‰

๏ฑ ๐‘ฑ โ† ๐ฟ เท๐’š, ๐’š + ๐œ†๐›บ ๐‘พ, ๐’— = (เท๐’š โˆ’ ๐’š)๐Ÿ+๐œ†( ๐‘พ ๐น2 + ๐’— 22)

๐’‰ ๐’—

(18)

U Kang 18

Example

๏ฎ Back propagation

๏ฑ ๐‘” โ† ๐›ป๐‘ฆเทœ๐ฝ = 2 เทœ๐‘ฆ โˆ’ ๐‘ฆ

๏ฑ ๐›ป๐’—๐ฝ โ† ๐›ป๐’—[ เทœ๐‘ฆ โˆ’ ๐‘ฆ 2 + ๐œ† ๐‘พ ๐น2 + ๐’— 22 ] = ๐‘”๐’‰ + 2๐œ†๐’—

๏ฑ ๐’ˆ โ† ๐›ป๐’‰๐ฝ = ๐›ป๐’‰[ เทœ๐‘ฆ โˆ’ ๐‘ฆ 2 + ๐œ† ๐‘พ ๐น2 + ๐’— 22 ] = ๐‘”๐’—

๏ฑ ๐’ˆ โ† ๐›ป๐’‚๐ฝ = ๐’ˆ โจ€ ๐œŽโ€ฒ(๐’‚) (elementwise)

๏ฑ ๐›ป๐‘พ๐ฝ โ† ๐›ป๐‘พ[ เทœ๐‘ฆ โˆ’ ๐‘ฆ 2 + ๐œ† ๐‘พ ๐น2 + ๐’— 22 ] = ๐’ˆ๐’™๐‘‡ + 2๐œ†๐‘พ

๐’™ ๐’‰ ๐’‚

๐‘พ ๐’—

๐’š โ† ๐’—เท ๐‘‡๐’‰

๐’‰ โ† ๐ˆ ๐’‚ ๐’‚ โ† ๐‘พ๐’™

(19)

What You Need to Know

๏ฎ

Backpropagation is an algorithm to compute partial derivatives efficiently in neural

networks

๏ฑ

Reuse partial derivatives

๏ฎ

Procedure of backpropagation

๏ฎ

Use of backpropagation in SGD

(20)

U Kang 20

Questions?

์ฐธ์กฐ

๊ด€๋ จ ๋ฌธ์„œ