Advanced Deep Learning

(1)

Advanced Deep Learning

Deep Feedforward Networks - 2

U Kang

(2)

U Kang 2

In This Lecture



Back propagation



Motivation



Main idea



Procedure

(3)

Computational Graphs



Formalizes computation



Each node indicates a variable



Each edge indicates an operation

(4)

U Kang 4

Chain Rule of Calculus



Let x be a real number, and f and g be functions from R to R. Suppose y = g(x), and z = f(g(x)) = f(y). Then the chain rule states that

^𝑑𝑧

𝑑𝑥

=

^𝑑𝑧

𝑑𝑦 𝑑𝑦 𝑑𝑥



Suppose 𝑥 ∈ 𝑅

^𝑚

, 𝑦 ∈ 𝑅

^𝑛

, 𝑔: 𝑅

^𝑚

→ 𝑅

^𝑛

, 𝑓: 𝑅

^𝑛

→ 𝑅 . If 𝒚 = 𝑔(𝒙) and 𝑧 = 𝑓(𝒚) , then

^𝜕𝑧

𝜕𝑥_𝑖

= σ

_𝑗 ^𝜕𝑧

𝜕𝑦_𝑗

𝜕𝑥_𝑖

. In vector notation: 𝛻

_𝒙

𝑧 = (

^𝜕𝒚

𝜕𝒙

)

^𝑇

𝛻

_𝒚

𝑧, where

^𝜕𝒚

𝜕𝒙

is the n x m Jacobian matrix of g

 E.g., suppose 𝑧 = 𝑑^𝑇𝒚, and 𝐲 = 𝛻_𝒙𝑔 𝒙 . Then 𝛻_𝒙𝑧 = (^𝜕𝒚

𝜕𝒙)^𝑇𝛻_𝒚𝑧 = 𝐻^𝑇𝑑

(5)

Repeated Subexpressions

 Computing the same subexpression many times would be wasteful

(6)

U Kang 6

Backpropagation - Overview



Backpropagation is an algorithm to compute partial derivatives efficiently in neural networks



The main idea is dynamic programming

 Many computations share common computations

 Store the common computations in memory, and read them from memory when needed, without re-computing them from scratch

(7)

Backpropagation - Overview

 Assume a computational graph to compute a single scalar 𝑢^(𝑛)

 E.g., this can be the loss

 Our goal is to compute a partial derivatives ^𝜕𝑢^(𝑛)

𝜕𝑢^(𝑖) for all 𝑖 ∈ {1,2, … , 𝑛_𝑖} where 𝑢⁽¹⁾ to 𝑢^(𝑛^𝑖⁾ are the parameters of model

 We assume that nodes are ordered such that we compute their outputs one after the other, starting at 𝑢^(𝑛^𝑖⁺¹⁾ and going up to 𝑢^(𝑛)

 Each node 𝑢^(𝑖) is associated with an operation 𝑓^(𝑖) and is

computed by 𝑢^(𝑖) = 𝑓(𝐴^(𝑖)) where 𝐴^(𝑖) is the set of all parents of 𝑢^(𝑖)

(8)

U Kang 8

Forward Propagation

 This procedure performs the computations mapping 𝑛_𝑖 inputs 𝑢⁽¹⁾ to 𝑢^(𝑛^𝑖⁾ to output 𝑢^(𝑛)

(9)

Back-Propagation

 This procedure computes the partial derivatives of 𝑢^(𝑛) with respect to the variables 𝑢⁽¹⁾, … , 𝑢^(𝑛^𝑖⁾

(10)

U Kang 10

Back-Propagation Example



Back-propagation procedure (re-use blue-colored computation)

 Compute ^𝜕𝑧

𝜕𝑦 and ^𝜕𝑧

𝜕𝑥

 𝜕𝑧

𝜕𝑤 ← ^𝜕𝑦

𝜕𝑤

𝜕𝑧

𝜕𝑦 + ^𝜕𝑥

𝜕𝑤

𝜕𝑧

𝜕𝑥

 𝜕𝑧

𝜕𝑣 ← ^𝜕𝑤

𝜕𝑣

𝜕𝑧

𝜕𝑤

v w

y x

z

(11)

Cost of Back-Propagation

 The amount of computation scales linearly with the number of edges in the computational graph

 Computation for each edge: computing a partial derivative, one multiplication, and one addition

w

y x

z

Compute ^𝜕𝑧

𝜕𝑦 and ^𝜕𝑧

𝜕𝑥

𝜕𝑧

𝜕𝑤 ← 𝜕𝑦

𝜕𝑤

𝜕𝑧

𝜕𝑦 + 𝜕𝑥

𝜕𝑤

𝜕𝑧

𝜕𝑥

𝜕𝑧 ← 𝜕𝑤 𝜕𝑧

(12)

U Kang 12

Back-Propagation in Fully Connected

 Forward propagation

MLP

(13)

Back-Propagation in Fully Connected

 Backward computation

MLP

⨀: element-wise product

(14)

U Kang 14

Stochastic Gradient Descent (SGD)

 A recurring problem in ML: large training sets are necessary for good generalization, but large training sets are more

computationally expensive

 Cost function using cross-entropy:

 𝐽 𝜃 = 𝐸_{𝑥,𝑦~ ො}_𝑝_{𝑑𝑎𝑡𝑎}𝐿 𝑥, 𝑦, 𝜃 = ¹

𝑚σ_𝑖=1^𝑚 𝐿(𝑥 ^𝑖 , 𝑦 ^𝑖 , 𝜃)

where L is the per-example loss 𝐿 𝑥, 𝑦, 𝜃 = − log 𝑝(𝑦|𝑥; 𝜃)

 Gradient descent requires computing 𝛻_𝜃𝐽(𝜃)=¹

𝑚σ_𝑖=1^𝑚 𝛻_𝜃𝐿(𝑥 ^𝑖 , 𝑦 ^𝑖 , 𝜃)

 The computational cost of the gradient descent is O(m) which can take long for large m

 Insight of SGD: gradient is an expectation which can be approximately estimated using a small set of samples

(15)

Stochastic Gradient Descent (SGD)

 SGD

 We sample a minibatch of examples 𝐵 = {𝑥 ¹ , … , 𝑥 ^𝑚′ } drawn uniformly from the training set

 The minibatch size m’ is typically a small number (1 to few hundred)

 m‘ is usually fixed as the training set size m grows

 The estimate of the gradient is 𝑔 = ¹

𝑚′ σ_𝑖=1^𝑚′ 𝛻_𝜃𝐿(𝑥 ^𝑖 , 𝑦 ^𝑖 , 𝜃)

 Then the gradient descent is given by 𝜃 ← 𝜃 − 𝜖𝑔

 SGD works well in practice

 I.e., it often finds a very low value of the cost function quickly enough to be useful

(16)

U Kang 16

Minibatch Processing

 SGD (recap)

 We sample a minibatch of examples 𝐵 = {𝑥 ¹ , … , 𝑥 ^𝑚′ } drawn uniformly from the training set

 The estimate of the gradient is 𝑔 = ¹

𝑚′ σ_𝑖=1^𝑚′ 𝛻_𝜃𝐿(𝑥 ^𝑖 , 𝑦 ^𝑖 , 𝜃)

 Then the gradient descent is given by 𝜃 ← 𝜃 − 𝜖𝑔

 SGD using back propagation

 For each instance (𝑥^(𝑖), 𝑦^(𝑖)), where i=1~m’, we compute the gradient

𝛻_𝜃𝐽^(𝑖) using back propagation where 𝐽^(𝑖) = 𝐿(𝑥 ^𝑖 , 𝑦 ^𝑖 , 𝜃) is the i-th loss

 The final gradient is 𝑔 = ¹

𝑚′σ_𝑖=1^𝑚′ 𝛻_𝜃𝐽^(𝑖)

 Update 𝜃 ← 𝜃 − 𝜖𝑔

(17)

Example

 A feedforward neural network with one hidden layer

 Forward propagation

 𝒂 ← 𝑾𝒙

 𝒉 ← 𝝈(𝒂) (elementwise)

 𝒚 ← 𝒗ෝ ^𝑇𝒉

 𝑱 ← 𝐿 ෝ𝒚, 𝒚 + 𝜆𝛺 𝑾, 𝒗 = (ෝ𝒚 − 𝒚)^𝟐+𝜆( 𝑾 _𝐹² + 𝒗 ₂²)

𝒉 𝒗

(18)

U Kang 18

Example

 Back propagation

 𝑔 ← 𝛻_𝑦_ො𝐽 = 2 ො𝑦 − 𝑦

 𝛻_𝒗𝐽 ← 𝛻_𝒗[ ො𝑦 − 𝑦 ² + 𝜆 𝑾 _𝐹² + 𝒗 ₂² ] = 𝑔𝒉 + 2𝜆𝒗

 𝒈 ← 𝛻_𝒉𝐽 = 𝛻_𝒉[ ො𝑦 − 𝑦 ² + 𝜆 𝑾 _𝐹² + 𝒗 ₂² ] = 𝑔𝒗

 𝒈 ← 𝛻_𝒂𝐽 = 𝒈 ⨀ 𝜎′(𝒂) (elementwise)

 𝛻_𝑾𝐽 ← 𝛻_𝑾[ ො𝑦 − 𝑦 ² + 𝜆 𝑾 _𝐹² + 𝒗 ₂² ] = 𝒈𝒙^𝑇 + 2𝜆𝑾

𝒙 𝒉 𝒂

𝑾 𝒗

𝒚 ← 𝒗ෝ ^𝑇𝒉

𝒉 ← 𝝈 𝒂 𝒂 ← 𝑾𝒙

(19)

What You Need to Know



Backpropagation is an algorithm to compute partial derivatives efficiently in neural

networks



Reuse partial derivatives



Procedure of backpropagation



Use of backpropagation in SGD

(20)

U Kang 20

Advanced Deep Learning