Advanced Deep Learning
Deep Feedforward Networks - 2
U Kang
U Kang 2
In This Lecture
๏ฎ
Back propagation
๏ฑ
Motivation
๏ฑ
Main idea
๏ฑ
Procedure
Computational Graphs
๏ฎ
Formalizes computation
๏ฎ
Each node indicates a variable
๏ฎ
Each edge indicates an operation
U Kang 4
Chain Rule of Calculus
๏ฎ
Let x be a real number, and f and g be functions from R to R. Suppose y = g(x), and z = f(g(x)) = f(y). Then the chain rule states that
๐๐ง๐๐ฅ
=
๐๐ง๐๐ฆ ๐๐ฆ ๐๐ฅ
๏ฎ
Suppose ๐ฅ โ ๐
๐, ๐ฆ โ ๐
๐, ๐: ๐
๐โ ๐
๐, ๐: ๐
๐โ ๐ . If ๐ = ๐(๐) and ๐ง = ๐(๐) , then
๐๐ง๐๐ฅ๐
= ฯ
๐ ๐๐ง๐๐ฆ๐
๐๐ฆ๐
๐๐ฅ๐
. In vector notation: ๐ป
๐๐ง = (
๐๐๐๐
)
๐๐ป
๐๐ง, where
๐๐๐๐
is the n x m Jacobian matrix of g
๏ฑ E.g., suppose ๐ง = ๐๐๐, and ๐ฒ = ๐ป๐๐ ๐ . Then ๐ป๐๐ง = (๐๐
๐๐)๐๐ป๐๐ง = ๐ป๐๐
Repeated Subexpressions
๏ฎ Computing the same subexpression many times would be wasteful
U Kang 6
Backpropagation - Overview
๏ฎ
Backpropagation is an algorithm to compute partial derivatives efficiently in neural networks
๏ฎ
The main idea is dynamic programming
๏ฑ Many computations share common computations
๏ฑ Store the common computations in memory, and read them from memory when needed, without re-computing them from scratch
Backpropagation - Overview
๏ฎ Assume a computational graph to compute a single scalar ๐ข(๐)
๏ฑ E.g., this can be the loss
๏ฎ Our goal is to compute a partial derivatives ๐๐ข(๐)
๐๐ข(๐) for all ๐ โ {1,2, โฆ , ๐๐} where ๐ข(1) to ๐ข(๐๐) are the parameters of model
๏ฎ We assume that nodes are ordered such that we compute their outputs one after the other, starting at ๐ข(๐๐+1) and going up to ๐ข(๐)
๏ฎ Each node ๐ข(๐) is associated with an operation ๐(๐) and is
computed by ๐ข(๐) = ๐(๐ด(๐)) where ๐ด(๐) is the set of all parents of ๐ข(๐)
U Kang 8
Forward Propagation
๏ฎ This procedure performs the computations mapping ๐๐ inputs ๐ข(1) to ๐ข(๐๐) to output ๐ข(๐)
Back-Propagation
๏ฎ This procedure computes the partial derivatives of ๐ข(๐) with respect to the variables ๐ข(1), โฆ , ๐ข(๐๐)
U Kang 10
Back-Propagation Example
๏ฎ
Back-propagation procedure (re-use blue-colored computation)
๏ฑ Compute ๐๐ง
๐๐ฆ and ๐๐ง
๐๐ฅ
๏ฑ ๐๐ง
๐๐ค โ ๐๐ฆ
๐๐ค
๐๐ง
๐๐ฆ + ๐๐ฅ
๐๐ค
๐๐ง
๐๐ฅ
๏ฑ ๐๐ง
๐๐ฃ โ ๐๐ค
๐๐ฃ
๐๐ง
๐๐ค
v w
y x
z
Cost of Back-Propagation
๏ฎ The amount of computation scales linearly with the number of edges in the computational graph
๏ฑ Computation for each edge: computing a partial derivative, one multiplication, and one addition
w
y x
z
Compute ๐๐ง
๐๐ฆ and ๐๐ง
๐๐ฅ
๐๐ง
๐๐ค โ ๐๐ฆ
๐๐ค
๐๐ง
๐๐ฆ + ๐๐ฅ
๐๐ค
๐๐ง
๐๐ฅ
๐๐ง โ ๐๐ค ๐๐ง
U Kang 12
Back-Propagation in Fully Connected
๏ฎ Forward propagation
MLP
Back-Propagation in Fully Connected
๏ฎ Backward computation
MLP
โจ: element-wise product
U Kang 14
Stochastic Gradient Descent (SGD)
๏ฎ A recurring problem in ML: large training sets are necessary for good generalization, but large training sets are more
computationally expensive
๏ฎ Cost function using cross-entropy:
๏ฑ ๐ฝ ๐ = ๐ธ๐ฅ,๐ฆ~ เท๐๐๐๐ก๐๐ฟ ๐ฅ, ๐ฆ, ๐ = 1
๐ฯ๐=1๐ ๐ฟ(๐ฅ ๐ , ๐ฆ ๐ , ๐)
where L is the per-example loss ๐ฟ ๐ฅ, ๐ฆ, ๐ = โ log ๐(๐ฆ|๐ฅ; ๐)
๏ฑ Gradient descent requires computing ๐ป๐๐ฝ(๐)=1
๐ฯ๐=1๐ ๐ป๐๐ฟ(๐ฅ ๐ , ๐ฆ ๐ , ๐)
๏ฑ The computational cost of the gradient descent is O(m) which can take long for large m
๏ฎ Insight of SGD: gradient is an expectation which can be approximately estimated using a small set of samples
Stochastic Gradient Descent (SGD)
๏ฎ SGD
๏ฑ We sample a minibatch of examples ๐ต = {๐ฅ 1 , โฆ , ๐ฅ ๐โฒ } drawn uniformly from the training set
๏ฑ The minibatch size mโ is typically a small number (1 to few hundred)
๏ฑ mโ is usually fixed as the training set size m grows
๏ฑ The estimate of the gradient is ๐ = 1
๐โฒ ฯ๐=1๐โฒ ๐ป๐๐ฟ(๐ฅ ๐ , ๐ฆ ๐ , ๐)
๏ฑ Then the gradient descent is given by ๐ โ ๐ โ ๐๐
๏ฎ SGD works well in practice
๏ฑ I.e., it often finds a very low value of the cost function quickly enough to be useful
U Kang 16
Minibatch Processing
๏ฎ SGD (recap)
๏ฑ We sample a minibatch of examples ๐ต = {๐ฅ 1 , โฆ , ๐ฅ ๐โฒ } drawn uniformly from the training set
๏ฑ The estimate of the gradient is ๐ = 1
๐โฒ ฯ๐=1๐โฒ ๐ป๐๐ฟ(๐ฅ ๐ , ๐ฆ ๐ , ๐)
๏ฑ Then the gradient descent is given by ๐ โ ๐ โ ๐๐
๏ฎ SGD using back propagation
๏ฑ For each instance (๐ฅ(๐), ๐ฆ(๐)), where i=1~mโ, we compute the gradient
๐ป๐๐ฝ(๐) using back propagation where ๐ฝ(๐) = ๐ฟ(๐ฅ ๐ , ๐ฆ ๐ , ๐) is the i-th loss
๏ฑ The final gradient is ๐ = 1
๐โฒฯ๐=1๐โฒ ๐ป๐๐ฝ(๐)
๏ฑ Update ๐ โ ๐ โ ๐๐
Example
๏ฎ A feedforward neural network with one hidden layer
๏ฎ Forward propagation
๏ฑ ๐ โ ๐พ๐
๏ฑ ๐ โ ๐(๐) (elementwise)
๏ฑ ๐ โ ๐เท ๐๐
๏ฑ ๐ฑ โ ๐ฟ เท๐, ๐ + ๐๐บ ๐พ, ๐ = (เท๐ โ ๐)๐+๐( ๐พ ๐น2 + ๐ 22)
๐ ๐
U Kang 18
Example
๏ฎ Back propagation
๏ฑ ๐ โ ๐ป๐ฆเท๐ฝ = 2 เท๐ฆ โ ๐ฆ
๏ฑ ๐ป๐๐ฝ โ ๐ป๐[ เท๐ฆ โ ๐ฆ 2 + ๐ ๐พ ๐น2 + ๐ 22 ] = ๐๐ + 2๐๐
๏ฑ ๐ โ ๐ป๐๐ฝ = ๐ป๐[ เท๐ฆ โ ๐ฆ 2 + ๐ ๐พ ๐น2 + ๐ 22 ] = ๐๐
๏ฑ ๐ โ ๐ป๐๐ฝ = ๐ โจ ๐โฒ(๐) (elementwise)
๏ฑ ๐ป๐พ๐ฝ โ ๐ป๐พ[ เท๐ฆ โ ๐ฆ 2 + ๐ ๐พ ๐น2 + ๐ 22 ] = ๐๐๐ + 2๐๐พ
๐ ๐ ๐
๐พ ๐
๐ โ ๐เท ๐๐
๐ โ ๐ ๐ ๐ โ ๐พ๐
What You Need to Know
๏ฎ
Backpropagation is an algorithm to compute partial derivatives efficiently in neural
networks
๏ฑ
Reuse partial derivatives
๏ฎ
Procedure of backpropagation
๏ฎ
Use of backpropagation in SGD
U Kang 20