Processing and Machine Translation

(1)

Deep learning for Natural Language Processing and Machine Translation

2015.10.16

Seung-Hoon Na

(2)

Deep learning: Motivation

• 기계 학습 방법에서 자질 추출 (Feature extraction)

 Handcrafted features 사용

– 자질 추출 단계가 자동화되지는 않음

지속적으로 자질 개선 필요

 성능 개선 및 튜닝 요구



Deep learning

자질 추출 단계의 제거 또는 간소화

 정교한 자질을 비선형 학습 과정에 내재시킴

𝒇(𝐱; 𝐰) = argmax_y 𝐰 ⋅ 𝜳(𝐱, 𝐲)

Feature vector

(4)



Multi-layer perceptron

상위 은닉층은 하위 은닉층의

출력에 대한 추상화  비선형성 모델

다층 NN 구조로 추상 자질 내재가능

 자질 튜닝 절차를 단순화 시킴

Deep learning

Abstraction

Raw data Hand-crafted Features

Trainable Classifier 일반적인 기계 학습

Raw data Trainable

Features

Trainiable Classifier 딥 러닝

(5)

Neural Network



A neuron

A general computational unit that takes n inputs and produces a single output

Sigmoid & binary logistic regression unit

 The most popular neurons

 Takes an n-dimensional input vector 𝑥 and produces the scalar activation (output) 𝑎

𝑎 = 1

1 + exp(𝑤^𝑇𝑥 + 𝑏)

𝑥₁ 𝑥₂

𝑥_𝑛

…

𝑤₁ 𝑤₂

𝑤_𝑛

∑

𝑏

𝜎 𝑎

(6)

Neural network: Setting for Classification

Input layer: Feature values

Output layer: Scores of labels

Softmax layer: Normalization of output values

– To get probabilities of output labels given input – K: the number of labels

6

Output layer Hidden layer

Input layer

𝑦₁ 𝑦₂ 𝑦_𝐾−1𝑦_𝐾

Softmax layer 𝑦_𝑖 = 𝑒𝑥𝑝(𝑦_𝑖)

∑ 𝑒𝑥𝑝(𝑦_𝑡)

𝑦₃

(7)

Neural network: Training



Training data: Tr = (𝒙

₁

, 𝑔

₁

), … , (𝒙

_𝑁

, 𝑔

_𝑁

)

𝒙_𝑖: i-th input feature vector

𝑔_𝑖 ∈ {1, … , 𝐾}: i-th target label



Objective function

Negative Log-likelihood (NLL)

𝐿 = − ∑ _{𝒙,𝑔 ∈𝑇} log 𝑃(𝑔|𝒙)

(8)

Neural network: Training



Stochastic gradient method

1. Randomly sample (𝒙, 𝑔) from training data

2. Define NLL for (𝒙, 𝑔)

 L = log 𝑃(𝑔|𝒙)

for each weight matrix 𝑾 ∈ 𝜽

3. Compute gradients : ^𝜕𝐿

𝜕𝑾

4. Update weight matrix 𝑊: 𝑊 ← 𝑊 − 𝜂 ^𝜕𝐿

𝜕𝑊

Iterate the above procedure ₈

(9)

Neural network: Backpropagation



Output layer

Compute delta for i-th output node

 𝛿_𝑖^𝑜 = ^𝜕𝐿

𝜕𝑦_𝑖 = 𝛿 𝑖, 𝑔 − ^{exp 𝑦}^𝑖

exp ∑ 𝑦_𝑗 = 𝛿 𝑖, 𝑔 − 𝑃(𝑖|𝑥)

Vector form: 𝜹^𝑜 = 𝟏_𝑔 −

𝑃(1|𝑥)

⋮

𝑃(𝐾|𝑥) ⁹

𝑦₁ 𝑦₂ 𝑦_𝑖 𝑦_𝐾−1 𝑦_𝐾

L = log 𝑃 𝑔 𝑥 = 𝑙𝑜𝑔 𝑒𝑥𝑝(𝑦_𝑔)

∑exp(𝑦_𝑖) = 𝑦_𝑔 − 𝑙𝑜𝑔∑exp(𝑦_𝑖)

softmax

(10)

Neural network: Backpropagation



Output weight matrix W

Compute gradient of 𝑤_𝑖𝑗

 𝜕𝐿

𝜕𝑤_𝑖𝑗 = ^𝜕𝐿

𝜕𝑦_𝑖

𝜕𝑤_𝑖𝑗 = 𝛿_𝑖^𝑜ℎ_𝑗

 𝜕𝐿

𝜕𝑾 = 𝛅^𝒐𝐡^𝐓

10

𝑦₁ 𝑦₂ 𝑦_𝑖 𝑦_𝐾−1 𝑦_𝐾 softmax

𝑦_𝑔 − 𝑙𝑜𝑔∑exp(𝑦_𝑖)

Hidden layer ℎ₁

𝑧₁

ℎ_𝑗 𝑧_𝑗

ℎ_𝑚 𝑧_𝑚 𝑤_𝑖𝑗

𝛿_𝑖^𝑜

ℎ_𝑗 = 𝑔 𝑧_𝑗

Output layer

(11)



Hidden layer

Compute delta for j-th hidden node

 𝛿_𝑗^ℎ = ^𝜕𝐿

𝜕𝑧_𝑗= ^𝜕𝐿

𝜕ℎ_𝑗

𝜕𝑧_𝑗=^𝜕ℎ^𝑗

𝜕𝑧_𝑗 ∑_𝑖 ^𝜕𝐿

𝜕𝑦_𝑖

𝜕ℎ_𝑗 = 𝑔^′ 𝑧_𝑗 ∑_𝑖 𝛿_𝑖^𝑂𝑤_𝑖𝑗

 𝜹^ℎ = 𝑔^′ 𝒛 °𝐖^𝐓𝜹^o

𝑦₁ 𝑦₂ 𝑦_𝑖 𝑦_𝐾−1 𝑦_𝐾 softmax

𝑧₁

ℎ_𝑗 𝑧_𝑗

ℎ_𝑚 𝑧_𝑚 𝑤_𝑖𝑗

𝛿_𝑖^𝑜 Output layer

Neural network: Backpropagation

(12)



Hidden weight matrix U

Compute gradient of 𝑢_𝑖𝑗

 𝜕𝐿

𝜕𝑢_𝑗𝑘 = ^𝜕𝐿

𝜕𝑧_𝑗

𝜕𝑢_𝑗𝑘 = 𝛿_𝑗^ℎ𝑥_𝑘

 𝜕𝐿

𝜕𝑼 = 𝛅^𝒉𝒙^𝐓 ₁₂

𝑧₁

ℎ_𝑗 𝑧_𝑗

ℎ_𝑚

𝑧_𝑚

Neural network: Backpropagation

𝛿_𝑗^ℎ

𝑥_𝑘 𝑢_𝑗𝑘

(13)

Deep learning for NLP

Learning word embedding matrix

Raw corpus

NN for Word embedding

Initialize lookup table

Application-specific neural network

Application- specific NN

Annotated corpus

 Lookup table is further fine-tuned

Unsupervised

Supervised

(14)

Word embedding:

Distributed representation



Distributed representation

n-dimensional latent vector for a word

Semantically similar words are closely located in vector space

14

(15)

Word embedding matrix:

Lookup Table

0

⋮ 1

⋮ 0 Word

One-hot vector e

(|V|-dimensional vector)

L= ^…

|V|

the cat mat ^…

d

𝐿 = 𝑅

^𝑑×|𝑉|

Word vector x is obtained from one-hot vector e by referring to lookup table

x =L e

(16)

Word embedding matrix:

context  input layer

 word seq 𝑤₁ ⋯ 𝑤_𝑛  input layer

16

𝑤_{𝑡−𝑛+1} 𝑤_𝑡−2 𝑤_𝑡−1

… … …

n context words

Input layer

𝐿 𝑒_𝑤_{𝑡−𝑛+1}

𝑛 𝑑 dim input vector

concatenation

d-diml vec

𝐿 𝑒_𝑤_𝑡−2 𝐿 𝑒_𝑤_𝑡−1

𝑒_𝑤 : one hot vector 𝑤

Lookup table

…

d-dim vec d-dim vec

(17)

Neural Probabilistic Language Model (Bengio ’03)



Language models

The probability of a sequence of words

N-gram model

 𝑃(𝑤_𝑖|𝑤₁, … , 𝑤_𝑖−1) ≈ 𝑃(𝑤_𝑖|𝑤_{𝑖− 𝑛−1} , ⋯ , 𝑤_𝑖−1)



Neural probabilistic language model

Estimate 𝑃(𝑤_𝑖|𝑤_{𝑖− 𝑛−1} , ⋯ , 𝑤_𝑖−1) by neural networks for classification

𝐱: Concatenated input features (input layer)

𝒚 = 𝐔 tanh(𝐝 + 𝐇𝐱)

𝒚 = 𝐖𝐱 + 𝐛 + 𝐔 tanh(𝐝 + 𝐇𝐱)

(18)

Neural probabilistic language model

18

Output layer

Input layer

𝑤_{𝑡−𝑛+1} 𝑤_𝑡−2 𝑤_𝑡−1

… …

Lookup table

𝑤_𝑡

… …

𝑃 𝑤_𝑡 𝑤_{𝑡−𝑛+1}, ⋯ , 𝑤_𝑡−1 = 𝑒𝑥𝑝(𝑦_𝑤_𝑡)

Softmax layer

𝑦₁ 𝑦₂ 𝑦_|𝑉|

Hidden layer

𝑼

𝑯 𝑾

(19)

Neural Probabilistic Language Model

𝑤_{𝑡−𝑛+1} 𝑤_𝑡−2 𝑤_𝑡−1

𝑃 𝑤_𝑡 𝑤_{𝑡−𝑛+1}, ⋯ , 𝑤_𝑡−1 =

… … … Input

layer

… ^… Output layer

Lookup table

Softmax=^{𝑒𝑥𝑝(𝑦}^𝑤𝑡⁾

Normalization for probabilistic value

Hidden Layer tanh…

Each node indicates a specific word

Directly connected to output layer

(20)

NPLM: Discussion



Limitation: Computational complexity

Softmax layer requires computing scores over all vocabulary words

 Vocabulary size is very large

20

(21)

Ranking Approach for Word Embedding



Idea: Sampling a negative example (Collobert

& Weston ’08)

𝒔: a given sequence of words (in training data)

𝒔′: a negative example the last word is replaced with another word

𝑓(𝒔): score of the sequence 𝒔

Goal: makes the score difference (𝑓(𝒔) – 𝑓(𝒔’)) large

Various loss functions are possible

 Hinge loss: max(0, 1 − 𝑓 𝒔 + 𝑓 𝒔^′ )

(22)

Ranking Approach for Word

Embedding (Collobert and Weston ‘08)

22

Hidden layer

𝑤_{𝑡−𝑛+1} 𝑤_𝑡−2 𝑤_𝑡−1

Lookup table

𝑓(𝑤_{𝑡−𝑛+1}, ⋯ , 𝑤_𝑡−1, 𝑤_𝑡)

𝑤_𝑡 𝑤_{𝑡−𝑛+1} 𝑤_𝑡−2 𝑤_𝑡−1 𝑤_𝑡′

Input layer

Positive example Negative example

Sampled Last word

𝑯 𝑯

𝑓(𝑤_{𝑡−𝑛+1}, ⋯ , 𝑤_𝑡−1, 𝑤_𝑡′)

𝒘 𝒘

𝒙 𝒙′

𝐰𝑡𝑎𝑛ℎ 𝐇𝐱 + 𝐛 + 𝑑 𝐰𝑡𝑎𝑛ℎ 𝐇𝐱′ + 𝐛 + 𝑑

(23)

Recurrent Neural Network (RNN) based Language Model

NPLM: only (n-1) previous words are conditioned

 𝑃(𝑤_𝑖|𝑤_{𝑖− 𝑛−1} , ⋯ , 𝑤_𝑖−1)

RNNLM: all previous words are conditioned

 𝑃(𝑤_𝑖|𝑤₁, … , 𝑤_𝑖−1)

Recurrent neural network

(24)

Recurrent Neural Network

24

𝒉_𝒕−𝟏

𝒙_𝒕−𝟏

𝒉_𝒕

𝒙_𝒕

𝒉_𝒕+𝟏

𝒙_𝒕+𝟏 𝒚_𝒕+𝟏 𝒚_𝒕

𝒚_𝒕−𝟏

𝒉_𝒕 = 𝒈(𝒛_𝒕)

𝑼

𝑾

𝑽 𝒚_𝒕 = 𝑼𝒉_𝒕

𝒛_𝒕 = 𝑾𝒉_𝒕−𝟏 + 𝑽𝒙_𝒕

𝒉_𝒕 = 𝒇(𝒙_𝒕, 𝒉_𝒕−𝟏)

(25)

Recurrent Neural Network:

Backpropagation Through Time

 Objective function: L = ∑_t 𝐿 𝑡

𝒉_𝒕

𝒙_𝒕

𝒉_𝒕+𝟏

𝑾

𝑽

𝒉_𝑻

𝒙_𝑻 𝒚_𝑻

𝒙_𝑻−𝟏

⋯

𝑼 𝑼 𝜕𝐿

𝜕𝑼 =

𝑡

𝛅_𝑡^𝑜𝒉_𝑡^𝑇

𝜕𝐿

𝜕𝑾 =

𝑡

𝛅_𝑡^ℎ𝒉_𝑡−1^𝑇

𝜕𝐿

𝜕𝑽 =

𝑡

𝛅_𝑡^ℎ𝒙_𝑡^𝑇

𝑾 𝑾

𝑼

𝑽 𝑽 𝑽

𝜹_𝑡^ℎ = 𝑔^′ 𝒛_𝑡+𝟏 °𝑾^𝑇𝜹_𝑡+1^ℎ + 𝑼^𝑇𝜹_𝑡^𝒐

(26)

Recurrent Neural Network: BPTT

Objective function: L = ∑_t 𝐿 𝑡

26

𝒉_𝒕−𝟏

𝒙_𝒕−𝟏

𝒉_𝒕

𝒙_𝒕

𝒉_𝒕+𝟏

𝒚_𝒕−𝟏

𝑼

𝑾

𝑽

𝜕𝐿(𝑡)

𝜕𝑾 = 𝛅_t,t^ℎ 𝒉_𝑡−1^𝑇 + 𝛅_t−1,t^ℎ 𝒉_𝑡−2^𝑇 + ⋯ + 𝜹_1,t^ℎ 𝒉₀^𝑇 Compute gradient of L(t) w.r.t. params

𝑾

𝑽

𝜕𝐿(𝑡)

𝜕𝑽 = 𝛅_t,t^ℎ 𝒙_𝑡^𝑇 + ⋯ + 𝜹_1,t^ℎ 𝒙₀^𝑇

𝜕𝐿(𝑡)

𝜕𝑼 = 𝛅_𝑡^𝑜𝒉_𝑡^𝑇

𝜕𝐿(𝑡)

𝜕𝐶(𝑤) =

𝑡^′≤𝑡:𝑤_𝑡=𝑤

𝛅_t′,t

𝑥 Lookup table 𝜹_𝑡,𝑡^ℎ = 𝑼^𝐓𝜹_𝒕^𝒐

𝜹_{𝑡−1,𝑡}^ℎ = 𝑔^′ 𝒛_𝑡 °𝐖^𝐓𝜹_𝑡,𝑡^𝒉 𝜹_{𝑡−2,𝑡}^ℎ = 𝑔^′ 𝒛_𝑡−1 °𝐖^𝐓𝜹_{𝑡−1,𝑡}^𝒉

Given specific time t

(27)

Recurrent Neural Network: BPTT



Vanishing gradient & gradient explosion problems

𝜕𝐿(𝑡)

𝜕𝑾 = 𝛅_t,𝐭^ℎ 𝒉_𝑡−1^𝑇 + 𝛅_t−1,t^ℎ 𝒉_𝑡−2^𝑇 + ⋯ + 𝜹_1,t^ℎ 𝒉₀^𝑇 𝜹_{𝑡−1,𝑡}^ℎ = 𝑔^′ 𝒛_𝑡 ° 𝐖^𝐓𝜹_𝑡,𝑡^ℎ

𝜹_{𝑡−2,𝑡}^ℎ = 𝑔^′ 𝒛_𝑡−𝟏 °𝐖^𝐓𝜹_{𝑡−1,𝑡}^ℎ

= 𝑔^′ 𝒛_𝑡−𝟏 °𝑔^′ 𝒛_𝑡 °𝐖^𝑇𝐖^𝑇𝛿_𝑡,𝑡^ℎ

𝑔^′ 𝑧_𝑘 ⋯ ° 𝑔^′ 𝒛_𝑡−𝟏 °𝑔^′ 𝒛_𝑡

𝐖^𝑇 … 𝐖^𝑇𝐖^𝑇

Can easily become

a very small or large number!

(28)

Recurrent Neural Network: BPTT



Solutions to the exploding and vanishing gradients

1. Instead of initializing W randomly, start off from an identify matrix initialization

2. Use the Rectified linear units (ReLU) instead of the sigmoid function

 The derivative for the ReLU is either 0 or 1

28

(29)

Experiments: NPLM

NPLM

(30)

Experiments: RNN LM

30

(31)

Long Short Term Memory (LSTM)

RNN: Very hard to actually train long-term

dependencies  exploding & vanishing gradients

LSTM: makes it easier for RNNs to capture long-term dependencies  Using gated units

 Traditional LSTM (Hochreiter and Schmidhuer, 98)

– Introduces input gate & output gate

– Limitation: The output is close to zero as long as the output gate is closed.

 Modern LSTM: Uses forget gate (Gers et al ‘00)

 Variants of LSTM

– Add peephole connections (Gers et al ‘02)

• Allow all gates to inspect the current cell state even when the output gate is closed.

(32)

Long Short Term Memory (LSTM)

 𝑖 ^𝑡 = 𝜎 𝑊 ^𝑖 𝑥 ^𝑡 + 𝑈 ^𝑖 ℎ ^𝑡−1 (Input gate)

 𝑓 ^𝑡 = 𝜎 𝑊 ^𝑓 𝑥 ^𝑡 + 𝑈 ^𝑓 ℎ ^𝑡−1 (Forget gate)

 𝑜 ^𝑡 = 𝜎 𝑊 ^𝑜 𝑥 ^𝑡 + 𝑈 ^𝑜 ℎ ^𝑡−1 (Output/Exposure gate)

 𝑐 ^𝑡 = 𝑡𝑎𝑛ℎ 𝑊 ^𝑐 𝑥 ^𝑡 + 𝑈 ^𝑐 ℎ ^𝑡−1 (New memory cell)

 𝑐^(𝑡) = 𝑓^(𝑡) ° 𝑐 ^𝑡−1 + 𝑖^(𝑡) ° 𝑐 ^𝑡 (Final memory cell)

 ℎ^(𝑡) = 𝑜^(𝑡)°tanh(𝑐 ^𝑡 )

32

𝒉^(𝒕) = 𝒇(𝒙 ^𝒕 , 𝒉^{(𝒕−𝟏)})

(33)

Long Short Term Memory (LSTM)

(34)

LSTM: Memory cell

34

(35)

Long Short Term Memory (LSTM)

Input gate: Whether or not the input is worth

preserving and thus is used to gate the new memory

Forget gate: Whether the past memory call is useful for the computation of the current memory cell

Final memory generation: Takes the advices of the forget and input gates to produce the final memory

Output/Exposure gate: What parts of the memory needs to be explored in the hidden state

 The purpose is to separate the final memory from the

hidden state. The final memory contains a lot of information that is not necessarily required to be saved in the hidden

state

(36)

Gated Recurrent Units (Cho et al ‘14)

 Alternative architecture to handle long-term dependencies

 𝑧 ^𝑡 = 𝜎 𝑊 ^𝑧 𝑥 ^𝑡 + 𝑈 ^𝑧 ℎ ^𝑡−1 (Update gate)

 𝑟 ^𝑡 = 𝜎 𝑊 ^𝑟 𝑥 ^𝑡 + 𝑈 ^𝑟 ℎ ^𝑡−1 (Reset gate)

 ℎ ^𝑡 = 𝑡𝑎𝑛ℎ 𝑟^(𝑡)°𝑈ℎ ^𝑡−1 + 𝑊𝑥^(𝑡) (New memory)

 ℎ^(𝑡) = (1 − 𝑧^(𝑡)) ° ℎ ^𝑡 + 𝑧^(𝑡) °ℎ ^𝑡−1 (Hidden state)

36

𝒉^(𝒕) = 𝒇(𝒙 ^𝒕 , 𝒉^{(𝒕−𝟏)})

(37)

Gated Recurrent Units

(38)

Gated Recurrent Units

New memory generation: A new memory ℎ^(𝑡) is the consolidation of a new input word 𝑥^(𝑡) with the past hidden state ℎ^(𝑡−1)

Reset gate: Determining how important ℎ^(𝑡−1) is to the summarization ℎ^(𝑡). It can completely diminish past hidden state if it is irrelevant to the

computation of the new memory

Update gate: Determining how much of ℎ^(𝑡−1) should be carried forward to the next stage

Hidden state: The hidden state ℎ^(𝑡) is generated using the past hidden input ℎ^(𝑡−1) and the new memory generated ℎ^(𝑡) with the advice of the

update gate 38

(39)

LSTM vs. GRU

(40)

Neural Machine Translation:

RNN Encoder-Decoder (Cho et al ‘14)

Computing the log of translation probability Log P(y|x) by two RNNs

40

(41)

Neural Machine Translation



RNN Encoder Decoder

RNN Search

 RNN Encoder Decoder with Attention

RNN Search Large Vocabulary



Sequence-to-Sequence Model

LSTM

Rare word model

(42)

Neural Machine Translation:

RNN Encoder-Decoder (Cho et al ‘14)

RNN for the encoder: input 𝒙  hidden state 𝒄

 𝒉_𝑡 = 𝑓(𝒉_𝑡−1, 𝑥_𝑡)

– Encode a variable-length sequence into a fixed-length vector representation

– Reads each symbol of an input sequence x sequentially – After reading the end of the sequence, we obtain a

summary c of the whole input sequence: 𝒄 = 𝒉_𝑻

42

(43)

RNN Encoder-Decoder (Cho et al ‘14)

RNN for the decoder: Hidden state 𝒄  Output 𝒚

 𝒉_𝑡 = 𝑓 𝒉_𝑡−1, 𝑦_𝑡−1, 𝒄

 𝑃(𝑦_𝑡|𝑦_𝑡−1, ⋯ , 𝑦₁, 𝒄) = 𝑔 𝒉_𝑡, 𝑦_𝑡−1, 𝒄

– Decode a given fixed-length vector representation back into a variable-length sequence

– Generate the output sequence by predicting the next symbol 𝑦_𝑡 given the previous hidden state

– Both 𝑦_𝑡 and 𝒉_𝑡 are conditioned on 𝑦_𝑡−1

How to define f?

 Gated recurrent unit (GRU) is used to capture long-term dependencies

(44)

Using RNN Encoder-Decoder

for Statistical Machine Translation

Statistical machine translation

 Generative model: 𝑃 𝒇 𝒆 ∝ 𝑝 𝒆 𝒇 𝑃 𝒇

– 𝑝 𝒆 𝒇 : Translation model, 𝑝(𝒇): Language model

 Log-linear model: log𝑃 𝒇 𝒆 = ∑ 𝑤_𝑛𝑓_𝑛(𝒇, 𝒆) + 𝑙𝑜𝑔𝑍 𝒆

– In Phrase-based SMT

• log𝑃(𝒆|𝒇) is factorized into the translation probabilities of matching phrases in the source and target sents

Scoring phrase pairs with RNN Encoder-Decoder

 Train the RNN Encoder-Decoder on a table of phrase pairs  Use its scores as additional features

– Ignore the normalized frequencies of each phrase pair

• Existing translation prob in the table already reflects the frequencies of the phrase pairs

44

(45)

RNN Encoder-Decoder: Experiments

 Training

– # hidden units: 1,000, # dim for word embedding: 100 – Adadelta & stochastic gradient descent

– Use Controlled vocabulary setting

• Tmost frequent 15,000 words for both source & target languages

– All the out-of-vocabulary words are mapped to UNK

English-to-French translation

(46)

Neural Machine Translation:

Sequence-to-sequence (Sutskever et al 14)

• <EOS>: the end-of-sentence token

 1) a LSTM reads input sentence 𝒙 which generates the last hidden state 𝒗

 2) a standard LSTM-LM computes the probability of 𝑦₁, ⋯ , 𝑦_𝑇^′ given the initial hidden state 𝒗

– 𝑃 𝑦₁, ⋯ , 𝑦_𝑇^′ 𝑥₁, ⋯ , 𝑥_𝑇 = _𝑡 𝑃(𝑦_𝑡|𝑣, 𝑦₁, ⋯ , 𝑦_𝑡−1) – 𝑃(𝑦_𝑡|𝑣, 𝑦₁, ⋯ , 𝑦_𝑡−1): Taking Softmax over all words

46

(47)

Can the LSTM reconstruct the input sentence?

Can this scheme learn the identity function?

Answer: it can, and it can do it very easily. It just does it effortlessly and perfectly.

Slide from http://www.cs.tau.ac.il/~wolf/deeplearningmeeting/pdfs/Sup+LSTM.pdf

(48)

Sequence-to-Sequence Model (Sutskever et al 14)

Two different LSTMs

 one for the input sentence

 Another for the output sentence

Deep LSTM (with 4 layer)

 1,000 cell at each layer, 1,000 dimensional word embedding

 Deep LSTM uses 8000 numbers for sentence representation

Revising source sentence

 …, C’, B’, A’ → A, B, C,…

 Backprop can notice the short-term dependencies first, and slowly extend them to long range dependencies

Ensemble of deep LSTMs

 LSTMs trained from Different initialization ₄₈

(49)

Sequence-to-Sequence Model :

End-to-end Translation Experiments

 English-to-French translation

 No SMT is used for reference model

 Input/Output vocabulary size: 160,000/80,000

 Training objective: maximizing ∑_(𝑆,𝑇) 𝑙𝑜𝑔𝑃(𝑇|𝑆)

 Decoding: Beam search

(50)

Sequence-to-Sequence Model : Re-scoring Experiments

 Rescoring the 1000-best list of the baseline system

50

(51)

Sequence-to-Sequence Model : Learned representation

2-dimensional PCA projection of the LSTM hidden states

(52)

Sequence-to-Sequence Model:

Learned representation

 2-dimensional PCA projection of the LSTM hidden states

52

(53)

Sequence-to-Sequence Learning:

Performance vs Sentence Length

(54)

Sequence-to-Sequence Learning:

Performance on rare words

54

(55)

RNN Search: Jointly Learning to Align and Translate (Bahdanau et al ‘15)



RNN Search: Extended architecture of RNN Encoder-Decoder

Encodes the input sentence into a seq of vectors using a bidirectional RNN

Context vector c

 Previously, the last hidden state (Cho et al ‘14)

 Now, mixture of hidden states of input sentence at generating each target word

– During decoding, chooses a subset of these vectors adaptively

(56)

RNN Search: Components

𝑝 𝑦_𝑖 𝑦₁, ⋯ , 𝑦_𝑖−1 = 𝑔 𝑦_𝑖−1, 𝑠_𝑖, 𝑐_𝑖

𝑠_𝑖 = 𝑓(𝑠_𝑖−1, 𝑦_𝑖−1, 𝑐_𝑖)

 An RNN hidden state

𝑐_𝑖 = ∑_𝑗 𝛼_𝑖𝑗ℎ_𝑗

𝛼_𝑖𝑗 = ^exp(𝑒^𝑖𝑗⁾

∑_𝑘 exp(𝑒_𝑖𝑘)

𝑒_𝑖𝑗 = 𝑎(𝑠_𝑖−1, ℎ_𝑗)

 Alignment model

56

(57)

RNN Search: Encoder

Bidirectional RNN

(58)

RNN Search: Decoder

58

 compute alignment

 compute context

 generate new output

 compute new decoder state

(59)

RNN Search: Alignment model

𝑒_𝑖𝑗 = 𝑣^𝑇 tanh 𝑊𝑠_𝑖−1 + 𝑉ℎ_𝑗

 Directly computes a soft alignment

𝛼_𝑖𝑗 = ^exp(𝑒^𝑖𝑗⁾

∑_𝑘 exp(𝑒_𝑖𝑘)

 Expected annotation

Alignment model is jointly trained with all other components

(60)

RNN Search:

Experiment – English-to-French

Model

 RNN Search, 1,000 units

Baseline

 RNN Encoder-Decoder, 1000 units

Data

 English to French translation, 348 million words

 30000 words + UNK token for the networks, all words for Moses

Training

 Stochastic gradient descent (SGD) with a minibatch of 80

Decoding

 Beam search (Sutskever ‘14) 60

(61)

RNN Search:

Experiment – English-to-French

(62)

RNN Search: Large Target Vocabulary ( Jean et al ‘15)

Decoding in RNN search

 𝑃 𝑦_𝑡 𝑦_<𝑡, 𝑥 = ¹

𝑍 exp 𝑤_𝑡^𝑇𝜙 𝑦_𝑡−1, 𝑧_𝑡, 𝑐_𝑡 + 𝑏_𝑡

Normalization constant

 𝑍 = ∑_𝑘:𝑦_𝑘_∈𝑉 exp 𝑤_𝑡^𝑇𝜙 𝑦_𝑡−1, 𝑧_𝑡, 𝑐_𝑡 + 𝑏_𝑡

– Computationally inefficient

Idea: Use only small V’ of target vocabulary to approximate Z

 Based on importance sampling (Bengio and Senecal, ‘08)

62

(63)

RNN Search: Large Target Vocabulary (Jean et al ‘15)

𝛻𝑙𝑜𝑔𝑝 𝑦_𝑡 𝑦_<𝑡, 𝑥 = 𝛻ℰ 𝑦_𝑡 −

𝑘:𝑦_𝑘∈𝑉

𝑃(𝑦_𝑘|𝑦_<𝑡, 𝑥)𝛻ℰ 𝑦_𝑘

ℰ 𝑦_𝑗 = 𝑤_𝑗^𝑇𝜙 𝑦_𝑗−1, 𝑧_𝑗, 𝑐_𝑗 + 𝑏_𝑗 Energy function

𝐸_𝑃 𝛻ℰ 𝑦

Expected gradient of the energy Approximation by importance

sampling

𝑘:𝑦_𝑘∈𝑉′

𝜔_𝑘

∑_𝑘_′_:𝑦

𝑘′∈𝑉′ 𝜔_𝑘′ 𝛻ℰ 𝑦_𝑘

𝜔_𝑘 = exp{ℰ 𝑦_𝑘 − 𝑙𝑜𝑔𝑄(𝑦_𝑘)}

Proposal distribution

(64)

RNN Search LV: Experiments

Setting (Vocabulary size)

 RNN Search: 30,000 for En-Fr, 50,000 for En-Ge

 RNN Search LV: 500,000 source and target words

64

(65)

Sequence-to-Sequence Model:

Rare Word Models (Luong et al ‘15)



Extend LSTM for NMT (Suskever ’14)



Translation-specific approach

For each OOV word, we make a pointer to its corresponding word in the source sentence

The pointer information is later utilized in a post- processing step

 Directly translates OOV using a original dictionary

 or with the identify translation, if no translation is found

(66)

Rare Word Models (Luong et al ‘15)



Replacing rare words with UNK

Positional information is stored in target sentence

66

(67)

Rare Word Models: Experiments

(68)

Grammar as Translation (Vinyals et al ‘15)

Extended Sequence-to-Sequence Model

 Two LSTMs + Attention (Bahdanau et al ‘15)

Linearizing parse trees: Depth-first traversal 68

(69)

Grammar as Translation:

Linearizing Parse Trees

(70)

Sequence-to-Sequence Model:

Parsing - Experiment Results

70

(71)

Neural Network Transition-Based Parsing (Weiss et al ‘15)

Structured learning!

(72)

NN Transition-Based Parsing: Experiments

72

(73)

Neural Network Graph-Based Parsing (Pei et al ‘15)

(74)

Neural Network Graph-Based Parsing (Pei et al ‘15)

74

(75)

NN Graph-Based Parsing: Experiments

(76)

Discussion: Research Topics



Neural Machine Translation

DNNs represent words, phrases, and sentences in continuous space

How to utilize more syntactic knowledge?

 Recursive recurrent NN (Liu et al ‘14)

Deep architecture may approximately represents input structure



Neural Dependency Parsing



Neural Language Model

76

(77)

Reference

 Mikolov Tomá¹, Karafiát Martin, Burget Luká¹, Èernocký Jan, Khudanpur Sanjeev: Recurrent neural network based language model, In: Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH 2010)

 Yoshua Bengio, Jerome Louradour, Ronan Collobert and Jason Weston, Curriculum Learning, ICML '09

 Yoshua Bengio, Réjean Ducharme and Pascal Vincent, A Neural Probabilistic Language Model, NIPS '01

 Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio, Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014

 S. Hochreiter, J. Schmidhuer, "Long short-term memory," Neural Computation, vol.9, no.8, pp.1735- 1780, 1997.

 F. A. Gers, N. N. Schraudolph, J. Schmidhuer, Learning Precise Timing with LSTM Recurrent Networks, JMLR, 2002

 Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio, Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, 2014

 Ilya Sutskever, Oriol Vinyals, Quoc V. Le, Sequence to Sequence Learning with Neural Networks, NIPS '14

 N. Kalchbrenner and Phil Blunsom, Recurrent continuous translation models, EMNLP ‘13

 Dzmitry Bahdanau, KyungHyun Cho, Yoshua Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, ICLR ‘15

 Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, Wojciech Zaremba, Addressing the Rare Word Problem in Neural Machine Translation, ALC ’15

 S´ebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In ACL.

 Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, Geoffrey Hinton, Grammar as a Foreign Language, 15

 David Weiss, Chris Alberti, Michael Collins, Slav Petrov, Structured Training for Neural Network Transition-Based Parsing, ACL '15

 Wenzhe Pei, Tao Ge, Baobao Chang, An Effective Neural Network Model for Graph-based Dependency Parsing, EMNLP '15

 Shujie Liu, Nan Yang, Mu Li, and Ming Zhou, A Recursive Recurrent Neural Network for Statistical Machine Translation, ACL '14

 Jiajun Zhang and Chengqing Zong, Deep Neural Networks in Machine Translation: An Overview, IEEE Intelligent Systems ‘15

Processing and Machine Translation

Deep learning for Natural Language Processing and Machine Translation

Seung-Hoon Na

Contents

Introduction: Neural network, deep learning

Deep learning for Natural language processing

Neural machine translation

Future plan

Deep learning: Motivation

Deep learning

Multi-layer perceptron

Deep learning

Neural Network

A neuron

Neural network: Setting for Classification

Neural network: Training

Training data: Tr = (𝒙

, 𝑔

), … , (𝒙

, 𝑔

)

Objective function

Neural network: Training

Stochastic gradient method

Neural network: Backpropagation

Output layer

Neural network: Backpropagation

Output weight matrix W

Hidden layer

Neural network: Backpropagation

Hidden weight matrix U

Neural network: Backpropagation

Deep learning for NLP

Word embedding:

Distributed representation

Distributed representation

Word embedding matrix:

Lookup Table

𝐿 = 𝑅

x =L e

Word embedding matrix:

context  input layer

Neural Probabilistic Language Model (Bengio ’03)

Language models

Neural probabilistic language model

Neural probabilistic language model

Neural Probabilistic Language Model

NPLM: Discussion

Limitation: Computational complexity

Ranking Approach for Word Embedding

Idea: Sampling a negative example (Collobert

& Weston ’08)

Ranking Approach for Word

Embedding (Collobert and Weston ‘08)

Recurrent Neural Network (RNN) based Language Model

Recurrent Neural Network

Recurrent Neural Network:

Backpropagation Through Time

Recurrent Neural Network: BPTT

Recurrent Neural Network: BPTT

Vanishing gradient & gradient explosion problems

Recurrent Neural Network: BPTT

Solutions to the exploding and vanishing gradients

Experiments: NPLM

Experiments: RNN LM

Long Short Term Memory (LSTM)

Long Short Term Memory (LSTM)

Long Short Term Memory (LSTM)

LSTM: Memory cell

Long Short Term Memory (LSTM)

Gated Recurrent Units (Cho et al ‘14)

Gated Recurrent Units

Gated Recurrent Units

LSTM vs. GRU

Neural Machine Translation:

RNN Encoder-Decoder (Cho et al ‘14)

Neural Machine Translation

RNN Encoder Decoder

Sequence-to-Sequence Model

Neural Machine Translation: