Vanilla RNN
- Weak for gradient vanishing/exploding
- Why?
- To find gradient at $$$h_1$$$, you should use chain rule
How to solve "gradient vanishing/exploding"?
- LSTM which has "gates"
- "gates" adds "previous information" and "current information"
- Then, gradient can flow well
Vanilla Seq2Seq
- Encoder: creates vectors from input sentence
- Decoder: converts vector to target sentence
- Encoder and Decoder: LSTM
Weakness of vanilla Seq2Seq
- Longer input sentence
- Deeper encoder
- Encoder should compress "too much of information"
- Encoder causes "loss of information"
- Most meaningful input data is paid attention to by using attention mechanism
- Bidirectional network: sees input sentence from L to R, from R to L
BiLSTM+LSTM cell in decoder+attention
The reason that Bi-LSTM has good performance
- End to end learning: when reducing loss wrt predicted output,
all trainable params are updated at the same time
- Use distributed representation:
Relationship between "word" and "phrase" in inserted into word-vector
- Use LSTM, attention:
even with long sentence, performance doesn't degraded
Stanford Attentive Reader (2016)
- It finds the "answer" wrt to "given question" in the context of text
- It uses BiLSTM with attention
$$$\alpha_i$$$: attention score of ith word in text_paragraph
$$$\alpha_i = softmax(q^T W p_i) $$$
$$$= alpha_1 p_1 + alpha_2 p_2 + \cdots $$$
$$$= \sum\limits_{i} \alpha_i p_i$$$
Stanford Attentive Reader (2017)
- Same with Stanford Attentive Reader (2016)
- Added part: extracting "paragraph" from the long text
Copy augmented S2S (2017)
- Copy words which have high attention score $$$\alpha$$$
- Paste copied words into "decoder"
Italian has high attention score
Italian is pasted into decoder_sentence
Tree based models (2015)
Tree based model
Dozat&Manning(2017) applies "BiLSTM with attention" into dependency parsing