================================================================================
https://ratsgo.github.io/natural%20language%20processing/2017/10/22/manning/
/mnt/1T-5e7/mycodehtml/NLP/Bi_LSTM/Ratsgo/main.html
================================================================================
Vanilla RNN
- Weak for gradient vanishing/exploding
- Why?
-
- To find gradient at $$$h_1$$$, you should use chain rule
================================================================================
How to solve "gradient vanishing/exploding"?
- LSTM which has "gates"
- "gates" adds "previous information" and "current information"
- Then, gradient can flow well
================================================================================
Vanilla Seq2Seq
- Encoder: creates vectors from input sentence
- Decoder: converts vector to target sentence
- Encoder and Decoder: LSTM
================================================================================
Weakness of vanilla Seq2Seq
- Longer input sentence
- Deeper encoder
- Encoder should compress "too much of information"
- Encoder causes "loss of information"
- Most meaningful input data is paid attention to by using attention mechanism
================================================================================
- Bidirectional network: sees input sentence from L to R, from R to L
================================================================================
BiLSTM+LSTM cell in decoder+attention
================================================================================
The reason that Bi-LSTM has good performance
- End to end learning: when reducing loss wrt predicted output,
all trainable params are updated at the same time
- Use distributed representation:
Relationship between "word" and "phrase" in inserted into word-vector
- Use LSTM, attention:
even with long sentence, performance doesn't degraded
================================================================================
Stanford Attentive Reader (2016)
- It finds the "answer" wrt to "given question" in the context of text
- It uses BiLSTM with attention
================================================================================
p_vector=Bidirectional_encoder1(text_paragraph)
q_vector=Bidirectional_encoder2(question)
$$$\alpha_i$$$: attention score of ith word in text_paragraph
$$$\alpha_i = softmax(q^T W p_i) $$$
================================================================================
output_vector_o=Decoder(alpha,p)
output_vector_o
$$$= alpha_1 p_1 + alpha_2 p_2 + \cdots $$$
$$$= \sum\limits_{i} \alpha_i p_i$$$
================================================================================
loss_val=loss_func(output_vector_o,gt)
================================================================================
================================================================================
Stanford Attentive Reader (2017)
- Same with Stanford Attentive Reader (2016)
- Added part: extracting "paragraph" from the long text
================================================================================
Copy augmented S2S (2017)
- Copy words which have high attention score $$$\alpha$$$
- Paste copied words into "decoder"
out_vec_enco,words_with_high_attention_scores=Copy_augmented_S2S_2017_Encoder(question)
out_sentence=Copy_augmented_S2S_2017_Decoder(out_vec_enco,words_with_high_attention_scores)
================================================================================
Question
Italian has high attention score
Italian is pasted into decoder_sentence
================================================================================
Tree based models (2015)
LSTM
Tree based model
================================================================================
Dozat&Manning(2017) applies "BiLSTM with attention" into dependency parsing