https://www.slideshare.net/deepseaswjh/rnn-bert
================================================================================
RNN
- $$$x_0, x_1, x_2, \cdots$$$: input tokens like I, go, to, ...
- 2 arrows from cells which are marked by "A": outputs
- Meaning
- Output from previous cell affect the training continuously
================================================================================
There are 2 trainable weights; W (for recurrent layers) and U (for input layers)
https://medium.com/deep-math-machine-learning-ai/chapter-10-deepnlp-recurrent-neural-networks-with-math-c4a6846a50a2
================================================================================
$$$a^{(t)} = b+W h^{(t-1)} + Ux^{(t)}$$$
$$$a^{(t)}$$$: output at time t
$$$b$$$: bias
$$$W$$$: trainable parameter for recurrent layers
$$$h^{(t-1)}$$$: output from previous cell
$$$x^{(t)}$$$: input data at time t
$$$U$$$: trainable parameter for input layers
$$$h^{(t)}=tanh(a^{(t)})$$$
$$$tanh()$$$: activation function
$$$h^{(t)}$$$: output from the cells
$$$o^{(t)}=c+Vh^{(t)}$$$
$$$o^{(t)}$$$: output in the last output layer
$$$y^{(t)}=softmax(o^{(t)})$$$: output in the last output layer after softmax
================================================================================
Various RNN structures
- Red: input structure
- Green: hidden structure
- Blue: output structure
================================================================================
LSTM (Long Shot Term Momory)
- Intensity of data becomes vague as training goes
================================================================================
LSTM memorizes "important data" up to the end of training
================================================================================
- LSTM means "specially designed RNN cell"
- LSTM has 8 trainable parameters (4 for recurrent layers + 4 for input layers)
================================================================================
Weakness of RNN
- RNN considers only "previous status" + "current status"
- RNN doesn't consider "entire context" in the training
================================================================================
Seq2Seq to overcome weakness of RNN
- Seq2Seq = Encoder_LSTM + Decoder_LSTM
================================================================================
Structure of Seq2Seq
- Step1: process "input sentence" by encoder, creating "feature vector"
- Step2: process "feature vector" by decoder, creating "output sentence"
================================================================================
Weakness of LSTM driven Seq2Seq
- It's lack of capability with only "gates" which adjust the flow of information
- "Gates": "removing gate", "input gate", "output gate"
- Due to above limitation of gates,
if input sentence becomes longer, LSTM-Seq2Seq becomes confused
================================================================================
Attention
Attention module focuses on important part from the sentence
"Important part" is directly passed to "decoder"
================================================================================
- When decorder creates output words,
"attention layer" decides what information "decoder" should take
================================================================================
Transformer
- LSTM is not used
- Attention module consisits of "Encoder-Decoder" model
================================================================================
Transformer does "self-attention"
================================================================================
How "transformer" does the "self-attention"?
- Use Scaled Dot-Product Attention module
- Each word is converted into Q,K,V
- Perform: (each word $$$Q$$$) $$$\cdot$$$ (other word $$$K$$$)
- SoftMax: select important word which transformer will pay attention to
- (import word vector after softmaxt) $$$\cdot$$$ (original word information $$$V$$$)
================================================================================
- Illustration of how "Scaled Dot-Product Attention module" works
- Input: 2 words
- Embedding: feature vector from 2 words
- Create Q,K,V vector from 2 feature vectors
- Score: $$$Q \cdot K$$$
- Divide by 8: maybe "Scale module"
- Softmax: 2 softmaxed-numbers from 2 word
It turns out "Thinking" is important word and transformer will pay attention to "Thinking"
================================================================================
================================================================================
Encoder and Decoder structure in Transformer
- Multiple encoders
- Multiple decoders
- Fully connected layer and Softmax layer
- Entire view
================================================================================
BERT (Bidirectional Encoder Representations from Transformers)
- Deep learning language model which uses pretraining to have general language understanding
- BERT is composed of transformers
================================================================================
Transfer learning (or fine tuning)
- Dataset: ImageNet
- Randomly initialized weight in the neural network
- Trainable parameters in the network is trained
by using ImageNet dataset which has 100 classes
- Save trained parameter into the file
- Prepare new data which has more classes
- Append new layer which has more output classes
- Load trained parameter and fill the loaded parameter values into network
- Perform training step over new dataset
================================================================================
Tranfer learning NLP models in 2018
================================================================================
Structure of BERT
- Use only "encoder network" from "transformer"
- BERT Base: 12 number of tranformer blocks
- BERT Large: 24 number of tranformer blocks
================================================================================
Pretrain BERT to give general language understanding to the BERT
Dataset
- BooksCorpus: 800 million of words
- Wikipedia: 2500 million of words
================================================================================
How to pretrain BERT network?
Method1: mask some words
* Details
- Input sentence with special tokens like CLS, SEP
- Randomly mask 15% of words of the input sentence
- improvisation is replaced with "MASK" token
- Max length of input data: 512
- Create output vector (512 dimension) from BERT network
- Pass output vector into FFNN+Softmax layer
to create probability values (2500 million number of probability values)
- Following stands for all English words (like 2500 million)
from Aardvark to Zyzzyva
- masked word has high probability value
and it means BERT model could predict "masked word" correctly,
and it means BERT mdoel could understand general language model
================================================================================
How to pretrain BERT network?
Method2: predict next sentence
* Details
- Input data: 2 sentences (sentence A + sentence B)
Input data has CLS (starting position), MASK (masked words), SEP (separation to the 2 sentences) tokens
- Perform tokenization on the input sentence
- Max length of input sentence is 512
- Use BERT network
- Get output vector (512 dimension)
- Pass output vector into FFNN+Softmax layer
- Given 2 sentence is "continous sentence"?
================================================================================
Embedding input data for BERT
Embedding_for_BERT = token_embedding + segment_embedding + position_embedding
* Details
- Raw input data before "BERT embedding"
- CLS: starting position, or class
- SEP: to separate 2 sentence
- Apply "token embedding" into input data
- Apply "segment embedding"
- Apply "position embedding"
================================================================================
Use transfer learning on pretrained BERT network
- Classify 2 sentences
- CLS: token
- Sentence1
- SEP: token
- Sentence2
- Preprocess input data
- Pass input data into BERT network
- Get class from C
================================================================================
- Q and A model
- CLS: token
- Sentence as question
- SEP: token
- Sentence as answer paragraph
- Preprocess input data
- Pass input data into BERT network
================================================================================
- Classify "one given sentence"
================================================================================
- POS tagging on single sentence
================================================================================
SQuAD: Standford Question Answering Dataset
Given paragraph
Question
Ground truth answer which BERT should make
================================================================================
BERT words well than other models
================================================================================