- $$$x_0, x_1, x_2, \cdots$$$: input tokens like I, go, to, ...
- 2 arrows from cells which are marked by "A": outputs
- Meaning
- Output from previous cell affect the training continuously
There are 2 trainable weights; W (for recurrent layers) and U (for input layers)
$$$a^{(t)} = b+W h^{(t-1)} + Ux^{(t)}$$$
$$$a^{(t)}$$$: output at time t
$$$b$$$: bias
$$$W$$$: trainable parameter for recurrent layers
$$$h^{(t-1)}$$$: output from previous cell
$$$x^{(t)}$$$: input data at time t
$$$U$$$: trainable parameter for input layers
$$$tanh()$$$: activation function
$$$h^{(t)}$$$: output from the cells
$$$o^{(t)}$$$: output in the last output layer
$$$y^{(t)}=softmax(o^{(t)})$$$: output in the last output layer after softmax
Various RNN structures
- Red: input structure
- Green: hidden structure
- Blue: output structure
LSTM (Long Shot Term Momory)
- Intensity of data becomes vague as training goes
LSTM memorizes "important data" up to the end of training
- LSTM means "specially designed RNN cell"
- LSTM has 8 trainable parameters (4 for recurrent layers + 4 for input layers)
Weakness of RNN
- RNN considers only "previous status" + "current status"
- RNN doesn't consider "entire context" in the training
Seq2Seq to overcome weakness of RNN
- Seq2Seq = Encoder_LSTM + Decoder_LSTM
Structure of Seq2Seq
- Step1: process "input sentence" by encoder, creating "feature vector"
- Step2: process "feature vector" by decoder, creating "output sentence"
Weakness of LSTM driven Seq2Seq
- It's lack of capability with only "gates" which adjust the flow of information
- "Gates": "removing gate", "input gate", "output gate"
- Due to above limitation of gates,
if input sentence becomes longer, LSTM-Seq2Seq becomes confused
Attention module focuses on important part from the sentence
"Important part" is directly passed to "decoder"
- When decorder creates output words,
"attention layer" decides what information "decoder" should take
- LSTM is not used
- Attention module consisits of "Encoder-Decoder" model
Transformer does "self-attention"
How "transformer" does the "self-attention"?
- Use Scaled Dot-Product Attention module
- Each word is converted into Q,K,V
- Perform: (each word $$$Q$$$) $$$\cdot$$$ (other word $$$K$$$)
- SoftMax: select important word which transformer will pay attention to
- (import word vector after softmaxt) $$$\cdot$$$ (original word information $$$V$$$)
- Illustration of how "Scaled Dot-Product Attention module" works
- Input: 2 words
- Embedding: feature vector from 2 words
- Create Q,K,V vector from 2 feature vectors
- Score: $$$Q \cdot K$$$
- Divide by 8: maybe "Scale module"
- Softmax: 2 softmaxed-numbers from 2 word
It turns out "Thinking" is important word and transformer will pay attention to "Thinking"
Encoder and Decoder structure in Transformer
- Multiple encoders
- Multiple decoders
- Fully connected layer and Softmax layer
- Entire view
BERT (Bidirectional Encoder Representations from Transformers)
- Deep learning language model which uses pretraining to have general language understanding
- BERT is composed of transformers
Transfer learning (or fine tuning)
- Dataset: ImageNet
- Randomly initialized weight in the neural network
- Trainable parameters in the network is trained
by using ImageNet dataset which has 100 classes
- Save trained parameter into the file
- Prepare new data which has more classes
- Append new layer which has more output classes
- Load trained parameter and fill the loaded parameter values into network
- Perform training step over new dataset
Tranfer learning NLP models in 2018
Structure of BERT
- Use only "encoder network" from "transformer"
- BERT Base: 12 number of tranformer blocks
- BERT Large: 24 number of tranformer blocks
Pretrain BERT to give general language understanding to the BERT
- BooksCorpus: 800 million of words
- Wikipedia: 2500 million of words
How to pretrain BERT network?
Method1: mask some words
* Details
- Input sentence with special tokens like CLS, SEP
- Randomly mask 15% of words of the input sentence
- improvisation is replaced with "MASK" token
- Max length of input data: 512
- Create output vector (512 dimension) from BERT network
- Pass output vector into FFNN+Softmax layer
to create probability values (2500 million number of probability values)
- Following stands for all English words (like 2500 million)
from Aardvark to Zyzzyva
- masked word has high probability value
and it means BERT model could predict "masked word" correctly,
and it means BERT mdoel could understand general language model
How to pretrain BERT network?
Method2: predict next sentence
* Details
- Input data: 2 sentences (sentence A + sentence B)
Input data has CLS (starting position), MASK (masked words), SEP (separation to the 2 sentences) tokens
- Perform tokenization on the input sentence
- Max length of input sentence is 512
- Use BERT network
- Get output vector (512 dimension)
- Pass output vector into FFNN+Softmax layer
- Given 2 sentence is "continous sentence"?
Embedding input data for BERT
Embedding_for_BERT = token_embedding + segment_embedding + position_embedding
* Details
- Raw input data before "BERT embedding"
- CLS: starting position, or class
- SEP: to separate 2 sentence
- Apply "token embedding" into input data
- Apply "segment embedding"
- Apply "position embedding"
Use transfer learning on pretrained BERT network
- Classify 2 sentences
- CLS: token
- Sentence1
- SEP: token
- Sentence2
- Preprocess input data
- Pass input data into BERT network
- Get class from C
- Q and A model
- CLS: token
- Sentence as question
- SEP: token
- Sentence as answer paragraph
- Preprocess input data
- Pass input data into BERT network
- Classify "one given sentence"
- POS tagging on single sentence
SQuAD: Standford Question Answering Dataset
Given paragraph
Ground truth answer which BERT should make
BERT words well than other models