My first HTML document

https://www.slideshare.net/deepseaswjh/rnn-bert ================================================================================ RNN

- $$$x_0, x_1, x_2, \cdots$$$: input tokens like I, go, to, ... - 2 arrows from cells which are marked by "A": outputs - Meaning - Output from previous cell affect the training continuously ================================================================================ There are 2 trainable weights; W (for recurrent layers) and U (for input layers) https://medium.com/deep-math-machine-learning-ai/chapter-10-deepnlp-recurrent-neural-networks-with-math-c4a6846a50a2

================================================================================

$$$a^{(t)} = b+W h^{(t-1)} + Ux^{(t)}$$$ $$$a^{(t)}$$$: output at time t $$$b$$$: bias $$$W$$$: trainable parameter for recurrent layers $$$h^{(t-1)}$$$: output from previous cell $$$x^{(t)}$$$: input data at time t $$$U$$$: trainable parameter for input layers $$$h^{(t)}=tanh(a^{(t)})$$$ $$$tanh()$$$: activation function $$$h^{(t)}$$$: output from the cells $$$o^{(t)}=c+Vh^{(t)}$$$ $$$o^{(t)}$$$: output in the last output layer $$$y^{(t)}=softmax(o^{(t)})$$$: output in the last output layer after softmax ================================================================================ Various RNN structures

- Red: input structure - Green: hidden structure - Blue: output structure ================================================================================ LSTM (Long Shot Term Momory)

- Intensity of data becomes vague as training goes ================================================================================

LSTM memorizes "important data" up to the end of training ================================================================================

- LSTM means "specially designed RNN cell" - LSTM has 8 trainable parameters (4 for recurrent layers + 4 for input layers) ================================================================================ Weakness of RNN - RNN considers only "previous status" + "current status" - RNN doesn't consider "entire context" in the training ================================================================================ Seq2Seq to overcome weakness of RNN

- Seq2Seq = Encoder_LSTM + Decoder_LSTM ================================================================================ Structure of Seq2Seq

- Step1: process "input sentence" by encoder, creating "feature vector" - Step2: process "feature vector" by decoder, creating "output sentence" ================================================================================ Weakness of LSTM driven Seq2Seq - It's lack of capability with only "gates" which adjust the flow of information - "Gates": "removing gate", "input gate", "output gate"

- Due to above limitation of gates, if input sentence becomes longer, LSTM-Seq2Seq becomes confused ================================================================================ Attention Attention module focuses on important part from the sentence "Important part" is directly passed to "decoder"

================================================================================

- When decorder creates output words, "attention layer" decides what information "decoder" should take ================================================================================ Transformer - LSTM is not used - Attention module consisits of "Encoder-Decoder" model ================================================================================ Transformer does "self-attention"

================================================================================ How "transformer" does the "self-attention"? - Use Scaled Dot-Product Attention module - Each word is converted into Q,K,V - Perform: (each word $$$Q$$$) $$$\cdot$$$ (other word $$$K$$$) - SoftMax: select important word which transformer will pay attention to - (import word vector after softmaxt) $$$\cdot$$$ (original word information $$$V$$$)

================================================================================

- Illustration of how "Scaled Dot-Product Attention module" works - Input: 2 words - Embedding: feature vector from 2 words - Create Q,K,V vector from 2 feature vectors - Score: $$$Q \cdot K$$$ - Divide by 8: maybe "Scale module" - Softmax: 2 softmaxed-numbers from 2 word It turns out "Thinking" is important word and transformer will pay attention to "Thinking" ================================================================================

================================================================================ Encoder and Decoder structure in Transformer - Multiple encoders

- Multiple decoders

- Fully connected layer and Softmax layer

- Entire view

================================================================================ BERT (Bidirectional Encoder Representations from Transformers) - Deep learning language model which uses pretraining to have general language understanding - BERT is composed of transformers ================================================================================ Transfer learning (or fine tuning)

- Dataset: ImageNet - Randomly initialized weight in the neural network - Trainable parameters in the network is trained by using ImageNet dataset which has 100 classes - Save trained parameter into the file - Prepare new data which has more classes - Append new layer which has more output classes - Load trained parameter and fill the loaded parameter values into network - Perform training step over new dataset ================================================================================ Tranfer learning NLP models in 2018

================================================================================ Structure of BERT - Use only "encoder network" from "transformer" - BERT Base: 12 number of tranformer blocks - BERT Large: 24 number of tranformer blocks ================================================================================ Pretrain BERT to give general language understanding to the BERT Dataset - BooksCorpus: 800 million of words - Wikipedia: 2500 million of words ================================================================================ How to pretrain BERT network? Method1: mask some words

* Details - Input sentence with special tokens like CLS, SEP

- Randomly mask 15% of words of the input sentence

- improvisation is replaced with "MASK" token - Max length of input data: 512 - Create output vector (512 dimension) from BERT network

- Pass output vector into FFNN+Softmax layer to create probability values (2500 million number of probability values)

- Following stands for all English words (like 2500 million) from Aardvark to Zyzzyva

- masked word has high probability value and it means BERT model could predict "masked word" correctly, and it means BERT mdoel could understand general language model ================================================================================ How to pretrain BERT network? Method2: predict next sentence

* Details - Input data: 2 sentences (sentence A + sentence B) Input data has CLS (starting position), MASK (masked words), SEP (separation to the 2 sentences) tokens - Perform tokenization on the input sentence

- Max length of input sentence is 512 - Use BERT network

- Get output vector (512 dimension) - Pass output vector into FFNN+Softmax layer

- Given 2 sentence is "continous sentence"?

================================================================================ Embedding input data for BERT Embedding_for_BERT = token_embedding + segment_embedding + position_embedding

* Details - Raw input data before "BERT embedding"

- CLS: starting position, or class - SEP: to separate 2 sentence - Apply "token embedding" into input data

- Apply "segment embedding"

- Apply "position embedding"

================================================================================ Use transfer learning on pretrained BERT network - Classify 2 sentences

- CLS: token - Sentence1 - SEP: token - Sentence2 - Preprocess input data - Pass input data into BERT network - Get class from C ================================================================================ - Q and A model

- CLS: token - Sentence as question - SEP: token - Sentence as answer paragraph - Preprocess input data - Pass input data into BERT network ================================================================================ - Classify "one given sentence"

================================================================================ - POS tagging on single sentence

================================================================================ SQuAD: Standford Question Answering Dataset Given paragraph

Question

Ground truth answer which BERT should make

================================================================================ BERT words well than other models

================================================================================