https://ai3040.tistory.com/2
================================================================================
Embedding by using word
- Word2Vec: similar meaning words are located in near area in vector space
- FastTest: it makes training faster
Embedding by using sentence
- CoVe: aninal bat and baseball bat are distinguished through sentence context
- ELMo
================================================================================
Transformer (Google, 2017)
- Self attention driven encoder
- Transformer overcomes limitation of Seq2Seq model
================================================================================
Self attention
- Finds trainable weights of Q, K, V
- Find attention scores from Q, K
- Apply found score into V, and find Z
- Make many pairs of Q, K, V
- Use mask
- Encoder mask: optional, mask some tokens (this is used in BERT)
- Decoder mask: to predict next sentence (this is not used in BERT)
================================================================================
Effect of self-attention
- Model can get correlational information of words
================================================================================
Transformer structure
input --is passed into--> self-attention module --> output_from_self_attention
output_from_self_attention --is passed into--> forward network
================================================================================
BERT
How to use BERT
- Feture based usage
Pretrained model: it has optimized feature vector
You can use "optimized feature vector" for your task
- Fine tuning based usage
You load pretrained model
You retrain pretrained model
================================================================================
Structure of BERT
BERT Base: 12 number of transformers
BERT Large: 24 number of transformers
================================================================================
Learning method
- Masked language model
- It considers "context in bidirectional way"
- Randomly mask some tokens
- When you use find-tuning, you use "randomly not-use mask"
- Next sentence prediction
- Task of answering masked words
- Model should distinguish sentences from the source sentence
- Source sentences: 50% sentences are related sentence, other 50% are non-related
================================================================================
Input data for BERT
- Concatenate "2 sentences" and "special tokens like SEP, CLS,"
- Concatenated sentence should be less than 512 words, by considering capacity of transformer
- You tokenize sentence by using "word piece tokenizer"
- "word piece tokenizer"
- token embedding + segment embedding + position embedding
- zero paddingg for empty spaces
================================================================================
How to use BERT in fine-tuning usage
- You load pretrained BERT model
- When you use 2 sentences, you should put SEP token into the concatenated sentece
================================================================================
How to use BERT in faeture based usage
- You input "sentence"
- You get weight value of each word of the sentence
- You put "weight value of each word" into any applications
================================================================================
Preprocess for BERT
- FullTokenizer = BasicTokenizer + WordpieceTokenizer
- Input data: CLS + Q_sentence + SEP + context_sentece + SEP