011. Word2Vec (1)
@
Word2Vec has various algorithms such as CBOW, skip-gram, etc
@
We will learn natural language processing
We / will / learn / natural / language / processing
w0 / w1 / w2 / w3 / w4 / w5
@
There is hypothesis
With 8000 words, you can communicate
With 20,000 words, you can read newspaper and book
With 30,000 words, you can know most of practical words
The number of words which average human has is from 8 000 to 20,000
Let's suppose dictionary has 50,000 words
@
One hot encoding
[0 0 0 1 ... 0 0 0] is one hot vector
@
Let's create 50,000 dimension vector
You can arrange words in various ways like frequency, alphabetical order, descending, ascending, etc
In this case, let's arrange words in alphabetical order
We / will / learn / natural / language / processing
language / learn / natural / processing / We / will
language = [1 0 0 0 0 0]
learn = [0 1 0 0 0 0]
natural = [0 0 1 0 0 0]
processing = [0 0 0 1 0 0]
We = [0 0 0 0 1 0]
will = [0 0 0 0 0 1]
@
We input vector into neural net
Neural net outputs vector
@
one dictionary [a, aa, ...] size is $1\times 50,000$
practical full dictionary [0 0 1 ... 0], ...,[0 0 0 ... 0] size to store one word is $50,000\times 50,000$
@
When [0 0 1 0 0] vector goes into neural net, element of vector should be float type to perform arthmetics rather than logical operatioin
This case also increases size of vector
@
It's too wasteful when we have too many 0 in vector
Above way of storing words into vector turned out too inefficient
@
So, we mark index on each word
To do this, we create kind of table
language 0
learn 1
natural 2
processing 3
We 4
will 5
And we store those index numbers
If you have space where those strings(language, learn, ...) are stored, you can store 50,000 words
@
Let's apply this way to given sentence
We / will / learn / natural / language / processing
[4,5,1,2,0,3] means "We will learn natural language processing"
Above step can be expressed by indexing, encoding, or parsing
We use [4,5,1,2,0,3] as dataset
Suppose we have milions of dimension vector [4,5,1,2,0,3,.....] from entire text
We will put batch size(30 or 50 or 100 vectors) data into neural net
In this case, we suppose we use 3 batch size
Then, we process [4,5,1] into one-hot-encoding
Then, one hot encoded vector size will be $3\times 50,000$
We put $3\times 50,000$ dimension vector into neural net
This way is more efficient with saving memory
This way means I create one hot vector in real time
@
How to make one hot vectors
vec = tf.zeros(50000)
i : index
w : word
w0 : 0 index word
vec[wi] = 1
We make [4,5,1] one hot encoding
50,000 dimension vector [0 0 0 1 0 0 ...]
50,000 dimension vector [0 0 0 0 1 0 ...]
50,000 dimension vector [0 1 0 0 0 0 ...]
Now, we made $3\times 50,000$ matrix
@
When you make batch, you get entire index value
You devide entire dataset by index value
First devided vector will be one hot encoded and then be inserted into neural net
@
But we still too huge $3\times 50,000$ matrix
And even it doesn't make good training
because this way is too difficult to express meaning
@
In rule based processing, we analyze grammar, morpheme, etc
Then, we figure out something should be located after other word by probability
We just think some word1 gramartically and probabilistically is followed after word2
But in deep learning, we put meaning into vector
See vector representation of words on tensorflow site
@
I embed words and its meanings, characteristics, etc into vector
Above task is called "word embedding"
A collection of vectors becomes "words"
Embeded one vector is called "embedding"
A collection of vectors is called "embeddings"
@
When you compact some vector, that vector is sparse
Spase vector which is generated from words is not good for machine learning
Word embedding can convert sparse word vector into dense word vector
@
You can think of Word2Vec as preprocessing for nlp
Raw word can be represented in "char"
We can perform indexing on word represented in "char" with resulting one hot vector represented in integer
We can process "one hot vector represented in integer" via Word2Vec, resulting vector represented in dense "float"
We can put "vector represented in dense "float"" into neural net, seq2seq, LSTM, RNN, resulting output
raw word $\overset{indexing}\rightarrow$ one hot vector : simple preprocessing
one hot vector $\overset{Word2Vec}\rightarrow$ vector in dense float : preprocessing using nn
vector in dense float $\overset{nn, rnn, LSTM, seq2seq}\rightarrow$ : nlp model
nlp = preprocessing using nn + nlp model
@
How to perform Word2Vec
You collect every word in the world
Suppose we have 50,000 words
You perform indexing on 50,000 words
You arrange word in order of frequency or alphabet, etc with marking index number on each word
For example, suppose this sentence
We will learn natural language processing
suppose following index information
We 1028
will 356
learn 3
natural 2680
language 170
processing 394
one hot vector for "We" = [0 0 .. 1 ... 0] = vector[1028] is 1
$50000\times 200$ WeightMatrix $\cdot$ $1\times 50000$ VectorForWe = $200\times 1$Vector
We can arbitrary choose from 200 to 300 for row size
I transpose $50000\times 200$ $WeightMatrix^{T}$
$200\times 1$Vector $\cdot$ $200\times 50000$ WeightMatrix = $50000\times 1$ Vector
$50000\times 1$ Vector should be "will"
one hot vector for "will" = [0 ... 1 0 0 ... 0] = vector[356] is 1
We consider "$50000\times 1$ Vector" as output
We consider "[0 ... 1 0 0 ... 0]" as label
Then, we calculate nce(noise contrastive estimation) loss
We put output and label into nce loss function and get loss
Then, we perform gradient descent and back propagation and update WeightMatrix
img 218a759d-9560-41c5-827b-a6bda38c4b0d
If summary, "We" is converted into one hot vector
1*50000 one hot vector of "We" x 50000*200 weight = 1*200 vector
1*200 vector x 200*50000 weight' = 1*50000 prediction vector for "will"
We make one hot vector for "will" to use it as label
We calculate loss with 1*50000 prediction vector for "will" and 1*50000 label vector of "will"
We perform "gradient descent" and "back propagation" to update weight'
First vector of weight' represents "We"
Second vector of weight' represents "will"
Those multiple vectors are "embeddings"
@
word sparse one hot vector x 50000*200 W -> dense float vector x 200*50000 W' -> prediction one hot vector --- label one hot vector
In word one hot vector x W, let's think if multiplication has meaning because too many 0 are multiplied
Result will be 1*200 vector
img 7ad276d9-28ba-49c6-a315-4bfe6fa2cbde
Multiplication is nothing but bringing entire row and index is word's index
W[r][wordindex]
This method extracts one entire column so we don't need to perform above multiplication
params : weight
ids : set of index represented in list
tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None, validate_indices=True, max_norm=None)
By using tf.nn.embedding_lookup(), model becomes simple
200*1 dense vector x 50000*200 Weight -> 50000*1 output vector --- label
if word index is 137, we do like this
tf.nn.embedding_lookup(Weight, [137]) returns 200*1 dense vector
Above step with "tf.nn.embedding_lookup(Weight, [137]) returns 200*1 dense vector" is called "projection"
@
How to train?
We use CBOW which is not much used because of low performance or skip-gram
CBOW is to fill blank
We will learn
We w( )ll learn
training set becomes [W e w l]
[W e w l] x Weight = output vector --- label vector
output vector should be vector for "i"
label vector is vector for "i"
@
skip-gram is to answer around words if I know some words
W( ) le( )rn l( )ng( )age
I should infer ( ) by seeing 'e' or 'r'
Following is CBOW's dataset
W( ) le( )rn l( )ng( )age
Train dataset feature $\begin{bmatrix} [W, l, e, r] \\ [l, e, r, n] \\ [e, r, n, l] \\ [r, n, l, n] \\ [n, l, n, g] \\ ... \end{bmatrix}$
Train dataset label $\begin{bmatrix} e \\ a \\ a \\ a \\ a \end{bmatrix}$
Following is skip-gram's dataset
W( ) le( )rn l( )ng( )age
$\begin{bmatrix} [e] \\ [e] \\ [e] \\ [e] \\ [e] \\ [e] \\ ... \end{bmatrix} \begin{bmatrix} [w] \\ [l] \\ [e] \\ [a] \\ [r] \\ [n] \\ ... \end{bmatrix}$
Train dataset feature --- Train dataset label
[e] [w] is one set which is small size