https://ratsgo.github.io/machine%20learning/2017/04/28/tSNE/
================================================================================
t-SNE are for "dimensionality reduction" and "visualization"
================================================================================
t-SNE visualizes "high dimensional data like word vector"
================================================================================
y=tSNE(x)
x: data in high dimension
y: data in low dimension
tSNE: trainable
$$$|x-x_{\text{neighbors}}|$$$: should be preserved as much as possible
Distance $$$|x-x_{\text{neighbors}}|$$$ is represented stochastically
================================================================================
$$$p_{j|i}$$$: when $$$x_i$$$ is given, probability of $$$x_j$$$ occurring
$$$x_j$$$: jth neighbor of $$$x_i$$$
================================================================================
$$$q_{j|i}$$$: when $$$y_i$$$ is given, probability of $$$y_j$$$ occurring
$$$y_j$$$: jth neighbor of $$$y_i$$$
================================================================================
Goal of SNE: reduce the difference between p and q
================================================================================
probability_distribution_1
probability_distribution_2
if probability_distribution_1==probability_distribution_2:
kullback_leibler_divergence->0
elif probability_distribution_1!=probability_distribution_2:
kullback_leibler_divergence->1
================================================================================
So, you can use kullback_leibler_divergence as cost function for SNE
================================================================================
Tricks to increase the training speed on SNE
- Omit $$$\sigma_i$$$ and use constant
- $$$\sigma_i$$$: prevents "probability" from being tortured
because each vector has different probability of chosen
- Solve equation with supposing that following 2 probabilities are same
"when i-data is given, probability of j-data occurring"
"when j-data is given, probability of i-data occurring"
================================================================================
By using above 2 tricks, you can newly write p and q
================================================================================
Therefore, you can write cost function by using new p and q
================================================================================
Calculate partial derivatives (gradient) of cost $$$C$$$ wrt neighbor-data $$$y_i$$$
================================================================================
Note that your goal is to find "y" in low dimension
SNE updates SNE by using gradient descent
- Initialize y
- Update y by using direction and size of gradient
================================================================================
SNE supposes you are dealing with "Gaussian Normal distribution"
Crowding problem is caused by "Gaussian Normal distribution"
because "Gaussian Normal distribution" has not-thick tails
Crowding problem:
prob1=when i-data is given, probability of j-data-far occurring
prob2=when i-data is given, probability of j-data-near occurring
prob1 $$$\sim$$$ prob2
================================================================================
How to solve "crowding problem"?
Use "Gaussian Normal distribution" which has thick tails
This is t-SNE
================================================================================
t-SNE
$$$p_{i|j}$$$: same
$$$q_{i|j}$$$: use t-distribution to make tails thick
================================================================================
Usecase of t-SNE
1.
embedding_vector=word2vec(token_sentence)
visualize: tSNE(embedding_vector)
2.
clustered_docs=clustering(many_documents)
visualize: tSNE(clustered_docs)
================================================================================