https://www.youtube.com/watch?v=7GBXCD-B6fo
================================================================================
Which model (from model A and model B) is more precise?
================================================================================
Example of perfect prediction model
Difference of distibutions (distribution of label, distribution of prediction) is 0
================================================================================
Let's represent "difference of distribution" in numerical values
---> KL divergence
================================================================================
Model A's probability distribution about the prediction: $$$Q_A$$$
Model B's probability distribution about the prediction: $$$Q_B$$$
================================================================================
$$$D_{KL}(P||P) = 0.0$$$
$$$D_{KL}(P||Q_A) = 0.25$$$
$$$D_{KL}(P||Q_B) = 1.85$$$
================================================================================
Relative entropy:
Criterion: 90 score
Your score: 93 score
You have 3 scores more than criterion score
Relative entropy
= KL divergence
= "3 scores more than criterion score"
================================================================================
Cross Entropy(P,Q)
= exact_bits + extra_bits (uncertainty values) for storing information
Relative_Entropy
= D(P||Q)
= Cross_Entropy(P,Q) - Entropy(P)
= $$$-\sum\limits_{i=1}^{n} (p_i * \log_2{q_i}) + \sum\limits_{i=1}^{n} (p_i * \log_2{p_i})$$$
P: true label
Q: prediction (?)
================================================================================
$$$D(P||Q) \ge 0$$$
$$$D(P||Q) \ne D(Q||P)$$$
================================================================================
Why "KL divergence" is not used as loss function in deep learning?
- As training goes, weight values in the model changes
- Prediction Q will also change
- which means Cross_Entropy(P,Q) changes as training goes
- Entropy(P) is used as "constant"
- Therefore, there is no need to use "KL divergence" as loss function
- You can get same effect but with simpler loss function
- by using "cross entropy loss function"