003. Week 01. Motivations and Basics - 03. MAP (maximum a posteriori probability)
Bayes says MLE is not everything,
with also saying do you really think when you throw thumbtack probability of occuring head will be definately 60%?
Don't you think probability of occuring head should be 50%?
% ==================================================================
Bayes says you can add prior knowledge into step of inferencing parameter $$$\hat{\theta}$$$
% ==================================================================
Bayes says I created this formular before
$$$P(\theta|D)=\frac{P(D|\theta)P(\theta)}{P(D)}$$$
$$$P(D)$$$: probability of occuring data, normalizing constant
$$$P(\theta)$$$: probability of occuring $$$\theta$$$, prior knowledge about $$$\theta$$$
$$$P(D|\theta)$$$: when $$$\theta$$$ is given, probability of occuring data, likelihood
$$$P(\theta|D)$$$: when data is given, probability of occuring $$$\theta$$$, posterior
$$$P(D)$$$ is not that our focus because our interest is not data but parameter $$$\theta$$$
You'd like to inspect force and factor inside of data through many data
Force and factor inside of data are $$$\theta$$$
Data: many observation of thumbtacks and acts of people
Force and factor: why thumbtack gives this probability of occuring head and tail?
why people act like this?
% ==================================================================
You already defined $$$P(D|\theta)$$$ through Bernoulli trial and binomial distribution
$$$P(D|\theta)=\theta^{a_{H}}(1-\theta)^{a_{T}}$$$
$$$P(\theta)$$$: when you throw thumbtack, guess that probability of occuring head and tail will be 50:50 can be prior knowledge $$$P(\theta)$$$
Bayes says since you defined $$$P(D|\theta)=\theta^{a_{H}}(1-\theta)^{a_{T}}$$$, now you only need to add prior knowledge $$$P(\theta)$$$
So, you can find posterior $$$P(\theta|D)$$$ which is created by being affected by prior knowledge $$$P(\theta)$$$, normalizing constant $$$P(D)$$$, likelyhood $$$P(D|\theta)$$$
% ==================================================================
From $$$P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)}$$$, if you see $$$P(D|\theta)$$$,
P(D) is what already happened or fact that was given
You can't do anything about this because it's constant that's why you consider $$$P(D)$$$ as normalizing constant
That is, normalizing constant $$$P(D)$$$ is factor which is not affected by changing $$$\theta$$$
So, you can remove P(D) but can't use equal operator
$$$P(\theta|D) \propto P(D|\theta)P(\theta)$$$
% ==================================================================
$$$P(\theta|D) \propto P(D|\theta)P(\theta)$$$
$$$P(D|\theta)=\theta^{a_{H}}(1-\theta)^{a_{T}}$$$
Now, which way is best to express $$$P(\theta)$$$?
50:50? It will be not good choice
Where did $$$P(D|\theta)$$$ come from? It comes from binomial distribution
So, to calculate and represent $$$P(\theta)$$$, you also need to depend on some probability distribution
% ==================================================================
There can be various probability distributions but recommended probability distribution is beta distribution
CDF (cumulitive distribution function) of beta distribution is confined between 0 and 1
So, CDF of beta distribution represents characteristic of probability nicely
% /home/young/사진/IlChul_Moon-Basic_ML/2018_07_07_09:18:01.png
% ==================================================================
Beta distribution is represented as following
$$$P(\theta)=P(\theta)=\frac{\theta^{\alpha-1}(1-\beta)^{\beta-1}}{B(\alpha, \beta)}$$$
PDF (probability dense funtion) of $$$P(\theta)$$$ is $$$P(\theta)=\frac{\theta^{\alpha-1}(1-\beta)^{\beta-1}}{B(\alpha, \beta)}$$$
B part: $$$B=\frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}$$$
$$$\Gamma$$$ part: $$$\Gamma(\alpha)=(\alpha-1)!$$$
Order is reversed.
You define $$$\Gamma$$$ function
By using $$$\Gamma$$$ function, $$$\alpha, \beta$$$ are expressed
With $$$\alpha, \beta$$$, you can create $$$P(\theta)$$$
That is, parameters which beta distribution needs are $$$\alpha, \beta$$$
% ==================================================================
In binomial distribution, it needs parameters $$$a_{H}, a_{T}$$$
With using those, you can create parameter $$$\theta$$$
% ==================================================================
Into $$$P(\theta|D) \propto P(D|\theta)P(\theta)$$$, you will insert following 2 notations
$$$P(D|\theta)=\theta^{a_{H}}(1-\theta)^{a_{T}}$$$
$$$P(\theta)=\frac{\theta^{\alpha-1}(1-\beta)^{\beta-1}}{B(\alpha, \beta)}$$$
$$$P(\theta|D){\propto}\theta^{a_{H}}(1-\theta)^{a_{T}}P(\theta) \theta^{\alpha-1}(1-\beta)^{\beta-1}$$$
From $$$P(\theta)=\frac{\theta^{\alpha-1}(1-\beta)^{\beta-1}}{B(\alpha, \beta)}$$$, what happened with $$$B(\alpha, \beta)$$$?
That part is constant because $$$\alpha, \beta$$$ are already determined
That is, $$$B(\alpha, \beta)$$$ is not term depending on $$$\theta$$$
So, part $$$\theta^{\alpha-1}(1-\beta)^{\beta-1}$$$ depending on $$$\theta$$$ is only inserted and you can use proportion operator
% ==================================================================
From $$$P(\theta|D){\propto}\theta^{a_{H}}(1-\theta)^{a_{T}}P(\theta) \theta^{\alpha-1}(1-\beta)^{\beta-1}$$$,
there are $$$\theta^{a_{H}}$$$ and $$$\theta^{\alpha-1}$$$
And there is $$$(1-\theta)^{a_{H}}$$$ 와 $$$(1-\theta)^{\beta-1}$$$
Those parts can be performed "power sum"
$$$P(\theta|D){\propto}\theta^{a_{H}}(1-\theta)^{a_{T}}P(\theta) \theta^{\alpha-1}(1-\beta)^{\beta-1}$$$
$$$P(\theta|D){\propto}\theta^{a_{H}+\alpha-1}(1-\theta)^{a_{T}+\beta-1}$$$
% ==================================================================
You can see interesting point here
$$$a_{H}+\alpha-1$$$ 과 $$$a_{T}+\beta-1$$$ is similar shape with $$$\theta^{a_{H}}(1-\theta)^{a_{T}}$$$
% ==================================================================
In case of MLE, you found $$$\hat{\theta}=arg\;\underset{\theta}{max}P(D|\theta)$$$
By using derivatives, you found
$$$P(D|\theta) = \theta^{a_{H}}(1-\theta)^{a_{T}}$$$
$$$\hat{\theta}=\frac{a_{H}}{a_{H}+a_{T}}$$$
% ==================================================================
MAP is:
from $$$\hat{\theta}=arg\;\underset{\theta}{max}P(\theta|D)$$$, you replace likelihood $$$P(\theta|D)$$$ with posterior $$$P(D|\theta)$$$
% ==================================================================
$$$P(\theta|D) \propto \theta^{a_{H}+\alpha-1}(1-\theta)^{a_{T}+\beta-1}$$$
$$$\hat{\theta}=\frac{a_{H}+\alpha-1}{a_{H}+\alpha-1 + (a_{T}+\beta-1)}$$$
$$$\hat{\theta}=\frac{a_{H}+\alpha-1}{a_{H}+\alpha + a_{T}+\beta-2}$$$
% ==================================================================
With many trials, prior knowledge about $$$\alpha, \beta$$$ become fade away
$$$a_{H}$$$ and $$$a_{H}+a_{T}$$$ terms become dominant because $$$\alpha, \beta$$$ are not infinite
In conclusion, with many trials, MLE and MAP become same
Under small number of trials, prior knowledge plays important role
% ==================================================================
How to calculate $$$\alpha, \beta$$$?
MAP uses prior knowledge $$$\alpha, \beta$$$, which is assumption of you
Prior knowledge can be useful but you can get bad result when you choose bad prior knowledge
% ==================================================================