003. Week 01. Motivations and Basics - 03. MAP (maximum a posteriori probability) Bayes says MLE is not everything, with also saying do you really think when you throw thumbtack probability of occuring head will be definately 60%? Don't you think probability of occuring head should be 50%? % ================================================================== Bayes says you can add prior knowledge into step of inferencing parameter $$$\hat{\theta}$$$ % ================================================================== Bayes says I created this formular before $$$P(\theta|D)=\frac{P(D|\theta)P(\theta)}{P(D)}$$$ $$$P(D)$$$: probability of occuring data, normalizing constant $$$P(\theta)$$$: probability of occuring $$$\theta$$$, prior knowledge about $$$\theta$$$ $$$P(D|\theta)$$$: when $$$\theta$$$ is given, probability of occuring data, likelihood $$$P(\theta|D)$$$: when data is given, probability of occuring $$$\theta$$$, posterior $$$P(D)$$$ is not that our focus because our interest is not data but parameter $$$\theta$$$ You'd like to inspect force and factor inside of data through many data Force and factor inside of data are $$$\theta$$$ Data: many observation of thumbtacks and acts of people Force and factor: why thumbtack gives this probability of occuring head and tail? why people act like this? % ================================================================== You already defined $$$P(D|\theta)$$$ through Bernoulli trial and binomial distribution $$$P(D|\theta)=\theta^{a_{H}}(1-\theta)^{a_{T}}$$$ $$$P(\theta)$$$: when you throw thumbtack, guess that probability of occuring head and tail will be 50:50 can be prior knowledge $$$P(\theta)$$$ Bayes says since you defined $$$P(D|\theta)=\theta^{a_{H}}(1-\theta)^{a_{T}}$$$, now you only need to add prior knowledge $$$P(\theta)$$$ So, you can find posterior $$$P(\theta|D)$$$ which is created by being affected by prior knowledge $$$P(\theta)$$$, normalizing constant $$$P(D)$$$, likelyhood $$$P(D|\theta)$$$ % ================================================================== From $$$P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)}$$$, if you see $$$P(D|\theta)$$$, P(D) is what already happened or fact that was given You can't do anything about this because it's constant that's why you consider $$$P(D)$$$ as normalizing constant That is, normalizing constant $$$P(D)$$$ is factor which is not affected by changing $$$\theta$$$ So, you can remove P(D) but can't use equal operator $$$P(\theta|D) \propto P(D|\theta)P(\theta)$$$ % ================================================================== $$$P(\theta|D) \propto P(D|\theta)P(\theta)$$$ $$$P(D|\theta)=\theta^{a_{H}}(1-\theta)^{a_{T}}$$$ Now, which way is best to express $$$P(\theta)$$$? 50:50? It will be not good choice Where did $$$P(D|\theta)$$$ come from? It comes from binomial distribution So, to calculate and represent $$$P(\theta)$$$, you also need to depend on some probability distribution % ================================================================== There can be various probability distributions but recommended probability distribution is beta distribution CDF (cumulitive distribution function) of beta distribution is confined between 0 and 1 So, CDF of beta distribution represents characteristic of probability nicely % /home/young/사진/IlChul_Moon-Basic_ML/2018_07_07_09:18:01.png % ================================================================== Beta distribution is represented as following $$$P(\theta)=P(\theta)=\frac{\theta^{\alpha-1}(1-\beta)^{\beta-1}}{B(\alpha, \beta)}$$$ PDF (probability dense funtion) of $$$P(\theta)$$$ is $$$P(\theta)=\frac{\theta^{\alpha-1}(1-\beta)^{\beta-1}}{B(\alpha, \beta)}$$$ B part: $$$B=\frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}$$$ $$$\Gamma$$$ part: $$$\Gamma(\alpha)=(\alpha-1)!$$$ Order is reversed. You define $$$\Gamma$$$ function By using $$$\Gamma$$$ function, $$$\alpha, \beta$$$ are expressed With $$$\alpha, \beta$$$, you can create $$$P(\theta)$$$ That is, parameters which beta distribution needs are $$$\alpha, \beta$$$ % ================================================================== In binomial distribution, it needs parameters $$$a_{H}, a_{T}$$$ With using those, you can create parameter $$$\theta$$$ % ================================================================== Into $$$P(\theta|D) \propto P(D|\theta)P(\theta)$$$, you will insert following 2 notations $$$P(D|\theta)=\theta^{a_{H}}(1-\theta)^{a_{T}}$$$ $$$P(\theta)=\frac{\theta^{\alpha-1}(1-\beta)^{\beta-1}}{B(\alpha, \beta)}$$$ $$$P(\theta|D){\propto}\theta^{a_{H}}(1-\theta)^{a_{T}}P(\theta) \theta^{\alpha-1}(1-\beta)^{\beta-1}$$$ From $$$P(\theta)=\frac{\theta^{\alpha-1}(1-\beta)^{\beta-1}}{B(\alpha, \beta)}$$$, what happened with $$$B(\alpha, \beta)$$$? That part is constant because $$$\alpha, \beta$$$ are already determined That is, $$$B(\alpha, \beta)$$$ is not term depending on $$$\theta$$$ So, part $$$\theta^{\alpha-1}(1-\beta)^{\beta-1}$$$ depending on $$$\theta$$$ is only inserted and you can use proportion operator % ================================================================== From $$$P(\theta|D){\propto}\theta^{a_{H}}(1-\theta)^{a_{T}}P(\theta) \theta^{\alpha-1}(1-\beta)^{\beta-1}$$$, there are $$$\theta^{a_{H}}$$$ and $$$\theta^{\alpha-1}$$$ And there is $$$(1-\theta)^{a_{H}}$$$ 와 $$$(1-\theta)^{\beta-1}$$$ Those parts can be performed "power sum" $$$P(\theta|D){\propto}\theta^{a_{H}}(1-\theta)^{a_{T}}P(\theta) \theta^{\alpha-1}(1-\beta)^{\beta-1}$$$ $$$P(\theta|D){\propto}\theta^{a_{H}+\alpha-1}(1-\theta)^{a_{T}+\beta-1}$$$ % ================================================================== You can see interesting point here $$$a_{H}+\alpha-1$$$ 과 $$$a_{T}+\beta-1$$$ is similar shape with $$$\theta^{a_{H}}(1-\theta)^{a_{T}}$$$ % ================================================================== In case of MLE, you found $$$\hat{\theta}=arg\;\underset{\theta}{max}P(D|\theta)$$$ By using derivatives, you found $$$P(D|\theta) = \theta^{a_{H}}(1-\theta)^{a_{T}}$$$ $$$\hat{\theta}=\frac{a_{H}}{a_{H}+a_{T}}$$$ % ================================================================== MAP is: from $$$\hat{\theta}=arg\;\underset{\theta}{max}P(\theta|D)$$$, you replace likelihood $$$P(\theta|D)$$$ with posterior $$$P(D|\theta)$$$ % ================================================================== $$$P(\theta|D) \propto \theta^{a_{H}+\alpha-1}(1-\theta)^{a_{T}+\beta-1}$$$ $$$\hat{\theta}=\frac{a_{H}+\alpha-1}{a_{H}+\alpha-1 + (a_{T}+\beta-1)}$$$ $$$\hat{\theta}=\frac{a_{H}+\alpha-1}{a_{H}+\alpha + a_{T}+\beta-2}$$$ % ================================================================== With many trials, prior knowledge about $$$\alpha, \beta$$$ become fade away $$$a_{H}$$$ and $$$a_{H}+a_{T}$$$ terms become dominant because $$$\alpha, \beta$$$ are not infinite In conclusion, with many trials, MLE and MAP become same Under small number of trials, prior knowledge plays important role % ================================================================== How to calculate $$$\alpha, \beta$$$? MAP uses prior knowledge $$$\alpha, \beta$$$, which is assumption of you Prior knowledge can be useful but you can get bad result when you choose bad prior knowledge % ==================================================================