002. Week 01. Motivations and Basics - 02. MLE(maximum likelihood estimation) % ======================================================================== Formular of finding probability, $$$probability=\frac{\text{occured case}}{\text{entire case}}$$$ is actually not simply induced To induce above formular, you need to understand following concepts, for example, binomial distribution % ======================================================================== Binomial distribution is probability distribution about 2 discrete (like head or tail, unlike continuous) events Bernoulli experiment: experiment where result is 2 kinds % ======================================================================== When you throw thumbtack, you already suppose iid (independent event, identically distributed according to binomial distribution) Independent event: first throw and second throw are independent, which meant both events are not related identically distributed according to binomial distribution: no damage on thumbtack, so trials with same probability distribution % ======================================================================== Let's call probability of occuring H as $$$\theta$$$ like $$$P(H)=\theta$$$ So, probability of occuring T is $$$P(T)=1-\theta$$$ % ======================================================================== Under iid condition, let's throw thumbtack 5 times And you observed HHTHT $$$P(HHTHT)={\theta}{\theta}{(1-\theta)}{\theta}{(1-\theta)}$$$ $$$P(HHTHT)={\theta^{3}}{(1-\theta)}^{2}$$$ which is very small number % ======================================================================== Let's say above experiment in more detail Call Data as D D=H,H,T,H,T $$$n=5$$$, number of trial $$$k=a_{H}=3$$$, number of occuring head $$$p=\theta=P(H)$$$, probability of occuring head When $$$\theta$$$ is given, probability of occuring D is $$$P(D|\theta)=\theta^{a_{H}}(1-\theta)^{a_{T}}$$$ With above information, how can you reach to probability $$$\frac{3}{5}, \frac{2}{5}$$$ % ======================================================================== MLE (Maximum Likelihood Estimation) $$$P(D|\theta)=\theta^{a_{H}}(1-\theta)^{a_{T}}$$$ Data: is what you have observed sequence data of D with $$$a_{H}$$$ and $$$a_{T}$$$ Let's assume as hypothesis 1. Gambling result with thumbtack follows binomial distribution of $$$\theta$$$ From this, how can you make that hypothesis stronger to be true sentence? 1. You can find better distribution explaining your observation This way can be done but you need more rational 1. You admit result follows binomial distribution of $$$\theta$$$ Then, you can optimize $$$\theta$$$ to make your hypothesis strongerer, which means with which $$$\theta$$$ you will explain data? Finding best candidate $$$\theta$$$ is core idea of probability % ======================================================================== MLE: Probabilistic estimation MLE is about finding $$$\theta$$$ which makes probability of occuring observed data Mathematically, You should find argument $$$\theta$$$ which makes $$$P(D|\theta)$$$ maximized And you call found $$$\theta$$$ as $$$\hat{\theta}$$$ $$$\hat{\theta}=arg_{\theta}max P(D|\theta)$$$ $$$P(D|\theta)$$$: When $$$\theta$$$ is give, conditional probability of occuring Data D $$$\hat{\theta}$$$: maximizes $$$P(D|\theta)$$$ % ======================================================================== Let's apply MLE to "throwing thumbtack gambling" You can define $$$\hat{\theta}$$$ as follow: $$$\hat{\theta} = arg_{\theta} max P(D|\theta)$$$ $$$\hat{\theta} = arg_{\theta} max \theta^{a_{H}}(1-\theta)^{a_{T}}$$$ Above notations are rather complicated to proceed further because it contains power notations % ======================================================================== You can use log function whcih increases monotonically % /home/young/사진/IlChul_Moon-Basic_ML/2018_07_06_14:30:51.png Even if you map probability function $$$P(D|\theta)$$$ through log function, points where P becomes maximum are same You use ln: $$$\hat{\theta} = arg_{\theta}max \ln{P(D|\theta)}$$$ $$$\hat{\theta} = arg_{\theta}max \ln{\theta^{a_{H}}(1-\theta)^{a_{T}}}$$$ $$$\hat{\theta} = arg_{\theta}max \{a_{H}\ln{\theta}+a_{T}\ln{(1-\theta)}\}$$$ % ======================================================================== Now, it becomes maximazation question From $$$\{a_{H}\ln{\theta}+a_{T}\ln{(1-\theta)}\}$$$, you need to optimize $$$\theta$$$ Then, you need to find $$$\theta$$$ which maximizes $$$\{a_{H}\ln{\theta}+a_{T}\ln{(1-\theta)}\}$$$ Which ways will you use? You can use derivative which is set to zero You perform derivative in respect to $$$\theta$$$ then find points derivative is 0 $$$\frac{d}{d\theta} \{ a_{H} \ln{\theta} + a_{T} \ln{(1-\theta)} \} = 0$$$ $$$\frac{a_{H}}{\theta} - \frac{a_{T}}{1 - \theta} = 0$$$ You arrange in respect to $$$\theta$$$ $$$\theta=\frac{a_{H}}{a_{T}+a_{H}}$$$ $$$a_{H}$$$: number of occuring head $$$a_{T}+a_{H}$$$: number of throwings Results from probability formular like $$$\frac{3}{5}$$$ and $$$\frac{2}{5}$$$ come from these steps with "binomial distribution", "MLE", "optimization process" % ======================================================================== When $$$\theta=\frac{a_{H}}{a_{T}+a_{H}}$$$, $$$\theta$$$ becomes best candidate from MLE perspective So, you can write like this: $$$\hat{\theta}=\frac{a_{H}}{a_{T}+a_{H}}$$$ Above $$$\theta$$$ is inferenced optimized parameter $$$\hat{\theta}$$$ in terms of MLE % ======================================================================== So, you can explain how $$$\frac{3}{5} \frac{2}{5}$$$ come from So, you can say $$$\theta = 0.6$$$ is best $$$\theta$$$ % ======================================================================== Now, you can throw thumbtack 50 times And suppose 30 heads and 20 tails occurred Then, can you say 3H and 2T and 30H and 20T are different because probability is same as 0.6 % ======================================================================== No, it's actually different More you perform trials, error about inferenced parameter $$$\hat{\theta}$$$ as to actual parameter $$$\theta$$$ reduces Now, you found $$$\hat{\theta}=\frac{a_{H}}{a_{H}+a_{T}}$$$ via MLE where $$$N=a_{H}+a_{T}$$$ True parameter $$$\theta^{*}$$$ comes from without error, so it's ideal Since there are always trial erros, $$$\hat{\theta} \neq \theta^{*}$$$ % ======================================================================== $$$P(|\hat{\theta}-\theta^{*}|\geq \epsilon) \leq 2e^{-2N\epsilon^{2}}$$$ $$$\hat{\theta}$$$: inferenced $$$\theta$$$ $$$\theta^{*}$$$: true $$$\theta$$$ which thumbtack has $$$\epsilon$$$: specific error value $$$N$$$: number of trial If allowing error bound $$$\epsilon$$$ becomes larger, $$$2e^{-2N\epsilon^{2}}$$$ becomes smaller That is, larger error bound becomes, probability of occuring error $$$P(|\hat{\theta}-\theta^{*}|\geq \epsilon)$$$ becomes smaller Larger N becomes, probability of occuring error $$$P(|\hat{\theta}-\theta^{*}|\geq \epsilon)$$$ becomes smaller So, with increased N, you can say $$$P(|\hat{\theta}-\theta^{*}|\geq \epsilon)$$$ becomes smaller % ======================================================================== By increasing N, making $$$\hat{\theta}$$$ satisfying condition is PAC (Probably Approximate Correct) learning % ======================================================================== You inferenced $$$\hat{\theta}$$$ by using MLE And you figured out why $$$\hat{\theta}$$$ becomes result of PAC learning % ======================================================================== Next time, you will see other perspective inferencing $$$\theta$$$ instead of MLE