002. Week 01. Motivations and Basics - 02. MLE(maximum likelihood estimation)
% ========================================================================
Formular of finding probability, $$$probability=\frac{\text{occured case}}{\text{entire case}}$$$ is actually not simply induced
To induce above formular, you need to understand following concepts, for example, binomial distribution
% ========================================================================
Binomial distribution is probability distribution about 2 discrete (like head or tail, unlike continuous) events
Bernoulli experiment: experiment where result is 2 kinds
% ========================================================================
When you throw thumbtack, you already suppose iid (independent event, identically distributed according to binomial distribution)
Independent event: first throw and second throw are independent, which meant both events are not related
identically distributed according to binomial distribution: no damage on thumbtack, so trials with same probability distribution
% ========================================================================
Let's call probability of occuring H as $$$\theta$$$ like $$$P(H)=\theta$$$
So, probability of occuring T is $$$P(T)=1-\theta$$$
% ========================================================================
Under iid condition, let's throw thumbtack 5 times
And you observed HHTHT
$$$P(HHTHT)={\theta}{\theta}{(1-\theta)}{\theta}{(1-\theta)}$$$
$$$P(HHTHT)={\theta^{3}}{(1-\theta)}^{2}$$$
which is very small number
% ========================================================================
Let's say above experiment in more detail
Call Data as D
D=H,H,T,H,T
$$$n=5$$$, number of trial
$$$k=a_{H}=3$$$, number of occuring head
$$$p=\theta=P(H)$$$, probability of occuring head
When $$$\theta$$$ is given, probability of occuring D is $$$P(D|\theta)=\theta^{a_{H}}(1-\theta)^{a_{T}}$$$
With above information, how can you reach to probability $$$\frac{3}{5}, \frac{2}{5}$$$
% ========================================================================
MLE (Maximum Likelihood Estimation)
$$$P(D|\theta)=\theta^{a_{H}}(1-\theta)^{a_{T}}$$$
Data: is what you have observed sequence data of D with $$$a_{H}$$$ and $$$a_{T}$$$
Let's assume as hypothesis
1. Gambling result with thumbtack follows binomial distribution of $$$\theta$$$
From this, how can you make that hypothesis stronger to be true sentence?
1. You can find better distribution explaining your observation
This way can be done but you need more rational
1. You admit result follows binomial distribution of $$$\theta$$$
Then, you can optimize $$$\theta$$$ to make your hypothesis strongerer,
which means with which $$$\theta$$$ you will explain data?
Finding best candidate $$$\theta$$$ is core idea of probability
% ========================================================================
MLE: Probabilistic estimation
MLE is about finding $$$\theta$$$ which makes probability of occuring observed data
Mathematically,
You should find argument $$$\theta$$$ which makes $$$P(D|\theta)$$$ maximized
And you call found $$$\theta$$$ as $$$\hat{\theta}$$$
$$$\hat{\theta}=arg_{\theta}max P(D|\theta)$$$
$$$P(D|\theta)$$$: When $$$\theta$$$ is give, conditional probability of occuring Data D
$$$\hat{\theta}$$$: maximizes $$$P(D|\theta)$$$
% ========================================================================
Let's apply MLE to "throwing thumbtack gambling"
You can define $$$\hat{\theta}$$$ as follow:
$$$\hat{\theta} = arg_{\theta} max P(D|\theta)$$$
$$$\hat{\theta} = arg_{\theta} max \theta^{a_{H}}(1-\theta)^{a_{T}}$$$
Above notations are rather complicated to proceed further because it contains power notations
% ========================================================================
You can use log function whcih increases monotonically
% /home/young/사진/IlChul_Moon-Basic_ML/2018_07_06_14:30:51.png
Even if you map probability function $$$P(D|\theta)$$$ through log function, points where P becomes maximum are same
You use ln:
$$$\hat{\theta} = arg_{\theta}max \ln{P(D|\theta)}$$$
$$$\hat{\theta} = arg_{\theta}max \ln{\theta^{a_{H}}(1-\theta)^{a_{T}}}$$$
$$$\hat{\theta} = arg_{\theta}max \{a_{H}\ln{\theta}+a_{T}\ln{(1-\theta)}\}$$$
% ========================================================================
Now, it becomes maximazation question
From $$$\{a_{H}\ln{\theta}+a_{T}\ln{(1-\theta)}\}$$$, you need to optimize $$$\theta$$$
Then, you need to find $$$\theta$$$ which maximizes $$$\{a_{H}\ln{\theta}+a_{T}\ln{(1-\theta)}\}$$$
Which ways will you use?
You can use derivative which is set to zero
You perform derivative in respect to $$$\theta$$$ then find points derivative is 0
$$$\frac{d}{d\theta} \{ a_{H} \ln{\theta} + a_{T} \ln{(1-\theta)} \} = 0$$$
$$$\frac{a_{H}}{\theta} - \frac{a_{T}}{1 - \theta} = 0$$$
You arrange in respect to $$$\theta$$$
$$$\theta=\frac{a_{H}}{a_{T}+a_{H}}$$$
$$$a_{H}$$$: number of occuring head
$$$a_{T}+a_{H}$$$: number of throwings
Results from probability formular like $$$\frac{3}{5}$$$ and $$$\frac{2}{5}$$$ come from these steps with "binomial distribution", "MLE", "optimization process"
% ========================================================================
When $$$\theta=\frac{a_{H}}{a_{T}+a_{H}}$$$, $$$\theta$$$ becomes best candidate from MLE perspective
So, you can write like this:
$$$\hat{\theta}=\frac{a_{H}}{a_{T}+a_{H}}$$$
Above $$$\theta$$$ is inferenced optimized parameter $$$\hat{\theta}$$$ in terms of MLE
% ========================================================================
So, you can explain how $$$\frac{3}{5} \frac{2}{5}$$$ come from
So, you can say $$$\theta = 0.6$$$ is best $$$\theta$$$
% ========================================================================
Now, you can throw thumbtack 50 times
And suppose 30 heads and 20 tails occurred
Then, can you say 3H and 2T and 30H and 20T are different because probability is same as 0.6
% ========================================================================
No, it's actually different
More you perform trials, error about inferenced parameter $$$\hat{\theta}$$$ as to actual parameter $$$\theta$$$ reduces
Now, you found $$$\hat{\theta}=\frac{a_{H}}{a_{H}+a_{T}}$$$ via MLE where $$$N=a_{H}+a_{T}$$$
True parameter $$$\theta^{*}$$$ comes from without error, so it's ideal
Since there are always trial erros, $$$\hat{\theta} \neq \theta^{*}$$$
% ========================================================================
$$$P(|\hat{\theta}-\theta^{*}|\geq \epsilon) \leq 2e^{-2N\epsilon^{2}}$$$
$$$\hat{\theta}$$$: inferenced $$$\theta$$$
$$$\theta^{*}$$$: true $$$\theta$$$ which thumbtack has
$$$\epsilon$$$: specific error value
$$$N$$$: number of trial
If allowing error bound $$$\epsilon$$$ becomes larger,
$$$2e^{-2N\epsilon^{2}}$$$ becomes smaller
That is, larger error bound becomes, probability of occuring error $$$P(|\hat{\theta}-\theta^{*}|\geq \epsilon)$$$ becomes smaller
Larger N becomes, probability of occuring error $$$P(|\hat{\theta}-\theta^{*}|\geq \epsilon)$$$ becomes smaller
So, with increased N, you can say $$$P(|\hat{\theta}-\theta^{*}|\geq \epsilon)$$$ becomes smaller
% ========================================================================
By increasing N, making $$$\hat{\theta}$$$ satisfying condition is PAC (Probably Approximate Correct) learning
% ========================================================================
You inferenced $$$\hat{\theta}$$$ by using MLE
And you figured out why $$$\hat{\theta}$$$ becomes result of PAC learning
% ========================================================================
Next time, you will see other perspective inferencing $$$\theta$$$ instead of MLE