033. Week 06. Training, Testing, Regularization - 02. Trade-off relation between bias and variance

@
After deployment of machine learning application, you rarely have a chance to touch that application

Actually, it's best not to touch after deployment

@
There are 2 sources of error in machine learning

First one $E_{in}$ is what we couldn't explain about data set even if we tried to explain it with best efforts
It's called error by "inevitable" approximation

Second one $\Omega$ is error came from generalization
It's inevitable to have error between various variance of unknown coming data and generalization of model

@
Entire error $E_{out}$ is calculated by adding those two
$E_{out} = E_{in} + \Omega$

@
Suppose following elements
f is true target function which we want to learn
g is predicting model which we're processing step of learning
We genuinely can't have entire data
Even if we suppose we can have entire data, since data will be created continuously, we can't have entire data
$g^{(D)}$ is predicting model which is configured with inferenced parameters
D is available dataset drawn from real world
Now, suppose there is absolute entire data in the world, and data is generated continuously
And suppose we bring that entire data by bring data with infinite number of trying to take
And every time we try to bring each group of data D, we can make corresponding $g^{(D)}$
Now, we suppose we make infinite number of $g^{(D)}$
We can calculate infinite number of $g^{(D)}$ to make average hypothesis function
So $\bar{g}$ means average hypothesis induced from infinite number of $g^{(D)}$ via above process
We can express above logic by $g(x) = E_{D}[g^{(D)}(x)]$

@
We got this formular
$E_{out} \leq E_{in} + \Omega$

Let's think error of single instance of dataset D
$E_{out}(g^{(D)}(x)) = E_{X}[(g^{(D)}(x) - f(x))^{2}]$

Then, let's think of expected error of infinite number of dataset D
$E_{D}[E_{out}(g^{(D)}(x))] = E_{D}[E_{X}[(g^{(D)}(x) - f(x))^{2}]]$
We can switch $E_{X}$ with $E_{D}$ because it's process of calculating average(sum operation)
$E_{D}[E_{out}(g^{(D)}(x))] = E_{X}[E_{D}[(g^{(D)}(x) - f(x))^{2}]]$
We will simplify $E_{D}[(g^{(D)}(x) - f(x))^{2}]$

We can't go further only with $E_{D}[(g^{(D)}(x) - f(x))^{2}]$
So, we use trick of $-\bar{g}(x)+\bar{g}(x)$
$E_{D}[(g^{(D)}(x) - f(x))^{2}] = E_{D}[(g^{(D)}(x) -\bar{g}(x) + \bar{g}(x) - f(x))^{2}]$
We think as two groups
$E_{D}[(g^{(D)}(x) - f(x))^{2}] = E_{D}[((g^{(D)}(x) -\bar{g}(x)) + (\bar{g}(x) - f(x))^{2})]$
We use $(a+b)^{2} = a^{2}+2ab+b^{2}$
$E_{D}[(g^{(D)}(x) - f(x))^{2}] = E_{D}[(g^{(D)}(x) -\bar{g}(x))^{2} + (\bar{g}(x) - f(x))^{2} +2(g^{(D)}(x) -\bar{g}(x))(\bar{g}(x) - f(x))]$
We can extract terms not related to $E_{D}$ out of inside of entire term
$(g^{(D)}(x) -\bar{g}(x))^{2}$ is related to $E_{D}$
$(\bar{g}(x) - f(x))^{2}$ can be out of entire term
Anyway, we can get following terms
$E_{D}[(g^{(D)}(x) - f(x))^{2}] = E_{D}[(g^{(D)}(x) -\bar{g}(x))^{2}] + (\bar{g}(x) - f(x))^{2} + E_{D}[2(g^{(D)}(x) -\bar{g}(x))(\bar{g}(x) - f(x))]$
We know $E_{D}[2(g^{(D)}(x) -\bar{g}(x))(\bar{g}(x) - f(x))]$ becomes 0
because process of calculating expectation value from infinite number of sampling for D $E_{D}$ is actually same average hypothesis $\bar{g}(x)$

Then, we plug in simplified $E_{D}[(g^{(D)}(x) - f(x))^{2}]$ into $E_{X}$
$E_{D}[(g^{(D)}(x) - f(x))^{2}] = E_{D}[(g^{(D)}(x) -\bar{g}(x))^{2} + (\bar{g}(x) - f(x))^{2} +2(g^{(D)}(x) -\bar{g}(x))(\bar{g}(x) - f(x))]$
$E_{D}[(g^{(D)}(x) - f(x))^{2}] = E_{D}[(g^{(D)}(x) -\bar{g}(x))^{2} + (\bar{g}(x) - f(x))^{2}]$
$E_{D}[E_{out}(g^{(D)}(x))] = E_{X}[E_{D}[(g^{(D)}(x) - f(x))^{2}]]$
$E_{D}[E_{out}(g^{(D)}(x))] = E_{X}[E_{D}[(g^{(D)}(x) -\bar{g}(x))^{2} + (\bar{g}(x) - f(x))^{2}]]$

We will call $E_{D}[(g^{(D)}(x) -\bar{g}(x))^{2}$ as Variance(x)
This means error $\Omega$ from generalization

We will call $(\bar{g}(x) - f(x))^{2}$ as $Bias^{2}(X)$
This means error $E_{in}$ from approximation, which come from limitation of our predicting model
$\bar{g}(x)$ is average hypothesis from infite data
We don't have any issue on data set, but issue comes from limitation of approximation

If we think of $\bar{g}(x)$ as linear function to make it identical to true target function f(x), $\bar{g}(x)$ contains huge bias(strong assumption of liearity)
As we increase degree with making model more complex with being nearer to f(x), bias reduces
However, in this case, we can have issue in terms of variance
The more we make complex model, we will get far distance with average hypothesis $\bar{g}(x)$ which is created from infinite various dataset

In conclusion, there is trad-off relationship between variance and bias

@
We can't train Variance with average hypothesis $\bar{g}(x)$ because of limitation of dataset

@
We can't make average hypothesis $\bar{g}(x)$ to match real world function f(x) only by bias because we're confined and limited by machine learning model which we're using

@
So, how can we reduce variance and bias?

The way of reducing variance is collecting more dataset to make model as average hypothesis $\bar{g}(x)$

The way of reducing bias is making model more complex

@
We can reduce both variance and bias
But as I said before, those two has trade-off relationship to each other
If we reduce bias with making model complex, chance of overfitting rises, resulting in high variance