This is notes which I wrote based on the lecture originated from https://datascienceschool.net/view-notebook/ea4584bde04140368950ee50ef038833/ ================================================================================ - Suppose you know probability distribution function p(X) of random variable - You will learn how to generate samples which follow that probability distribution function using following methods * uniform distribution * inverse transform * rejection sampling * importance sampling ================================================================================ Python's random number generator is actually pseudo random number generator which uses uniform distribution ================================================================================ Basic probability distribution functions have equations. So, you can use inverse transform to generate samples. - Suppose monotonous increasing function f(x) - You generate uniform distribution samples - You put samples into f(x) and get ys - Let's say Y which has ys - Y's cdf is $$$f^{-1}$$$ $$$F_Y(y)=f^{-1}(x)$$$ ================================================================================ - Suppose you have target probability distribution p(x) which you want to have - Supposee there is probability distribution q(x) which is similar to p(x) But q(x) is easier to generate samples than q(x) - Generate samples from q(x) - You decide whether you throw extracted sample away or not, by using probability $$$\frac{p(z)}{kq(z)}$$$ - k is scaling constant to make $$$kq(x)\ge p(x)$$$ ================================================================================ Inference expectation value You can inference "expectation value" of probability distribution. It's the reason why you want to create samples from probability distribution. ================================================================================ Suppose you have probability distribution p(X) which you want to analyze Expectation value of it is $$$\text{E}[f(X)] = \int f(x)p(x)dx$$$ How to calculate (or inference) expectation value? Use monte carlo integration when you have N samples $$$\{ x_1, \ldots, x_N \}$$$ $$$\text{E}[f(X)] \approx \dfrac{1}{N} \sum_{i=1}^N f(x_i)$$$ Error becomes smaller as you have larger N ================================================================================ Importance sampling - Suppose your goal is the only one to generate samples via expectation value - Then you can use importance sampling where "generate samples" and "monte carlo integration for calculating expectation value" are merged ================================================================================ Step - Create similar distribution q which satisfies $$$kq>p$$$ - Calculate expectation value $$$\text{E}[f(X)] = \int f(x)p(x)dx \\ = \int f(x)\dfrac{p(x)}{q(x)} q(x) dx \\ \approx \dfrac{1}{N} \sum_{i=1}^N f(x_i)\dfrac{p(x_i)}{q(x_i)}$$$ ================================================================================ $$$\dfrac{p(x_i)}{q(x_i)}$$$ is called importance because it is used as weight for samples There is no samples thrown away like rejection sampling which threw away 80% of samples in above example. Using importance sampling, you can calculate expectation value wihtout having thrown sample. ================================================================================ Markov chain - Suppose time series probability procedure - Suppose that time series probability procedure has K finite states - If probability distribution $$$p(x_t) depends on p(x_{t-1}) and p(x_t|x_{t-1})$$$ this time series probability procedure is called "Markov chain" - Note that "Markov chain" is different from "linear chain Markov network" ================================================================================ Let's apply "law of the total probability" on discrete state Markov chain Then, following is true. $$$p(x_{t}) = \sum_{x_{t-1}} p(x_t, x_{t-1}) = \sum_{x_{t-1}} p(x_t | x_{t-1}) p(x_{t-1})$$$ ================================================================================ $$$p(x_t)$$$ is category distribution, you can express it as row vector $$$p_t$$$ then, you can convert $$$p(x_{t}) = \sum_{x_{t-1}} p(x_t, x_{t-1}) = \sum_{x_{t-1}} p(x_t | x_{t-1}) p(x_{t-1})$$$ to following matrix form $$$p_t=p_{t-1}T$$$ $$$T$$$: transition matrix, $$$K\times K$$$ matrix, conditional probability $$$T_{ij}=P(x_t=j|x_{t-1}=i)$$$ ================================================================================ If transition matrix is symmetric matrix, Markov chain is called "reversible Markov chain" or you say Markov chain satisfies "detailed balance condition" ================================================================================ These Markov chain always converges into same distribution $$$p_{\infty}$$$ regardless of initial condition $$$p_0$$$ as time t goes by. $$$p_{t'} = p_{t'+1} = p_{t'+2} = p_{t'+3} = \cdots = p_{\infty}$$$ $$$t'$$$: time after convergence After reaching to convergence state, samples $$$\{x_{t'}, x_{t'+1}, x_{t'+1}, \ldots, x_{t'+N} \}$$$ are independent values which come from same probability distribution ================================================================================ MCMC Markov Chain Monte Carlo generates samples using Markov chain. You first create Markov chain for Markov chain's convergence distribution to have p(x) Then, you run this Markov chain for $$$t^{'}$$$ hours, after that, you can get samples in distribution you want. ================================================================================ Metropolis-Hastings Sampling Metropolis-Hastings Sampling is used to generate sample as part method of MCMC. Metropolis-Hastings Sampling is similar to rejection sampling but MH uses conditional distribution $$$q(x^*|x_t)$$$ instead of unconditional distribution $$$q(x)$$$ for calculation distribution. MH generally uses Gaussian distribution which has expectation value $$$x_t$$$ $$$q(x^*|x_t)=N(x^*|x_t,\sigma^2)$$$ ================================================================================ How to generate samples using Metropolis-Hastings - if $$$t=0$$$, randomly generate $$$x_t$$$ - generate sampe $$$x^*$$$ from distribution $$$q(x^*|x_t=x_t)$$$ where samples can be generated - According to following probability p, select x_{t+1} for x^* Following probability p is called Metropolis-Hastings criterion $$$p = \min \left( 1, \dfrac{p(x^{\ast}) q(x_t \mid x^{\ast})}{p(x_t) q(x^{\ast} \mid x_t)} \right)$$$ If none is selected, newly selected value should be thrown away, you use previous value, $$$x_{t+1}=x_t$$$ - You iterate above step until you have enough number of samples. ================================================================================ If calculation distribution $$$q(x^*|x_t)$$$ is Gaussian normal distribution, if condition x_t and result $$$x^*$$$ are changed, probability is same. $$$q(x^{\ast} \mid x_t) = q(x_t \mid x^{\ast})$$$ ================================================================================ In this case, criterion becomes $$$p = \min \left( 1, \dfrac{p(x^{\ast})}{p(x_t)} \right)$$$ which is called "Metropolis criterion" ================================================================================ According to Metropolis criterion, it tries to make $$$p(x^*)$$$ bigger than $$$p(x_t)$$$ ================================================================================ PyMC3 is used to generate MCMC samples and is used to perform Bayesian inference. PyMC3 can use GPU ================================================================================
# @ Code
# If you already installed MKL library, you first need to execute this
import os
os.environ["MKL_THREADING_LAYER"]="GNU"

# ================================================================================
import pymc3 as pm
================================================================================ Step of usage: - Create Model class. Instance of this class is used like session in TensorFlow - Create symbolic random variable. - Create sampler instance like Metropolis, HamiltonianMC, NUTS - Generate sample by using sampe() ================================================================================
# @ Code
# c cov: covariance matrix
cov=np.array([[1.,1.5],[1.5,4]])
# c mu: mean value
mu=np.array([1,-1])

# ================================================================================
# Create Model scope
with pm.Model() as model:
  # c x: instance of multivariate Gaussian normal distribution
  x = pm.MvNormal('x',mu=mu,cov=cov,shape=(1,2))

  # c step: instance of Metropolis
  step = pm.Metropolis()

  # @ Execute generating 1000 sample
  trace = pm.sample(1000, step)

# ================================================================================
import warnings
warnings.simplefilter("ignore")

pm.traceplot(trace)
plt.show()

plt.scatter(trace['x'][:, 0, 0], trace['x'][:, 0, 1])
plt.show()
================================================================================ Bayesian Estimation using MCMC $$$P(\theta \mid x_{1},\ldots,x_{N}) \propto P(x_{1},\ldots,x_{N} \mid \theta) P(\theta)$$$ $$$P(\theta)$$$: Beta $$$P(x_1,\cdots,x_N|\theta)$$$: Binomial
# @ Code
theta0=0.7

# @ Set seed number
np.random.seed(0)

# c x_data1: generated 10 data which follow bernoulli distribution
x_data1=sp.stats.bernoulli(theta0).rvs(10)
print(x_data1)
# array([1, 0, 1, 1, 1, 1, 1, 0, 0, 1])

# @ create model scope
with pm.Model() as model:
  theta=pm.Beta('theta',alpha=1,beta=1)

  x=pm.Bernoulli('x',p=theta,observed=x_data1)

  start=pm.find_MAP()

  step=pm.NUTS()

  trace1=pm.sample(2000,step=step,start=start)

pm.traceplot(trace1)
plt.show()

pm.plot_posterior(trace1)
plt.xlim(0, 1)
plt.show()
================================================================================ Bayesian Linear Regression using MCMC
# @ Code
from sklearn.datasets import make_regression

# @ Create data for regression task
# 100 samples, 1 feature
x,y_data,coef=make_regression(n_samples=100,n_features=1,bias=0,noise=20,coef=True,random_state=1)
x = x.flatten()

# ================================================================================
print(coef)
# array(80.71051956)

# ================================================================================
plt.scatter(x, y_data)
plt.show()

# ================================================================================
# Create Model scope
with pm.Model() as m:

  w=pm.Normal('w',mu=0,sd=50)
  b=pm.Normal('b',mu=0,sd=50)
  esd=pm.HalfCauchy('esd',5)
  y=pm.Normal('y',mu=w*x+b,sd=esd,observed=y_data)

  start = pm.find_MAP()
  step = pm.NUTS()
  trace1 = pm.sample(10000, step=step, start=start)

# logp = -448.82, ||grad|| = 6.8486: 100%|██████████| 37/37 [00:00<00:00, 1599.20it/s]  
# Multiprocess sampling (4 chains in 4 jobs)
# NUTS: [esd, b, w]
# Sampling 4 chains: 100%|██████████| 42000/42000 [00:16<00:00, 2477.73draws/s]

# ================================================================================
pm.traceplot(trace1)
plt.show()

# ================================================================================
pm.summary(trace1)