https://datascienceschool.net/view-notebook/c2934b8a2f9140e4b8364b266b1aa0d8/ ================================================================================ * Suppose categorical random variable Z. * Probability distribution function of Z is $$$p(z=k)=\pi_k$$$ * Suppose random variable X which outputs real numbers. * Suppose when sample k of random variable Z varies, expectation value $$$\mu_k$$$ and variance $$$\Sigma_k$$$ of random variable X vary * $$$p(x|z) = \mathcal{N}(x|\mu_k,\Sigma_k)$$$ * You can merge above ones. $$$p(x)$$$ $$$= \sum\limits_Z p(z)p(x\mid z) $$$ $$$= \sum\limits_{k=1}^{K} \pi_k \mathcal{N}(x \mid \mu_k, \Sigma_k)$$$ ================================================================================ output_from_random_variable_Z=K_class_categorical_random_variable_Z(result_from_trial) output_from_random_variable_X=random_variable_X_which_outputs_real_number( result_from_trial,output_from_random_variable_Z) mean,var=mean_var(output_from_random_variable_X) output_from_random_variable_Z varies --> mean,var vary: - X is composed of multiple Gaussian normal distributions - X is Gaussian mixture model ================================================================================ with scope:Gaussian_mixture_model you can't know output_from_random_variable_Z Model which contains "latent random variable" is called "latent variable model" with scope:Mixture_model categorical_value=latent_random_variable(result_from_trial) with scope:Other_models float_value=latent_random_variable(result_from_trial) ================================================================================ Bernoulli Gaussian mixture mdoel: category is 2 output_from_bernoulli_random_variable=bernoulli_random_variable(result_from_trial) # output_from_bernoulli_random_variable={category1,category2} output_from_Gaussian_normal_distribution=Gaussian_normal_distribution( result_from_trial,output_from_bernoulli_random_variable) output_from_bernoulli_random_variable_case1 output_from_Gaussian_normal_distribution_case1 output_from_bernoulli_random_variable_case2 output_from_Gaussian_normal_distribution_case2 output_from_bernoulli_random_variable_case3 output_from_Gaussian_normal_distribution_case3 output_from_bernoulli_random_variable_case4 output_from_Gaussian_normal_distribution_case4 output_from_bernoulli_random_variable_case5 output_from_Gaussian_normal_distribution_case5 ================================================================================ Bernoulli (2 categories) + Gaussian (2D) mixture model * Code from numpy.random import randn n_samples=500 # c mu1: first mean for 2 models mu1=np.array([0,0]) # c mu2: second mean for 2 models mu2=np.array([-6,3]) sigma1=np.array([[0.,-0.1],[1.7,.4]]) sigma2=np.eye(2) np.random.seed(0) X=np.r_[1.0*np.dot(randn(n_samples,2),sigma1)+mu1, 0.7*np.dot(randn(n_samples,2),sigma2)+mu2] plt.scatter(X[:,0],X[:,1],s=5) plt.xlabel("x1") plt.ylabel("x2") plt.title("Samples of Bernoulli-Gaussian mixture model") plt.show() ================================================================================ How to inference parameters of Gaussian mixture model estimated_paramters=estimate_parameters_of_Gaussian_mixture_model() print("estimated_paramters",estimated_paramters) # PDF of categorical distribution # PDF of Gaussian normal distribution at category1 # PDF of Gaussian normal distribution at category2 # PDF of Gaussian normal distribution at category3 Gaussian_mixture_model has complicated shape to find its PDF via linear algebra methods ================================================================================ Probability distribution of random variable X which has N number of data is as follow $$$p(x)$$$ $$$=\prod\limits_{i=1}^N p(x_i)$$$ $$$=\prod\limits_{i=1}^N \sum\limits_{z_i} p(x_i,z_i)$$$ $$$=\prod\limits_{i=1}^N \sum\limits_{z_i} p(z_i)p(x_i\mid z_i)$$$ $$$=\prod\limits_{i=1}^N \sum\limits_{k=1}^{K} \pi_k \mathcal{N}(x_i\mid \mu_k, \Sigma_k)$$$ * You can use log $$$\log p(x) = \sum\limits_{i=1}^N \log \left( \sum\limits_{k=1}^{K} \pi_k \mathcal{N}(x_i\mid \mu_k, \Sigma_k) \right)$$$ * Both equations are the ones where you can't easily find the parameters which satisfy derivatives=0 ================================================================================ * If you know which "category $$$z_i$$$" "data $$$x_i$$$" is involved in, you can collect "all data $$$x_i$$$" which are involved in same "category $$$z_i$$$" Then, you can find "categorical probability distribution $$$\pi_k$$$" and you can find parameters "$$$\mu_k$$$" and "$$$\Sigma_k$$$" of "Gaussian normal distribution" * However, you can't know "category $$$z_i$$$" which "data $$$x_i$$$" has. So, you should find $$$\pi_k, \mu_k, \Sigma_k$$$ which maximize p(x) by using non-linear optimization ================================================================================ * In network graphical probabilistic model perspective, Gaussian mixture model is simple model where random varialbe $$$Z_i$$$ affects random variable $$$X_i$$$ * But, $$$Z_i$$$ affects $$$X_i$$$ sequentially from 1 to N So, you should express like this * Above graph can be expressed as following panel model ================================================================================ EM (Expectation-Maximization) * When you inference "parameters" of "mixture model", "conditional probability $$$p(z|x)$$$" is important, which shows that data x is involved in which category z * "Conditional probability $$$p(z|x)$$$" which is used for this case is called "responsibility" * Responsibility $$$\pi_{ik}$$$ is written as follow $$$\pi_{ik} \\ =p(z_i=k\mid x_i) \\ =\dfrac{p(z_i=k)p(x_i\mid z_i=k)}{p(x_i)} \\ =\dfrac{p(z_i=k)p(x_i\mid z_i=k)}{\sum\limits_{k=1}^K p(x_i,z_i=k)} \\ =\dfrac{p(z_i=k)p(x_i\mid z_i=k)}{\sum\limits_{k=1}^K p(z_i=k)p(x_i\mid z_i=k)}$$$ * Responsibility $$$\pi_{ik}$$$ for Gaussian mixture model is written as follwo $$$\pi_{ik} = \dfrac{\pi_k \mathcal{N}(x_i\mid \mu_k, \Sigma_k)}{\sum\limits_{k=1}^K \pi_k \mathcal{N}(x_i\mid \mu_k, \Sigma_k)}$$$ ================================================================================ * Above equation inferences responsibility $$$\pi_{ik}$$$ from parameters $$$\pi_k,\mu_k,\Sigma_k$$$ $$$(\pi_k, \mu_k, \Sigma_k)$$$ --inference--> $$$\pi_{ik}$$$ * $$$\pi_{ik}$$$: probability of that "ith data $$$x_i$$$" is created from "category k" ================================================================================ * Now, you will maximize log-joint probability distribution function * First, you perform differentiation on "log-joint probability distribution function" wrt $$$\mu_k$$$ to create equation $$$0 = - \sum\limits_{i=1}^N \dfrac{p(z_i=k)p(x_i\mid z_i=k)}{\sum\limits_{k=1}^K p(z_i=k)p(x_i\mid z_i=k)} \Sigma_k (x_i - \mu_k )$$$ ================================================================================ * You will simplify it $$$\sum\limits_{i=1}^N \pi_{ik} (x_i - \mu_k ) = 0 $$$ $$$\mu_k = \dfrac{1}{N_k} \sum\limits_{i=1}^N \pi_{ik} x_i$$$ ================================================================================ From above equation, you can know $$$N_k = \sum_{i=1}^N \pi_{ik}$$$ * $$$N_k$$$: almost number of data which is involved in kth category * $$$\mu_k$$$: almost sample mean value of data which is involved in kth category ================================================================================ Secondly, you perform differentiation on "log-joint probability distribution function" wrt $$$\Sigma_k$$$ and find parameter $$$\Sigma_k$$$ which maximizes differentiation $$$\Sigma_k = \dfrac{1}{N_k} \sum_{i=1}^N \pi_{ik} (x_i-\mu_k)(x_i-\mu_k)^T$$$ ================================================================================ Lastly, you perform differentiation on "log-joint probability distribution function" wrt $$$\pi_k$$$ and find parameter $$$\pi_k$$$ which maximizes differentiation * Due to constraints which parameter of categorical random variable has, you need to add Lagrange multiplier $$$\log p(x) + \lambda \left(\sum_{k=1}^K \pi_k - 1 \right)$$$ * Then, you perform differentiation of above equation wrt $$$\pi_k$$$ and find value $$$\pi_k$$$ which satisfies derivatives=0 $$$\pi_k = \dfrac{N_k}{N}$$$ ================================================================================ * Above 3 equations calculate parameters from "responsibility" $$$\pi_{ik}$$$ --is used to calculate--> $$$(\pi_k, \mu_k, \Sigma_k )$$$ ================================================================================ * Originally, you had to estimate parameters $$$(\pi_k, \mu_k, \Sigma_k )$$$ and responsibility $$$\pi_{ik}$$$ * But if you already know responsibility $$$\pi_{ik}$$$, equation becomes easy to find parameters $$$(\pi_k, \mu_k, \Sigma_k )$$$ * So, you can use iterative method called EM to estimate parameters $$$(\pi_k, \mu_k, \Sigma_k )$$$ * EM method inferences "parameter" and "responsibility" in turn with increasing precision of estimation ================================================================================ * E step * You assume "parameters" you know are precise. So, you "inference responsibility" which represents data is involved in which category by using assumed parameters $$$(\pi_k,\mu_k,\Sigma_k)$$$ --are used to inference--> $$$\pi_{ik}$$$ ================================================================================ * M step * You assume "responsibility" you know is precise So, you inference "parameters of Gaussian mixture model" by using assumed responsibility $$$\pi_{ik}$$$ --is used to inference--> $$$(\pi_k,\mu_k,\Sigma_k)$$$ ================================================================================ * If you iterate above EM, you can make parameter and responsibility better gradually ================================================================================ * K-means clustering $$$\subset$$$ EM If you can know "responsibility" of each data, you can find maximal "responsibility" in which each data is involved It means that you can perform "clustering" $$$k_i = \arg_k\max \pi_{ik}$$$ K-means clustering is special case of EM algorithm ================================================================================ * GaussianMixture from Scikit-Learn from sklearn.mixture import GaussianMixture from sklearn.utils.testing import ignore_warnings from sklearn.exceptions import ConvergenceWarning def plot_gaussianmixture(n): # c model: GaussianMixture model=GaussianMixture( n_components=2,init_params='random',random_state=0,tol=1e-9,max_iter=n) with ignore_warnings(category=ConvergenceWarning): model.fit(X) pi=model.predict_proba(X) plt.scatter(X[:,0],X[:,1],s=50,linewidth=1,edgecolors="b",cmap=plt.cm.binary,c=pi[:,0]) plt.title("iteration:{}".format(n)) plt.figure(figsize=(8, 12)) plt.subplot(411) plot_gaussianmixture(1) plt.subplot(412) plot_gaussianmixture(5) plt.subplot(413) plot_gaussianmixture(10) plt.subplot(414) plot_gaussianmixture(15) plt.tight_layout() plt.show() ================================================================================ General EM algorithm * EM algorithm inferences probability distribution p(x) when there is "random variable x" which is dependent on latent random variable z and when latent random variable z is not observable but random variable x is observable output_from_Z=random_variable_Z(result_from_trial) ret=is output_from_Z observable print(ret) # False output_from_X=random_variable_X(result_from_trial,output_from_Z) ret=is output_from_X observable print(ret) # True estimated_p(X)=EM_algorithm_to_inference_p(X)(output_from_X) scope:network_model ret=conditional_probability_distribution_p(X|Z,theta) ret1=is ret determined by parameter theta? print(ret1) # True ret2=you already know conditional_probability_distribution_p(X|Z,theta)? print(ret2) # True scope:mixture_model ret=is z discrete random variable? print(ret) # True $$$p(x \mid \theta) = \sum_z p(x, z \mid \theta) = \sum_z q(z) p(x\mid z, \theta)$$$ Your goal in EM algorithm is to find 1. latent random variable distribution $$$q(z)$$$ 2. $$$\theta$$$ which maximize likelihood $$$p(x\theta)$$$ wrt given data x ================================================================================ * First, you need to prove following equation $$$\log p(x) \\ = \sum\limits_z q(z) \log \left(\dfrac{p(x, z \mid \theta)}{q(z)}\right) - \sum\limits_z q(z) \log \left(\dfrac{p(z\mid x, \theta)}{q(z)}\right)$$$ ================================================================================ * Proof $$$\sum\limits_z q(z) \log \left(\dfrac{p(x, z \mid \theta)}{q(z)}\right) \\ =\sum\limits_z q(z) \log \left(\dfrac{p(z \mid x, \theta)p(x \mid \theta)}{q(z)}\right) \\ =\sum\limits_z \left( q(z) \log \left(\dfrac{p(z \mid x, \theta)}{q(z)}\right) + q(z) \log p(x \mid \theta) \right) \\ =\sum\limits_z q(z) \log \left(\dfrac{p(z \mid x, \theta)}{q(z)}\right) + \sum\limits_z q(z) \log p(x \mid \theta) \\ =\sum\limits_z q(z) \log \left(\dfrac{p(z \mid x, \theta)}{q(z)}\right) + \log p(x \mid \theta)$$$ ================================================================================ * From following proven equation, $$$\log p(x) \\ = \sum\limits_z q(z) \log \left(\dfrac{p(x, z \mid \theta)}{q(z)}\right) - \sum\limits_z q(z) \log \left(\dfrac{p(z\mid x, \theta)}{q(z)}\right)$$$ * You will write $$$L(q,\theta)$$$ and $$$KL(q|p)$$$ $$$L(q, \theta) = \sum_z q(z) \log \left(\dfrac{P(x, z \mid \theta)}{q(z)}\right)$$$ $$$KL(q \mid p) = -\sum_z q(z) \log \left(\dfrac{p(z\mid x, \theta)}{q(z)}\right)$$$ $$$\log p(x) = L(q, \theta) + KL(q \mid p)$$$ ================================================================================ * $$$L(q,\theta)$$$ is "functional" which outputs "number" when $$$L(q,\theta)$$$ is given by distribution function $$$q(z)$$$ ================================================================================ * KL(q|p) is kullback leibler divergence which shows difference between "probability distribution function $$$q(z)$$$" and "probability distribution function $$$p(z|x,\theta)$$$" * KL-divergence $$$\ge$$$ 0 * Therefore, L(q,\theta) is lower bound of $$$\log{p(x)}$$$ * So, L is called ELBO (evidence lower bound) $$$\log{p(x)} \ge L(q,\theta)$$$ * Conversely, $$$\log{p(x)}$$$ is upper bound of $$$L(q,\theta)$$$ ================================================================================ * If you maximize $$$L(q,\theta)$$$, $$$\log{p(x)}$$$ is maximized * EM algorithm finds $$$q$$$ and $$$\theta$$$ values in turn and by iterative method, to maximize $$$L(q,\theta)$$$ ================================================================================ * EM algorithm (1) E step * You fix $$$\theta$$$ to $$$\theta_{\text{old}}$$$ * You find $$$q_{\text{new}}$$$ which maximizes $$$L(q_{\text{old}},\theta_{\text{old}})$$$ * If you can find correct $$$q_{\text{new}}$$$, $$$L(q_{\text{new}},\theta_{\text{old}})$$$ becomes equal to upper bound $$$\log{p(x)}$$$ * That is, $$$KL(q_{\text{new}}|p)$$$ becomes 0 $$$L(q_{\text{new}},\theta_{\text{old}})=\log{p(x)}$$$ $$$KL(q_{\text{new}}|p)=0$$$ $$$q_{\text{new}}=p(z|x,\theta_{\text{old}})$$$ ================================================================================ (2) M step * You fix $$$q$$$ to current function $$$q_{\text{new}}$$$ * You find $$$\theta_{\text{new}}$$$ which maximizes $$$L(q_{\text{new}},\theta)$$$ * Since you performed maximization, $$$L(q_{\text{new}},\theta_{\text{new}})$$$ becomes larger than previous value $$$L(q_{\text{new}},\theta_{\text{new}}) > L(q_{\text{new}},\theta_{\text{old}})$$$ * And since $$$p(Z|X,\theta_{\text{new}})$$$ becomes different from previous value $$$p(Z|X,\theta_{\text{old}})$$$ $$$q_{\text{new}}$$$ also becomes different to $$$p(Z|X,\theta_{\text{new}})$$$ * Then, KL-divergence becomes larger than 0 * $$$q_{\text{new}} \ne p(Z|X,\theta_{\text{new}})$$$ $$$KL(q_{\text{new}}|p) > 0$$$