This is personal study note Copyright and original reference: https://www.youtube.com/watch?v=EPlrYy9KuNU&list=PLbhbGI_ppZISMV4tAWHlytBqNq1-lb8bz&index=58 ================================================================================ Following algorithms are important to HMM Forward probability Backward probability ================================================================================ $$$\pi$$$: parameter which is used to affect "first latent factor z" $$$a$$$: transition probability of "state" $$$b$$$: emission probability of "observation" from "state" $$$X$$$: observation, dataset observed based on emission probability ================================================================================ Q1. Evaluation question - When you have $$$\pi,a,b,X$$$ - you can create HMM structure - From HMM, observation X is generated - That observation is how much likelike observation? (evaluate that observation) ================================================================================ Q2. Decoding question - When you have $$$\pi,a,b,X$$$ - you also have observation X, HMM structure M - Find sequence Z of most likely latent factor z - Suppose you have data (observation) - Which latent factor z (or flow of z) can explain observation in best way? ================================================================================ Q3. Learning question - When you have X (observation like professor's facial expression) - You don't know parameters $$$\pi, a, b$$$ - Find parameters $$$\pi, a, b$$$ which maximize probability of given data X - That is, inference parameters, learn parameters ================================================================================ In GMM, you also have only X (data) And you found (inferenced) "mixture components", "probability of selecting mixtures" This is "learning question" ================================================================================ You can see Z But Z should be marginalized out, but Z is invisible So, alternatively, you should use "EM algorithm" also for HMM like GMM case ================================================================================ Decoding question: supervised learnig - All parameters (as true dataset) are given - Against given X dataset, find most likely parameter Z Decoding question: unsupervised learnig - like dynamic clustering ================================================================================ When you have "train data" (X and M), how can you know parameters $$$\pi,a,b$$$ ---> Essentially it's counting method ================================================================================ L: loaded dice (which is unfair dice) F: fair dice LLLLLFFF...: "latent factor z" which only casino dealer knows in his mind 1234253....: "observation x" you can see dataset=create_dataset(above_information) perform_supervised_learning(dataset) ================================================================================ $$$P(\text{L} \rightarrow \text{L}) = P(z_{t+1}=\text{L} | z_t = \text{L}) = a$$$ Transition probability: state L at time t ---> state L at time t+1 ================================================================================ $$$ P(x_{t}=1 | z_t = \text{L}) = b$$$ observation number is 1 at time t = emit_observatioin(state is L at time t) ================================================================================ How to calculate probabilities a and b? Use MLE, MAP by counting events Then, you can find $$$\pi,a,b$$$ when you have X and M ================================================================================ What you saw: - When dataset (X) and model M are given - You can calculate parameters $$$\pi, a, b$$$ ================================================================================ ================================================================================ HMM can be "Bayesian network" ================================================================================ Suppose you have X (observation) and Z (latent factor or state) $$$pi$$$, a, b = calculate_parameters(method=[MLE or MAP], observation=X, latent_factor=Z) ================================================================================ * If you have "full joint probability" like $$P(X,Z)$$, you can use "Bayesian network" in useful way ================================================================================ * In Bayesian network, you can use "factorization" by using "Bayesian network's topology" ================================================================================ * Factoize $$$P(X,Z)$$$ $$$P(X,Z) = \\ = P(x_1,\cdots,x_t,z_1,\cdots,z_t) \\ = P(z_1)\times P(x_1|z_1) \times P(z_2|z_1) \times P(x_2|z_2) \times P(z_3|z_2) \times P(x_3|z_3)$$$ $$$P(x_1|z_1)$$$: emission probability of observation $$$x_1$$$ from latent factor $$$z_1$$$ $$$P(z_2|z_1)$$$: transition probability from current latent factor $$$z_1$$$ to next latent factor $$$z_2$$$ ================================================================================ Meaning: $$$P(X,Z)$$$ is well factorized by "emission probability" and "transition probability" ================================================================================ You can write above notation using parameters $$$\pi,a,b$$$ $$$= \pi_{idx(z_1=1)} \times b_{idx(x_1=1),idx(z_1=1)} \times a_{idx(z_1=1),idx(z_2=1)} \times \cdots$$$ ================================================================================ Suppose you know parameters $$$\pi,a,b$$$ When using L, probability of ocurring 1,2,3,4,5 is $$$\dfrac{1}{10}$$$ When using L, probability of ocurring 6 is $$$\dfrac{5}{10}$$$ $$$L\rightarrow L$$$: 70% Casino dealer use 70% of L ================================================================================ Example of calculation "joint probability" $$$P(166,LLL)$$$: how much likely is this situation? $$$166$$$: observation $$$LLL$$$: Z $$$= \frac{1}{2} \times \frac{1}{10} \times \frac{7}{10} \times \frac{1}{2} \times \frac{7}{10} \times \frac{1}{2} \\$$$ $$$\frac{1}{2}$$$: $$$\pi$$$? (just randomly assume) $$$\frac{1}{10}$$$: probability of "1" ocurring $$$\frac{7}{10}$$$: use L from L $$$ = 0.0061$$$ ================================================================================ $$$P(166,FFF)$$$: how much likely is this situation? $$$166$$$: observation $$$LLL$$$: Z $$$=\frac{1}{2} \times \frac{1}{6} \times \frac{1}{2} \times \frac{1}{6} \times \frac{1}{2} \times \frac{1}{6} \\$$$ $$$\frac{1}{2}$$$: initial state probability (just assume) $$$\frac{1}{10}$$$: prob of "1" ocurring $$$\frac{1}{2}$$$: use F from F $$$=5.7870e-{04}$$$: very low probability ================================================================================ Conclusion: Situation $$$P(166,FFF)$$$ is more likely ================================================================================ But possible scenario: $$$2^3$$$ 2*2*2, x x x You evaluated only 2 cases But evaluating all cases takes much time For this, you should use another method ================================================================================ - Now, your question is "observation X" is given inference best likely latent factor Z - Decoding question ================================================================================ Ultimately, in real case, you only know observation data X, you don't know latent factor Z ================================================================================ In GMM: marginalize_out.wrt(Z) you perform marginalize out wrt $$$Z$$$ $$$P(X|\theta) = \sum\limits_{z} P(X,Z|\theta)$$$ ================================================================================ In HMM: marginalize_out.wrt(Z) In HMM, you can marginalize out wrt $$$Z$$$ $$$P(X|\pi,a,b) = \sum\limits_{Z} P(X,Z|\theta)$$$ ================================================================================ $$$P(X) = \sum\limits_{Z} P(X,Z)$$$ # To remove Z perform_summation(P(X,Z)).wrt(Z) - Z is multiple random variables (multiple latent factors) = $$$\{z_1,z_2,\cdots,z_t\}$$$ ================================================================================ Full joint distribution form $$$= \sum\limits_{z_1} \cdots \sum\limits_{z_t} P(x_1,\cdots,x_t,z_1,\cdots,z_t)$$$ Perform factorization $$$= \sum\limits_{z_1} \cdots \sum\limits_{z_t} \pi_{z_1} \prod_{t=2}^T a_{z_{t-1},z_t} \prod_{t=1}^{T} b_{z_t,x_t}$$$ ================================================================================ Possible factorization even without constraint $$$P(A,B,C) = P(A)P(B|A)P(C|A,B)$$$ ================================================================================ By using above "factorization", "following complex form" should be changed into simple recursive form ================================================================================ First, you should consider "partial joint probability" than "full joint probability" $$$P(x_1,\cdots,x_t,z_t^k=1)$$$ - $$$x_1,\cdots,x_t$$$: all observations are in the joint distribution - $$$z_t^k=1$$$: only latent factor at time t is in the joint distribution Remaining parts are marginalized out ================================================================================ From above equation, only one latent factor $$$z_{t-1}$$$ is introduced Note that $$$z_{t-1}$$$ from $$$\sum\limits_{z_{t-1}}$$$ $$$\sum\limits_{z_{t-1}} P(x_1,\cdots,x_{t-1},x_t,z_{t-1},z_t^k=1)$$$ ================================================================================ Apply above "factorization" on it ================================================================================ Conclusion Complicated form $$$P(x_1,\cdots,x_t,z_t^k=1)$$$ into simplified form $$$P(x_1,\cdots,x_t,z_t^k=1) = \alpha_t^k = b_{k,x_t} \sum\limits_{i} \alpha_{t-1}^i \alpha_{i,k}$$$ ================================================================================