Note from original reference by https://datascienceschool.net/view-notebook/cba6d8e7667646de9c27e8f9d75f040c/ * Suppose 3 random variables A,B,C * Each random variable can have values within $$$[0,2]$$$ * Joint probability distribution of A,B,C A B C P(A,B,C) 0 0 0 P(A=0,B=0,C=0) 0 0 1 P(A=0,B=0,C=1) 0 0 2 P(A=0,B=0,C=2) ... 2 2 1 P(A=2,B=2,C=1) 2 2 2 P(A=2,B=2,C=2) * Number of parameters of joint probability distribution of A,B,C is $$$3^3-1=26$$$ which means you need 26 storages to store these parameters ================================================================================ In the real world, the case where "only several specific random variables" affect "each other" is more often than the case where "all random variables" affect "each other" * Graphical probability model: you express relationships of "only several random variables" from "all random variables" by using graph structure ================================================================================ * Bayesian network model $$$\subset$$$ Graphical probability model * Bayesian network model 1. Relationship of cause-result is clear 2. So, you can use "arrow" to express relationship ================================================================================ This is also called "directed graph" or "Bayesian network model" Circle (node, vertex): random variable Arrow (edge, link): relationship ================================================================================ "directed acyclic graph" "directed cyclic graph" ================================================================================ * Cause -> Result * Conditional probability: $$$P(B|A)$$$ ================================================================================ * Joint probability by using multiplication of conditional probability $$$P(A,B,C) = P(A)\times P(B|A)\times P(C|B)$$$ ================================================================================ * Caution * There can be "no direct causal relationship" between A (like health status) and B (like test score) * There can be a student who has good health but who has low test score * But there ca be "correlation relationship" between A (like health status) and B (like test score) * Generally, when health is good, test score is good ================================================================================ * Factors which consist of joint probability of $$$P(A,B,C) = P(A)\times P(B|A)\times P(C|B)$$$ * Factor on A Event A Probability of event A occuring P(A) A=0 P(A=0) A=1 P(A=1) A=2 P(A=2) * Factor on B Event B P(B|A=0) P(B|A=1) P(B|A=2) B=0 P(B=0|A=0) P(B=0|A=1) P(B=0|A=2) B=1 P(B=1|A=0) P(B=1|A=1) P(B=1|A=2) B=2 P(B=2|A=0) P(B=2|A=1) P(B=2|A=2) * Factor on C Event C P(C|B=0) P(C|B=1) P(C|B=2) C=0 P(C=0|B=0) P(C=0|B=1) P(C=0|B=2) C=1 P(C=1|B=0) P(C=1|B=1) P(C=1|B=2) C=2 P(C=2|B=0) P(C=2|B=1) P(C=2|B=2) * To model above joint probabilistic model, you should know 14 number of parameters * For $$$P(A)$$$: $$$3-1=2$$$ * For $$$P(B|A)$$$: $$$(3-1)\times 2=6$$$ * For $$$P(C|B)$$$: $$$(3-1)\times 2=6$$$ * Originally, you needed 26 number of parameters * But by adding "information of relationship between random variables", number of parameters you should know to model probabilistic model had reduced. ================================================================================ Joint probability distribution of Bayesian network * How to create "Bayesian network" 1. Create nodes for random variables which you want to inspect 2. Create nodes which have a casual relationship to "above nodes" 3. Draw "arrows" ================================================================================ * Once you create the Bayesian network, joint probability distribution of these random variables can be written as: $$$P(X_1,\cdots,X_N) \\ = P(X_1|Pa(X_1)) \times P(X_2|Pa(X_2)) \times \cdots \times P(X_N|Pa(X_N)) \\ = \prod\limits_{i=1}^N P(X_i|Pa(X_i))$$$ $$$Pa(X_i)$$$: parent node of $$$X_i$$$ node, cause $$$X_i$$$: child node, result ================================================================================ * Example of Bayesian network Joint probability distribution of random variables $$$(X_1,X_2,X_3,X_4,X_5,X_6,X_7)$$$ $$$P(X_1, X_2, X_3, X_4, X_5, X_6, X_7) \\ = P(X_1) P(X_2) P(X_3 | X_1) P(X_4| X_2, X_3) P(X_5|X_4) P(X_6|X_4) P(X_7|X_2)$$$ ================================================================================ Important point when you create Bayesian network is that "conditional independence relationship" between "random variables" should show in the graph ================================================================================ * "Conditional independence" should have random variable as being used condition ================================================================================ * "Independence" between random variable A and B $$$P(A,B)=P(A)\times P(B)$$$ $$$A\perp B|\phi$$$ ================================================================================ * "Conditional independence" $$$P(A,B|C)=P(A|C)\times P(B|C)$$$ * C: random variable as condition * When C is given, A and B are independent * $$$A\perp B|C$$$ ================================================================================ * Separating direction is the way you can inspect whether 2 random variables are "conditional independent" or not * To use this, you should know following 3 relationships 1. Tail-tail binding 2. Head-tail binding 3. Head-head binding ================================================================================ * Tail-tail binding * C is tail-tail binding * A and B are not independent * A and B are conditional independent with respect to C $$$P(A,B|C)\\ = \dfrac{P(A, B, C)}{P(C)}\\ = \dfrac{P(A|C)P(B|C)P(C)}{P(C)}\\ = P(A|C)P(B|C)$$$ You can call this status as "C blocks between of A and B" ================================================================================ * Tail-head binding * A and B have casual relationship * Between A and B, C is inserted * There is a meeting of tail and heal in C * A and B are not independent * A and B are conditional independent wrt C $$$P(A,B|C) \\ = \dfrac{P(A, B, C)}{P(C)}\\ = \dfrac{P(A)P(C|A)P(B|C)}{P(C)}\\ = \dfrac{P(A,C)P(B|C)}{P(C)}\\ = P(A|C)P(B|C)$$$ You can call this status as "C blocks between of A and B" ================================================================================ Head-head binding (V structure) * Parenet nodes of C: A and B * A and B are independent $$$P(A,B,C) = P(A)P(B)P(C|A,B)$$$ $$$P(A,B) = \sum_c P(A)P(B)P(C|A,B) = P(A)P(B)$$$ * But A and B are NOT conditional independent For example, suppose A: overslept, B: traffic jam, C: lateness Overselpt and traffic jam are independet * When C is given, A and B have negative correlation relationship That is, when lateness is occurred, if you didn't overslept, probability of traffic jam occuring increases * This situation is called "explaining-out" ================================================================================ Head-head binding which has further descendent * Same characteristics with head-head binding ================================================================================ D-separation * If A and B are conditional independent wrt C, following should be satisfied, * C is "tail-tail binding" or "tail-head binding" in between A and B (C blocks A and B) * C shouldn't have "head-head binding" between A and B ================================================================================ * Bayesian network model $$$\subset$$$ Graphical probability model * Markov network $$$\subset$$$ Graphical probability model * Markov network is "undirected graph" * There are 9 random variables * At least all 2 random variables have "relatioinship" ================================================================================ * "Markov network" is composed of "cliques" * "Clique" is composed of "random variable" * "Distribution of random variable" for clique is represented by "potential function" or "factor" * "Factor" is a function which is multiplied by "positive constant" on "joint probability distribution" * There is no limitation (to Markov network) that sum of all probabilities should be 1 ================================================================================ "Joint probability distribution" of "Markov network" is represented by multiplication of "factors of all cliques" $$$P(X) \\ = \dfrac{1}{Z(X)} \{\psi_1(X_1) \times \psi_2(X_2) \times \cdots \times \psi_C(X_C)\} \\ = \dfrac{1}{Z(X)} \prod\limits_{\{C\}}\psi_C(X_C)$$$ $$$C$$$: one clique $$$X_C$$$: random variable in each clique $$$\psi_C$$$: factor of clique $$$\{C\}$$$: set of all cliques $$$Z$$$: partiaion function ================================================================================ * Suppose you have 3x3 image 3x3 image has 9 random variables * Joint probability distribution of above 9 random variables can be represented by using Markov network $$$P(X_{11}, \ldots, X_{33}) = \dfrac{1}{Z} \psi(X_{11}, X_{12}) \psi(X_{11}, X_{21}) \psi(X_{12}, X_{13}) \cdots \psi(X_{23}, X_{33}) \psi(X_{32}, X_{33})$$$ ================================================================================ * Factor functions $$$\psi(X) = \exp(-E(X))$$$ * $$$E(X)$$$: energy function * Higher probability, value of energy function lower ================================================================================ * Bernoulli random variable $$$X_1, X_2$$$, which can have 0 or 1 * Let's express $$$X_1, X_2$$$ by using energy function $$$E(X_1, X_2) = -3(2X_1 - 1)(2X_2 - 1)$$$ * Let's calculate value of factors $$$\psi(X_1=1,X_2=1)=e^{3}$$$ $$$\psi(X_1=0,X_2=0)=e^{3}$$$ $$$\psi(X_1=1,X_2=0)=e^{-3}$$$ $$$\psi(X_1=0,X_2=1)=e^{-3}$$$ * Probability of $$$X_1$$$ and $$$X_2$$$ have sample values > Probability of $$$X_1$$$ and $$$X_2$$$ have different values * That is, $$$X_1$$$ and $$$X_2$$$ have positive correlation relationship