025-001. bernoulli distribution @ In some trial, if result comes either success or fail, that trial is calls bernoulli trial Example of bernoulli trial is coin toss @ If you want to represent result of bernoulli trial in random variable X, success is denoted by X=1, fail is denoted by X=0 Sometimes, fail denoted by X=-1 @ Since bernoulli random variable has 0 or 1, bernoulli random variable is discrete random variable Therefore, bernoulli distribution can be defined by probability mass function @ probability mass function of bernoulli distribution is defined as follow $Bern(x;\theta) = \theta$ if x = 1 $Bern(x;\theta) = 1 - \theta$ if x = 0 Bernoulli random variable has one parameter $\theta$ which means probability of occuring 1 Independent variable x and parameter $\theta$ are separated by ";" Probability of occurring 0 is $1-\theta$ @ Above formular can be represented as one sentence $Bern(x;\theta) = \theta^{x} (1 - \theta)^{(1-x)}$ @ Practice 1 1. In above one sentence formular, put x=1 and x=0, then check if you can get pmf of bernoulli distribution @ If bernoulli random variable has 1 or -1, you should denote bernoulli distribution as following $Bern(x;\theta) = \theta^{\frac{(1+x)}{2}} (1-\theta)^{\frac{(1-x)}{2}}$ @ If some random variable X is occurred by bernoulli distribution, we say random variable X follows bernoulli distribution And we denote it in formular as follow $X\sim Bern(x;\theta)$ @ SciPy를 사용한 베르누이 분포의 시뮬레이션¶ We can use bernoulli class in stats subpackage of scipy for bernoulli distribution You can use argument p for parameter $\theta$ % I configure p = 0.6 theta = 0.6 rv = sp.stats.bernoulli(theta) @ % You can calculate probability mass function by using pmf() % This is values which X can have xx = [0, 1] % I get probability mass function of bernoulli distribution plt.bar(xx, rv.pmf(xx)) plt.xlim(-1, 2) plt.ylim(0, 1) % Name of graph for 0 is x=0 plt.xticks([0, 1], ["x=0", "x=1"]) plt.xlabel("Sample values") plt.ylabel("P(x)") plt.title("pmf of bernoulli distribution") plt.show() #img 1195e5a1-8724-472e-990b-338c6a8daa6b @ % You can use rvs() to simulate trial % 100 trials x = rv.rvs(100, random_state=0) x % array([1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, % 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, % 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, % 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, % 1, 0, 1, 1, 1, 1, 0, 1]) @ % I can visualize simulated result by using countplot() of seaborn package % I pass simulated data x sns.countplot(x) plt.title("Simulated result of bernoulli random variable X") plt.xlabel("Sample values") plt.show() #img d5ea3c80-4331-4fae-b846-6dcc569b84e5 @ % You can use following code to show both theoretical probability distribtion and sample based probability distribtion % x = rv.rvs(100, random_state=0) y = np.bincount(x, minlength=2) / float(len(x)) % count of 1 and 0 / 100.0 % array([38, 62], dtype=int64) / 100.0 % array([0.38, 0.62]) % I make dataframe df = pd.DataFrame({"Theoretical":rv.pmf(xx), "Sample based":y}) df % Sample based Theoretical % 0 0.38 0.4 % 1 0.62 0.6 % You can visualize above result by using barplot() of seaborn % First, I reset index for df df2 = df.stack().reset_index() % I set columns by list of data df2.columns = ["Sample value", "Type", "Ratio"] df2 % Sample value Type Ratio % 0 0 Sample based 0.38 % 1 0 Theoretical 0.40 % 2 1 Sample based 0.62 % 3 1 Theoretical 0.60 % I use barplot() % I configure barplot sns.barplot(x="Sample value", y="Ratio", hue="Type", data=df2) plt.show() #img 908c0b45-5892-4504-8c88-ce5c940d3a1e @ Moments of bernoulli distribtion are as following 1. Expectation of X $E[X]=\theta$ (proof) $E[X] = \sum x_{i}P(x_{i})$ $E[X] = 1\cdot\theta + 0\cdot(1-\theta)$ $E[X] = \theta$ 2. Variance of X $Var[X] = \theta(1-\theta)$ (proof) $Var[X] = \sum(x_{i}-\mu)^{2} P(x_{i})$ $Var[X] = (1-\theta)^{2}\cdot\theta+(0-\theta)^{2}\cdot(1-\theta)$ $Var[X] = \theta(1-\theta)$ In above example, $\theta = 0.6$ So, theoretical expection of X and variance of X are following ones E[X]=0.6 $Var[X]=0.6\cdot(1-0.6)=0.24$ @ % You can calculate sample mean and sample variance as following np.mean(x) % 0.62 np.var(x, ddof=1) % 0.23797979797979804 @ % You can use describe() of scipy s = sp.stats.describe(x) s[0], s[1] % (100, (0, 1)) % mean of x, variance of x s[2], s[3] % (0.62, 0.23797979797979804) @ Practice 2 1. Find expection of X and variance of X with parameter $\theta = 0.5$ Draw countplot compared with pmf Calculate above calculation with 10 sample and 1000 samples 2. Find expection of X and variance of X with parameter $\theta = 0.9$ Draw countplot compared with pmf Calculate above calculation with 10 sample and 1000 samples @ We can find parameter from sample data Above step is called parameter estimation In case of bernoulli distribtion, we can estimate its parameter as following $\hat{\theta} = \frac{\sum\limits_{i=1}^{N}x_{i}}{N}$ $\hat{\theta} = \frac{N_{1}}{N}$ $\hat{\theta}$ : estimated parameter N : the number of sample data $N_{1}$ : the number of occurring 1 @ % You can apply bernoulli distribtion in following cases % 1. When output data of classification prediction question is categorical value having 2 values % You can use bernoulli distribtion to represent which category value has high likelihood % 1. 1. When input data is categorical value having 2 values, you can use bernoulli distribtion to represent ratio of showing each value @ % Suppose you made spam filter which distinguishes ham and spam % Suppose you get 10 emails % Suppose 6 are spam and 4 are ham % We can judge one email coming can be spam with 60% % This case can be represented by bernoulli distribtion with $\theta=0.6$ $P(Y)=Bern(y;\theta=0.6)$ % Random variable Y represents if arrived email is spam or no % If Y=1 occurs, it means spam % Spam email has high likelihood of having specific kind of word and keyword % If you have various keyword to distinguish spam and ham, you can represent those keywords in form of BOW which is encoded vector % In this case, we suppose spam keywords are composed of 4 words $\begin{bmatrix} 1\\0\\1\\0 \end{bmatrix}$ % Above vector means this email has 1st keyword and 3rd keyword of spam keywords % If you have 6 emails, you can represent like this $\begin{bmatrix} 1&0&1&0 \\ 1&1&1&0 \\ 1&1&0&1 \\ 0&0&1&1 \\ 1&1&0&0 \\ 1&1&0&1 \end{bmatrix}$ In this case, we can represent characteristic of spam email in 4 bernoulli distribtions($X_{1}, X_{2}, X_{3}, X_{4}$) $X_{1} \sim Bern(x_{1} ; \theta_{1})$ : probability that spam email has 1st keyword $X_{2} \sim Bern(x_{2} ; \theta_{2})$ : probability that spam email has 2nd keyword $X_{3} \sim Bern(x_{3} ; \theta_{3})$ : probability that spam email has 3rd keyword $X_{4} \sim Bern(x_{4} ; \theta_{4})$ : probability that spam email has 4th keyword We suppose each parameter for each bernoulli distribtion $\theta_{1} = \frac{5}{6}$ $\theta_{2} = \frac{4}{6}$ $\theta_{3} = \frac{3}{6}$ $\theta_{4} = \frac{3}{6}$ @ % Practice 3 % If ham email represents as following, how can you represent characteristic of ham email? $\begin{bmatrix} 0&0&1&1 \\ 0&1&1&1 \\ 0&0&1&1 \\ 1&0&0&1 \end{bmatrix}$