025-001. bernoulli distribution
In some trial, if result comes either success or fail, that trial is calls bernoulli trial
Example of bernoulli trial is coin toss
If you want to represent result of bernoulli trial in random variable X,
success is denoted by X=1,
fail is denoted by X=0
Sometimes, fail denoted by X=-1
Since bernoulli random variable has 0 or 1, bernoulli random variable is discrete random variable
Therefore, bernoulli distribution can be defined by probability mass function
probability mass function of bernoulli distribution is defined as follow
$Bern(x;\theta) = \theta$ if x = 1
$Bern(x;\theta) = 1 - \theta$ if x = 0
Bernoulli random variable has one parameter $\theta$ which means probability of occuring 1
Independent variable x and parameter $\theta$ are separated by ";"
Probability of occurring 0 is $1-\theta$
Above formular can be represented as one sentence
$Bern(x;\theta) = \theta^{x} (1 - \theta)^{(1-x)}$
Practice 1
1. In above one sentence formular, put x=1 and x=0, then check if you can get pmf of bernoulli distribution
If bernoulli random variable has 1 or -1,
you should denote bernoulli distribution as following
$Bern(x;\theta) = \theta^{\frac{(1+x)}{2}} (1-\theta)^{\frac{(1-x)}{2}}$
If some random variable X is occurred by bernoulli distribution,
we say random variable X follows bernoulli distribution
And we denote it in formular as follow
$X\sim Bern(x;\theta)$
SciPy를 사용한 베르누이 분포의 시뮬레이션¶
We can use bernoulli class in stats subpackage of scipy for bernoulli distribution
You can use argument p for parameter $\theta$
% I configure p = 0.6
theta = 0.6
rv = sp.stats.bernoulli(theta)
% You can calculate probability mass function by using pmf()
% This is values which X can have
xx = [0, 1]
% I get probability mass function of bernoulli distribution
plt.bar(xx, rv.pmf(xx))
plt.xlim(-1, 2)
plt.ylim(0, 1)
% Name of graph for 0 is x=0
plt.xticks([0, 1], ["x=0", "x=1"])
plt.xlabel("Sample values")
plt.title("pmf of bernoulli distribution")
#img 1195e5a1-8724-472e-990b-338c6a8daa6b
% You can use rvs() to simulate trial
% 100 trials
x = rv.rvs(100, random_state=0)
% array([1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,
% 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,
% 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0,
% 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,
% 1, 0, 1, 1, 1, 1, 0, 1])
% I can visualize simulated result by using countplot() of seaborn package
% I pass simulated data x
plt.title("Simulated result of bernoulli random variable X")
plt.xlabel("Sample values")
#img d5ea3c80-4331-4fae-b846-6dcc569b84e5
% You can use following code to show both theoretical probability distribtion and sample based probability distribtion
% x = rv.rvs(100, random_state=0)
y = np.bincount(x, minlength=2) / float(len(x))
% count of 1 and 0 / 100.0
% array([38, 62], dtype=int64) / 100.0
% array([0.38, 0.62])
% I make dataframe
df = pd.DataFrame({"Theoretical":rv.pmf(xx), "Sample based":y})
% Sample based Theoretical
% 0 0.38 0.4
% 1 0.62 0.6
% You can visualize above result by using barplot() of seaborn
% First, I reset index for df
df2 = df.stack().reset_index()
% I set columns by list of data
df2.columns = ["Sample value", "Type", "Ratio"]
% Sample value Type Ratio
% 0 0 Sample based 0.38
% 1 0 Theoretical 0.40
% 2 1 Sample based 0.62
% 3 1 Theoretical 0.60
% I use barplot()
% I configure barplot
sns.barplot(x="Sample value", y="Ratio", hue="Type", data=df2)
#img 908c0b45-5892-4504-8c88-ce5c940d3a1e
Moments of bernoulli distribtion are as following
1. Expectation of X
$E[X] = \sum x_{i}P(x_{i})$
$E[X] = 1\cdot\theta + 0\cdot(1-\theta)$
$E[X] = \theta$
2. Variance of X
$Var[X] = \theta(1-\theta)$
$Var[X] = \sum(x_{i}-\mu)^{2} P(x_{i})$
$Var[X] = (1-\theta)^{2}\cdot\theta+(0-\theta)^{2}\cdot(1-\theta)$
$Var[X] = \theta(1-\theta)$
In above example, $\theta = 0.6$
So, theoretical expection of X and variance of X are following ones
% You can calculate sample mean and sample variance as following
% 0.62
np.var(x, ddof=1)
% 0.23797979797979804
% You can use describe() of scipy
s = sp.stats.describe(x)
s[0], s[1]
% (100, (0, 1))
% mean of x, variance of x
s[2], s[3]
% (0.62, 0.23797979797979804)
Practice 2
1. Find expection of X and variance of X with parameter $\theta = 0.5$
Draw countplot compared with pmf
Calculate above calculation with 10 sample and 1000 samples
2. Find expection of X and variance of X with parameter $\theta = 0.9$
Draw countplot compared with pmf
Calculate above calculation with 10 sample and 1000 samples
We can find parameter from sample data
Above step is called parameter estimation
In case of bernoulli distribtion, we can estimate its parameter as following
$\hat{\theta} = \frac{\sum\limits_{i=1}^{N}x_{i}}{N}$
$\hat{\theta} = \frac{N_{1}}{N}$
$\hat{\theta}$ : estimated parameter
N : the number of sample data
$N_{1}$ : the number of occurring 1
% You can apply bernoulli distribtion in following cases
% 1. When output data of classification prediction question is categorical value having 2 values
% You can use bernoulli distribtion to represent which category value has high likelihood
% 1. 1. When input data is categorical value having 2 values, you can use bernoulli distribtion to represent ratio of showing each value
% Suppose you made spam filter which distinguishes ham and spam
% Suppose you get 10 emails
% Suppose 6 are spam and 4 are ham
% We can judge one email coming can be spam with 60%
% This case can be represented by bernoulli distribtion with $\theta=0.6$
% Random variable Y represents if arrived email is spam or no
% If Y=1 occurs, it means spam
% Spam email has high likelihood of having specific kind of word and keyword
% If you have various keyword to distinguish spam and ham, you can represent those keywords in form of BOW which is encoded vector
% In this case, we suppose spam keywords are composed of 4 words
$\begin{bmatrix} 1\\0\\1\\0 \end{bmatrix}$
% Above vector means this email has 1st keyword and 3rd keyword of spam keywords
% If you have 6 emails, you can represent like this
$\begin{bmatrix} 1&0&1&0 \\ 1&1&1&0 \\ 1&1&0&1 \\ 0&0&1&1 \\ 1&1&0&0 \\ 1&1&0&1 \end{bmatrix}$
In this case, we can represent characteristic of spam email in 4 bernoulli distribtions($X_{1}, X_{2}, X_{3}, X_{4}$)
$X_{1} \sim Bern(x_{1} ; \theta_{1})$ : probability that spam email has 1st keyword
$X_{2} \sim Bern(x_{2} ; \theta_{2})$ : probability that spam email has 2nd keyword
$X_{3} \sim Bern(x_{3} ; \theta_{3})$ : probability that spam email has 3rd keyword
$X_{4} \sim Bern(x_{4} ; \theta_{4})$ : probability that spam email has 4th keyword
We suppose each parameter for each bernoulli distribtion
$\theta_{1} = \frac{5}{6}$
$\theta_{2} = \frac{4}{6}$
$\theta_{3} = \frac{3}{6}$
$\theta_{4} = \frac{3}{6}$
% Practice 3
% If ham email represents as following, how can you represent characteristic of ham email?
$\begin{bmatrix} 0&0&1&1 \\ 0&1&1&1 \\ 0&0&1&1 \\ 1&0&0&1 \end{bmatrix}$