048-002. naive bayes classification model
@ Naive Bayes classification model is one of famous "generating models" Target variable y has each class $\{C_{1}, ...,C_{K}\}$ There is independent variable x We can find $P(x|y=C_{K})$ We can estimate $P(y=C_{K}|x)$ We can choose maximal $P(y=C_{K}|x)$ from mulple estimated $P(y=C_{K}|x)$ We finally choose class K from maximal $P(y=C_{K}|x)$ Above way is called "naive bayes classification model" @ We can calculate $p(y=C_{K}|x)$ by using bayes rule $P(y=C_{K}|x) = \frac{P(y=C_{K}|x) P(y=C_{K})}{P(x)}$ We don't need to use marginal probability P(X) because we're only interested in probability in respect to each class K, and P(X) is regardless of K So, we can write like this without P(X) $P(y=C_{K}|x) \propto P(x|y=C_{k}) P(y=C_{k})$ We can easily find prior probability $P(y=C_{k})$ as follow $P(y=C_{k}) \approx \frac{number-of-samples-with-y=C_{k}}{number-of-all-samples}$ We can find likelihood $P(x|y=C_{k})$ as follwing steps on assumption of specific model like gaussian normal distribution or bernoulli distribution 1. We suppose $P(x|y=C_{k})$ follows specific probability distribution model 1. We find parameter of this model by using training data $\{x_{1}, ...,x_{N}\}$ 1. Since we know parameter of this model, we can find $P(x|y=C_{k})$ as to any new value of x @ If independent variable x is multi-dimensional $x=(x_{1}, ...,x_{n})$, above likelihood $P(x|y=C_{k})$ should use joint probability $P(x_{1}, ..., x_{n}|y=C_{k})$ in respect to all $x_{i}$ But since we hardly get this joint probability $P(x_{1}, ..., x_{n}|y=C_{k})$, we use assumption that $x_{1}, ...,x_{n}$ are all independent We call this assumption naive assumption Under naive assumption, joint probability is represented multiplication by each probability $P(x|y=C_{k}) = P(x_{1}, ...,x_{n}|y=C_{K}) = \prod\limits_{i=1}^{n} P(x_{i}|y=C_{k})$ Since we already have following relation $P(y=C_{K}|x) \propto P(x|y=C_{k}) P(y=C_{k})$ We can write this way $P(y=C_{K}|x) \propto \prod\limits_{i=1}^{n} P(x_{i}|y=C_{k}) P(y=C_{k})$ @ Distributions which are much used for model of likelihood are following 1. bernoulli distribution x can have only 0 or 1 The probability of x being 1 is fixed Example: model which finds which coin was thrown, based on result of coin toss $P(x_{i}|y=C_{k}) = \theta^{x}_{k}(1-\theta_{K})^{(1-x_{i})}$ 1. multinomial distribution $(x_{1}, ...,x_{n})$ has 0 or positive integer Example: model which find which dice was thrown, based on result of throwing dice $P(x_{1}, ...,x_{n}|y=C_{k}) = \prod\limits_{i} \theta_{k}^{x_{i}}$ 1. gaussian normal distribution x is range of specific value as real number Example: model which finds which student was, based on result of exam $P(x_{i}|y=C_{k}) = \frac{1}{\sqrt{2\pi\sigma_{k}^{2}}} e^{-\frac{(x_{i}-\mu_{k})^{2}}{2\sigma_{k}^{2}}}$ @ Subpackage naive_bayes of scikit-learn provides 3 naive bayes classification model classes BernoulliNB class is bernoulli distribution naive bayes classification model MultinomialNB class is multinomial distribution naive bayes classification model GaussianNB class is gaussian normal distribution naive bayes classification model @ $P(y=C_{K}|x) = \frac{P(y=C_{K}|x) P(y=C_{K})}{P(x)}$ Above classes have following attribute and method @ Attributes which are related to "prior probability" $P(y=C_{K})$ classes_ is label of y class_count_ is the number of sample data in respect to specific y class_prior_ is unconditional probability distribution P(Y) in respect to y (this attribute is only for gaussian normal distribution) class_log_prior_ is log($\log{P(Y)}$) of unconditional probability distribution in respect to y (this attribute is only for bernoulli distribution or multinomial distribution) @ Attributes which estimate likelihood $P(y=C_{K}|x)$ theta_, sigma_ calculate $\mu$ and $\sigma^{2}$ for gaussian normal distribution feature_count_ calculates the number of showing frequency in respect to each independent variable x for bernoulli distribution or multinomial distribution feature_log_prob_ calculates log of parameter vector of bernoulli distribution or multinomial distribution $\log{\theta} = (\log{\theta_{1}}, ...,\log{\theta_{n}}) = (\log{\frac{N_{1}}{N}}, ...,\log{\frac{N_{K}}{N}})$ Above text, K means the number of classes which x can have N means the number of entire tries $N_{i}$ means the number of occurring 1 on ith try @ If you have small sample, you can use "smoothing" $\hat{\theta} = \frac{N_{i}+\alpha}{N+\alpha K}$ @ Implementing naive bayes classification model with gaussian normal distribution import numpy as np import scipy as sp import pandas as pd import statsmodels.api as sm import statsmodels.formula.api as smf import statsmodels.stats.api as sms import sklearn as sk import matplotlib as mpl mpl.use('Agg') import matplotlib.pylab as plt from mpl_toolkits.mplot3d import Axes3D import seaborn as sns sns.set() sns.set_style("whitegrid") sns.set_color_codes() %matplotlib inline np.random.seed(0) X0 = sp.stats.norm(-2, 1).rvs(40) X1 = sp.stats.norm(+2, 1).rvs(60) X = np.hstack([X0, X1])[:, np.newaxis] y0 = np.zeros(40) y1 = np.ones(60) y = np.hstack([y0, y1]) sns.distplot(X0, rug=True, kde=False, norm_hist=True, label="class 0") sns.distplot(X1, rug=True, kde=False, norm_hist=True, label="class 1") plt.legend() plt.xlim(-6,6) plt.show() #img 048-002-001 from sklearn.naive_bayes import GaussianNB clf_norm = GaussianNB().fit(X, y) clf_norm.classes_ % array([0., 1.]) clf_norm.class_count_ % array([40., 60.]) clf_norm.class_prior_ % array([0.4, 0.6]) clf_norm.theta_, clf_norm.sigma_ % (array([[-1.68745753], % [ 1.89131838]]), array([[1.13280656], % [0.8668681 ]])) xx = np.linspace(-6, 6, 100) p0 = sp.stats.norm(clf_norm.theta_[0], clf_norm.sigma_[0]).pdf(xx) p1 = sp.stats.norm(clf_norm.theta_[1], clf_norm.sigma_[1]).pdf(xx) sns.distplot(X0, rug=True, kde=False, norm_hist=True, color="r", label="class 0 histogram") sns.distplot(X1, rug=True, kde=False, norm_hist=True, color="b", label="class 1 histogram") plt.plot(xx, p0, c="r", label="class 0 est. pdf") plt.plot(xx, p1, c="b", label="class 1 est. pdf") plt.legend() plt.show() #img 048-002-002 x_new = -1 clf_norm.predict_proba([[x_new]]) % array([[0.98327446, 0.01672554]]) px = sp.stats.norm(clf_norm.theta_, np.sqrt(clf_norm.sigma_)).pdf(x_new) px # array([[0.30425666], # [0.00345028]]) p = px.flatten() * clf_norm.class_prior_ p % array([0.12170266, 0.00207017]) clf_norm.class_prior_ % array([0.4, 0.6]) p / p.sum() % array([0.98327446, 0.01672554]) Practice 1 Resolve iris classification question by using naive bayes classification model and find following values 1. confusion matrix 1. classification report 1. ROC curve 1. AUC @ using naive bayes classification model with bernoulli distribution In bernoulli distribution, target variable y should have 0 or 1 but also independent variable x should have 0 or 1 You can do modeling about checking if specific words are contained in document by using bernoulli distribution So you can apply bernoulli distribution model to building spam filtering np.random.seed(0) X = np.random.randint(2, size=(10, 4)) y = np.array([0,0,0,0,1,1,1,1,1,1]) print(X) print(y) % [[0 1 1 0] % [1 1 1 1] % [1 1 1 0] % [0 1 0 0] % [0 0 0 1] % [0 1 1 0] % [0 1 1 1] % [1 0 1 0] % [1 0 1 1] % [0 1 1 0]] % [0 0 0 0 1 1 1 1 1 1] from sklearn.naive_bayes import BernoulliNB clf_bern = BernoulliNB().fit(X, y) clf_bern.classes_ % array([0, 1]) clf_bern.class_count_ % array([ 4., 6.]) np.exp(clf_bern.class_log_prior_) % array([ 0.4, 0.6]) fc = clf_bern.feature_count_ fc % array([[ 2., 4., 3., 1.], % [ 2., 3., 5., 3.]]) fc / np.repeat(clf_bern.class_count_[:, np.newaxis], 4, axis=1) % array([[ 0.5 , 1. , 0.75 , 0.25 ], % [ 0.33333333, 0.5 , 0.83333333, 0.5 ]]) theta = np.exp(clf_bern.feature_log_prob_) theta % array([[ 0.5 , 0.83333333, 0.66666667, 0.33333333], % [ 0.375 , 0.5 , 0.75 , 0.5 ]]) x_new = np.array([1, 1, 0, 0]) clf_bern.predict_proba([x_new]) % array([[ 0.72480181, 0.27519819]]) p = ((theta**x_new)*(1-theta)**(1-x_new)).prod(axis=1)*np.exp(clf_bern.class_log_prior_) p / p.sum() % array([ 0.72480181, 0.27519819]) x_new = np.array([0, 0, 1, 1]) clf_bern.predict_proba([x_new]) % array([[ 0.09530901, 0.90469099]]) p = ((theta**x_new)*(1-theta)**(1-x_new)).prod(axis=1)*np.exp(clf_bern.class_log_prior_) p / p.sum() % array([ 0.09530901, 0.90469099]) Practice 2 From NIST Digit classification question, you convert y value into 0 or 1 by using binarizer and then you resolve question by using naive bayes classification model with bernoulli distribution Additionally, resolve same question with arguments from binarizer by using BernoulliNB class @ using naive bayes classification model with multinomial distribution np.random.seed(0) X0 = np.random.multinomial(10, [0.3, 0.5, 0.1, 0.1], size=4) X1 = np.random.multinomial(8, [0.1, 0.1, 0.35, 0.45], size=6) X = np.vstack([X0, X1]) y = np.array([0,0,0,0,1,1,1,1,1,1]) print(X) print(y) % [[3 4 1 2] % [3 5 1 1] % [3 3 0 4] % [3 4 1 2] % [1 2 1 4] % [0 0 5 3] % [1 2 4 1] % [1 1 4 2] % [0 1 2 5] % [2 1 2 3]] % [0 0 0 0 1 1 1 1 1 1] from sklearn.naive_bayes import MultinomialNB clf_mult = MultinomialNB().fit(X, y) clf_mult.classes_ % array([0, 1]) clf_mult.class_count_ % array([ 4., 6.]) fc = clf_mult.feature_count_ fc array([[ 12., 16., 3., 9.], [ 5., 7., 18., 18.]]) fc / np.repeat(fc.sum(axis=1)[:, np.newaxis], 4, axis=1) array([[ 0.3 , 0.4 , 0.075 , 0.225 ], [ 0.10416667, 0.14583333, 0.375 , 0.375 ]]) clf_mult.alpha % 1.0 (fc + clf_mult.alpha) / (np.repeat(fc.sum(axis=1)[:, np.newaxis], 4, axis=1) + clf_mult.alpha * X.shape[1]) % array([[ 0.29545455, 0.38636364, 0.09090909, 0.22727273], % [ 0.11538462, 0.15384615, 0.36538462, 0.36538462]]) theta = np.exp(clf_mult.feature_log_prob_) theta % array([[ 0.29545455, 0.38636364, 0.09090909, 0.22727273], % [ 0.11538462, 0.15384615, 0.36538462, 0.36538462]]) x_new = np.array([10, 10, 10, 10]) clf_mult.predict_proba([x_new]) % array([[ 0.38848858, 0.61151142]]) p = (theta**x_new).prod(axis=1)*np.exp(clf_bern.class_log_prior_) p / p.sum() % array([ 0.38848858, 0.61151142]) Practice 3 1. Resolve MNIST Digit classification question by using naive bayes classification model with multinomial distribution 1. Can you apply naive bayes classification model with multinomial distribution in case that x is real number not integer? @ Let's try to apply naive bayes classification model to "20 News Group" data import numpy as np import scipy as sp import pandas as pd import statsmodels.api as sm import statsmodels.formula.api as smf import statsmodels.stats.api as sms import sklearn as sk import matplotlib as mpl mpl.use('Agg') import matplotlib.pylab as plt from mpl_toolkits.mplot3d import Axes3D import seaborn as sns sns.set() sns.set_style("whitegrid") sns.set_color_codes() %matplotlib inline from sklearn.datasets import fetch_20newsgroups news = fetch_20newsgroups(subset="all") X = news.data y = news.target from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline model1 = Pipeline([ ('vect', CountVectorizer()), ('clf', MultinomialNB()), ]) model2 = Pipeline([ ('vect', TfidfVectorizer()), ('clf', MultinomialNB()), ]) model3 = Pipeline([ ('vect', TfidfVectorizer(stop_words="english")), ('clf', MultinomialNB()), ]) model4 = Pipeline([ ('vect', TfidfVectorizer(stop_words="english", token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b")), ('clf', MultinomialNB()), ]) %%time from sklearn.model_selection import cross_val_score, KFold for i, clf in enumerate([model1, model2, model3, model4]): scores = cross_val_score(clf, X, y, cv=5) print(("Model{0:d}: Mean score: {1:.3f})").format(i, np.mean(scores))) % Model0: Mean score: 0.855) % Model1: Mean score: 0.856) % Model2: Mean score: 0.883) % Model3: Mean score: 0.888) % Wall time: 2min 50s Practice 4 How will you resolve question if x is composed of $x_{1}$ has real numbers, $x_{2}$ has 0 or 1, etc?