https://datascienceschool.net/view-notebook/58269d7f52bd49879965cdc4721da42d/ ================================================================================ * regression analysis * regression analysis quantitatively finds the relationship between D-dim vector independent variable x and scalar dependent variable y * regression analysis: deterministic model, probabilistic model ================================================================================ * deterministic model * deterministic model should find f(x) satisfying following $$$\hat{y} = f(x) \approx y$$$ * If $$$f(x)$$$ is following-like linear function, it's called linear regression $$$\hat{y} \\ = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_D x_D \\ = w_0 + w^Tx$$$ * $$$w_0,\cdots, w_D$$$: parameter of linear regression model, coefficient of $$$f(x)$$$ ================================================================================ * If linear regression model has "constant term", you can insert "constant term" into feature vector * It is called "bias augmentation" $$$x_i = \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{iD} \end{bmatrix}$$$ Insert constant term 1 $$$x_{i,a} = \begin{bmatrix} 1 \\ x_{i1} \\ x_{i2} \\ \vdots \\ x_{iD} \end{bmatrix}$$$ ================================================================================ * Use "bias augmentation" on multiple feature vectors $$$X = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1D} \\ x_{21} & x_{22} & \cdots & x_{2D} \\ \vdots & \vdots & \vdots & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{ND} \\ \end{bmatrix}$$$ Insert constant term into all feature vectors as bias $$$X_a = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1D} \\ 1 & x_{21} & x_{22} & \cdots & x_{2D} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{N1} & x_{N2} & \cdots & x_{ND} \\ \end{bmatrix}$$$ ================================================================================ * Linear regression function f(x) which has constant term will look like $$$f(x) = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_D x_D$$$ * It can be expressed by matrix multiplication $$$= \begin{bmatrix} 1 & x_1 & x_2 & \cdots & x_D \end{bmatrix} \begin{bmatrix} w_0 \\ w_1 \\ w_2 \\ \vdots \\ w_D \end{bmatrix}$$$ * You can simplify above matrix multiplication $$$= x_a^T w_a = w_a^T x_a$$$ ================================================================================ * Code # ================================================================================ from sklearn.datasets import make_regression import numpy as np import statsmodels.api as sm # ================================================================================ X0,y,coef=make_regression( n_samples=100,n_features=2,bias=100,noise=10,coef=True,random_state=1) # c ori_feat_vecs: original feature vectors ori_feat_vecs=X0[:5] # print("ori_feat_vecs",ori_feat_vecs) # [[ 0.0465673 0.80186103] # [-2.02220122 0.31563495] # [-0.38405435 -0.3224172 ] # [-1.31228341 0.35054598] # [-0.88762896 -0.19183555]] # ================================================================================ # Create constant terms one_vecs=np.ones((X0.shape[0],1)) # print("one_vecs",one_vecs) # [[1.] # [1.] # ================================================================================ # Perform "bias augmentation" (inserting constant term into feature vectors) stacked_arr=np.hstack((one_vecs,X0)) # print("stacked_arr",stacked_arr) # [[ 1. 0.0465673 0.80186103] # [ 1. -2.02220122 0.31563495] # [ 1. -0.38405435 -0.3224172 ] # ================================================================================ # You can perform "bias augmentation" by using statsmodels.api.add_constant() X = sm.add_constant(X0) # print("X",X) # [[ 1. 0.0465673 0.80186103] # [ 1. -2.02220122 0.31563495] # [ 1. -0.38405435 -0.3224172 ] ================================================================================ * Ordinary Least Squares (OLS) * Basic method for deterministic linear regression * It finds parameters which minimize "Residual Sum of Squares" by using matrix differentiation * Linear regression model: $$$\hat{y}=X\omega$$$ * $$$X$$$: feature vectors * $$$\omega$$$: trainable parameters * Residual vector e: $$$e \\ =y-\hat{y} \\ =y-X\omega$$$ * Residual Sum of Squares $$$RSS \\ = e^Te \\ = (y-X\omega)^T(y-X\omega) \\ = y^Ty-2y^TX\omega+\omega^TX^TX\omega$$$ ================================================================================ * You should find minimum value from RSS So, you perform differentiation $$$\dfrac{d \text{RSS}}{d w} = -2 X^T y + 2 X^TX w$$$ * You put 0 to find minimum value $$$\dfrac{d \text{RSS}}{d w} = 0$$$ * $$$X^TX w^{\ast} = X^T y$$$ ================================================================================ * If inverse matrix of $$$X^TX$$$ exist, you can find optimal parameter $$$\omega$$$ $$$w^{\ast} = (X^TX)^{-1} X^T y$$$ ================================================================================ * Normal equation * Equation which has 0 in its gradient $$$\dfrac{d \text{RSS}}{d w} = 0$$$ $$$X^TX w^{\ast} = X^T y$$$ $$$X^T y - X^TX w = 0$$$ ================================================================================ * Code * Linear regression analysis by using NumPy from sklearn.datasets import make_regression bias=100 X0,y,coef=make_regression( n_samples=100,n_features=1,bias=bias,noise=10,coef=True,random_state=1) # print("X0",X0) # [[-0.61175641] # [-0.24937038] # print("y",y) # y [ 47.5431146 84.60560134 # print("coef",coef) # 80.71051956187792 X=sm.add_constant(X0) # print("X",X) # [[ 1. -0.61175641] # [ 1. -0.24937038] y=y.reshape(len(y),1) # print("y",y) # [[ 47.5431146 ] # [ 84.60560134] # ================================================================================ # Relationship between x and y in linear regression model # $$$y = 100 + 80.7105 x + \epsilon$$$ # $$$y=bias + Wx + \epsilon$$$ # # ================================================================================ # * Get bias (100) and weight (80) manually # $$$X^T\cdot X$$$ sqr_mat=np.dot(X.T,X) # print("sqr_mat",sqr_mat) # [[100. 6.05828521] # [ 6.05828521 78.71718049]] # (X^TX)^{-1} inv_of_sqr_mat=np.linalg.inv(sqr_mat) # print("inv_of_sqr_mat",inv_of_sqr_mat) # [[ 0.01004684 -0.00077323] # [-0.00077323 0.01276322]] # $$$(X^TX)^{-1}\cdot X^T$$$ sqr_mat=np.dot(inv_of_sqr_mat,X.T) # print("sqr_mat",sqr_mat) # [[ 0.01051987 0.01023967 0.00966911 0.00945763 0.00887167 0.0097549 # 0.00965023 0.01056587 0.01112666 0.00980279 0.01053939 0.01035363 # print("sqr_mat",sqr_mat.shape) # (2, 100) # $$$((X^TX)^{-1}\cdot X^T)\cdot y$$$ X_mul_y=np.dot(sqr_mat,y) # print("X_mul_y",X_mul_y) # [[102.02701439] # [ 81.59750943]] # \hat{y} = 102.02701439 + 81.59750943x # ================================================================================ # \hat{y}=WX+b # Find W and b sol=np.linalg.lstsq(X,y) # print("sol",sol) # (array([[102.02701439], # [ 81.59750943]]), # array([8314.18911692]), # 2, # array([10.07986548, 8.78142884])) # ================================================================================ x_new = np.linspace(np.min(X0), np.max(X0), 100) X_new = sm.add_constant(x_new) # 상수항 결합 # y_new = np.dot(X_new, X_mul_y) y_new = np.dot(X_new, sol[0]) plt.scatter(X0, y, label="data") plt.plot(x_new, y_new, 'r-', label="Result from regression analysis") plt.xlabel("x") plt.ylabel("y") plt.title("Example of linear regression analysis") plt.legend() plt.show() # ================================================================================ # * Linear regression analysis by using scikit-learn # 1. Create object of LinearRegreeson # model=LinearRegreeson(fit_intercept=True) # * fit_intercept # True: model contains constant term # False: model contains no constant term # 2. Perform "inference model", "bias augmentation" by using fit() # modle=model.fit(X,y) # Return: # * LinearRegreeson itself but it contains data # * coef_: inferenced trainable weight vectors # * intercept_: inferenced constant terms # 3. Make a prediction by using predict() # y_pred=model.predict(x_new) # ================================================================================ # * Code # * Linear regression analysis by using scikit-learn from sklearn.datasets import load_boston from sklearn.linear_model import LinearRegression boston=load_boston() X=boston.data # print("X",X) # [[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00] # [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00] y=boston.target # print("y",y) # [24. 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15. 18.9 21.7 20.4 # 18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8 model_boston=LinearRegression().fit(X,y) # print("model_boston.coef_",model_boston.coef_) # [-1.08011358e-01 4.64204584e-02 2.05586264e-02 2.68673382e+00 # -1.77666112e+01 3.80986521e+00 6.92224640e-04 -1.47556685e+00 # Meaning: # If room (RM) increases by 1, predicted price y increases (3810 dollars) # print("boston",boston.feature_names) # ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT'] # print("model_boston.intercept_",model_boston.intercept_) # 36.459488385090054 # ================================================================================ X_data=boston.data y_true_price=boston.target y_pred_price=model_boston.predict(X_data) # plt.scatter(y_true_price,y_pred_price) # plt.xlabel(u"True price") # plt.ylabel(u"Pred price") # plt.title("Relationship between true price and predicted price") # plt.show() # ================================================================================ # * Linear regression analysis by using statsmodel # 1. Create OLS (Ordinary Least Squares) # If you need, use add_constant() to add constant term # model=OLS(y,X) # 2. Inference model # result=model.fit() # result: RegressionResults object # 3. Create report and prediction # print("result.summary()",result.summary()) # y_pred=result.predict(x_new) # If you performed "bias augmentation", # you also should perform "bias augmentation" to x_new # ================================================================================ # * Code X_data=boston.data X_data_feat_names=boston.feature_names X_wo_const_term=pd.DataFrame(X_data,columns=X_data_feat_names) X_w_const_term=sm.add_constant(X_wo_const_term) y_data=boston.target y_data=pd.DataFrame(y_data,columns=["MEDV"]) lin_reg_model=sm.OLS(y_data,X_w_const_term) trained_lin_reg_model=lin_reg_model.fit() # print("trained_lin_reg_model.summary()",trained_lin_reg_model.summary()) # OLS Regression Results # ============================================================================== # Dep. Variable: MEDV R-squared: 0.741 # Model: OLS Adj. R-squared: 0.734 # Method: Least Squares F-statistic: 108.1 # Date: 화, 07 5월 2019 Prob (F-statistic): 6.72e-135 # Time: 11:31:01 Log-Likelihood: -1498.8 # No. Observations: 506 AIC: 3026. # Df Residuals: 492 BIC: 3085. # Df Model: 13 # Covariance Type: nonrobust # ============================================================================== # coef std err t P>|t| [0.025 0.975] # ------------------------------------------------------------------------------ # const 36.4595 5.103 7.144 0.000 26.432 46.487 # CRIM -0.1080 0.033 -3.287 0.001 -0.173 -0.043 # ZN 0.0464 0.014 3.382 0.001 0.019 0.073 # INDUS 0.0206 0.061 0.334 0.738 -0.100 0.141 # CHAS 2.6867 0.862 3.118 0.002 0.994 4.380 # NOX -17.7666 3.820 -4.651 0.000 -25.272 -10.262 # RM 3.8099 0.418 9.116 0.000 2.989 4.631 # AGE 0.0007 0.013 0.052 0.958 -0.025 0.027 # DIS -1.4756 0.199 -7.398 0.000 -1.867 -1.084 # RAD 0.3060 0.066 4.613 0.000 0.176 0.436 # TAX -0.0123 0.004 -3.280 0.001 -0.020 -0.005 # PTRATIO -0.9527 0.131 -7.283 0.000 -1.210 -0.696 # B 0.0093 0.003 3.467 0.001 0.004 0.015 # LSTAT -0.5248 0.051 -10.347 0.000 -0.624 -0.425 # ============================================================================== # Omnibus: 178.041 Durbin-Watson: 1.078 # Prob(Omnibus): 0.000 Jarque-Bera (JB): 783.126 # Skew: 1.521 Prob(JB): 8.84e-171 # Kurtosis: 8.281 Cond. No. 1.51e+04 # ============================================================================== # Warnings: # [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. # [2] The condition number is large, 1.51e+04. This might indicate that there are # strong multicollinearity or other numerical problems. # ================================================================================ # print("X_wo_const_term",X_wo_const_term.shape) # (506, 13) # print("X_wo_const_term",X_wo_const_term) # CRIM ZN INDUS CHAS ... TAX PTRATIO B LSTAT # 0 0.00632 18.0 2.31 0.0 ... 296.0 15.3 396.90 4.98 # 1 0.02731 0.0 7.07 0.0 ... 242.0 17.8 396.90 9.14 mean_of_X_wo_const_term=X_wo_const_term.mean().values # print("mean_of_X_wo_const_term",mean_of_X_wo_const_term.shape) # (13,) # print("mean_of_X_wo_const_term",mean_of_X_wo_const_term) # [3.61352356e+00 1.13636364e+01 1.11367787e+01 6.91699605e-02 # 5.54695059e-01 6.28463439e+00 6.85749012e+01 3.79504269e+00 # 9.54940711e+00 4.08237154e+02 1.84555336e+01 3.56674032e+02 # 1.26530632e+01] ================================================================================