045-001. K-Means clustring
@ You don't predict y value from data You make each cluster by using features of independent variable x This is called clustring @ K-Means clustring algorithm is one of simplest and fastest clustering algorithms First, you find target function You couninuously find the number of clusters K and centroid of each cluster $\mu_{K}$ until target function goes lowest Target function $J = \sum\limits_{K=1}^{K} \sum\limits_{i\in C_{K}} d(x_{i}, \mu_{k})$ $d(x_{i}, \mu_{k})$ is similarity function or distance between two data $d(x_{i}, \mu_{k}) = ||x_{i}−\mu_K||^{2}$ @ Following is detailed steps for k-means clustering algorithm 1. You choose random centroid $\mu_{k}$ (generally, you randomly choose one from data sample) 2. You calculate distance between centroid and each indivitual data of sample 3. You choose nearest centroid from each data, and based on this, you update boundary group of cluster 4. You calculate centroid from newly updated cluster, and iterate step 1 to step 4 @ X = np.array([[7, 5],[5, 7],[7, 7],[4, 4],[4, 6],[1, 4],[0, 0],[2, 2],[8, 7],[6, 8],[5, 5],[3, 7]]) plt.scatter(X[:,0], X[:,1], s=100) plt.show() #img 045-001-001 from sklearn.cluster import KMeans model = KMeans(n_clusters=2, init="random", n_init=1, max_iter=1, random_state=1).fit(X) c0, c1 = model.cluster_centers_ print(c0, c1) plt.scatter(X[model.labels_==0,0], X[model.labels_==0,1], s=100, marker='v', c='r') plt.scatter(X[model.labels_==1,0], X[model.labels_==1,1], s=100, marker='^', c='b') plt.scatter(c0[0], c0[1], s=100, c="r") plt.scatter(c1[0], c1[1], s=100, c="b") plt.show() #img 045-001-002 % [6.6 6.8] [2.71428571 4. ] def kmeans_df(c0, c1): df = pd.DataFrame(np.hstack([X, np.linalg.norm(X - c0, axis=1)[:, np.newaxis], np.linalg.norm(X - c1, axis=1)[:, np.newaxis], model.labels_[:, np.newaxis]]), columns=["x0", "x1", "d0", "d1", "c"]) return df kmeans_df(c0, c1) % x0 x1 d0 d1 c % 0 7.0 5.0 6.708204 1.978277 0.0 % 1 5.0 7.0 6.403124 0.895806 0.0 % 2 7.0 7.0 7.810250 1.739164 0.0 % 3 4.0 4.0 3.605551 2.650413 1.0 % 4 4.0 6.0 5.000000 1.461438 1.0 % 5 1.0 4.0 2.000000 4.969040 1.0 % 6 0.0 0.0 2.236068 8.267891 1.0 % 7 2.0 2.0 1.000000 5.448978 1.0 % 8 8.0 7.0 8.602325 2.671292 0.0 % 9 6.0 8.0 7.810250 1.862562 0.0 % 10 5.0 5.0 5.000000 1.300522 0.0 % 11 3.0 7.0 5.385165 2.565199 1.0 print(X[model.labels_==0,0].mean(), X[model.labels_==0,1].mean()) print(X[model.labels_==1,0].mean(), X[model.labels_==1,1].mean()) % 6.333333333333333 6.5 % 2.3333333333333335 3.8333333333333335 model.score(X) % -63.00408163265304 model = KMeans(n_clusters=2, init="random", n_init=1, max_iter=2, random_state=0).fit(X) c0, c1 = model.cluster_centers_ print(c0, c1) plt.scatter(X[model.labels_==0,0], X[model.labels_==0,1], s=100, marker='v', c='r') plt.scatter(X[model.labels_==1,0], X[model.labels_==1,1], s=100, marker='^', c='b') plt.scatter(c0[0], c0[1], s=100, c="r") plt.scatter(c1[0], c1[1], s=100, c="b") kmeans_df(c0, c1) #img 045-001-003 % [1. 2.] [5.44444444 6.22222222] % x0 x1 d0 d1 c % 0 7.0 5.0 6.708204 1.978277 1.0 % 1 5.0 7.0 6.403124 0.895806 1.0 % 2 7.0 7.0 7.810250 1.739164 1.0 % 3 4.0 4.0 3.605551 2.650413 1.0 % 4 4.0 6.0 5.000000 1.461438 1.0 % 5 1.0 4.0 2.000000 4.969040 0.0 % 6 0.0 0.0 2.236068 8.267891 0.0 % 7 2.0 2.0 1.000000 5.448978 0.0 % 8 8.0 7.0 8.602325 2.671292 1.0 % 9 6.0 8.0 7.810250 1.862562 1.0 % 10 5.0 5.0 5.000000 1.300522 1.0 % 11 3.0 7.0 5.385165 2.565199 1.0 model.score(X) % -45.777777777777715 (np.linalg.norm(X[model.labels_==0] - c0, axis=1)**2).sum() + \ (np.linalg.norm(X[model.labels_==1] - c1, axis=1)**2).sum() % 45.77777777777778 model = KMeans(n_clusters=2, init="random", n_init=1, max_iter=100, random_state=0).fit(X) c0, c1 = model.cluster_centers_ print(c0, c1) plt.scatter(X[model.labels_==0,0], X[model.labels_==0,1], s=100, marker='v', c='r') plt.scatter(X[model.labels_==1,0], X[model.labels_==1,1], s=100, marker='^', c='b') plt.scatter(c0[0], c0[1], s=100, c="r") plt.scatter(c1[0], c1[1], s=100, c="b") kmeans_df(c0, c1) #img 045-001-004 % [1. 2.] [5.44444444 6.22222222] % x0 x1 d0 d1 c % 0 7.0 5.0 6.708204 1.978277 1.0 % 1 5.0 7.0 6.403124 0.895806 1.0 % 2 7.0 7.0 7.810250 1.739164 1.0 % 3 4.0 4.0 3.605551 2.650413 1.0 % 4 4.0 6.0 5.000000 1.461438 1.0 % 5 1.0 4.0 2.000000 4.969040 0.0 % 6 0.0 0.0 2.236068 8.267891 0.0 % 7 2.0 2.0 1.000000 5.448978 0.0 % 8 8.0 7.0 8.602325 2.671292 1.0 % 9 6.0 8.0 7.810250 1.862562 1.0 % 10 5.0 5.0 5.000000 1.300522 1.0 % 11 3.0 7.0 5.385165 2.565199 1.0 @ K-Means++ is algorithm to configure initial centroid 1. You prepare set M to store centroid in it 2. First of all, you choose one random centroid $\mu_{0}$ and store it into M 3. You calculate distance $d(M, x_{i})$ between M and $x_{i}$ which is all sample not involved in M You can calculate $d(M, x_{i})$ by choosing minimal value from $d(\mu_{k}, x_{i})$ which is distance between $\mu_{k}$ which is all sample involved in M and $x_{i}$ which is all sample not involved in M 4. You choose next centroid $\mu$ based on probability which is proportional to $d(M, x_{i})$ 5. You iterate above steps until you choose k number of centorid 6. With initialized centroid, you use k-means algorithm @ Let's deal with Iris data with k-means and k-means++ from mpl_toolkits.mplot3d import Axes3D from sklearn.cluster import KMeans from sklearn import datasets np.random.seed(5) centers = [[1, 1], [-1, -1], [1, -1]] iris = datasets.load_iris() X = iris.data y = iris.target estimators = {'k_means_iris_3': KMeans(n_clusters=3), 'k_means_iris_8': KMeans(n_clusters=8)} fignum = 1 for name, est in estimators.items(): fig = plt.figure(fignum) plt.clf() ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134) plt.cla() est.fit(X) labels = est.labels_ ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float), s=100, cmap=mpl.cm.jet) ax.w_xaxis.set_ticklabels([]) ax.w_yaxis.set_ticklabels([]) ax.w_zaxis.set_ticklabels([]) ax.set_xlabel('Petal width') ax.set_ylabel('Sepal length') ax.set_zlabel('Petal length') plt.title(name) fignum = fignum + 1 ​ plt.show() #img 045-001-005 #img 045-001-006 @ Let's deal with Digit Image with k-means and k-means++ from sklearn.datasets import load_digits from sklearn.preprocessing import scale digits = load_digits() data = scale(digits.data) def print_digits(images, labels): f = plt.figure(figsize=(10,2)) plt.subplots_adjust(top=1, bottom=0, hspace=0, wspace=0.05) i = 0 while (i < 10 and i < images.shape[0]): ax = f.add_subplot(1, 10, i + 1) ax.imshow(images[i], cmap=plt.cm.bone) ax.grid(False) ax.table ax.set_title(labels[i]) ax.xaxis.set_ticks([]) ax.yaxis.set_ticks([]) plt.tight_layout() i += 1 print_digits(digits.images, range(10)) #img 1195e5a1-8724-472e-990b-338c6a8daa6b from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test, images_train, images_test = \ train_test_split(data, digits.target, digits.images, test_size=0.25, random_state=42) from sklearn.cluster import KMeans clf = KMeans(init="k-means++", n_clusters=10, random_state=42) clf.fit(X_train) print_digits(images_train, clf.labels_) #img 2eeb3735-ee7b-4c9e-912d-87a4b611f4ed y_pred = clf.predict(X_test) def print_cluster(images, y_pred, cluster_number): images = images[y_pred == cluster_number] y_pred = y_pred[y_pred == cluster_number] print_digits(images, y_pred) for i in range(10): print_cluster(images_test, y_pred, i) #img 7fe8cafd-06bd-44e9-9ebc-9f24846465d7 from sklearn.metrics import confusion_matrix confusion_matrix(y_test, y_pred) array([[ 0, 0, 43, 0, 0, 0, 0, 0, 0, 0], [20, 0, 0, 7, 0, 0, 0, 10, 0, 0], [ 5, 0, 0, 31, 0, 0, 0, 1, 1, 0], [ 1, 0, 0, 1, 0, 1, 4, 0, 39, 0], [ 1, 50, 0, 0, 0, 0, 1, 2, 0, 1], [ 1, 0, 0, 0, 1, 41, 0, 0, 16, 0], [ 0, 0, 1, 0, 44, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 1, 34, 1, 0, 5], [21, 0, 0, 0, 0, 3, 1, 2, 11, 0], [ 0, 0, 0, 0, 0, 2, 3, 3, 40, 0]], dtype=int64) @ You can evaluate performance of clustring by following criterions 1. When you know exact answer(the number of clusters and belong), you can use Adjusted Rand index Adjusted Mutual Information Homogeneity, completeness and V-measure Fowlkes-Mallows scores 1. When you don't know exact answer(the number of clusters and belong), you can use Silhouette Coefficient Calinski-Harabaz Index @ Example of silhouette coefficient $s = \frac{b−a}{max(a,b)}$ a : average distance between each element inside of cluster b : average distance of nearest distance between one element and other cluster's centroid http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html from sklearn.datasets import make_blobs from sklearn.cluster import KMeans from sklearn.metrics import silhouette_samples, silhouette_score import matplotlib.cm as cm # Generating the sample data from make_blobs # This particular setting has one distinct cluster and 3 clusters placed close # together. X, y = make_blobs(n_samples=500, n_features=2, centers=4, cluster_std=1, center_box=(-10.0, 10.0), shuffle=True, random_state=1) # For reproducibility range_n_clusters = [2, 3, 4, 5, 6] for n_clusters in range_n_clusters: # Create a subplot with 1 row and 2 columns fig, (ax1, ax2) = plt.subplots(1, 2) fig.set_size_inches(18, 7) # The 1st subplot is the silhouette plot # The silhouette coefficient can range from -1, 1 but in this example all # lie within [-0.1, 1] ax1.set_xlim([-0.1, 1]) # The (n_clusters+1)*10 is for inserting blank space between silhouette # plots of individual clusters, to demarcate them clearly. ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10]) # Initialize the clusterer with n_clusters value and a random generator # seed of 10 for reproducibility. clusterer = KMeans(n_clusters=n_clusters, random_state=10) cluster_labels = clusterer.fit_predict(X) # The silhouette_score gives the average value for all the samples. # This gives a perspective into the density and separation of the formed # clusters silhouette_avg = silhouette_score(X, cluster_labels) print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg) # Compute the silhouette scores for each sample sample_silhouette_values = silhouette_samples(X, cluster_labels) y_lower = 10 for i in range(n_clusters): # Aggregate the silhouette scores for samples belonging to # cluster i, and sort them ith_cluster_silhouette_values = \ sample_silhouette_values[cluster_labels == i] ith_cluster_silhouette_values.sort() size_cluster_i = ith_cluster_silhouette_values.shape[0] y_upper = y_lower + size_cluster_i color = cm.spectral(float(i) / n_clusters) ax1.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values, facecolor=color, edgecolor=color, alpha=0.7) # Label the silhouette plots with their cluster numbers at the middle ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i)) # Compute the new y_lower for next plot y_lower = y_upper + 10 # 10 for the 0 samples ax1.set_title("The silhouette plot for the various clusters.") ax1.set_xlabel("The silhouette coefficient values") ax1.set_ylabel("Cluster label") # The vertical line for average silhouette score of all the values ax1.axvline(x=silhouette_avg, color="red", linestyle="--") ax1.set_yticks([]) # Clear the yaxis labels / ticks ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1]) # 2nd Plot showing the actual clusters formed colors = cm.spectral(cluster_labels.astype(float) / n_clusters) ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7, c=colors) # Labeling the clusters centers = clusterer.cluster_centers_ # Draw white circles at cluster centers ax2.scatter(centers[:, 0], centers[:, 1], marker='o', c="white", alpha=1, s=200) for i, c in enumerate(centers): ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50) ax2.set_title("The visualization of the clustered data.") ax2.set_xlabel("Feature space for the 1st feature") ax2.set_ylabel("Feature space for the 2nd feature") plt.suptitle(("Silhouette analysis for KMeans clustering on sample data " "with n_clusters = %d" % n_clusters), fontsize=14, fontweight='bold') plt.show() #img d5ea3c80-4331-4fae-b846-6dcc569b84e5 #img 908c0b45-5892-4504-8c88-ce5c940d3a1e #img 968c5d9c-bf48-49b9-9a90-92c1b3c470bf #img f64b257f-2f44-45f1-b29c-c2e19d2bf180 #img 01c4ef47-f039-4e22-b245-5fb0efe62522