Enhanced K-means++: The Intelligent Evolution of K-means
Written on
Introduction to K-means Clustering
The K-means clustering algorithm is widely recognized among machine learning enthusiasts and is often one of the first algorithms introduced in academic courses.
To summarize, K-means clustering is a vector quantization technique originating from signal processing. Its primary goal is to divide n observations into k clusters, where each observation is assigned to the cluster with the closest mean, which acts as the cluster's representative.
Challenges with the Standard K-means Algorithm
However, this beloved algorithm has its drawbacks. K-means is particularly sensitive to how the initial centroids, or mean points, are selected. For example, if a centroid is positioned far from the data points, it might not attract any associated points, and conversely, multiple clusters could be grouped under a single centroid.
Enter K-means++
This is where K-means++ comes into play! It retains the core principles of K-means but incorporates a more intelligent method for initializing centroids, thereby enhancing the quality of the clustering process.
Implementation of K-means++ Algorithm
Here’s a practical implementation of the K-means++ algorithm:
# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
# Generating synthetic data
mean_01 = np.array([0.0, 0.0])
cov_01 = np.array([[1, 0.3], [0.3, 1]])
dist_01 = np.random.multivariate_normal(mean_01, cov_01, 100)
mean_02 = np.array([6.0, 7.0])
cov_02 = np.array([[1.5, 0.3], [0.3, 1]])
dist_02 = np.random.multivariate_normal(mean_02, cov_02, 100)
mean_03 = np.array([7.0, -5.0])
cov_03 = np.array([[1.2, 0.5], [0.5, 1.3]])
dist_03 = np.random.multivariate_normal(mean_03, cov_01, 100)
mean_04 = np.array([2.0, -7.0])
cov_04 = np.array([[1.2, 0.5], [0.5, 1.3]])
dist_04 = np.random.multivariate_normal(mean_04, cov_01, 100)
# Combining data
data = np.vstack((dist_01, dist_02, dist_03, dist_04))
np.random.shuffle(data)
# Function to visualize selected centroids
def plot(data, centroids):
plt.scatter(data[:, 0], data[:, 1], marker='.', color='gray', label='Data Points')
plt.scatter(centroids[:-1, 0], centroids[:-1, 1], color='black', label='Previously Selected Centroids')
plt.scatter(centroids[-1, 0], centroids[-1, 1], color='red', label='Next Centroid')
plt.title('Select %d-th Centroid' % (centroids.shape[0]))
plt.legend()
plt.xlim(-5, 12)
plt.ylim(-10, 15)
plt.show()
# Function to calculate Euclidean distance
def distance(p1, p2):
return np.sum((p1 - p2)**2)
# K-means++ initialization
def initialize(data, k):
centroids = []
centroids.append(data[np.random.randint(data.shape[0]), :])
plot(data, np.array(centroids))
for c_id in range(k - 1):
dist = []
for i in range(data.shape[0]):
point = data[i, :]
d = sys.maxsize
for j in range(len(centroids)):
temp_dist = distance(point, centroids[j])
d = min(d, temp_dist)
dist.append(d)
dist = np.array(dist)
next_centroid = data[np.argmax(dist), :]
centroids.append(next_centroid)
plot(data, np.array(centroids))
return centroids
centroids = initialize(data, k=4)
In this video titled "Implementing k-Means Clustering from Scratch," you will see a comprehensive guide on how to implement the K-means algorithm step-by-step, including the nuances of its mechanics.
Exploring K-means Clustering Theory
To further understand the theory behind K-means and its enhanced version, K-means++, let's watch this insightful video.
In this second video, "k-means clustering theory algorithm implementation scaling," the discussion revolves around the theoretical foundations of the K-means algorithm, delving into its scalability and practical applications.