K-Means Clustering and Implementation with sklearn

This article will explain the principles and applications of the K-Means clustering method from both mathematical and coding perspectives.

Cluster analysis allows us to find groups of similar samples or features, where the correlations between these objects are stronger. Common uses include grouping samples based on different gene expression patterns or grouping genes based on classifications of different samples.
This article will introduce the k-means clustering algorithm:

Basic concepts of k-means clustering
Mathematical principles behind the k-means algorithm
Advantages and disadvantages of k-means
Implementation using the scikit-learn package
Visualization of clustering
Selecting the optimal k

Basic Concepts of k-means Clustering

k-means is an efficient unsupervised clustering method originally used in signal processing. It aims to partition n observations into k clusters, where each observation belongs to the cluster with the nearest mean (cluster center or centroid), forming a group.
It is easy to confuse k-means with another clustering method from supervised learning called k-nearest neighbors (KNN), so care should be taken.

Mathematical Principles Behind the k-means Algorithm

Given a set of observations (left(x_1, x_2, ldots, x_nright)), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets (S = {S_1, S_2, ldots, S_k}) to minimize the within-cluster sum of squares (WCSS) (i.e., variance) and maximize between-cluster differences. Formally:

where (u_i) is the mean of (S_i).
This formula is equivalent to minimizing the pairwise squared differences within the same cluster:

Advantages and Disadvantages of k-means

Advantages:

Simple principle and fast implementation
Good clustering performance
High interpretability of the algorithm
Only requires tuning the number of clusters k

Disadvantages:

Difficult to select the value of k
Hard to converge on non-convex datasets
Uses iterative methods that only guarantee local optima
Sensitive to noise and outliers

Implementation Using scikit-learn

First, randomly generate classified data, then perform clustering using the KMeans function:

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# from 春江暮客

x,y = make_blobs(n_samples=150, n_features=2,centers=3, cluster_std=0.5,shuffle=True, random_state=1024)
km = KMeans( n_clusters=3, init='random',n_init=10, max_iter=300, tol=1e-04, random_state=1024)
y_km = km.fit_predict(x)

Visualization of Clustering

The KMeans clustering results can be visualized using matplotlib or seaborn.

First, using seaborn:

# seaborn

df = pd.DataFrame(x,columns=["x","y"])
df["color"] = y_km
sns.scatterplot(data=df,x="x",y="y",hue="color", markers="color")
# Plot cluster centers
plt.scatter(
    km.cluster_centers_[:, 0], km.cluster_centers_[:, 1],
    s=250, marker='*',
    c='red', edgecolor='black',
    label='center'
)

Result:

Next, directly using matplotlib:

# matplotlib
plt.scatter(
    x[y_km == 0, 0], x[y_km == 0, 1],
    s=50, c='green',
    marker='s', edgecolor='black',
    label='cluster1'
)

plt.scatter(
    x[y_km == 1, 0], x[y_km == 1, 1],
    s=50, c='orange',
    marker='o', edgecolor='black',
    label='cluster2'
)

plt.scatter(
    x[y_km == 2, 0], x[y_km == 2, 1],
    s=50, c='blue',
    marker='v', edgecolor='black',
    label='cluster3'
)

# Plot cluster centers
plt.scatter(
    km.cluster_centers_[:, 0], km.cluster_centers_[:, 1],
    s=250, marker='*',
    c='red', edgecolor='black',
    label='center'
)
plt.legend(scatterpoints=1)
plt.grid()
plt.show()

Result:

Selecting the Optimal k

For our generated data, the k-means clustering effect is very good. However, as mentioned earlier,
k-means is highly influenced by the choice of k. In multidimensional data, the number of clusters is not easy to judge. Theoretically, as k increases, within-cluster variance decreases, and between-cluster differences increase. Therefore, evaluating the best k in k-means clustering is very important. Here, the elbow method is used for evaluation.

distortions = []
for i in range(1, 11):
    km = KMeans(
        n_clusters=i, init='random',
        n_init=10, max_iter=300,
        tol=1e-04, random_state=1024
    )
    km.fit(x)
    distortions.append(km.inertia_)

# Plot
plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()

Result:

You can see that after k reaches 3, as k increases, y does not change drastically, so k=3 is the ideal number of clusters for this data.

Summary

This article started from the principles, step-by-step performed classification using sklearn, visualized the clustering effect using matplotlib and seaborn, and finally provided a method to find the optimal number of clusters.

July 1, 2019July 1, 2019

Serialization and Deserialization in Python

TECHNOLOGY

Serialization and Deserialization in Python

Sometimes you may need to temporarily store data so that it can be called directly the next time the program runs, or exchanged between different threads. Serialization is a way to store data in this way, and here we explain Python's serialization and deserialization using the pickle package.

December 13, 2021December 13, 2021

Four Plotting Tips in Python

TECHNOLOGY

When visualizing data, it's common to plot multiple charts in a single figure. For example, visualizing the same variable from different perspectives like side-by-side histograms and boxplots for numerical variables is useful.

January 20, 2019January 20, 2019

Detailed Examples of Seaborn Plotting Kernel Density Curves

TECHNOLOGY

Detailed Examples of Seaborn Plotting Kernel Density Curves

November 7, 2018November 7, 2018

Drawing the Butterfly Curve with Python

TECHNOLOGY

Google Advertisement

December 8, 2021December 8, 2021

Using Docker to Render JavaScript Web Pages via Browser from Command Line

TECHNOLOGY

Using Docker to Render JavaScript Web Pages via Browser from Command Line

When using scrapy to crawl web pages, many websites render content with JavaScript, so directly fetching the source code will not get the needed content. In this case, using selenium to drive a browser to get the rendered content is very suitable.

December 6, 2021December 6, 2021

Accelerating pip and anaconda in China

TECHNOLOGY

Since the default addresses for pip and anaconda are very slow to access in China, adding domestic mirrors for acceleration is necessary.

December 28, 2020December 28, 2020

Hands-on Implementation of Random Forest Algorithm with Python

TECHNOLOGY

Hands-on Implementation of Random Forest Algorithm with Python

This article will guide you through a hands-on implementation of a powerful random forest machine learning model. It aims to complement my conceptual explanation of random forests, but as long as you have a basic understanding of decision trees and random forests, you can fully read it. Later, we will discuss how to improve the model built here.

September 22, 2020September 22, 2020

Python Script to Snatch Recently Expired Domains

TECHNOLOGY

Python Script to Snatch Recently Expired Domains

'Many domain enthusiasts scour forums and websites frantically searching for and snatching up suitable domains, even spending heavily to buy desired domains from their owners. International domain management bodies adopt a "first-to-apply, first-to-register, first-to-use" policy. Since domains only require a small annual registration fee, continuous registration grants you the right to use the domain. Because of this, many domain resellers (commonly known as "domaining pros") often spend heavily on short, easy-to-remember domains. I used to think about buying shorter domains for building scraping sites, but unfortunately, both snatching and buying from others were very expensive. Since it"s first-come, first-served, we can also acquire good domains by registering them before the current owner forgets to renew.'

Google Advertisement

July 29, 2020July 29, 2020

python3 solution to LeeCode medium problem

TECHNOLOGY

python3 solution to LeeCode medium problem

This is an article analyzing a problem from the coding practice site LeeCode.

April 28, 2020April 28, 2020

Efficient Ways to Check If a List or Tuple Is Empty in Python

TECHNOLOGY

Efficient Ways to Check If a List or Tuple Is Empty in Python

In Python, to check whether an array or tuple is empty, there are three methods: comparing with an empty list, checking the length, and using an if statement.

January 8, 2020January 8, 2020

Application of Python Implementation of Gradient Descent in Practice

TECHNOLOGY

Application of Python Implementation of Gradient Descent in Practice

Gradient descent is a first-order optimization algorithm, commonly called the steepest descent method. To find a local minimum of a function using gradient descent, one must iteratively move from the current point in the opposite direction of the gradient (or approximate gradient) by a specified step size.

December 4, 2019December 4, 2019

Predicting Virmach Black Friday VPS Prices Using sklearn Linear Regression and Alerts

TECHNOLOGY

Predicting Virmach Black Friday VPS Prices Using sklearn Linear Regression and Alerts

Virmach's Black Friday machines appear and disappear rapidly. For hosting enthusiasts, constantly checking current VPS prices and deciding whether to buy is troublesome. This article lists Black Friday machine configurations using sklearn and predicts VPS prices with machine learning to alert users if prices are cheap.

Basic Concepts of k-means Clustering

Mathematical Principles Behind the k-means Algorithm

Advantages and Disadvantages of k-means

Implementation Using scikit-learn

Visualization of Clustering

Selecting the Optimal k

Summary

Related