K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster.
The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.
import pandas as pd
import seaborn as sns
df = pd.read_csv('https://raw.githubusercontent.com/theleadio/datascience_demo/master/rfm-agg.csv')
df
sns.scatterplot(data=df, x='Frequency', y='AmountSpent')
k=4
from sklearn.cluster import KMeans
X=['Frequency','AmountSpent']
kmeans=KMeans(n_clusters=k).fit(df[X])
df['cluster'] = kmeans.labels_.astype(str)
df
sns.scatterplot(data=df, x='Frequency', y='AmountSpent', hue='cluster')
df.to_csv('cluster.csv')
distortions = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k)
kmeanModel.fit(df)
distortions.append(kmeanModel.inertia_)
distortions
import matplotlib.pyplot as plt
plt.figure(figsize=(16,8))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()