K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster.

The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.

In [ ]:
import pandas as pd
import seaborn as sns
In [ ]:
df = pd.read_csv('https://raw.githubusercontent.com/theleadio/datascience_demo/master/rfm-agg.csv')
In [ ]:
df
Out[ ]:
CustomerID Frequency AmountSpent
0 12347 27 75.54
1 12348 5 81.13
2 12349 16 53.57
3 12350 3 3.95
4 12352 17 109.70
... ... ... ...
4089 18280 3 12.25
4090 18281 2 2.07
4091 18282 2 19.50
4092 18283 132 227.15
4093 18287 15 20.37

4094 rows × 3 columns

In [ ]:
sns.scatterplot(data=df, x='Frequency', y='AmountSpent')
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2845b77e10>
In [ ]:
k=4 
from sklearn.cluster import KMeans
X=['Frequency','AmountSpent']
kmeans=KMeans(n_clusters=k).fit(df[X])
In [ ]:
df['cluster'] = kmeans.labels_.astype(str)
In [ ]:
df
Out[ ]:
CustomerID Frequency AmountSpent cluster
0 12347 27 75.54 0
1 12348 5 81.13 0
2 12349 16 53.57 0
3 12350 3 3.95 0
4 12352 17 109.70 0
... ... ... ... ...
4089 18280 3 12.25 0
4090 18281 2 2.07 0
4091 18282 2 19.50 0
4092 18283 132 227.15 3
4093 18287 15 20.37 0

4094 rows × 4 columns

In [ ]:
sns.scatterplot(data=df, x='Frequency', y='AmountSpent', hue='cluster')
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f28429eddd0>
In [ ]:
df.to_csv('cluster.csv')
In [ ]:
distortions = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(df)
    distortions.append(kmeanModel.inertia_)
In [ ]:
distortions
Out[ ]:
[12423553328.08133,
 3352212193.438077,
 1632058999.1078076,
 1046448933.0939847,
 759111212.95807,
 561101201.1299284,
 418427812.44608366,
 327194165.33339405,
 272561781.22020674]
In [ ]:
import matplotlib.pyplot as plt
plt.figure(figsize=(16,8))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
In [ ]: