Silhouette cluster validation


Silhouette cluster validation 




We have learnt how k-means algorithm works in my previous post. In that we learnt how we determine the optimal cluster number(k value) using elbow method which was a pictorial representation of number of clusters with the sum of square error value. Now we have a method which will give both a graphical as well as mathematical representation of the optimal cluster value which is called Silhouette cluster validation technique.

In this technique, the silhouette value will tell us how similar is a data point similar to its own cluster compared to other clusters. Silhouette value can be calculated with distance like Euclidean and Manhattan distance (which we have discussed earlier here).

The silhouette value range from -1 to +1. Here more the value is close to +1, more is that data point similar to its cluster and similarly more the value is close to -1, the data point is having worst match (or no match) with its cluster. Below steps are followed to calculate the silhouette value. 

Step 1:
Calculate the mean distance from a data point to all other points within the same clusters

Ci is the cluster

i and j are the data point from where the distance is calculated and other data points in the same cluster respectively.

d(i,j) is the distance between data point i and j in the cluster Ci.


Step 2:

Calculate the mean distance from the data point i from the cluster Ci to all the other data points from another cluster Cj(say).  This step is to check the mean  dissimilarity between the data point i from the cluster Cj.


The cluster with the smallest mean dissimilarity is the neighboring cluster to i and also the next best fit cluster for point i.


Step 3:

Finally we can calculate the Silhouette value of data point i.

, if 

This can also be written.






























To get the code and step by step explanation for Silhouette validation click here

Reference:

  1. Wikipedia
  2. scikit-learn.org
  3. YouTube: Krish Naik

 

Comments

Popular posts from this blog

Euclidean Distance and Manhattan Distance

Ensemble Technique 2 : AdaBoost

Forecast Accuracy Metrics