Posts

Showing posts from April, 2021

k-nearest neighbor

Image
 KNN The k-nearest neighbor algorithm is a supervised machine learning algorithm that can be used to solve both classification and regression problems. Classification usage Let us understand the algorithm using the below figure where we have two classes of data points ( A and B).  Source:  www.kdnuggets.com The first step is to determine the K value ( 'k' here is the number of nearest data points from the new data point). Let us take k = 3 as per the above figure. We can see that 2 out of 3 neighbors are from class B. So, in this case we go with the majority votes, i.e., the new data point will be classified as class B. We can use either Euclidean or Manhattan distance to get the nearest neighbor. Regression usage Below we can see the difference in the regression and classification usage. We will focus on the left figure for regression under this topic. Source:  www.jeremyjordan.me Similar to the classification problem, here also we need to come up with a k value. If the k val

Silhouette cluster validation

Image
Silhouette cluster validation  We have learnt how k-means algorithm works in my previous post . In that we learnt how we determine the optimal cluster number(k value) using elbow method which was a pictorial representation of number of clusters with the sum of square error value . Now we have a method which will give both a graphical as well as mathematical representation of the optimal cluster value which is called  Silhouette cluster validation technique. In this technique, the silhouette value will tell us how similar is a data point similar to its own cluster compared to other clusters. Silhouette value can be calculated with distance like Euclidean and Manhattan distance (which we have discussed earlier here ). The silhouette value range from -1 to +1. Here more the value is close to +1, more is that data point similar to its cluster and similarly more the value is close to -1, the data point is having worst match (or no match) with its cluster. Below steps are followed to calcu

K means clustering

Image
 K means clustering K-means  is a non-parametric method of clustering where we pre define the number of clusters(non-parametric means computation complexity depends in number of samples). It is an unsupervised machine learning technique. Source: miro.medium.com From the above figure, the algorithm finds the similarity between points and group them into clusters (green, blue and red in this case). Let us see what are the steps required and how the algorithm arrived at 3 clusters as above. Determine K value:  K value is basically how many centroids is needed for our data to form the best clusters. Here we can take any value for beginning.  Initialize K points (randomly) as cluster centers in the plane( K = 3 in this case) Find the distance between the points nearer to the centroids:  Here we can calculate the distance either using Euclidean and Manhattan Distance (Click here to know more about these distance). Based on the shortest distance from the centroids, clusters are formed. Sele

Euclidean Distance and Manhattan Distance

Image
  Euclidean and Manhattan Distance Euclidean Distance and Manhattan Distance are the two distances that we generally use to find distance between two data points. Use of this distance can be seen in algorithms like KNN, Kmean and Hierarchical clustering. We can check if there are any similarities between the two data points with the help of these distances. Euclidean distance The smallest possible distance between any two points is named as euclidean distance. This distance is calculated based on Pythagoras theorem as shown in the below figure. Source:  cdn-images-1.medium.com This is the case for 2-D data.  For higher dimensions we can add the additional dimension. This is case for any number of dimensions. Euclidean distance is used where we calculate the distance between any two points in a straight line irrespective of whether there are any other data points in between them. Ex. when we google to check distance between Cochin and Delhi in case of flight we get a straight point from

Ensemble Technique 2 : AdaBoost

Image
 AdaBoost AdaBoost or Adaptive Boosting is a boosting technique that is used as an Ensemble Method in Machine Learning. In this algorithm, the weights are re-assigned at each and every instances with higher weight for wrongly classified instances. The main motive behind this algorithm is to convert the weak learners to strong learners. Source:  miro.medium.com Boosting Technique As we can see from the above figure, suppose the original data is D1 and the trained classifier seen below is a base learner BL1(say). BL1 can be any model/algorithm provided with a sample of data to train. After training the model, when the original data is used to validate the model, the incorrectly classified records are further treated by moving them to a next base learner (BL2). Now the BL2 will be trained with those wrong classified records and further if they classify some more records incorrectly then again a new base learning is created (BL3) and this process continues till we specify some limits to th

Ensemble Technique 1 : Random Forest

Image
 Random Forest Ensemble technique is a technique where we combine multiple models and random forest is an ensemble technique with multiple decision trees. Random forest comes under bagging technique (also know as bootstrap aggregation). Bagging Technique Source: Wikipedia Consider the above picture, where we have an original data, D(say). Similarly we have  multiple base learners or base models or classifiers , M1, M2,...Mn(say). Each models is provided with a sample of data (D') and the sampling of data for each model will be row sampling with replacement. Each model will get trained on that sample of data. Suppose once the model is trained, test data is provided for validation of each model. As each model will come up with an output, based on voting classifier the majority of the output will be taken as final output for classification. For example, if from above figure, if majority of the classifier give blue as the output, we take blue as our final output. In this case, bootstr

Decision Tree

Image
 Decision Tree A decision tree  is a  decision  support tool that uses a  tree -like model of  decisions  and their possible consequences, including chance event outcomes, resource costs, and utility. (Source:  Wikipedia )   Decision tree can be used in solving classification as well as regression problems. In most of the cases for classification purpose, we use decision tree. In this algorithm a tree structured classifier is generated by the model.  In a decision tree, there are 2 nodes that we need to consider.  Decision node (nodes with further extension to another node) Leaf node (last node in tree which provides final class labels) The target of the algorithm is to reach the leaf node asap. Entire data is provided to the root node(decision node) and where splitting action happens. To construct a decision tree, there is an algorithm called as ID3 ( Iterative Dichotomiser 3). First step is to select the right attribute/feature for splitting the decision tree. For this we need to use