Posts

Hyper parameter tuning: Grid Search CV VS Randomized Search CV

Image
Hyperparameters are the variables that is defined by the developer of the model in order to get the best model possible. For example, max_depth and n_estimators in Random Forest. There are multiple ways to setup the these hyperparameters. Two basic methods are Grid and Random Search. Grid Search The way GS will work is such that it will take a combination of each and every hyperparameters that is decided to use. It is a very efficient method in terms of getting the best possible hyperparameters. But as there will be multiple combinations along with cross validation, it will be a time consuming and expensive business.                                                                                                                         Fig: encrypted-tbn0.gstatic           Some important arguments of Grid Search CV 1.  estimator  – A scikit-learn model 2.  param_grid  – A dictionary with parameter names as keys and lists of parameter values. 3.  scoring  – The performance measure. For e

Forecast Accuracy Metrics

Image
 Forecast Accuracy Metrics We will be discussing here about the most commonly used forecast accuracy metrics mentioned below: Forecast bias MAD MAPE MSE RMSE MPE Forecast Bias This metric will help us in understanding whether the model has over estimated or under estimated. If the value is positive then its overestimation and if its negative then underestimation.  In order to get bias as a % of sales, we have Forecast Bias %                                                                             fig:  relexsoumtions                                    If the value is more than 100 % then its an over forecast and if its less than 100% its under forecast. It is important metric in demand forecasting as it will tell us the over or under supply at the central warehouse or distribution centers. It does not give information on quality of detailed level of forecast. Target is to achieve a 1 or 100% and the number -/+ tells the deviation.        2. Mean Absolute Deviation (MAD) This metrics

Support Vector Machines

Image
SVM Support Vector Machines is a supervised machine learning approach which will help in solving both classification and regression problems.  In the below figure, we have two class of data (red squares and blue circles). The center line is called as hyperplane and helps in classifying the two classes. The hyperplane will create two margin lines with a distance. These margin lines are the parallel line which will pass through the nearest point(s)( support vectors ) of each class(line can pass through any number of points if the nearest points are equidistant from the hyperbola). There can be any kind of hyperplane(vertical/horizontal) but what depends on the hyperplane selection is the marginal distance that is formed. Here the dotted lines as mentioned are called margin lines and the distance between them is called marginal distance (more the distance better the model is). So in this case, we can say that classes are easily separable using the hyperbola that is they are linearly sepa

k-nearest neighbor

Image
 KNN The k-nearest neighbor algorithm is a supervised machine learning algorithm that can be used to solve both classification and regression problems. Classification usage Let us understand the algorithm using the below figure where we have two classes of data points ( A and B).  Source:  www.kdnuggets.com The first step is to determine the K value ( 'k' here is the number of nearest data points from the new data point). Let us take k = 3 as per the above figure. We can see that 2 out of 3 neighbors are from class B. So, in this case we go with the majority votes, i.e., the new data point will be classified as class B. We can use either Euclidean or Manhattan distance to get the nearest neighbor. Regression usage Below we can see the difference in the regression and classification usage. We will focus on the left figure for regression under this topic. Source:  www.jeremyjordan.me Similar to the classification problem, here also we need to come up with a k value. If the k val

Silhouette cluster validation

Image
Silhouette cluster validation  We have learnt how k-means algorithm works in my previous post . In that we learnt how we determine the optimal cluster number(k value) using elbow method which was a pictorial representation of number of clusters with the sum of square error value . Now we have a method which will give both a graphical as well as mathematical representation of the optimal cluster value which is called  Silhouette cluster validation technique. In this technique, the silhouette value will tell us how similar is a data point similar to its own cluster compared to other clusters. Silhouette value can be calculated with distance like Euclidean and Manhattan distance (which we have discussed earlier here ). The silhouette value range from -1 to +1. Here more the value is close to +1, more is that data point similar to its cluster and similarly more the value is close to -1, the data point is having worst match (or no match) with its cluster. Below steps are followed to calcu

K means clustering

Image
 K means clustering K-means  is a non-parametric method of clustering where we pre define the number of clusters(non-parametric means computation complexity depends in number of samples). It is an unsupervised machine learning technique. Source: miro.medium.com From the above figure, the algorithm finds the similarity between points and group them into clusters (green, blue and red in this case). Let us see what are the steps required and how the algorithm arrived at 3 clusters as above. Determine K value:  K value is basically how many centroids is needed for our data to form the best clusters. Here we can take any value for beginning.  Initialize K points (randomly) as cluster centers in the plane( K = 3 in this case) Find the distance between the points nearer to the centroids:  Here we can calculate the distance either using Euclidean and Manhattan Distance (Click here to know more about these distance). Based on the shortest distance from the centroids, clusters are formed. Sele

Euclidean Distance and Manhattan Distance

Image
  Euclidean and Manhattan Distance Euclidean Distance and Manhattan Distance are the two distances that we generally use to find distance between two data points. Use of this distance can be seen in algorithms like KNN, Kmean and Hierarchical clustering. We can check if there are any similarities between the two data points with the help of these distances. Euclidean distance The smallest possible distance between any two points is named as euclidean distance. This distance is calculated based on Pythagoras theorem as shown in the below figure. Source:  cdn-images-1.medium.com This is the case for 2-D data.  For higher dimensions we can add the additional dimension. This is case for any number of dimensions. Euclidean distance is used where we calculate the distance between any two points in a straight line irrespective of whether there are any other data points in between them. Ex. when we google to check distance between Cochin and Delhi in case of flight we get a straight point from