Ensemble Technique 1 : Random Forest

 Random Forest





Ensemble technique is a technique where we combine multiple models and random forest is an ensemble technique with multiple decision trees. Random forest comes under bagging technique (also know as bootstrap aggregation).


Bagging Technique




Source: Wikipedia


Consider the above picture, where we have an original data, D(say). Similarly we have  multiple base learners or base models or classifiers , M1, M2,...Mn(say). Each models is provided with a sample of data (D') and the sampling of data for each model will be row sampling with replacement. Each model will get trained on that sample of data.
Suppose once the model is trained, test data is provided for validation of each model. As each model will come up with an output, based on voting classifier the majority of the output will be taken as final output for classification. For example, if from above figure, if majority of the classifier give blue as the output, we take blue as our final output. In this case, bootstrapping is basically generating multiple models from a single data and aggregation is combining the output(in this case taking the majority option).


Random Forest
Now we know what a bagging technique means, so random forest is a bagging technique with multiple decision tree models.  




Suppose we have a data 'D' with d number of records and n number of columns. In this case we have multiple decision tree as our base models (M1, M2, M3, ...Mn). Here for sampling of data(D') to each decision tree models we do both row sampling with replacement(RS) and feature sampling with replacement(FS) together. One need to keep in mind is if the original data set has D number of records and the sample has D' number of records then always D>D' and same in case of features (m > n, where number of column of original data and sample are m and n respectively). 
When we get the test data, each decision tree will give us an output. For aggregating we use voting classifier and this is called aggregation.

Let us analyse why to use a ransom forest instead of a single decision tree model.

Reason 1
Decision tree has 2 properties:
  1. Low bias: If decision tree is created to its complete depth then the model will get trained properly for training dataset. That is the training error will be less in this scenario. 
  2. High variance: Decision tree models are prone to give larger amount of error when test data is used for validation. This would result in an issue we call as overfitting.
So in case of random forest, as there are multiple decision trees with high variance. But when we combine all the model with maximum vote aggregation, the high variance will get converted to low variance. Each set of RS and FS will make the single base decision tree model expert with that sample and hence will lead to high variance. To convert high to low variance, we do aggregation as mentioned.


Reason 2
Suppose we make changes in the total number of records in the original data with 1000 resords(say). Eg, we make changes in 200 records(say). In this case, these new records will get evenly distributed on all the decision trees and hence there would not be much impact (in terms of accuracy) on the decision tree.


Note: In case of a regression problem, we take either the mean or median of the decision tree outputs as aggregation depending on the distribution of the data.


References:
  1. YouTube Channel - Krish Naik
  2. Wikipedia

Comments

Popular posts from this blog

Euclidean Distance and Manhattan Distance

k-nearest neighbor

K means clustering