Handling Missing Values with Random Forest

 

random forest vs decision tree


Missing values have always been a concern for any statistical analysis. They significantly reduce the study’s statistical powers, which may lead to faulty conclusions. 

Most of the algorithms used in statistical modellings such as Linear regression, Logistic Regression, SVM and even Neural Networks are prone to these missing values. So, knowledge of handling lost data is a must. Several methods are popularly used to manage missing data. This article will discuss the Random Forest method of filling missing values and see how it fares compared to another popular technique, K-Nearest Neighbour.

What are the Missing Data?

Missing data is the data value not stored for a variable in the observation of interest. There are four ways the missing values could occur in a dataset. And those are

  • Structurally missing data,
  • MCAR(missing completely at random),
  • MAR(Missing at random) and
  • NMAR(Not missing at random).

Structurally missing data:

  These are missing because they are not supposed to exist. For example, the age of the youngest kid of a couple having no kids is best addressed by excluding the observations from any analysis of the variables.

MCAR (Missing Completely At Random): These are the missing values that occur entirely at random. There is no pattern, and each missing value is unrelated. For example, a weighing machine that ran out of batteries. Some of the data will be lost just because of bad luck, and the probability of each missing data is the same throughout.

MAR (Missing At Random): The assumption here is that the missing values are somewhat related to the other observations in the data. We can predict the missing values by using information from other variables, such as indicating a person’s missing height value from age, gender, and weight. This can be handled using specific imputation techniques or any sophisticated statistical analysis.

NMAR(Not Missing At Random): If the data cannot be classified under the above three, the missing value falls in this category. The missing values here originated not at random but deliberately. For example, people in a survey slowly do not answer some questions.

Imputation Techniques

There are many imputation techniques we can employ to tackle missing values. For example, imputing means for continuous data is the most routine matter in the case of categorical data. Or we can use machine learning algorithms like KNN and Random Forests to address the missing data problems. For this article, we will be discussing Random Forest methods, Miss Forest, and Mice Forest to handle missing values and compare them with the KNN imputation method.

Random Forest for Missing Values

Random Forest for data imputation is an exciting and efficient way of imputation, and it has almost every quality of being the best imputation technique. The Random Forests are pretty capable of scaling to significant data settings, and these are robust to the non-linearity of data and can handle outliers. Random Forests can hold mixed-type of data ( both numerical and categorical). On top of that, they have a built-in feature selection technique. These distinctive qualities of Random Forests can easily give it an upper hand over KNN or any other methods.

KNN, on the other hand, involves the calculation of Euclidean distance of data points, thus making it prone to outliers. It cannot handle categorical data, so data transformation is needed, and it requires the data to be scaled to perform better. All these things can be bypassed by using Random Forest-based imputation methods.

For this article, we will discuss two such methods: Miss Forest and the other is Mice Forest. We will explore how they work and their python implementation.

Miss Forest

Arguably the best imputation method. If you need precision, then this is what you must use. An iterative imputation technique powered by Random Forest to precisely impute data. So, how does it work?

Step-1: First, the missing values are filled by the mean of respective columns for continuous and most frequent data for categorical data.

Step-2: The dataset is divided into two parts: training data consisting of the observed variables and the other is missing data used for prediction. These training and prediction sets are then fed to Random Forest, and subsequently, the predicted data is imputed at appropriate places. After imputing all the values, one iteration gets completed.

Step-3: The above step is repeated until a stopping condition is reached. The iteration process ensures the algorithm operates on better quality data in subsequent iterations. The process continues until the sum of squared differences between the current and previous imputation increases or a specific iteration limit is reached. Usually, it takes 5-6 iterations to attribute the data well.

 

 

 

Enregistrer un commentaire

Plus récente Plus ancienne