KNN Imputation

Talking about Multi-variate Imputation, one of the techniques that are very common and familiar to every data scientist is the KNN Impute. Though KNN Impute might be a new term, KNN is not a new term and is familiar to everyone related to this field. Even if it is a new term for you, don't worry we have defined it for you in the next section.

Let's define the KNN and make it familiar to the new aspirants.

What is KNN?

KNN, a short form of K-Nearest Neighbour, is a way of grouping the data points in 'K' groups based on their distance from the centre of the group.
Let's explain it with an example, suppose we have mixed(4 colours) some Lego pieces and want to segregate them into 4 groups based on colour (K=4). So, to start we randomly take 4 pieces of different colours and place them randomly on the floor(let's call it centroid). Now, once we have done this, the next task is to start distributing the remaining pieces to the 4 centroids based on the nearest colour match. Once we are done with this, we have successfully segregated our Lego pieces.

This process that we did in the above example is how a KNN is performed. We can vary the value K as per our need. But always such value should be selected that it reduces the groups and clearly distinguishes them.

*Note:- This is not the actual way how computers perform KNN, for them, they randomly guess centroid(4 different colours Lego) from the entire data space to start, and then it calculates the distance of point with these centroids and shifts the centroid to minimise the distance and group the data.

This is a very common Machine learning algorithm when we want to cluster things out from a heap.

We can read more about KNN from HERE

KNN Imputation

Now, we may wonder if the algo KNN is used for grouping/segregating the data, How it can be used for imputation?

Relax, it's quite simple just like we grouped our Lego based on colour(1 variable), similarly in the case of missing data imputation, the KNN algorithm is used to group the data based on a variable and predict the values for missing observations based on the nearest neighbours.

We can easily perform KNN Imputation using the KNNimputer from the skLearn library of python. It uses the euclidean distance to find the nearest neighbours, and then the values from neighbours are averaged uniformly or weighted by distance for imputation. In case an observation has multiple missing values, then the neighbours for imputing each variable may/will be different. When the number of available neighbours is less than n_neighbors and there are no defined distances to the training set, the training set average for that feature is used during imputation.

Time to flex our hands and see how we can practically achieve this.

Code

1. Importing the libraries and data

Importing the libraries and data

2. Checking Missing Data

Percentage of missing data

3. Loading KNN Imputer

Using KNN Imputer

4. Transforming Data -- KNN Imputation

KNN Imputation

5. Verifying the Imputation

Verifying the Imputation

6. Parameters in KNNImputer

While initializing the KNNImputer, we are provided with many parameters that we can use to change the behaviour of our imputer. In our example we used a default imputer, now let's dive and see what all parameters are present for us.

A. missing_values:- It is the placeholder for missing values. It can take following values 'Int','float', 'str','np.nan' OR 'None'. By default, if we do not specify any value it takes np.nan

B. n_neighbors:- It specifies the number of neighbouring samples to be used for Imputation. By default, it takes 5 neighbours.

C. weights:- It is used to define the weight function that we want to use for predicting the values. By default, it uses 'uniform'. It can take the following values --

i. 'uniform’: All neighbouring points are treated uniformly for prediction.

ii. ‘distance’: It is used when we want to use an inverse relation based on the distance of neighbouring points. The closer the neighbour, the higher the influence and vice versa.

iii. callable: it used in case we want to define our own set of distances. It takes an array of distances and returns an array of weights of the same size.

D. metric:- the metric to be used for searching neighbours. It can take 2 values 'nan_euclidean' OR 'callable' -- user-defined values. By default, 'nan_euclidean' is used.

E. copy:- True or False, is used to specify if we want to create a copy of our dataset and then perform Imputation or not. Default - False.

F. add_indicator:- True or False, used to add a missing indicator for the observation having missing data or not. Default - False.

Limitation

Just like other techniques, this also has some limitations

1. This technique can be only used for Numerical Variables.

2. In the case of huge datasets and more missing values, the process can be time taking.

Advantages

1. Although the process can be time taking, it provides a more accurate imputation.

Summary

To conclude, we studied a new, better and enhanced technique for more accurate Imputation of missing data, i.e. KNN Imputation.

This technique uses the famous Machine Learning concept of the K- Nearest Neighbour algorithm for imputing the missing values in the dataset.

In this article, we studied the Definition and practical implementation of KNN Imputation.

That's all from Here. Until Then... This is Quick Data Science Team signing off.

Comment below to get the complete notebook and dataset.

QuickDataScience | Quick & Easy Data Science

Search This Blog

KNN Imputation

What is KNN?

KNN Imputation

Code

1. Importing the libraries and data

2. Checking Missing Data

3. Loading KNN Imputer

4. Transforming Data -- KNN Imputation

5. Verifying the Imputation

6. Parameters in KNNImputer

Limitation

Advantages

Summary

Labels

Comments

Post a Comment