Discretisation -- Decision Tree

One of the most favourite processes in Data Science is using Tree-based algorithms to find & predict the values. Trees are so popular in this field because they work over binary answers {Yes: No, 1: 0, True: False}, and when we can provide a clear difference between Yes or No, it becomes easy to better analyse things.

Thus, in the field of Data, when we have way too much inflow of data, it is always preferred to get cut to point answers to the questions.

So, why not use the same technique with Discretisation...!!!

How to use Decision Tree?

To find the optimal bins for the data Decision Tree creates a tree. Just like any other Decision Tree, here also a tree is created which is used to assign the values to one of the 'N' end leaves, thus segregating them to 'N' different bins.

A quick way to remember How to use this technique:-

1) Train a decision tree of limited depth (2, 3 or 4) using the variable we want to discretise and the target.

2) Replace the values with the output returned by the tree.

This process is quite useful as it does not focus on creating bins based on the widths or heights of the observation rather focuses on creating a more logical group/bins.

Let's analyse more the advantages and disadvantages of using this technique.

Advantages

The output returned by the decision tree is monotonically related to the target.
The tree end nodes or bins in the discretised variable show decreased entropy: that is, the observations within each bin are more similar among themselves than to those of other bins.

Disadvantages

Prone over-fitting, Since, there is no perfect way to find out exact no. of bins, higher bins can lead to Over-fitting.
Time Consuming. Since, some tuning of the tree parameters is needed to obtain the optimal number of splits (e.g., tree depth, the minimum number of samples in one partition, the maximum number of partitions, and a minimum information gain).

Practical

We will be using the feature-engine library of python for demo purposes.

1. Importing the Libraries

Importing Decision Tree Discretiser, libraries and data

2. Data Visualization

Since, we will be using only "Age" column of the data set for Discretisation, let's see how it looks when we visualise it.

Data Visuals before Binning

We can see here that the labels on X-axis are very close and way too many to be read and easily understood. Also, there are too many bars closely packed in the graph, which makes it more difficult to analyse.

3. Cleaning the data

Since, data has way too many NULL data, so we would be trying to fill all these NULL with '0'.

Cleaning Data

4. Using the Decision Tree Discretiser

Initializing the Decision Tree Discretiser

In first line we have initialized the Decision Tree Discretiser, then used to fit it on "Age" column.

Once we are done with it, using the "binner_dict_" function we can see that the function has created bins automatically.

5. Visualizing the End Result

Decision Tree Discretisation Result

The Decision tree has grouped all the data into 5 bins, reflecting the original shape of graph, with high in middle and lower at ends.

Thus, we can say we were able to successfully perform binning using the Decision Tree.

Resources

Please comment below to get the complete dataset and libraries.

Learn to install Anaconda Here.

Learn about the Feature-Engine library Here.

Summary

In this Quick Reads, we studied a technique, Decision Tree Discretisation that is commonly used for Discretisation. We had a quick overview of the technique, understood how and why to use it and some quick Points to Remember.

We also performed a practical demo of this technique using a famous python library "feature-engine".

QuickDataScience | Quick & Easy Data Science

Search This Blog