The process of converting analogue or continuous variables/data into discrete variables/data is known as Discretisation.
The discretisation is the process of transforming continuous variables into discrete variables by creating a set of contiguous intervals that span the range of the variable's values. The discretisation is also called binning, where the bin is an alternative name for an interval.
Not Sure why we are introducing and referring to this term here..?? Let's find it out in the next section.
What is Discretisation?
Let's have an example to see why we need discretisation.
Suppose we are assigned the task of finding the Physical activity related to the weight of students in a college. (Demo data shown below)
Demo Discretisation Data |
In the above example, we have shown have data for 10 students, their weights and Physical Activity Score(ranging from 0-10, 0 least & 10 for most).
Here we haven't used the Age of students as the age range for a college student is almost fixed(16-21) & might have very rare and few deviations. So, if we notice here weight is a continuous entity and can take any value from 30Kg - 100Kg (just assuming a range it might vary in actual). i.e we can have a unique weight for each student. Amazing... So, how we can handle this as having a different value for each observation will not lead to any meaningful result.
In this case, we use Discretisation, we divide the weights into different ranges, say 0-30 Kg, 31-50 Kg, 51-70 Kg and so on. Doing so will reduce the N different values for observations to 10 groups at max.
Thus, a modified group will look like below.
Demo Discretisation Data Modified |
Here, we can notice what was at first 10 different values has now been reduced to 2 distinct groups. Thus, making it easier to analyse and create models.
Why Discretisation?
Let's have a look at some important reasons to understand it better.
- ML Models work better on Non- Continuous Data/Discrete data.
- Continous Data have an infinite possible Degree of Freedom.
- As shown above, continuous data have less correlation with the target.
- In case data has some noise, on grouping, the noise will be reduced/removed.
- Outliers are handled automatically.
When Discretisation?
Now that we have seen What and Why we need Discretisation, another important thing to know is when to use it?
The answer to this question is quite simple if we understand the logic behind it.
Discretisation needs to be performed for continuous variables.
Methods
What we saw above was one way of doing Discretisation of the continuous variables, other techniques that we can use for discretisation are:-
- Dividing the data into equal Widths.
- Dividing the data into equal Frequency.
- Using Decision Tree.
- Using K-Means Clustering.
- Arbitrary Discretisation.
Comments
Post a Comment