Skip to main content

Variable Encoding


Introduction 

Computers are one of the best creations of  Human Beings. They are so powerful and useful that which was once a luxury item has now become so common that it can be seen everywhere like watches, cars, spaceships etc, etc. They have become so common now that imagining a life without them is like going back to the 'Stone Age'... 

These computerised systems might be great, but have one serious issue, i.e. they work on only Numerical Data, more specifically, Binary Data, i.e 1 & 0 only. But the data we see around us can be Numerical, Alphabetical, Categorical, Visual, Audible and others. 

Now, coming to the point, whether it is Machine Learning, Data Science, Deep Learning, or Artificial Intelligence. All these work on data, i.e. they use data to deliver results. But like we know all the data sets are/can be a mixture of Numerical, Alphabetical & Categorical(let's ignore Audio & Visual data for now). Dealing with Numerical data is not an issue with computers as the binary conversion of numerical values is easy and maintains the order, i.e. computers can easily identify that 2 < 3 or 10000 > 100 and thus chances of computational error is very small.

The problem is in the case of Alphabetical & Categorical variables, as first of all we do not know what will be the binary conversion of these values and even if we know we don't know if the computer will treat A, B, C in the same order or not. Also, once the computers convert these values into binary values they assign some hierarchy to them, but in the majority of the cases these categorical variables share the same hierarchy, i.e. we can not say that A is greater than B or C is smaller than A. 

So, to avoid these machines creating any big issue at the end of our analysis we would like to treat them in an earlier stage only, and the process used to rectify these data is known as "Encoding". 

Thus, we can define Encoding as the process of converting the categorical or non-numerical fields to numerical for easy and predictable calculation by our system. In simple terms, we can say, in the process of encoding we assign some numerical values to the non-numerical fields such that the system does not treat them differently.

Wiki defines it as:- 

In computing, data storage, and data transmission, the character encoding is used to represent a repertoire of characters by some kind of encoding system that assigns a number to each character for digital representation.

Encoding Demo
Encoding Demo

*Note:- We have talked about Non-Numerical encoding here only to explain the concept of encoding, this does not imply that we can't use it for other fields.

Why Encoding? 

A few main reasons why we need Variable Encoding in Data Science/ Deep Learning/ Machine Learning are:- 

  • Most of the time, Python is used for data pre-processing and one of a major drawback it has is that majority of Python libraries works only with Numeric Data. Thus, it becomes important to convert all Non-Numeric data to Numeric before we begin our work. 
  • Computers are more compatible with numeric data, they do not understand categorical or non-numerical values. 
  • Machine Learning models are not compatible with non-numeric data.
  • Processing speed is also high for Numerical data.

Encoding Methods


Encoding Methods
Encoding Methods






Comments