Windows Functions in SQL

Introduction

Data and Data, all we can see around us is a world completely surmounted by data, huge or small doesn't matter much. Everywhere we can see is data running our world. Some legends have rightly said

"Data is the Fuel of Future"

If we sit down and analyse the world around us right now, we can notice he was truly a legend who said these lines. From the mobile phone, we use to the Television, roads, vehicles, etc. etc everything is using the data and extensively using it. From the recommender systems to the automated machines everything consumes the data that we once generated and is now delivering services through it.

But such a huge amount of data gives birth to many problems like storage & analysis. Though we have traditional technologies like SQL & the latest technologies like Hive, Spark, Pig to handle these Big data, its important to know how they work.

Let me tell you an amazing fact about these techs also uses SQL under the hood & deliver some amazing results. Yes, we can say we are being sold the same old SQL in the new packing of Hive, Spark or Pig (pun intended).

So, let's see some hidden, less explored and less talked functions of SQL --- The Windows Functions.

What is Windows Function?

This is the question that I first asked when I heard that SQL has something like this too. Trust me, its true and these are some real bunch of powerful functions in SQL that can reduce our work. Let's get back to track now.

Windows Function is used to perform calculations on a particular set of rows returned by the query, similar to the aggregate functions in SQL but it maintains the identity of each row. Windows functions are uniquely identified by the OVER clause used in them. This OVER clause provides the windows functions its special power to process each row while maintaining its identity. Another clause that is mostly used with the windows function is PARTITION BY, used to partition the queried data based on some column.

A technical definition of Windows Function is as follows:-

"In SQL, a window function or analytic function is a function which uses values from one or multiple rows to return a value for each row. (This contrasts with an aggregate function, which returns a single value for multiple rows.) Window functions have an OVER clause; any function without an OVER clause is not a window function, but rather an aggregate or single-row (scalar) function."

Let's have an example to understand them better before seeing these functions separately.

Eg.1:- Calculating the total salary paid to employees.

Normal Sum Query

Normal Sum Resultset

Eg.2:- Calculate total salaries based on department.

One way to do this is by using the GROUP BY clause.

Total Salary based on department using Group By

The second way is using the Window function.

Total Salary based on department using Window Function

Eg.3:- Calculate salary based on departments and the total salary paid to employees.

Total Salary & Department wise salary using Window Function

Amazing... isn't it, instead of using the ORDER BY clause we can also generate ranks for these departments based on the salary paid. There is a lot more that we can achieve using these windows functions. We will completely cover most of these functions in the next section.

List of Functions

List of Windows Functions

Time to learn the functions and their uses, but before that a quick note about our dataset.

We will be using an 'Employee' table with the following columns

Select from Employee Table

Also, we will be using MySQL for demo purposes.

1. Aggregate Functions:-

Functions such as SUM(), AVG(), MIN(), MAX(), COUNT(), COUNT(DISTINCT) which are used to aggregate the resultset can also be used as windows functions. These are simple functions, we would be covering them in a short, quick and easy way.

A. SUM():- It is used to get the sum of a particular set of data. We have seen how we can use the SUM function as a window function in the above examples.

B. COUNT():- It is used to count the number of rows based on a particular query. Like we used the SUM function to get the sum of salaries based on department, similarly, we can use COUNT to get the number of employees in each department.

Number of employees in each department using windows function

C. AVG():- It is used to calculate the average of resultset from a given query.

Average Salary in each department using windows function

D. MIN():- to get the Minimum value from the resultset from a given query.

Minimum Salary in each department using windows function

E. MAX():- to get the Maximum value from the resultset from a given query.

Maximum Salary in each department using windows function

2. Ranking Functions:-

Another common problem that we are asked in general to solve through MySQL is to RANK the values based on some columns or get the Nth highest/lowest value etc. All these problems can be easily solved by using Ranking functions.

A. RANK():- It is used to get the rank of the current row within its partition. Another important feature of the RANK function is that if two or more rows are having equal values, then they are assigned the same RANK, the value after that is given the next rank. i.e. 1,2,2,4.

RANK() Windows Function

In the above query, we have ranked all the employees based on their salary.

* The ranking doesn't look proper because of the table structure, having salary column as String.

B. DENSE_RANK():- This function is similar to the RANK() function, the only difference here is that, unlike RANK, wherein the case of same ranks, next rank was skipped, here ranks are not skipped and follow a continuous pattern. i.e. 1,2,2,3..

Let's see an example...

Ranking the departments

Dense Ranking the departments

We have executed the same queries using RANK & DENSE_RANK functions and could notice that, in the case of RANK, the next department has the RANK of a higher number(or we can say the number of employees in the previous department). Whereas in DENSE_RANK we have a simpler 1-9 rank for the departments.

This skipping of the next value is usually known as 'GAP' in RANK. So we can say the RANK function leaves GAP whereas DENSE_RANK we can not see any Gaps.

Now, we might doubt which is better and which is not. Then let me tell you both have different uses and both can be equally used as they take almost equal time in execution.

RANK VS DENSE RANK Performance

RANK function is used with ORDER BY clause to sort the resultset, without ORDER BY, all rows are treated as same.

C. PERCENT_RANK():- It is used to get the percentage of partition values less than the value in the current row, excluding the highest value. The values returned are relative rank and are in range from 0 to 1.

Percent rank window function

D. ROW_NUMBER():- As the name suggest, it is used to assign or create unique row numbers to each row in the dataset or more precisely the rows selected via the query. This is helpful in case if we do not have sequential ordering rows in our dataset.

ROW_NUMBER() window function

E. NTH_VALUE():- it returns the Nth value from the window frame. In case if there is no value at the specified Nth value, then it will return NULL.

NTH_VALUE Window Function

F. NTILE():- this is a tricky function as per the name, it is used to divide the resultset into N groups/buckets and returns the group/bucket number for each row. It can be thought of as Quantiles, the difference here is user can define its own number of quantiles(group/bucket).

NTILE window function

G. LEAD() & LAG():- a quite simple and powerful functions, they are used to get the preceding or succeeding values for a particular row or we can say Leading or Lagging values.

LEAD & LAG window function

H. FIRST_VALUE() & LAST_VALUE():- these functions are used to return the first and last values from the partition of the resultset.

FIRST_VALUE() & LAST_VALUE() window function

I. CUME_DIST():- it is used to get the cumulative distribution of a value within a group of values. It represents the number of rows preceding or peer with the current row in the window divided by the total number of rows in the window. The values returned are in the range from 0 to 1.

CUME_DIST() window function

Summary

We learned the some very important and powerful windows functions in SQL. We have seen how to use them and learned the practical aspects as well.

So don't just read, practice along with us and let us know the issues and challenges if you face any.

Please comment below and get the complete dataset and SQL queries.

QuickDataScience | Quick & Easy Data Science

Search This Blog