Introduction
Data and Data, all we can see around us is a world completely surmounted by data, huge or small doesn't matter much. Everywhere we can see is data running our world. Some legends have rightly said
"Data is the Fuel of Future"
If we sit down and analyse the world around us right now, we can notice he was truly a legend who said these lines. From the mobile phone, we use to the Television, roads, vehicles, etc. etc everything is using the data and extensively using it. From the recommender systems to the automated machines everything consumes the data that we once generated and is now delivering services through it.
But such a huge amount of data gives birth to many problems like storage & analysis. Though we have traditional technologies like SQL & the latest technologies like Hive, Spark, Pig to handle these Big data, its important to know how they work.
Let me tell you an amazing fact about these techs also uses SQL under the hood & deliver some amazing results. Yes, we can say we are being sold the same old SQL in the new packing of Hive, Spark or Pig (pun intended).
So, let's see some hidden, less explored and less talked functions of SQL --- The Windows Functions.
What is Windows Function?
This is the question that I first asked when I heard that SQL has something like this too. Trust me, its true and these are some real bunch of powerful functions in SQL that can reduce our work. Let's get back to track now.
Windows Function is used to perform calculations on a particular set of rows returned by the query, similar to the aggregate functions in SQL but it maintains the identity of each row. Windows functions are uniquely identified by the OVER clause used in them. This OVER clause provides the windows functions its special power to process each row while maintaining its identity. Another clause that is mostly used with the windows function is PARTITION BY, used to partition the queried data based on some column.
A technical definition of Windows Function is as follows:-
"In SQL, a window function or analytic function is a function which uses values from one or multiple rows to return a value for each row. (This contrasts with an aggregate function, which returns a single value for multiple rows.) Window functions have an OVER clause; any function without an OVER clause is not a window function, but rather an aggregate or single-row (scalar) function."
Let's have an example to understand them better before seeing these functions separately.
Eg.1:- Calculating the total salary paid to employees.
Normal Sum Query |
Normal Sum Resultset |
Eg.2:- Calculate total salaries based on department.
Total Salary based on department using Group By |
Total Salary based on department using Window Function |
Eg.3:- Calculate salary based on departments and the total salary paid to employees.
Total Salary & Department wise salary using Window Function |
Amazing... isn't it, instead of using the ORDER BY clause we can also generate ranks for these departments based on the salary paid. There is a lot more that we can achieve using these windows functions. We will completely cover most of these functions in the next section.
List of Functions
List of Windows Functions |
Select from Employee Table |
1. Aggregate Functions:-
Number of employees in each department using windows function |
Average Salary in each department using windows function |
E. MAX():- to get the Maximum value from the resultset from a given query.
Maximum Salary in each department using windows function |
2. Ranking Functions:-
Another common problem that we are asked in general to solve through MySQL is to RANK the values based on some columns or get the Nth highest/lowest value etc. All these problems can be easily solved by using Ranking functions.
A. RANK():- It is used to get the rank of the current row within its partition. Another important feature of the RANK function is that if two or more rows are having equal values, then they are assigned the same RANK, the value after that is given the next rank. i.e. 1,2,2,4.
RANK() Windows Function |
In the above query, we have ranked all the employees based on their salary.
* The ranking doesn't look proper because of the table structure, having salary column as String.
B. DENSE_RANK():- This function is similar to the RANK() function, the only difference here is that, unlike RANK, wherein the case of same ranks, next rank was skipped, here ranks are not skipped and follow a continuous pattern. i.e. 1,2,2,3..
Let's see an example...
Ranking the departments |
Dense Ranking the departments |
We have executed the same queries using RANK & DENSE_RANK functions and could notice that, in the case of RANK, the next department has the RANK of a higher number(or we can say the number of employees in the previous department). Whereas in DENSE_RANK we have a simpler 1-9 rank for the departments.
This skipping of the next value is usually known as 'GAP' in RANK. So we can say the RANK function leaves GAP whereas DENSE_RANK we can not see any Gaps.
Now, we might doubt which is better and which is not. Then let me tell you both have different uses and both can be equally used as they take almost equal time in execution.
RANK VS DENSE RANK Performance |
RANK function is used with ORDER BY clause to sort the resultset, without ORDER BY, all rows are treated as same.
C. PERCENT_RANK():- It is used to get the percentage of partition values less than the value in the current row, excluding the highest value. The values returned are relative rank and are in range from 0 to 1.
Percent rank window function |
D. ROW_NUMBER():- As the name suggest, it is used to assign or create unique row numbers to each row in the dataset or more precisely the rows selected via the query. This is helpful in case if we do not have sequential ordering rows in our dataset.
ROW_NUMBER() window function |
E. NTH_VALUE():- it returns the Nth value from the window frame. In case if there is no value at the specified Nth value, then it will return NULL.
NTH_VALUE Window Function |
F. NTILE():- this is a tricky function as per the name, it is used to divide the resultset into N groups/buckets and returns the group/bucket number for each row. It can be thought of as Quantiles, the difference here is user can define its own number of quantiles(group/bucket).
NTILE window function |
G. LEAD() & LAG():- a quite simple and powerful functions, they are used to get the preceding or succeeding values for a particular row or we can say Leading or Lagging values.
LEAD & LAG window function |
H. FIRST_VALUE() & LAST_VALUE():- these functions are used to return the first and last values from the partition of the resultset.
FIRST_VALUE() & LAST_VALUE() window function |
I. CUME_DIST():- it is used to get the cumulative distribution of a value within a group of values. It represents the number of rows preceding or peer with the current row in the window divided by the total number of rows in the window. The values returned are in the range from 0 to 1.
CUME_DIST() window function |
Summary
We learned the some very important and powerful windows functions in SQL. We have seen how to use them and learned the practical aspects as well.
So don't just read, practice along with us and let us know the issues and challenges if you face any.
Comments
Post a Comment