Calculate Median in SQL
Overview
Median is a term widely used in statistical theory to calculate the middle value after sorting a list of numbers. It is used to separate small valued numbers from large numbers in a dataset. As the median splits the dataset into two halves, it is collectively used in finding the central tendency and to derive better results from a large dataset. In SQL also, the median is particularly used to find the middle range of column values in a table. For example, we can use the median to measure the central average value for income distributions, marks of students, and stock fluctuations.
SQL has no direct function for computing the intermediate value of a list of numbers. Therefore, in this article, we will discuss various methods used to find the median value for a given dataset.
Introduction to SQL MEDIAN
Median is a highly used term in the case of large data sets to find the middlemost number in a sorted list of values. If the dataset contains an odd number of values, then the median will be the middle value of the list of numbers and if the dataset contains an even number of values, then, the median will be the average of the two middle values of the list of numbers.
If the number of values is odd, the median can be calculated as:
We can understand this with the help of an example, suppose, we have 5 numbers in a list, {6, 89, 19, 3, 5}. After sorting, the list will be changed to {3, 5, 6, 19, 89}. Now, the middle element of the list is the median, i.e 6. We can also calculate it using the above formula (5 + 1) / 2 = 3rd item of the sorted list, that is 6.
If the number of values is even, the median can be calculated as:
Where n is the number of values in the dataset.
To understand this, let's say we have 10 numbers in a list, {7, 1, 3, 5, 9, 0, 2, 99, 12, 100}. After sorting this list, we will get {0, 1, 2, 3, 5, 7, 9, 12, 99, 100}.
Now, we can split this list into two halves {0, 1, 2, 3} and {9, 12, 99, 100}. The middle elements left are 5 and 7, the average of them will be the median of our list of numbers. So, the median is (5 + 7) / 2 = 6. We can also do it using the above formula for, ((10/2)th element + (10/2 + 1)th element)/2 = (5th element + 6th element)/2 = (5 + 7)/2 = 6.
As we know, in SQL, we have pre-defined aggregate functions used to obtain multiple results from a given set of values, like, the SUM method to find the sum of all the values in a list, the MIN function to find the minimum value in a set of numbers, MAX function to find the maximum value in a set of numbers, and the AVG function to calculate the average of numbers in a list.
But, unfortunately, there is no predefined function to calculate the median of numbers. So, developers need to write queries to find the median. In this article, we will look at different approaches to do the same.
Why is Calculating Median Hard?
It is considered hard to calculate the median in SQL, as there is no direct pre-defined function to calculate the median value of a given set of numbers in a column, and so developers have to write queries using several in-built techniques in SQL like MAX, MIN, SUM, AVG.
For the standard aggregate functions (like min, max, count, etc.) it’s possible to get the aggregated result in a single traversal over a set of data. Median cannot work like that because it’s algorithmically much more complicated as so there is no direct method of finding it. The median is the middlemost value in a sorted list of items. So, the data needs to be sorted so that the middle value can be found.
Calculate Median Value Using a Simple SQL Query
To calculate the median of values, let’s first create a sample table. Suppose you have a student_details table, containing roll_number, and marks as the fields and some data has been inserted in these fields.
We have now created a table student_details, let’s insert some data into our fields:
We have now inserted some data in our fields of the student_details table. Now, let’s see what our table looks like, by using the SELECT query:
The output of the above query would be:
Now, let's calculate the median for the values in the marks column:
Let's now understand how this query is getting executed:
-
Firstly, we have initialized a variable named index in SQL having a single @ as the first character with the statement SET @index := -1.
-
Then, we can begin with the internal subquery.
In which we can sort the student_detail table in ascending order for the marks field, and then increment the index for each roll number and select the index with the corresponding marks.
The result of the above query would be:
-
We have kept the result of the internal subquery with the alias as m.
-
Now, we can check if the index we received as a result of the internal subquery is the middle index or not using the WHERE clause with the statement (WHERE m.i IN (FLOOR(@index / 2), CEIL(@index / 2));). For the case, when the number of elements in the list is even, we have taken the floor and the ceiling value of the number.
-
The FLOOR and the CEIL are the pre-defined functions in SQL, where FLOOR is used to return the largest integer value that is smaller than or equal to a number, and CEIL is used to return the smallest integer value that is bigger than or equal to a number.
-
In this way, the outer query will fetch the middle items of the given list of numbers.
-
Then, the SELECT clause of the outer query (SELECT AVG(m.marks) as Median) will return the average of those two middle values (selected in step 4) in the case of an even number of elements, and in the case of an odd number of elements, those two middle numbers will be the same.
-
The average is calculated using the built-in function in SQL, i.e AVG.
The output of the above complete query would be:
Calculate Median Value Using Transact SQL
Transact SQL (T SQL) is the extended and advanced version of SQL. It is the product of Microsoft, which is used for performing the operations of procedural programming, string analyzing, manipulation, mathematics, and declaring local variables.
One of the most important features of T SQL is the stored procedure, which can be compiled and stored and then can later be executed when called. User-defined functions in SQL are a part of T SQL.
As we know that it is difficult to find the value of the median in a given list of items by using Transact Structured Query Language as there is no predefined built-in function available, and so we need to design our functions to calculate the value.
In the upcoming sections, will see different query methods in T SQL to calculate the value of the median in a dataset.
Calculate Median Value Using the Ranking Function and Common Table Expressions
This method is the easiest method to find the Median. We can start by separating the dataset into two halves, the lower half and the upper half. To apply this, we can run the following SQL query on the above example of the student_details table, having roll_number and marks as the fields:
As we can see this query works by separating and grouping the list into 50 percent highest and 50 percent lowest values.
The output of the above query would be:
Calculate Median Value Using PERCENTILE_CONT
The PERCENTILE_CONT function is used to find the value at a specific percentile in a given list of values. If there is no value at the exact location obtained, then PERCENTILE_CONT interpolates the answer.
The PERCENTILE_CONT function assumes a continuous distribution between the values of the expression and the sort specification. It tries to find the value at the exact location, but if no value was obtained, then it interpolates the value of that expression at the given percentile, performing linear interpolation.
The syntax for calculating PERCENTILE_CONT is:
Here, only a single column is allowed in the ORDER BY clause, and this column is the one for which the percentile is to be calculated. By default ORDER BY clause sorts the data in ascending order. The OVER() clause is used to define how to partition the input set. In our case, we want to have a single partition (as a whole list of values) and so, no PARTITION BY clause is specified. The percentile value which is taken as a parameter is a numeric value between 0 and 1.
To compute the median value using PERCENTILE_COUNT, we have to use the percentile value as 0.5.
To see the use of the PERCENTILE_COUNT function to calculate the median, let's see an example:
Suppose you have a stocks table, containing stock_id, and price as the fields, and some data has been inserted in these fields.
We have now created a table stocks, let’s insert some data into our fields:
We have now inserted some data in our fields of the stocks table. Now, let’s see what our table looks like, by using the SELECT query:
The output of the above query would be:
Now, let's calculate the median for the values in the price column:
The output of the above query is:
Creating a Function to Calculate the Median Value
We can create a function in transact SQL to find the value of the median in a given list of numbers. It takes a table name and the column name (column for which the median is to be calculated) as a parameter. Then, it creates a temporary table with another name and stores the column values in a sorted manner. Then, it uses a dynamic scrollbar to move to the middle of the list of items and gives the middle value as a median.
We can create the function as:
Now, we can compile and save this procedure. Then, we’ll execute this procedure to get the median value of the Marks column from the student details Table. To run the procedure, execute the following query:
The median for the above student details table example is:
Conclusion
- Median is the middlemost value in a sorted list of numbers.
- IN SQL, we don't have a direct function to calculate the medium of a list of numbers.
- If a dataset contains an odd number of values, then the median will be the middle value of the list of numbers and if the dataset contains an even number of values, then, the median will be the average of the two middle values in the list of numbers.
- Calculating the median is considered hard, as there is no direct pre-defined function to calculate the median value of a given set of numbers in a column in SQL, and so developers have to write queries using several in-built techniques in SQL like MAX, MIN, SUM, and AVG.
- Median can be calculated by writing a Simple SQL Query, along with the use of built-in functions in SQL.
- Median can be calculated using Transact SQL, like by the PERCENTILE_CONT method, Ranking Function, and Common Table Expressions.
- PERCENTILE_CONT is an inverse distribution function. It takes numbers as a continuous distribution between sets of values. Then, it interpolates the value of that expression at the given percentile, if the exact value can't be returned.
- For finding the median, the percentile value in the PERCENTILE_CONT function is taken as 0.5.
- In Transact Structured Query Language, stored procedures can be created, which can be compiled and stored and then can later be executed when called.
- Ranking method is the simplest method that works by separating and grouping the list into 50 percent highest and 50 percent lowest values.