T-SQL Ninja #33

PERCENTILE_DISC

What is PERCENTILE_DISC?

Like the CUME_DIST function introduced last week, PERCENTILE_DISC is a statistical function. It was introduced in SQL Server 2012 and calculates quantiles. If you want to use the 0.5-quantile of a set (or ‘sample’ if we want to stick to the usual terminology used in statistics) you understand the value where half of all values are smaller and the other half are larger than this value. Accordingly, the 0.1-quantile is the value where 10% of the values are smaller and 90% of the values are larger than it. You can calculate quantiles with PERCENTILE_DISC. As with CUME_DIST, a window with sorting is specified and the value of the ORDER BYcolumns is returned, where the distribution function, i.e. CUME_DIST in the same window is greater than the passed value for the first time.

How can you work with PERCENTILE_DISC?

Let’s make a query in the WideWorldImporters database, looking at the distribution of orders in the year:

SELECT 
     InvoiceDate
    ,CUME_DIST() OVER (PARTITION BY DATEPART(YEAR, InvoiceDate) ORDER BY InvoiceDate) as SalesDistribution
FROM Sales.Invoices

Let’s assume that we now want to know on which day of the year 90% of the orders were received. Then we calculate the 0.9 quantile of orders as follows:

SELECT 
     InvoiceDate
    ,CUME_DIST() OVER (PARTITION BY DATEPART(YEAR, InvoiceDate) ORDER BY InvoiceDate) as SalesDistribution
    ,PERCENTILE_DISC(0.9) WITHIN GROUP (ORDER BY InvoiceDate) OVER (PARTITION BY DATEPART(YEAR, InvoiceDate)) as BreakevenPoint
FROM Sales.Invoices
ORDER BY InvoiceDate

It is important to note that PERCENTILE_DISC does not have the ORDER BY condition in the normal window, i.e. in the OVER clause, but in a special WITHIN GROUP clause. This is because, unlike the OVER clause, we cannot specify multiple sort columns here, but are restricted to a single column for sorting, which is also the return value of the function.

Now you can see on the line of each order when in the year of the order 90% of the orders were received. In this granularity this might be helpful in an application, but of course we can also do this calculation in a CTE, whose results we then summarize:

WITH CTE_Percentile AS (
    SELECT 
         InvoiceDate
        ,CUME_DIST() OVER (PARTITION BY DATEPART(YEAR, InvoiceDate) ORDER BY InvoiceDate) as SalesDistribution
        ,PERCENTILE_DISC(0.9) WITHIN GROUP (ORDER BY InvoiceDate) OVER (PARTITION BY DATEPART(YEAR, InvoiceDate)) as BreakevenPoint
    FROM Sales.Invoices
)
SELECT DISTINCT 
     DATEPART(YEAR, InvoiceDate) as InvoiceYear
    ,BreakevenPoint 
FROM CTE_Percentile

It is important to note that PERCENTILE_DISC will always return a concrete value that exists in your table, i.e. it will not interpolate or anything, but will return the value where CUME_DIST is above the searched quantile for the first time. Therefore the name of the function: PERCENTILE_DISC calculates on the discrete set of rows exactly the value that fulfills the condition.

References

Ninja-Notebooks @ GitHub

Leave a Reply

Your email address will not be published. Required fields are marked *