T-SQL Ninja #34

PERCENTILE_CONT

What does PERCENTILE_CONT do?

The last time you worked in the Dojo you used the PERCENTILE_DISC function. This function always returned you the exact value in your table where a certain quantile was reached. That is, it operated on the discrete set of data and picked a specific value from that set, but what if you wanted to have an interpolated quantile instead, one that works on a continuous distribution function, i.e. one that doesn’t work with discrete values like number or day but with continuous values? This is what the PERCENTILE_CONT function does, which will be your Randori partner today.

How can you work with PERCENTILE_CONT?

Like last week, today’s dojo will be rather short, as the syntax is not surprisingly the same as for PERCENTILE_DISC. So let’s take a data set where we can compare both functions. For this we go to the WideWorldImporters database and look at the prices of the stock items:

SELECT 
     UnitPrice
    ,StockItemID
FROM [Warehouse].[StockItems]

From this data we would now like to find out at which prices the 0.05 and 0.95 quantiles are. This means that we determine the price range in which 90% of the products are located, which can help us to find outliers. For this we use both the PERCENTILE_DISC function:

SELECT 
     UnitPrice
    ,StockItemID
    ,PERCENTILE_DISC(0.05) 
        WITHIN GROUP (ORDER BY UnitPrice) 
        OVER (PARTITION BY 1) 
        AS Lower5Percent
    ,PERCENTILE_DISC(0.95) 
        WITHIN GROUP (ORDER BY UnitPrice) 
        OVER (PARTITION BY 1) 
        AS Upper5Percent
FROM [Warehouse].[StockItems]

We now want to compare these values with the continuous version of the quantiles function:

SELECT 
     UnitPrice
    ,StockItemID
    ,PERCENTILE_DISC(0.05) 
        WITHIN GROUP (ORDER BY UnitPrice) 
        OVER (PARTITION BY 1) 
        AS Lower5PercentDisc
    ,PERCENTILE_DISC(0.95) 
        WITHIN GROUP (ORDER BY UnitPrice) 
        OVER (PARTITION BY 1) 
        AS Upper5PercentDisc
    ,PERCENTILE_CONT(0.05) 
        WITHIN GROUP (ORDER BY UnitPrice) 
        OVER (PARTITION BY 1) 
        AS Lower5PercentCont
    ,PERCENTILE_CONT(0.95) 
        WITHIN GROUP (ORDER BY UnitPrice) 
        OVER (PARTITION BY 1) 
        AS Upper5Percent
FROM [Warehouse].[StockItems]

You can see that the discrete variant gives a value of 2.55 for the lower 0.5 quantile, while the continuous variant gives a value of 2.595. We now check these values by counting how many lines have prices that are less than or equal to these values:

SELECT COUNT(*), 'Count DISC'
FROM [Warehouse].[StockItems]
WHERE UnitPrice <= 2.55
UNION ALL 
SELECT COUNT(*), 'Count CONT'
FROM [Warehouse].[StockItems]
WHERE UnitPrice <= 2.595

In both cases we see 12 lines that meet the criterion. The difference between the two functions is that PERCENTILE_DISC just returns the discrete (i.e. the concrete) value of a row, while PERCENTILE_CONTinterpolates between the values of the rows and calculates and returns the real value for the quantile, because while there are rows with a UnitPrice of 2.55, there are none with a UnitPrice of 2.595:

SELECT StockItemID, 'Price DISC'
FROM [Warehouse].[StockItems]
WHERE UnitPrice = 2.55
UNION ALL 
SELECT StockItemID, 'Price CONT'
FROM [Warehouse].[StockItems]
WHERE UnitPrice = 2.595

This short Randori hopefully explained to you what the difference is between the two variations of the 'Percentile' function. Which of the variants you should use will depend on how you use it. If you want to end up referencing concrete records that are exactly on top of the quantile, then the PERCENTILE_DISCfunction is of course preferable, but if you want to specify a threshold value as accurately as possible, then PERCENTILE_CONT is certainly the right choice. The good thing is that you have the choice.

References

Ninja-Notebooks @ GitHub

Leave a Reply

Your email address will not be published. Required fields are marked *