What does PERCENTILE_CONT do?
The last time you worked in the Dojo you used the PERCENTILE_DISC
function. This function always returned you the exact value in your table where a certain quantile was reached. That is, it operated on the discrete set of data and picked a specific value from that set, but what if you wanted to have an interpolated quantile instead, one that works on a continuous distribution function, i.e. one that doesn’t work with discrete values like number or day but with continuous values? This is what the PERCENTILE_CONT
function does, which will be your Randori partner today.
How can you work with PERCENTILE_CONT?
Like last week, today’s dojo will be rather short, as the syntax is not surprisingly the same as for PERCENTILE_DISC
. So let’s take a data set where we can compare both functions. For this we go to the WideWorldImporters
database and look at the prices of the stock items:
SELECT UnitPrice ,StockItemID FROM [Warehouse].[StockItems]
From this data we would now like to find out at which prices the 0.05 and 0.95 quantiles are. This means that we determine the price range in which 90% of the products are located, which can help us to find outliers. For this we use both the PERCENTILE_DISC
function:
SELECT UnitPrice ,StockItemID ,PERCENTILE_DISC(0.05) WITHIN GROUP (ORDER BY UnitPrice) OVER (PARTITION BY 1) AS Lower5Percent ,PERCENTILE_DISC(0.95) WITHIN GROUP (ORDER BY UnitPrice) OVER (PARTITION BY 1) AS Upper5Percent FROM [Warehouse].[StockItems]
We now want to compare these values with the continuous version of the quantiles function:
SELECT UnitPrice ,StockItemID ,PERCENTILE_DISC(0.05) WITHIN GROUP (ORDER BY UnitPrice) OVER (PARTITION BY 1) AS Lower5PercentDisc ,PERCENTILE_DISC(0.95) WITHIN GROUP (ORDER BY UnitPrice) OVER (PARTITION BY 1) AS Upper5PercentDisc ,PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY UnitPrice) OVER (PARTITION BY 1) AS Lower5PercentCont ,PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY UnitPrice) OVER (PARTITION BY 1) AS Upper5Percent FROM [Warehouse].[StockItems]
You can see that the discrete variant gives a value of 2.55 for the lower 0.5 quantile, while the continuous variant gives a value of 2.595. We now check these values by counting how many lines have prices that are less than or equal to these values:
SELECT COUNT(*), 'Count DISC' FROM [Warehouse].[StockItems] WHERE UnitPrice <= 2.55 UNION ALL SELECT COUNT(*), 'Count CONT' FROM [Warehouse].[StockItems] WHERE UnitPrice <= 2.595
In both cases we see 12 lines that meet the criterion. The difference between the two functions is that PERCENTILE_DISC
just returns the discrete (i.e. the concrete) value of a row, while PERCENTILE_CONT
interpolates between the values of the rows and calculates and returns the real value for the quantile, because while there are rows with a UnitPrice of 2.55, there are none with a UnitPrice of 2.595:
SELECT StockItemID, 'Price DISC' FROM [Warehouse].[StockItems] WHERE UnitPrice = 2.55 UNION ALL SELECT StockItemID, 'Price CONT' FROM [Warehouse].[StockItems] WHERE UnitPrice = 2.595
This short Randori hopefully explained to you what the difference is between the two variations of the 'Percentile' function. Which of the variants you should use will depend on how you use it. If you want to end up referencing concrete records that are exactly on top of the quantile, then the PERCENTILE_DISC
function is of course preferable, but if you want to specify a threshold value as accurately as possible, then PERCENTILE_CONT
is certainly the right choice. The good thing is that you have the choice.