What is CUME_DIST?
Suppose you have a table with certain values and you want to know what percentage of records are below each value. This kind of question can be answered by the CUME_DIST
function. CUME_DIST
is short for “Cumulative Distribution” and is a statistical function that calculates for a given value and distribution the probability of a record being below the value.
How can you work with CUME_DIST?
So now that we have clarified what CUME_DIST
does, let’s take a look at how you can use the function. It is clear that we are dealing with distributions in CUME_DIST
, so the function will return a floating point number between 0 and 1.
To get some experience with CUME_DIST
, we will first write a query that matches the StockItems
of the WideWorldImporters
database with their item groups. Since there are items that are assigned to several item groups, we do not use a join for this, but always assign these items to the first group:
SELECT UnitPrice ,SG.StockGroupName ,StockItemID OUTER APPLY ( SELECT TOP 1 StockGroupName FROM [Warehouse].[StockItemStockGroups] SISG LEFT JOIN [Warehouse].[StockGroups] SG ON SISG.StockGroupID = SG.StockGroupID WHERE SI.StockItemID = SISG.StockItemID ) SG ORDER BY UnitPrice
Already here we get a feeling for the data, when we scroll through the result set, for example, we intuitively see that about half of the products cost less than 20 dollars. Now let’s summarize some products by creating price segments, rounding each price to $10 and counting the products in each group:
SELECT CAST(ROUND(UnitPrice/10, 0)*10 AS int) as PriceRange ,COUNT(*) AS NumberOfProducts FROM [Warehouse].[StockItems] SI OUTER APPLY ( SELECT TOP 1 StockGroupName FROM [Warehouse].[StockItemStockGroups] SISG LEFT JOIN [Warehouse].[StockGroups] SG ON SISG.StockGroupID = SG.StockGroupID WHERE SI.StockItemID = SISG.StockItemID ) SG GROUP BY CAST(ROUND(UnitPrice/10, 0)*10 AS int)
We can also see here that about 50% of the products are in price segments up to $20 (you can also see the distribution graphically by clicking on the chart icon next to the result set in Azure Data Studio). But now we want to know exactly: and look at the information on product level, for this we use the CUME_DIST
function to look at the distribution on product level:
SELECT StockItemID ,UnitPrice ,CUME_DIST() OVER ( ORDER BY UnitPrice ) PercentageOfProeuctsBelow FROM [Warehouse].[StockItems] SI ORDER BY UnitPrice
It is important to note that products that have the same UnitPrice are assigned the same value of the distribution function. Since the distribution function cannot differentiate between the products on the basis of prices, they are treated the same here, similar to RANK
. If you wanted to change this, you would have to add a column as tie-breaker:
SELECT StockItemID ,UnitPrice ,CUME_DIST() OVER ( ORDER BY UnitPrice, StockItemID ) PercentageOfProeuctsBelow FROM [Warehouse].[StockItems] SI ORDER BY UnitPrice
But we can also go a step further with CUME_DIST
. If you are now interested in how the price should be classified in the respective product group, we give the window of the CUME_DIST
function a PARTITION BY
argument:
SELECT StockItemID ,SG.StockGroupName ,UnitPrice ,CUME_DIST() OVER ( PARTITION BY SG.StockGroupName ORDER BY UnitPrice, StockItemID ) PercentageOfProeuctsBelow FROM [Warehouse].[StockItems] SI OUTER APPLY ( SELECT TOP 1 StockGroupName FROM [Warehouse].[StockItemStockGroups] SISG LEFT JOIN [Warehouse].[StockGroups] SG ON SISG.StockGroupID = SG.StockGroupID WHERE SI.StockItemID = SISG.StockItemID ) SG ORDER BY UnitPrice
For each product we get the information how many percent of the items in the respective group are cheaper and equally expensive as the product you are looking at.
CUME_DIST
is similar to the recently introduced FIRST_VALUE
, LAST_VALUE
, LEAD
and LAG
a tool that can help you to collect and analyze a data set by means of statistics and thus gain an understanding of the data or provide your users or applications with corresponding added value without leaving the SQL Server. Sure, CUME_DIST
is not one of the most frequently used T-SQL functions, but as real ninjas you can show off your skills when it comes to determining the distribution of data and you have CUME_DIST
in mind…