CUME_DIST

What is CUME_DIST?

Suppose you have a table with certain values and you want to know what percentage of records are below each value. This kind of question can be answered by the CUME_DIST function. CUME_DIST is short for “Cumulative Distribution” and is a statistical function that calculates for a given value and distribution the probability of a record being below the value.

How can you work with CUME_DIST?

So now that we have clarified what CUME_DIST does, let’s take a look at how you can use the function. It is clear that we are dealing with distributions in CUME_DIST, so the function will return a floating point number between 0 and 1.

To get some experience with CUME_DIST, we will first write a query that matches the StockItems of the WideWorldImporters database with their item groups. Since there are items that are assigned to several item groups, we do not use a join for this, but always assign these items to the first group:

SELECT 
     UnitPrice
    ,SG.StockGroupName
    ,StockItemID
OUTER APPLY (
    SELECT TOP 1 
        StockGroupName 
    FROM [Warehouse].[StockItemStockGroups] SISG 
    LEFT JOIN [Warehouse].[StockGroups] SG
        ON SISG.StockGroupID = SG.StockGroupID
    WHERE SI.StockItemID = SISG.StockItemID
) SG 
ORDER BY UnitPrice

Already here we get a feeling for the data, when we scroll through the result set, for example, we intuitively see that about half of the products cost less than 20 dollars. Now let’s summarize some products by creating price segments, rounding each price to $10 and counting the products in each group:

SELECT 
     CAST(ROUND(UnitPrice/10, 0)*10 AS int) as PriceRange
    ,COUNT(*) AS NumberOfProducts
FROM [Warehouse].[StockItems] SI
OUTER APPLY (
    SELECT TOP 1 
        StockGroupName 
    FROM [Warehouse].[StockItemStockGroups] SISG 
    LEFT JOIN [Warehouse].[StockGroups] SG
        ON SISG.StockGroupID = SG.StockGroupID
    WHERE SI.StockItemID = SISG.StockItemID
) SG 
GROUP BY
     CAST(ROUND(UnitPrice/10, 0)*10 AS int) 

We can also see here that about 50% of the products are in price segments up to $20 (you can also see the distribution graphically by clicking on the chart icon next to the result set in Azure Data Studio). But now we want to know exactly: and look at the information on product level, for this we use the CUME_DISTfunction to look at the distribution on product level:

SELECT
     StockItemID
    ,UnitPrice
    ,CUME_DIST() OVER (
        ORDER BY UnitPrice 
     ) PercentageOfProeuctsBelow
FROM [Warehouse].[StockItems] SI
ORDER BY UnitPrice

It is important to note that products that have the same UnitPrice are assigned the same value of the distribution function. Since the distribution function cannot differentiate between the products on the basis of prices, they are treated the same here, similar to RANK. If you wanted to change this, you would have to add a column as tie-breaker:

SELECT
     StockItemID
    ,UnitPrice
    ,CUME_DIST() OVER (
        ORDER BY UnitPrice, StockItemID 
     ) PercentageOfProeuctsBelow
FROM [Warehouse].[StockItems] SI
ORDER BY UnitPrice

But we can also go a step further with CUME_DIST. If you are now interested in how the price should be classified in the respective product group, we give the window of the CUME_DIST function a PARTITION BYargument:

SELECT
     StockItemID
    ,SG.StockGroupName
    ,UnitPrice
    ,CUME_DIST() OVER (
        PARTITION BY SG.StockGroupName
        ORDER BY UnitPrice, StockItemID 
     ) PercentageOfProeuctsBelow
FROM [Warehouse].[StockItems] SI
OUTER APPLY (
    SELECT TOP 1 
        StockGroupName 
    FROM [Warehouse].[StockItemStockGroups] SISG 
    LEFT JOIN [Warehouse].[StockGroups] SG
        ON SISG.StockGroupID = SG.StockGroupID
    WHERE SI.StockItemID = SISG.StockItemID
) SG 
ORDER BY UnitPrice

For each product we get the information how many percent of the items in the respective group are cheaper and equally expensive as the product you are looking at.

CUME_DIST is similar to the recently introduced FIRST_VALUELAST_VALUELEAD and LAG a tool that can help you to collect and analyze a data set by means of statistics and thus gain an understanding of the data or provide your users or applications with corresponding added value without leaving the SQL Server. Sure, CUME_DIST is not one of the most frequently used T-SQL functions, but as real ninjas you can show off your skills when it comes to determining the distribution of data and you have CUME_DIST in mind…

References

Ninja-Notebooks @ GitHub

Leave a Reply

Your email address will not be published.