## What is CUME_DIST?

Suppose you have a table with certain values and you want to know what percentage of records are below each value. This kind of question can be answered by the `CUME_DIST`

function. `CUME_DIST`

is short for “Cumulative Distribution” and is a statistical function that calculates for a given value and distribution the probability of a record being below the value.

## How can you work with CUME_DIST?

So now that we have clarified what `CUME_DIST`

does, let’s take a look at how you can use the function. It is clear that we are dealing with distributions in `CUME_DIST`

, so the function will return a floating point number between 0 and 1.

To get some experience with `CUME_DIST`

, we will first write a query that matches the `StockItems`

of the `WideWorldImporters`

database with their item groups. Since there are items that are assigned to several item groups, we do not use a join for this, but always assign these items to the first group:

SELECT UnitPrice ,SG.StockGroupName ,StockItemID OUTER APPLY ( SELECT TOP 1 StockGroupName FROM [Warehouse].[StockItemStockGroups] SISG LEFT JOIN [Warehouse].[StockGroups] SG ON SISG.StockGroupID = SG.StockGroupID WHERE SI.StockItemID = SISG.StockItemID ) SG ORDER BY UnitPrice

Already here we get a feeling for the data, when we scroll through the result set, for example, we intuitively see that about half of the products cost less than 20 dollars. Now let’s summarize some products by creating price segments, rounding each price to $10 and counting the products in each group:

SELECT CAST(ROUND(UnitPrice/10, 0)*10 AS int) as PriceRange ,COUNT(*) AS NumberOfProducts FROM [Warehouse].[StockItems] SI OUTER APPLY ( SELECT TOP 1 StockGroupName FROM [Warehouse].[StockItemStockGroups] SISG LEFT JOIN [Warehouse].[StockGroups] SG ON SISG.StockGroupID = SG.StockGroupID WHERE SI.StockItemID = SISG.StockItemID ) SG GROUP BY CAST(ROUND(UnitPrice/10, 0)*10 AS int)

We can also see here that about 50% of the products are in price segments up to $20 (you can also see the distribution graphically by clicking on the chart icon next to the result set in Azure Data Studio). But now we want to know exactly: and look at the information on product level, for this we use the `CUME_DIST`

function to look at the distribution on product level:

SELECT StockItemID ,UnitPrice ,CUME_DIST() OVER ( ORDER BY UnitPrice ) PercentageOfProeuctsBelow FROM [Warehouse].[StockItems] SI ORDER BY UnitPrice

It is important to note that products that have the same UnitPrice are assigned the same value of the distribution function. Since the distribution function cannot differentiate between the products on the basis of prices, they are treated the same here, similar to `RANK`

. If you wanted to change this, you would have to add a column as tie-breaker:

SELECT StockItemID ,UnitPrice ,CUME_DIST() OVER ( ORDER BY UnitPrice, StockItemID ) PercentageOfProeuctsBelow FROM [Warehouse].[StockItems] SI ORDER BY UnitPrice

But we can also go a step further with `CUME_DIST`

. If you are now interested in how the price should be classified in the respective product group, we give the window of the `CUME_DIST`

function a `PARTITION BY`

argument:

SELECT StockItemID ,SG.StockGroupName ,UnitPrice ,CUME_DIST() OVER ( PARTITION BY SG.StockGroupName ORDER BY UnitPrice, StockItemID ) PercentageOfProeuctsBelow FROM [Warehouse].[StockItems] SI OUTER APPLY ( SELECT TOP 1 StockGroupName FROM [Warehouse].[StockItemStockGroups] SISG LEFT JOIN [Warehouse].[StockGroups] SG ON SISG.StockGroupID = SG.StockGroupID WHERE SI.StockItemID = SISG.StockItemID ) SG ORDER BY UnitPrice

For each product we get the information how many percent of the items in the respective group are cheaper and equally expensive as the product you are looking at.

`CUME_DIST`

is similar to the recently introduced `FIRST_VALUE`

, `LAST_VALUE`

, `LEAD`

and `LAG`

a tool that can help you to collect and analyze a data set by means of statistics and thus gain an understanding of the data or provide your users or applications with corresponding added value without leaving the SQL Server. Sure, `CUME_DIST`

is not one of the most frequently used T-SQL functions, but as real ninjas you can show off your skills when it comes to determining the distribution of data and you have `CUME_DIST`

in mind…