T-SQL Ninja #14

GROUP BY … HAVING

What’s GROUP BY?

GROUP BY and HAVING belong to the absolute basics of SQL syntax, but every ninja should also know the basics and not only be familiar with Jutsu, so let’s have a look at the construct. A query that aggregates data, for example by making a sum, doesn’t always have to make the sum over all rows, it can also make the sum over certain subsets of the table. What these subsets look like is determined by the GROUP BY construct.

How can you work with GROUP BY?

To write a query with GROUP BY, we take a table from the Wide World Importers database. Suppose we want to find financial figures from the database and we use the CustomerTransactions table. We can see a sum of the sales as follows:

Um eine Abfrage mit GROUP BY zu schreiben, nehmen wir uns eine Tabelle der Wide World Importers-Datenbank vor. Angenommen, wir möchten Finanzkennzahlen aus der Datenbank herausfinden und knüpfen uns dafür die CustomerTransactions-Tabelle vor. Eine Summe der Umsätze können wir wie folgt sehen:

SELECT
       SUM(TransactionAmount)
      ,SUM(TaxAmount)
      ,SUM(OutstandingBalance)
FROM Sales.CustomerTransactions

The result of this query is nice, but not very helpful, because we don’t know yet in which period of time the transactions were made, so we have to add the TransactionDate:

SELECT
       SUM([TransactionAmount])
      ,SUM([TaxAmount])
      ,SUM([OutstandingBalance])
      ,[TransactionDate]
  FROM [Sales].[CustomerTransactions]

Here, the database immediately informs us that the TransactionDate is not allowed in this query because it does not occur in an aggregating function or a GROUP BY:

Column 'Sales.CustomerTransactions.TransactionDate' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

What does this mean? The server uses the SUM aggregation to combine the remaining columns into a single line. Of course we could aggregate the date using the MIN or the MAX function, but that’s not what we want. Instead, we want to aggregate the sales per day. We can then do this using GROUP BY as follows:

SELECT
       SUM([TransactionAmount])
      ,SUM([TaxAmount])
      ,SUM([OutstandingBalance])
      ,[TransactionDate]
  FROM [Sales].[CustomerTransactions]
  GROUP BY [TransactionDate]

Here we now receive, as requested, the turnover broken down to individual days. We can also sort by days:

SELECT
       SUM([TransactionAmount])
      ,SUM([TaxAmount])
      ,SUM([OutstandingBalance])
      ,[TransactionDate]
  FROM [Sales].[CustomerTransactions]
  GROUP BY [TransactionDate]
  ORDER BY [TransactionDate]

Of course, we could also sort only by columns that are contained in the output and thus either in an aggregating function or in a GROUP BY construct. What is possible, however, is to display only the year from the TransactionDate:

 SELECT
       SUM([TransactionAmount]) as [TransactionAmount]
      ,SUM([TaxAmount]) as [TaxAmount]
      ,SUM([OutstandingBalance]) as [OutstandingBalance]
      ,DATEPART(YEAR, [TransactionDate]) as [Year]
  FROM [Sales].[CustomerTransactions]
  GROUP BY [TransactionDate]
  ORDER BY [TransactionDate]

However, the result here is not quite the desired one. A line still appears for each day of the year, but the year column only contains the year of the respective date. To get year totals here, we have to group (and sort) by the years:

SELECT
       SUM([TransactionAmount]) as [TransactionAmount]
      ,SUM([TaxAmount]) as [TaxAmount]
      ,SUM([OutstandingBalance]) as [OutstandingBalance]
      ,DATEPART(YEAR, [TransactionDate]) as [Year]
  FROM [Sales].[CustomerTransactions]
  GROUP BY DATEPART(YEAR, [TransactionDate])
  ORDER BY DATEPART(YEAR, [TransactionDate])

So what’s the “having” about? That’s also relatively simple. Let’s start by calculating monthly sums:

SELECT
       SUM([TransactionAmount]) as [TransactionAmount]
      ,SUM([TaxAmount]) as [TaxAmount]
      ,SUM([OutstandingBalance]) as [OutstandingBalance]
      ,DATEPART(MONTH, [TransactionDate]) as [Month] 
      ,DATEPART(YEAR, [TransactionDate]) as [Year]
  FROM [Sales].[CustomerTransactions]
  GROUP BY DATEPART(YEAR, [TransactionDate]), DATEPART(MONTH, [TransactionDate])
  ORDER BY DATEPART(YEAR, [TransactionDate]), DATEPART(MONTH, [TransactionDate])

In the query you will see that it is possible to group by multiple columns. With this query, it is not possible to omit one of the two groupings…

If you now want to restrict this query to see only the year 2016, you can do so with a WHERE condition:

SELECT
       SUM([TransactionAmount]) as [TransactionAmount]
      ,SUM([TaxAmount]) as [TaxAmount]
      ,SUM([OutstandingBalance]) as [OutstandingBalance]
      ,DATEPART(MONTH, [TransactionDate]) as [Month] 
      ,DATEPART(YEAR, [TransactionDate]) as [Year]
  FROM [Sales].[CustomerTransactions]
  WHERE DATEPART(YEAR, [TransactionDate]) = 2016
  GROUP BY DATEPART(YEAR, [TransactionDate]), DATEPART(MONTH, [TransactionDate])
  ORDER BY DATEPART(YEAR, [TransactionDate]), DATEPART(MONTH, [TransactionDate])

But what if you only want to find the months where the OutstandingBalance is not zero? This no longer works with the WHERE condition, since aggregating functions are not allowed in the WHERE condition:

SELECT
       SUM([TransactionAmount]) as [TransactionAmount]
      ,SUM([TaxAmount]) as [TaxAmount]
      ,SUM([OutstandingBalance]) as [OutstandingBalance]
      ,DATEPART(MONTH, [TransactionDate]) as [Month] 
      ,DATEPART(YEAR, [TransactionDate]) as [Year]
  FROM [Sales].[CustomerTransactions]
  WHERE SUM([OutstandingBalance]) <> 0 
  GROUP BY DATEPART(YEAR, [TransactionDate]), DATEPART(MONTH, [TransactionDate])
  ORDER BY DATEPART(YEAR, [TransactionDate]), DATEPART(MONTH, [TransactionDate])

The server now reports: 

An aggregate may not appear in the WHERE clause unless it is in a subquery contained in a HAVING clause or a select list, and the column being aggregated is an outer reference.

The error message immediately gives the appropriate hint here: HAVING is the magic word we need:

SELECT
       SUM([TransactionAmount]) as [TransactionAmount]
      ,SUM([TaxAmount]) as [TaxAmount]
      ,SUM([OutstandingBalance]) as [OutstandingBalance]
      ,DATEPART(MONTH, [TransactionDate]) as [Month] 
      ,DATEPART(YEAR, [TransactionDate]) as [Year]
  FROM [Sales].[CustomerTransactions]
  GROUP BY DATEPART(YEAR, [TransactionDate]), DATEPART(MONTH, [TransactionDate])
  HAVING SUM([OutstandingBalance]) <> 0 
  ORDER BY DATEPART(YEAR, [TransactionDate]), DATEPART(MONTH, [TransactionDate])

For the old, experienced ninjas among you, this may come as little surprise, but we have seen many young warriors fail to formulate an aggregated query correctly… They should also remember the order of SELECT … FROM … GROUP BY … HAVING … ORDER BY

References

Ninja-Notebooks @ GitHub

2 thoughts on “GROUP BY … HAVING

Leave a Reply

Your email address will not be published. Required fields are marked *