What’s GROUP BY?
GROUP BY
and HAVING
belong to the absolute basics of SQL syntax, but every ninja should also know the basics and not only be familiar with Jutsu, so let’s have a look at the construct. A query that aggregates data, for example by making a sum, doesn’t always have to make the sum over all rows, it can also make the sum over certain subsets of the table. What these subsets look like is determined by the GROUP BY
construct.
How can you work with GROUP BY?
To write a query with GROUP BY
, we take a table from the Wide World Importers database. Suppose we want to find financial figures from the database and we use the CustomerTransactions
table. We can see a sum of the sales as follows:
Um eine Abfrage mit GROUP BY
zu schreiben, nehmen wir uns eine Tabelle der Wide World Importers-Datenbank vor. Angenommen, wir möchten Finanzkennzahlen aus der Datenbank herausfinden und knüpfen uns dafür die CustomerTransactions
-Tabelle vor. Eine Summe der Umsätze können wir wie folgt sehen:
SELECT SUM(TransactionAmount) ,SUM(TaxAmount) ,SUM(OutstandingBalance) FROM Sales.CustomerTransactions
The result of this query is nice, but not very helpful, because we don’t know yet in which period of time the transactions were made, so we have to add the TransactionDate
:
SELECT SUM([TransactionAmount]) ,SUM([TaxAmount]) ,SUM([OutstandingBalance]) ,[TransactionDate] FROM [Sales].[CustomerTransactions]
Here, the database immediately informs us that the TransactionDate
is not allowed in this query because it does not occur in an aggregating function or a GROUP BY
:
Column 'Sales.CustomerTransactions.TransactionDate' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
What does this mean? The server uses the SUM
aggregation to combine the remaining columns into a single line. Of course we could aggregate the date using the MIN
or the MAX
function, but that’s not what we want. Instead, we want to aggregate the sales per day. We can then do this using GROUP BY
as follows:
SELECT SUM([TransactionAmount]) ,SUM([TaxAmount]) ,SUM([OutstandingBalance]) ,[TransactionDate] FROM [Sales].[CustomerTransactions] GROUP BY [TransactionDate]
Here we now receive, as requested, the turnover broken down to individual days. We can also sort by days:
SELECT SUM([TransactionAmount]) ,SUM([TaxAmount]) ,SUM([OutstandingBalance]) ,[TransactionDate] FROM [Sales].[CustomerTransactions] GROUP BY [TransactionDate] ORDER BY [TransactionDate]
Of course, we could also sort only by columns that are contained in the output and thus either in an aggregating function or in a GROUP BY
construct. What is possible, however, is to display only the year from the TransactionDate
:
SELECT SUM([TransactionAmount]) as [TransactionAmount] ,SUM([TaxAmount]) as [TaxAmount] ,SUM([OutstandingBalance]) as [OutstandingBalance] ,DATEPART(YEAR, [TransactionDate]) as [Year] FROM [Sales].[CustomerTransactions] GROUP BY [TransactionDate] ORDER BY [TransactionDate]
However, the result here is not quite the desired one. A line still appears for each day of the year, but the year column only contains the year of the respective date. To get year totals here, we have to group (and sort) by the years:
SELECT SUM([TransactionAmount]) as [TransactionAmount] ,SUM([TaxAmount]) as [TaxAmount] ,SUM([OutstandingBalance]) as [OutstandingBalance] ,DATEPART(YEAR, [TransactionDate]) as [Year] FROM [Sales].[CustomerTransactions] GROUP BY DATEPART(YEAR, [TransactionDate]) ORDER BY DATEPART(YEAR, [TransactionDate])
So what’s the “having” about? That’s also relatively simple. Let’s start by calculating monthly sums:
SELECT SUM([TransactionAmount]) as [TransactionAmount] ,SUM([TaxAmount]) as [TaxAmount] ,SUM([OutstandingBalance]) as [OutstandingBalance] ,DATEPART(MONTH, [TransactionDate]) as [Month] ,DATEPART(YEAR, [TransactionDate]) as [Year] FROM [Sales].[CustomerTransactions] GROUP BY DATEPART(YEAR, [TransactionDate]), DATEPART(MONTH, [TransactionDate]) ORDER BY DATEPART(YEAR, [TransactionDate]), DATEPART(MONTH, [TransactionDate])
In the query you will see that it is possible to group by multiple columns. With this query, it is not possible to omit one of the two groupings…
If you now want to restrict this query to see only the year 2016, you can do so with a WHERE
condition:
SELECT SUM([TransactionAmount]) as [TransactionAmount] ,SUM([TaxAmount]) as [TaxAmount] ,SUM([OutstandingBalance]) as [OutstandingBalance] ,DATEPART(MONTH, [TransactionDate]) as [Month] ,DATEPART(YEAR, [TransactionDate]) as [Year] FROM [Sales].[CustomerTransactions] WHERE DATEPART(YEAR, [TransactionDate]) = 2016 GROUP BY DATEPART(YEAR, [TransactionDate]), DATEPART(MONTH, [TransactionDate]) ORDER BY DATEPART(YEAR, [TransactionDate]), DATEPART(MONTH, [TransactionDate])
But what if you only want to find the months where the OutstandingBalance is not zero? This no longer works with the WHERE
condition, since aggregating functions are not allowed in the WHERE
condition:
SELECT SUM([TransactionAmount]) as [TransactionAmount] ,SUM([TaxAmount]) as [TaxAmount] ,SUM([OutstandingBalance]) as [OutstandingBalance] ,DATEPART(MONTH, [TransactionDate]) as [Month] ,DATEPART(YEAR, [TransactionDate]) as [Year] FROM [Sales].[CustomerTransactions] WHERE SUM([OutstandingBalance]) <> 0 GROUP BY DATEPART(YEAR, [TransactionDate]), DATEPART(MONTH, [TransactionDate]) ORDER BY DATEPART(YEAR, [TransactionDate]), DATEPART(MONTH, [TransactionDate])
The server now reports:
An aggregate may not appear in the WHERE clause unless it is in a subquery contained in a HAVING clause or a select list, and the column being aggregated is an outer reference.
The error message immediately gives the appropriate hint here: HAVING
is the magic word we need:
SELECT SUM([TransactionAmount]) as [TransactionAmount] ,SUM([TaxAmount]) as [TaxAmount] ,SUM([OutstandingBalance]) as [OutstandingBalance] ,DATEPART(MONTH, [TransactionDate]) as [Month] ,DATEPART(YEAR, [TransactionDate]) as [Year] FROM [Sales].[CustomerTransactions] GROUP BY DATEPART(YEAR, [TransactionDate]), DATEPART(MONTH, [TransactionDate]) HAVING SUM([OutstandingBalance]) <> 0 ORDER BY DATEPART(YEAR, [TransactionDate]), DATEPART(MONTH, [TransactionDate])
For the old, experienced ninjas among you, this may come as little surprise, but we have seen many young warriors fail to formulate an aggregated query correctly… They should also remember the order of SELECT
… FROM
… GROUP BY
… HAVING
… ORDER BY
2 thoughts on “GROUP BY … HAVING”
echt stark!
Danke.