T-SQL Ninja #19

Batches

What are batches?

Sometimes you may need to do operations on very large tables. In these cases, due to the transactional behavior of your database, there may be no space left in the TempDB. This is because relational databases follow the ACID principles that operations should be atomic, so for example a `DELETE’ statement will be completely executed or rolled back, intermediate states where only some records have been deleted may not occur.

How can you work with batches?

To work with batches, you need WHILE statements, which we introduced last week. The basic idea is: delete the next N records until there are no more left. Let’s first look for the largest table in our WideWorldImporters database for this question. To find it, we query the metadata of the database:

SELECT 
     SCHEMA_NAME(A.schema_id) + '.' + A.Name AS [TableName]
    ,B.rows as [RowCount]
FROM sys.objects A
INNER JOIN sys.partitions B 
ON A.object_id = B.object_id
WHERE A.type = 'U'
ORDER BY B.[rows] DESC

The largest table with over 3.5 million records is the Warehouse.ColdRoomTemperatures_Archive table. If we take a look at this table, we can see that it contains the records from the period 20.12.2015 to 31.05.2016:

SELECT 
 MIN(RecordedWhen)
,MAX(RecordedWhen) 
FROM Warehouse.ColdRoomTemperatures_Archive 

From this we now want to delete all data records older than 2017-04-01. When we count them, we find that there are 2.2 million records:

SELECT 
    COUNT(*)
FROM Warehouse.ColdRoomTemperatures_Archive 
WHERE RecordedWhen < '2016-04-01'

However, since the table is a temporal table (system versioned table), we cannot delete directly from it, so we copy it into a temporary object where we can delete at will:

DROP TABLE IF EXISTS #WarehouseDeleteTest
SELECT * INTO #WarehouseDeleteTest FROM Warehouse.ColdRoomTemperatures_Archive

In order to delete them now for the TempDB and the transaction log in a resource-saving way, we can now apply the pattern described above. To do this, we first create a variable in which we store how many records we have deleted in the last step. Then we always delete 100,000 records from the table until this number is zero, i.e. there are no records left to delete. We do this in one transaction at a time to avoid overloading the transaction log:

DECLARE @deletedRows int = 1

WHILE @deletedRows > 0
BEGIN
    BEGIN TRANSACTION
    DELETE TOP(100000) 
        #WarehouseDeleteTest
    WHERE RecordedWhen < '2016-04-01'
    
    SET @deletedRows = @@ROWCOUNT
    COMMIT TRANSACTION
END

We see several rows in the output telling us that 100,000 records have been deleted (100000 rows affected). Then at the end a row where less than 100,000 records are affected (96324 rows affected) and at the very end a row where there are no records left to delete (0 rows affected). If we now count, we have only 1.3 million records left in our temporary table:

select count(*) 
FROM #WarehouseDeleteTest

And finally, if we check what the maximum and minimum dates of these records are, we find that, as expected, all records were deleted before April 1, 2016:

SELECT 
 MIN(RecordedWhen)
,MAX(RecordedWhen)
FROM #WarehouseDeleteTest

This procedure is always useful if you want to process large amounts of data, but transactional security is not so important to you that this must be done in an atomic step. For example, it can be used in data warehousing if you want to merge from a staging table into a DWH table (we described MERGE statements in week 15) and the source table contains so many records that the MERGE statement bursts all hardware resources. The important thing is that you are aware that you are disabling ACID principles here and that you are taking care of appropriate error handling and restartability.

References

Ninja-Notebooks @ GitHub

Leave a Reply

Your email address will not be published. Required fields are marked *