What is COMPRESS?
If you have a large amount of data in a database that you no longer need, such as deleted records that you want to archive but are no longer used in your application, then it would be nice if this data was stored in compressed form. How you can achieve this with the help of the COMPRESS
function, you can see today in our Dojo.
How can you use COMPRESS?
What does COMPRESS do? The COMPRESS function converts the input value into a byte array of the type VARBINARY(MAX)
, compressed with the GZIP algorithm. This means that you can pass a value, such as a string, to this function, and that value will be compressed and returned. How can you use this? To demonstrate this, we use the WideWorldImporters database and work with the Sales Invoicelines. First we create a table where we can archive deleted data:
DROP TABLE IF EXISTS [Sales].[DeletedInvoiceLines] SELECT TOP 0 [InvoiceLineID] ,[InvoiceID] ,[StockItemID] ,[Description] ,[PackageTypeID] ,[Quantity] ,[UnitPrice] ,[TaxRate] ,[TaxAmount] ,[LineProfit] ,[ExtendedPrice] ,[LastEditedBy] ,[LastEditedWhen] INTO [Sales].[DeletedInvoiceLines] FROM [Sales].[InvoiceLines]
Of course, this is not the best way to create an archive table, as you should normally worry about filegroups and locations, but for our current demonstration, this table will suffice.
If you want to delete data and archive it in the new table, you can do so as follows:
DELETE FROM [Sales].[InvoiceLines] OUTPUT Deleted.[InvoiceLineID] ,Deleted.[InvoiceID] ,Deleted.[StockItemID] ,Deleted.[Description] ,Deleted.[PackageTypeID] ,Deleted.[Quantity] ,Deleted.[UnitPrice] ,Deleted.[TaxRate] ,Deleted.[TaxAmount] ,Deleted.[LineProfit] ,Deleted.[ExtendedPrice] ,Deleted.[LastEditedBy] ,Deleted.[LastEditedWhen] INTO Sales.DeletedInvoiceLines WHERE InvoiceID < 50000
So now you have deleted the data from the productive table but not compressed it, how can we manage to compress the data? First, we copy the archived data back into the original table:
INSERT INTO Sales.InvoiceLines SELECT * FROM Sales.DeletedInvoiceLines
Unfortunately, it is now the case that not all data types are suitable for compression with COMPRESS
. Therefore, the COMPRESS
function only allows values of the following data types as arguments:
binary(n)
char(n)
nchar(n)
nvarchar(max)
nvarchar(n)
varbinary(max)
varbinary(n)
varchar(max)
How can we take advantage of this? First, we convert the data of the InvoiceLines table into JSON objects. With a SELECT
statement, for example, this would work as follows:
SELECT CONCAT('{', ' "InvoiceLineID" : ', [InvoiceLineID], ',"InvoiceID" : ', [InvoiceID], ',"StockItemID" : ', [StockItemID], ',"Description" : ', '"', [Description], '"', ',"PackageTypeID" : ', [PackageTypeID], ',"Quantity" : ', [Quantity], ',"UnitPrice" : ', [UnitPrice], ',"TaxRate" : ', [TaxRate], ',"TaxAmount" : ', [TaxAmount], ',"LineProfit" : ', [LineProfit], ',"ExtendedPrice" : ', [ExtendedPrice], ',"LastEditedBy" : ', [LastEditedBy], ',"LastEditedWhen" : ', '"', [LastEditedWhen], '"', '}' ) FROM Sales.InvoiceLines
This definition is not exactly elegant, but it serves its purpose. Now let's work with these objects in our delete statement by compressing this column and writing it to an archive table. We create this archive table as follows:
Now we write the deleted data into this table:
DELETE FROM [Sales].[InvoiceLines] OUTPUT COMPRESS( CONCAT('{', ' "InvoiceLineID" : ', Deleted.[InvoiceLineID], ',"InvoiceID" : ', Deleted.[InvoiceID], ',"StockItemID" : ', Deleted.[StockItemID], ',"Description" : ', '"', Deleted.[Description], '"', ',"PackageTypeID" : ', Deleted.[PackageTypeID], ',"Quantity" : ', Deleted.[Quantity], ',"UnitPrice" : ', Deleted.[UnitPrice], ',"TaxRate" : ', Deleted.[TaxRate], ',"TaxAmount" : ', Deleted.[TaxAmount], ',"LineProfit" : ', Deleted.[LineProfit], ',"ExtendedPrice" : ', Deleted.[ExtendedPrice], ',"LastEditedBy" : ', Deleted.[LastEditedBy], ',"LastEditedWhen" : ', '"', Deleted.[LastEditedWhen], '"', '}' )) INTO Sales.DeletedInvoiceLinesCompressed WHERE InvoiceID < 50000
For this table, the chosen procedure is not very efficient, as you can easily confirm by using the sp_spaceused
function, since many of the columns, namely all numerical values, are not compressed here and the overhead caused by the JSON format is enormous. But if you do this with tables that contain many VARCHAR
fields, it looks different. As an alternative approach, let's just choose to compress the description column. For this we create the following table:
SELECT TOP 0 [InvoiceLineID] ,[InvoiceID] ,[StockItemID] ,COMPRESS([Description]) as DescriptionCompressed ,[PackageTypeID] ,[Quantity] ,[UnitPrice] ,[TaxRate] ,[TaxAmount] ,[LineProfit] ,[ExtendedPrice] ,[LastEditedBy] ,[LastEditedWhen] INTO Sales.DeletedInvoiceLinesCompressed_V2 FROM Sales.InvoiceLines
Now we adjust the DELETE
statement as follows
DELETE FROM [Sales].[InvoiceLines] OUTPUT Deleted.[InvoiceLineID] ,Deleted.[InvoiceID] ,Deleted.[StockItemID] ,COMPRESS(Deleted.[Description]) ,Deleted.[PackageTypeID] ,Deleted.[Quantity] ,Deleted.[UnitPrice] ,Deleted.[TaxRate] ,Deleted.[TaxAmount] ,Deleted.[LineProfit] ,Deleted.[ExtendedPrice] ,Deleted.[LastEditedBy] ,Deleted.[LastEditedWhen] INTO Sales.DeletedInvoiceLinesCompressed_V2 WHERE InvoiceID < 50000
If you now compare the three tables with sp_spaceused
, you will see that the data occupies 28,360 KB of memory in the original variant. In the version with the full JSON and a lot of overhead, however, you get 53,544 KB of space, which is caused by the poor compression of the numerical values and the overhead of the JSON structure. The variant in which only the description is compressed, however, comes to 27,920 KB of storage space for the data. Here a reduction of the data by about 2% has taken place by COMPRESS
.
To retrieve the compressed data again, we can use the DECOMPRESS
function, but be careful: since DECOMPRESS
does not know which value is "hidden" in the compressed byte array, you will get a VARBINARY(MAX)
back here, which you have to CAST into the correct data value if you want to use it further:
SELECT [InvoiceId] ,[InvoiceLineId] ,[DescriptionCompressed] ,CAST(DECOMPRESS([DescriptionCompressed]) as nvarchar(100)) FROM Sales.DeletedInvoiceLinesCompressed_V2 WHERE InvoiceID = 1
In the Dojo you had seen that a COMPRESS
is not always and everywhere useful. Furthermore you probably noticed at the running time of the examples that you buy a reduction of the memory and the IO-load of your system here with CPU-load, because calling COMPRESS
line by line doesn't make your queries "sporty". So consider whether the reduction in memory is worth the overhead of the query.