T-SQL Ninja #47

COMPRESS

What is COMPRESS?

If you have a large amount of data in a database that you no longer need, such as deleted records that you want to archive but are no longer used in your application, then it would be nice if this data was stored in compressed form. How you can achieve this with the help of the COMPRESS function, you can see today in our Dojo.

How can you use COMPRESS?

What does COMPRESS do? The COMPRESS function converts the input value into a byte array of the type VARBINARY(MAX), compressed with the GZIP algorithm. This means that you can pass a value, such as a string, to this function, and that value will be compressed and returned. How can you use this? To demonstrate this, we use the WideWorldImporters database and work with the Sales Invoicelines. First we create a table where we can archive deleted data:

DROP TABLE IF EXISTS [Sales].[DeletedInvoiceLines]

SELECT TOP 0 [InvoiceLineID]
      ,[InvoiceID]
      ,[StockItemID]
      ,[Description]
      ,[PackageTypeID]
      ,[Quantity]
      ,[UnitPrice]
      ,[TaxRate]
      ,[TaxAmount]
      ,[LineProfit]
      ,[ExtendedPrice]
      ,[LastEditedBy]
      ,[LastEditedWhen]
INTO [Sales].[DeletedInvoiceLines]
FROM [Sales].[InvoiceLines]

Of course, this is not the best way to create an archive table, as you should normally worry about filegroups and locations, but for our current demonstration, this table will suffice.

If you want to delete data and archive it in the new table, you can do so as follows:

DELETE 
FROM [Sales].[InvoiceLines]
OUTPUT 
     Deleted.[InvoiceLineID]
    ,Deleted.[InvoiceID]
    ,Deleted.[StockItemID]
    ,Deleted.[Description]
    ,Deleted.[PackageTypeID]
    ,Deleted.[Quantity]
    ,Deleted.[UnitPrice]
    ,Deleted.[TaxRate]
    ,Deleted.[TaxAmount]
    ,Deleted.[LineProfit]
    ,Deleted.[ExtendedPrice]
    ,Deleted.[LastEditedBy]
    ,Deleted.[LastEditedWhen]
INTO Sales.DeletedInvoiceLines
WHERE InvoiceID < 50000

So now you have deleted the data from the productive table but not compressed it, how can we manage to compress the data? First, we copy the archived data back into the original table:

INSERT INTO Sales.InvoiceLines 
SELECT * FROM Sales.DeletedInvoiceLines

Unfortunately, it is now the case that not all data types are suitable for compression with COMPRESS. Therefore, the COMPRESS function only allows values of the following data types as arguments:

  • binary(n)
  • char(n)
  • nchar(n)
  • nvarchar(max)
  • nvarchar(n)
  • varbinary(max)
  • varbinary(n)
  • varchar(max)

How can we take advantage of this? First, we convert the data of the InvoiceLines table into JSON objects. With a SELECT statement, for example, this would work as follows:

SELECT
    CONCAT('{',
        ' "InvoiceLineID" : ', [InvoiceLineID], 
        ',"InvoiceID" : ', [InvoiceID],
        ',"StockItemID" : ', [StockItemID],
        ',"Description" : ', '"', [Description], '"',
        ',"PackageTypeID" : ', [PackageTypeID],
        ',"Quantity" : ', [Quantity],
        ',"UnitPrice" : ', [UnitPrice],
        ',"TaxRate" : ', [TaxRate],
        ',"TaxAmount" : ', [TaxAmount],
        ',"LineProfit" : ', [LineProfit], 
        ',"ExtendedPrice" : ', [ExtendedPrice], 
        ',"LastEditedBy" : ', [LastEditedBy], 
        ',"LastEditedWhen" : ', '"', [LastEditedWhen], '"',
        '}'
    )
FROM Sales.InvoiceLines

This definition is not exactly elegant, but it serves its purpose. Now let's work with these objects in our delete statement by compressing this column and writing it to an archive table. We create this archive table as follows:

Now we write the deleted data into this table:

DELETE 
FROM [Sales].[InvoiceLines]
OUTPUT 
     COMPRESS(
       CONCAT('{',
        ' "InvoiceLineID" : ', Deleted.[InvoiceLineID], 
        ',"InvoiceID" : ', Deleted.[InvoiceID],
        ',"StockItemID" : ', Deleted.[StockItemID],
        ',"Description" : ', '"', Deleted.[Description], '"',
        ',"PackageTypeID" : ', Deleted.[PackageTypeID],
        ',"Quantity" : ', Deleted.[Quantity],
        ',"UnitPrice" : ', Deleted.[UnitPrice],
        ',"TaxRate" : ', Deleted.[TaxRate],
        ',"TaxAmount" : ', Deleted.[TaxAmount],
        ',"LineProfit" : ', Deleted.[LineProfit], 
        ',"ExtendedPrice" : ', Deleted.[ExtendedPrice], 
        ',"LastEditedBy" : ', Deleted.[LastEditedBy], 
        ',"LastEditedWhen" : ', '"', Deleted.[LastEditedWhen], '"',
        '}'
    ))
INTO Sales.DeletedInvoiceLinesCompressed
WHERE InvoiceID < 50000

For this table, the chosen procedure is not very efficient, as you can easily confirm by using the sp_spaceused function, since many of the columns, namely all numerical values, are not compressed here and the overhead caused by the JSON format is enormous. But if you do this with tables that contain many VARCHAR fields, it looks different. As an alternative approach, let's just choose to compress the description column. For this we create the following table:

SELECT TOP 0
     [InvoiceLineID]
    ,[InvoiceID]
    ,[StockItemID]
    ,COMPRESS([Description]) as DescriptionCompressed
    ,[PackageTypeID]
    ,[Quantity]
    ,[UnitPrice]
    ,[TaxRate]
    ,[TaxAmount]
    ,[LineProfit]
    ,[ExtendedPrice]
    ,[LastEditedBy]
    ,[LastEditedWhen]
INTO Sales.DeletedInvoiceLinesCompressed_V2
FROM Sales.InvoiceLines

Now we adjust the DELETE statement as follows

DELETE 
FROM [Sales].[InvoiceLines]
OUTPUT 
     Deleted.[InvoiceLineID]
    ,Deleted.[InvoiceID]
    ,Deleted.[StockItemID]
    ,COMPRESS(Deleted.[Description])
    ,Deleted.[PackageTypeID]
    ,Deleted.[Quantity]
    ,Deleted.[UnitPrice]
    ,Deleted.[TaxRate]
    ,Deleted.[TaxAmount]
    ,Deleted.[LineProfit]
    ,Deleted.[ExtendedPrice]
    ,Deleted.[LastEditedBy]
    ,Deleted.[LastEditedWhen]
INTO Sales.DeletedInvoiceLinesCompressed_V2
WHERE InvoiceID < 50000

If you now compare the three tables with sp_spaceused, you will see that the data occupies 28,360 KB of memory in the original variant. In the version with the full JSON and a lot of overhead, however, you get 53,544 KB of space, which is caused by the poor compression of the numerical values and the overhead of the JSON structure. The variant in which only the description is compressed, however, comes to 27,920 KB of storage space for the data. Here a reduction of the data by about 2% has taken place by COMPRESS.

To retrieve the compressed data again, we can use the DECOMPRESS function, but be careful: since DECOMPRESS does not know which value is "hidden" in the compressed byte array, you will get a VARBINARY(MAX) back here, which you have to CAST into the correct data value if you want to use it further:

SELECT 
     [InvoiceId]
    ,[InvoiceLineId]
    ,[DescriptionCompressed]
    ,CAST(DECOMPRESS([DescriptionCompressed]) as nvarchar(100))
FROM Sales.DeletedInvoiceLinesCompressed_V2
WHERE InvoiceID = 1

In the Dojo you had seen that a COMPRESS is not always and everywhere useful. Furthermore you probably noticed at the running time of the examples that you buy a reduction of the memory and the IO-load of your system here with CPU-load, because calling COMPRESSline by line doesn't make your queries "sporty". So consider whether the reduction in memory is worth the overhead of the query.

References

Ninja-Notebooks @ GitHub

Leave a Reply

Your email address will not be published. Required fields are marked *