ColumnStores vs. RowStores: I/O Performance and Compression

One of the major benefits associated with ColumnStore  indexes is increased I/O performance.  Below are two of the more common explanations for this bump in I/O performance…

ColumnStores only retrieve data (from storage) necessary to satisfy the query

This notion is inherent in the column-oriented architecture…data pages in column-oriented storage only contain data from a single column.  Therefore, data pages containing only the columns referenced by the query need to be pulled from storage.  In row-oriented storage, data pages contain entire table rows.  So the data pages containing data for all columns in the row have to be retrieved from storage.  Then, the subset of columns referenced by the query can be extracted (which has an additional cost…no free lunches here) and operated on.  So to put it another way…the I/O for column oriented data isn’t “faster”, but just more efficient.

ColumnStores have higher compression ratios than RowStore counterparts

This explanation stems from the assumption that column-oriented data pages contain data of the same nature/type and are therefore better suited for compression algorithms, such as run-length encoding.  Below is a brief illustration of the differences in compression…

Here’s a screenshot of the setup:

There are two identical table structures:

  1. NC_MediumTable – used for base case (traditional NonClustered index)
  2. CS_MediumTable – used for test case (new ColumnStoreIndex)

Each table has 1 million rows of dummy data and a clusterd PK on [Id].  Columns [k1] through [k3] represent FKs  (think: star-schema fact table) with 100, 10000, and 100000 unique values respectively…and [c1] is just a random column I had in there with 10 unique values.

Next I created a traditional NonClustered index on NC_MediumTable and a ColumnStore index on CS_MediumTable:

[sql]CREATE NONCLUSTERED INDEX NCIDX_NC_MediumTable_001
ON dbo.NC_MediumTable (k1,k2,k3) INCLUDE (c1)
;
CREATE NONCLUSTERED COLUMNSTORE INDEX NCCSIDX_CS_MediumTable
ON dbo.CS_MediumTable (id, k1, k2, k3, c1)
;[/sql]

And now, here’s a quick look at the index sizes…

As you can see in the image above, the ColumnStore is MUCH smaller than the traditional NonClustered index.  And here’s what it looks like when the base table is a heap instead of a clustered index…

As you can see, both indexes increased in size…nearly 2x for the ColumnStore!

But what if we compress the row-store?  Let’s add 2 more indexes to the NC_MediumTable; one with page compression, the other with row compression:

[sql]CREATE NONCLUSTERED INDEX NCIDX_NC_MediumTable_001_PC
ON dbo.NC_MediumTable (k1,k2,k3) INCLUDE (c1)
WITH (DATA_COMPRESSION = PAGE)
;
CREATE NONCLUSTERED INDEX NCIDX_NC_MediumTable_001_RC
ON dbo.NC_MediumTable (k1,k2,k3) INCLUDE (c1)
WITH (DATA_COMPRESSION = ROW)
;[/sql]

Not much gained from the row-compressed index, but wow – check out the size of the page-compressed index.  It’s almost as small as the ColumnStore index.  That will definitely help close the gap on I/O performance, but there’s still going to be the cost associated with extracting the subset of columns after the row-oriented page is decompressed.

While this wasn’t by any measure, an exhaustive comparison, I hope  you found the compression demo interesting as we will be revisting it down the road.

Leave a Reply