Archive

Archive for February, 2012

Columnstore : A massive boost for data warehousing and reporting

February 20, 2012 2 comments

One of the most exciting features in the upcoming SQL Server 2012 is most definitely ColumnStore Indexes.  This feature uses technology developed initially for PowerPivot to deliver a huge performance boost to queries running of a data-warehouse, and I do mean massive !

If we look at your typical Data Warehouse workload , you would have large amounts of data , especially in the fact tables. Apart from the ETL window, you are normally querying not updating these tables, so it is essentially a read-only workload. This is perfect for the current implementation of ColumnStore.

The ColumStore feature actually brings to us not just the new Index where data is stored by column ( and potentially compressed ), but also the new vector based query execution capability called “batch processing”. The latter is also responsible in a big way for the stunning performance gains .

So, as always, I’ve built a little demo using AdventureWorks, so that you can try this yourself. I have increased the sizeof the Fact table to about 13 million rows by duplicating the data. I brought in only 4 Dimension tables, which after testing was sufficient for demonstrating the difference in performance when running the queries.

The sample query

The sample query

I then created a copy of the fact table. The reason for the second fact table existing is because I wanted to create the columnstore index there and then compare the query execution time to running the query on the original table.

The index itself is very easy to create – you can script it or use the GUI, just as you would a normal index.  For this example , I simply created the columnstore index on all columns on the fact table.

Using the GUI makes it really simple

So I run the first query , which you can see on the first image. I’m running 2 DBCC commands before the query. The first, DBCC FREEPROCCACHE, will flush out already cached query plans. The second , DBCC DROPCLEANBUFFERS, will clear buffers from the buffer pool. Running these will ensure that we get a more scientific analysis of query performance when testing.
The results are as follows:

Query 1

Query Execution time CPU Time
Normal ( 1st run ) 1:19 4891 ms
Normal ( 2nd run ) 1:13 3890 ms
ColumnStore ( 1st) 0:04 1264 ms
ColumnStore (2nd) 0:04 1218 ms

Your eyes do not decieve you.  The query execution time in this example went from 79 seconds down to an amazing 4 seconds ! The CPU time is roughly a quarter of that of the first query. This perhaps illlustrates that CPU was not a huge bottleneck in the first query , but reading the data from disk was mkore of a bottleneck. It also illustrates the advantage that “batch processing” brings.

So, i then modified the query, removing the entire WHERE clause. I wanted to see the effect of less filtering and a larger data set. The results :

Query 2

Query Execution time CPU Time
Normal ( 1st run ) 1:10 27375 ms
Normal ( 2nd run ) 1:10 27141 ms
ColumnStore ( 1st) 0:04 3233 ms
ColumnStore (2nd) 0:04 3139 ms

The query off the “normal” table improves in terms of execution time slightly , with less filtering to do ( but a lot more CPU being used ), while the query off the ColumnStore table holds steady at about 4 seconds.

Now if you set up a test scenario like this , there are a whole host of further tests that you can do. Some things that I found interesting, when doing queries with lots of aggregations ( eg. SUM ) , using Columnstore delivered particularly fast results back compared to the normal table. Also very interesting , there were queries where the benefit was very little , or when the engine decided not to even use the Columnstore index at all. But overall, I was blown away by the performance gains and really, this is one of the big features of SQL 2012 for me.

Now currently there is one limitation to using Columnstore – the table becomes Read-Only when the index is built. Is this a problem ? Well , the current implementation of this feature states that it is to be used for Data Warehousing workloads. In a typical scenario, you would drop the index , perform the ETL , then rebuild the index. Creating the index on my table above would take on average about 2 minutes , so it was roughly the same time as creating a clustered index. I don’t really see a problem with this in most scenarios , however you can also use partition switching to add to a fact table with a Columnstore Index.

Of course , you can’t use this on an OLTP database, but perhaps in future we will be allowed to do so. Now that would really be something to look forward to. In the meantime however, I predict that many companies will enjoy performance gains by using Columnstore on their Data Marts and Data Warehouses. What do you think  ?

UPDATE : For news on xVelocity on OLTP , see here.

Advertisements
Categories: SQL Server