As you will have heard by now, some exciting announcements were made at SQL Pass. One of those was Hekaton , the exciting new in-memory database platform for SQL Server. I’ve found some more information on Hekaton.
Before I move onto that though, remember that xVelocity ColumnStore debuted in SQL 2012 , and the performance results are already impressive. I did a blog post on ColumnStore here, and the results were very impressive.
But that currently only applies to Data Warehouse workloads. Now we have “Hekaton” , which promises to bring the xVelocity technology to OLTP workloads. What we’ve heard at pass about Hekaton :
- Will ship with the next major version of Sql Server
- Will be integrated into SQL Server – no separate interface or programming model
- Basically allows you to place most frequently used OLTP data tables in memory , in order to boost reads and writes ( that’s a very simplified explanation of it )
- You can keep your other OLTP tables on normal storage. So a database application can be designed to run with a hybrid storage mechanism.
I would imagine that at some interval , the in-memory table data is written to disk. According to Microsoft , some customers on TAP are testing this at the moment , and are seeing anywhere between 5 and 50 times throughput gains. While I don’t like repeating such claims , we did see the evidence of such speed boosts with ColumnStore , so this is very exciting.
In the mean time , start ordering more RAM for your servers ….. J
One of the most exciting features in the upcoming SQL Server 2012 is most definitely ColumnStore Indexes. This feature uses technology developed initially for PowerPivot to deliver a huge performance boost to queries running of a data-warehouse, and I do mean massive !
If we look at your typical Data Warehouse workload , you would have large amounts of data , especially in the fact tables. Apart from the ETL window, you are normally querying not updating these tables, so it is essentially a read-only workload. This is perfect for the current implementation of ColumnStore.
The ColumStore feature actually brings to us not just the new Index where data is stored by column ( and potentially compressed ), but also the new vector based query execution capability called “batch processing”. The latter is also responsible in a big way for the stunning performance gains .
So, as always, I’ve built a little demo using AdventureWorks, so that you can try this yourself. I have increased the sizeof the Fact table to about 13 million rows by duplicating the data. I brought in only 4 Dimension tables, which after testing was sufficient for demonstrating the difference in performance when running the queries.
I then created a copy of the fact table. The reason for the second fact table existing is because I wanted to create the columnstore index there and then compare the query execution time to running the query on the original table.
The index itself is very easy to create – you can script it or use the GUI, just as you would a normal index. For this example , I simply created the columnstore index on all columns on the fact table.
So I run the first query , which you can see on the first image. I’m running 2 DBCC commands before the query. The first, DBCC FREEPROCCACHE, will flush out already cached query plans. The second , DBCC DROPCLEANBUFFERS, will clear buffers from the buffer pool. Running these will ensure that we get a more scientific analysis of query performance when testing.
The results are as follows:
|Query||Execution time||CPU Time|
|Normal ( 1st run )||1:19||4891 ms|
|Normal ( 2nd run )||1:13||3890 ms|
|ColumnStore ( 1st)||0:04||1264 ms|
|ColumnStore (2nd)||0:04||1218 ms|
Your eyes do not decieve you. The query execution time in this example went from 79 seconds down to an amazing 4 seconds ! The CPU time is roughly a quarter of that of the first query. This perhaps illlustrates that CPU was not a huge bottleneck in the first query , but reading the data from disk was mkore of a bottleneck. It also illustrates the advantage that “batch processing” brings.
So, i then modified the query, removing the entire WHERE clause. I wanted to see the effect of less filtering and a larger data set. The results :
|Query||Execution time||CPU Time|
|Normal ( 1st run )||1:10||27375 ms|
|Normal ( 2nd run )||1:10||27141 ms|
|ColumnStore ( 1st)||0:04||3233 ms|
|ColumnStore (2nd)||0:04||3139 ms|
The query off the “normal” table improves in terms of execution time slightly , with less filtering to do ( but a lot more CPU being used ), while the query off the ColumnStore table holds steady at about 4 seconds.
Now if you set up a test scenario like this , there are a whole host of further tests that you can do. Some things that I found interesting, when doing queries with lots of aggregations ( eg. SUM ) , using Columnstore delivered particularly fast results back compared to the normal table. Also very interesting , there were queries where the benefit was very little , or when the engine decided not to even use the Columnstore index at all. But overall, I was blown away by the performance gains and really, this is one of the big features of SQL 2012 for me.
Now currently there is one limitation to using Columnstore – the table becomes Read-Only when the index is built. Is this a problem ? Well , the current implementation of this feature states that it is to be used for Data Warehousing workloads. In a typical scenario, you would drop the index , perform the ETL , then rebuild the index. Creating the index on my table above would take on average about 2 minutes , so it was roughly the same time as creating a clustered index. I don’t really see a problem with this in most scenarios , however you can also use partition switching to add to a fact table with a Columnstore Index.
Of course , you can’t use this on an OLTP database, but perhaps in future we will be allowed to do so. Now that would really be something to look forward to. In the meantime however, I predict that many companies will enjoy performance gains by using Columnstore on their Data Marts and Data Warehouses. What do you think ?
UPDATE : For news on xVelocity on OLTP , see here.
We’re only days away from Tech Ed Africa. I’ve been busy with preparations ( I’m doing 4 sessions this year ), so I havent been blogging much lately. I’ll pick that up after Tech Ed – in fact there’s plenty of my Tech Ed content that I want to share here. Of course , Sql Denali will be the focus of everything.
With the Denali CTP 3 out there isn’t a moment to waste – there’s new features to try out !! I ran the download over the weekend – it came in at about 2.3GB. There seems to be a download manager as part of the process. When you run the .EXE, it will extract all the files into the usual format we know from previous SQL installations. The installation program is the familiar one that we have had since SQL Server 2005.
The process is similar as well, you choose the “Evaluation” Edition and off you go. All is normal until you come to the “Feature Selection” screen, where SQL Enthusiasts will suddenly become quite excited. Here there are some goodies that we haven’t seen before.
There are also 2 new services that will be installed with Denali, for the Distributed Replay feature. Apparently this feature allows you to perform analysis and testing of your SQL apps before upgrading – sounds good and I’ll blog about it once I’ve played with it.
If you’ve never seen Denali before , the first surprise that might wake you a bit is the Analysis Services Configuration Screen. There’s a new option called “Tabular Mode” , and it is for when you want to deploy BISM models to the Analysis Server. You can only choose one, so choose wisely ….
Funny thing, I didn’t even have to reboot to start using it ( see Linux geeks ? ). The first thing that you’ll notice is that Management Studio is a bit … blue.
Of course , you now get a version of Visual Studio 2010 with Denali. Its the new BIDS. Right , off to play with the new stuff. The first thing I want to tackle is “Column Based Storage”…..
Here’s a quick tip I thought I’d share with you. Those of you who are using Powerpivot will know that Slicers are one of the best features of Excel 2010 and PowerPivot.
I find that some people aren’t aware of the slicer settings tab. Here you can tweak the look and feel of the slicer pane , but there is a setting that can improve performance. To access the tab , click on the slicer you want to edit, prefereably on the name of the slicer itself. You will then notice that on the Office Ribbon at the top, the “Slicer Tools” option appears.
Now here there are an avalanche of options falling out of the ribbon, ranging from styles and sizing to renaming the slicer. The two that you might be interested in are the “Slicer Settings” and “PivotTableConnections” buttons , both on the left. When you click on PivotTable Connections, you get a little pop up box. Here you can define which source tables the slicer loads data from. This is useful when you back end data has changed.
When you click on Slicer Settings , there are a few more interesting options. You can Rename and Re-Caption the Slicer, and set the sort order. The one that is interesting however , is the “Visually Indicate items with no data” option.
When you have a large amount of large slicers , you may experience some performance issues ( Even with the mighty Vertipaq engine ). You can therefore uncheck this option and it should help out with the performance of your workbook when slicing and dicing.
Since Sql Server 2008 launched, we’ve had an excellent tool called Resource Governor, which has allowed us to control how much system resources were being used. This isn’t limited to resources used by the SQL Server instance itself , but you can actually create resource pools, which are then linked to applications or even users.
I don’t want to go into details of how to set all of this up – for that you can simply go here.
What I did find out the other day was quite interesting though. I had a customer who was running a DBCC CHECKDB command as part of a batch script. This was checking the integrity of a very large database , and this one command would use up all the RAM on the server.
The DBA had enabled Resource Governor ,created the Resource Pool correctly, and set the maximum memory to be used to an amount lower than the system maximum. So this shouldn’t be happening.
When scanning the documentation around Resource Governor ,the following extract was found :
“Examples in SQL Server 2008 that are not managed by the Resource Governor are extended stored procedures, linked servers, COM objects, and certain SQLCLR objects (depending on garbage collection). In addition, the Resource Governor does not manage memory used by the buffer pool for data pages”
A Potential Solution
So the DBCC command falls outside the scope of what Resource Governor was controlling ( from my understanding ). However , we still had the older option of memory limits available to us , the‘max server memory’ setting. This allows you to set the maximum amount of server memory an instance of SQL Server can utilize. Its very easy to set up as well , an example would be :
Keep in Mind
When you set this limit, SQL Server will not use all of it immediatly, but gradually. If you are running multiple instances of SQL Server on a machine , you can use this setting inside every instance to control memory usage, in conjunction with the min server memory setting.
Just keep in mind the following, from MSDN : “Note that the max server memory option only limits the size of the SQL Server buffer pool, and does not limit a remaining unreserved memory area that SQL Server leaves for allocations of other components such as extended stored procedures, COM objects, non-shared DLLs, EXEs, and MAPI components.”
On YouTube you’ll now find a dedicated channel with videos on Business Intelligence. Check it out here :
WARNING : Some videos may contain scenes of seriously dodgy humour…
Temp DB is not normally something that SQL developers think about – after all , it’s a “background” feature of Sql Server that has nothing to do with the T-SQL code or Stored Procedure that they are working on.
However , that assessment is incorrect. Temp DB is a vital cog in the engine ( or should that be gearbox ? ) of SQL Server , and if your code places too much of strain on TempDB, it will definitely lead to performance and scalability issues later on.
While you might not be too concerned with the setup of Temp DB as a developer ( eg. Whether there are multiple file groups on multiple disks), some of your code will have a direct impact on it, the most obvious one being when you use Temp Tables. IF you are the type of developer who decides that it’s a good thing to go and create Temporary Tables whenever possible , remember that the performance you get in the development environment is not the performance you will get in production. On the production server , Temp DB will be shared with other databases.
Not only will this impact your code , but if you are using tempdb inefficiently it will impact other production applications.
Things to Consider
Before you create Temp Tables :
- Don’t use temp tables unnecessarily. Sometimes a clever statement ( like a CTE ) will do the job. As far as I know , a CTE won’t create a temp table if the amount of data is small enough to fit in RAM. Another option for small data sets is the table variable, although once again , if the set of data is very large, even a table variable will write to TempDB.
- If you have a set of data that will be referenced by multiple queries , why not persist it to a proper table ? You can always truncate it when it’s not needed anymore and it will be stored on the disk that the database uses, not the potentially high impact disk that temp db uses.
If you have to use Temp Tables :
- Always drop the temp table after you have used it.
- You don’t have to put all the columns and rows from the permanent table into the Temp Table
- Try not to use SELECT INTO to create the temp table. This will create locks on system objects while it figures out how to create the Temp Table
- Consider placing a clustered index on the TempTable. There will be a cost to populating the temp table , but if it is a large data set being referenced by many operations , it might be valuable ( in this case however I still prefer a permanent table )
The third tip form my Tech Ed deck that I’ll share with you concerns the definition of tables in a Sql database. Most developers focus on writing code that either just gets the job done or thats optimized in some way. However , I’ve noticed that many SQL developers don’t pay attention to the actual table definition, lazily using standard column definitions. Most of you might think that its not really a big deal – How much space do you really save by narrowing columns on a sql table definition ? The answer might surprise you.
Lets take a look at a table that I created. Its very simple , holding just a primary key , a number column and a date column.
So an alternative would be the following definition.
The Primary key stays the same. The number column changes from INT to TINYINT , since I know that the values I will insert are not greater than 256. The DATETIME field becomes a SMALLDATETIME, since I don’t need the extra precision.
Now what exactly will all of this achieve ? Surely this is being obsessive ? Well, I wrote a script to populate both tables with a million rows each. The script was a while-loop so that the data generated would be exactly the same in each table , except possibly the second table would have a slightly later datevalue ( shouldn’t affect it ). The results are very interesting. Without any compression enabled, the “Big” table used up 24.188MB , while the “Small” table came in at 17.406MB , which is about 71% the size or a 29% saving. Now I don’t know about you , but a 30% saving before compression is something I’ll take anyday. This is just a table with 2 columns , it might be more on a “wider” table. And if you implement this on a fact table which will grow to a billion rows , the savings become really meaningful.
Other things to consider :
- More memory is used when larger fields are queried
- More data is transported across the network
- If all tables in a database are larger , this will mean larger backups
- Larger backups mean longer restore times , which means longer recovery times in disasters
The NVARCHAR issue
The other issue when it comes to field definition lies with the NVARCHAR and NCHAR type. Some developers don’t know what the difference is between VARCHAR and NVARCHAR, so they just use NVARCHAR. Well NVARCHAR can handle international language code sets – so if you have a user captured field that might contain European or Asian language characters , then you would need NVARCHAR. However , for text fields where you know that the data is stored in English ( or is constrained ), you are simply wasting a lot of space by defining the field as NVARCHAR. How much space ? Try twice as much – yep , NVARCHAR uses double the space as VARCHAR – definitely something to think about.
Following up to my previous post on scalability, we now turn our attention to cursors. You can probably find lots of info on the web on cursors and where not to use them , but I had included it into the presentation for the benefit of junior developers, and I’ll just mention here the demo I did during the presentation.
So what are the problems with cursors ?
1) They are procedural. This means that programmers coming from other languages think that they are perfectly logical , without realising that SQL is set based.
2) Some types of cursors lock records in the database and prevent other users from changing them. Other types of cursors create an additional copy of all records and then work with them.
3) A cursor is a lot more code to maintain. You’ll see that set based code is normally smaller.
Lets move on to an example.
Suppose that we wanted to update a table based on values from another table. Yes, this is a simple example I know, but it will be easier to understand if you’re coming from a .NET background into T-SQL development. So you’re probably thinking about some kind of lookup at this point , and in code terms you’re thinking about implementing the following :
I’ve used tables from AdventureWorksDW, althought I modified the DimCustomers table to hold geographical information, and put them in a cursors schema. Now, the above is certainly not an example of best practice coding. When I ran it on my machine , it took about 10 minutes to complete. To get an idea of what is actually going on , run a profiler trace when the cursor is running. In the background , you’re actually getting thousands of UPDATE and SELECT statements executing on the database. This is very inefficient when compared to the set-based solution that we’ll investigate in a moment. It also means a tremendous amount if IO going on.
Why this doesn’t scale
Well , typically what happens is that the cursor runs fine in development because the volumes of data are very small. In production , the volumes grow monthly and the cursor gets slower every month. Another reason is that when the cursor is running , it will cause blocking in the production environment. Lets take a look at a set based solution. One way of writing this would be to write and UPDATE statement with a JOIN clause.
When I ran this on my machine , it took exactly 1 second to execute , but did the same job as the cursor. Now you see the power of set based code. The update statement , because there is a join , can in 1 statement update the relevant cell with the right piece of information. That is why we see such a difference. Another thing you can do is to compare execution plans.
Once you get your head around Set-Based queries , you’ll find it much easier to come up with set-based alternatives to cursors in your daily programming life.