Building an OLAP Farm in the Cloud with Azure Virtual Machines – Part 2
FYI: the features needed to implement the OLAP Farm architecture (Remote partitions, Remote linked measure groups, Linked dimensions) have been deprecated as of SQL 2014.
This is the second part of a series of posts on the “OLAP Farm” – which is an Analysis Services architecture that makes use of remote partitions on the backend and linked dimensions and linked measure groups on the front end in order to support solutions where datasets and latency requirements simply can’t be handled by a single processing server. It was discussed in more detail in the previous blog post – so if you’re not familiar, stop now and check it out before continuing.
Finally things are cooling down at work, wife is out of town on a “girls trip”, and I’ve got the house to myself – all the coffee, all the wings, and later all the beer. But most importantly all the time (or a lot of it) to get back to this “OLAP-Farm in the Cloud” blog series.
Quick note before diving into part 2, a bit of housekeeping…
Since we last left off, Microsoft released several new preconfigured images in the VM gallery, one of which is a SQL Server 2014 image that is “Optimized for Data Warehousing Loads” – more information here. So I have (temporarily) gotten rid of AZVM-SQL01 and replaced it with AZVM-SQLDW. One caveat to this is that the relational database server is now eating up 8 cores of the 20 at my disposal via the MSDN subscription. So once we start looking at distributed query processing in the OLAP Farm, I will probably switch back to SQL01 (from SQLDW) in order to free up a few more CPU cores for the SSAS servers.
Just to recap – in the last post, we walked through the process of provisioning our infrastructure. We created a few servers, a VNet, installed a SQL Server and Analysis Services instances on the various servers, and established connectivity between AzureVMs (i.e. Site-to-Site) and from our on-premise workstation to the AzureVMs (i.e. Point-to-Site)…
- AZVM-DC01: domain controller for lab environment
- AZVM-SQLDW: SQL Server containing data source for SSAS distributed database
- AZVM-SSAS-MSTR: master SSAS instance
- AZVM-SSAS-BE01: backend server for storing, processing, and scanning remote partitions
- AZVM-SSAS-BE02: backend server for storing, processing, and scanning remote partitions
- AZVM-SSAS-QS01: query server and target for user queries
Now in this post, we’re going to add our data and measure processing performance (loading data into SSAS database) using a single SSAS server – which will serve as our baseline from which to measure incremental improvements of distributing certain processing tasks across several backend servers (i.e. AZVM-SSAS-BE01 and AZVM-SSAS-BE02).
For this exercise, we’re going to use the good-old AdventureWorksDW data set and apply Bob Duffy’s (b | t) script (here) to enlarge it a bit. What we end up with is a standard AdventureWorksDW database but with a much larger product dimension (~10GB @ 1.2mm records) and internet sales fact table (~16GB @ 120mm records).
This is a bit more realistic of a production sized dataset (though still on the small end) but more importantly, it is enough data to capture bottlenecks in a trace and to show off some of the benefits for common SSAS processing-related performance tuning techniques.
In the rest of this blog series, we’ll be using a very simple cube consisting of 1 fact table + 3 dimensions…
- Fact Internet Sales
- Dim Date
- Dim Product
- Dim Customer
…from which we’ll create 3 measure groups:
- Internet Sales (base measures using sum aggregation)
- Internet Customers (distinct count of customers)
- Internet Orders (distinct count of orders)
Here’s a screenshot of what this looks like in the SSAS database project:
At this point, we are ready to deploy to AZVM-SSAS-MSTR – but first we need to make a few adjustments to the server.
SSAS Server Adjustments
Initially, AZVM-SSAS-MSTR was provisioned with only 1 additional data disk. This is pretty unrealistic from a production perspective (and would definitely cause a bottleneck) so I went ahead and provisioned 7 more disks and created a striped volume across all 8 disks.
Note: There are definitely other (more complicated) ways to configure the drives for SSAS servers depending on a number of scenario-specific factors (e.g. partitioning design, measure groups, expected/tested query access patterns).
Next we can update the SSAS server configuration (via SSMS or msmdsrv.ini) to use this new volume as the DataDir for storing SSAS data and deploy the SSAS database.
Baseline Processing Performance
Before we get ahead of ourselves and start distributing the SSAS database across backend SSAS servers, we need to take a baseline of processing on a single SSAS server (i.e. AZVM-SSAS-MSTR). This gives us an opportunity to discover and workout any immediate bottlenecks to maximize performance and help isolate the improvement (if any) that can be fully attributed to distributing the measure groups across multiple backend servers.
The following sections will cover the results and impact of various adjustments as we work to maximize processing performance on a single server. We will be using Performance Monitor to help track, interpret, and compare the results.
Test01: Process Full, No Partitions (~20.0 minutes)
With single partitions for each measure group and zero optimizations (other than the basics such as increasing network packet size to 32767 on the SSAS data source object), it took ~18 minutes to process (via Process Full) the SSAS database.
Memory: from the graph below, memory pressure does not seem to be a limiting factor – though we do see 2 spikes that might eventually pose an issue as data size increases.
CPU Utilization: from the graph below it looks like we still have some processor cycles to utilize. The humps are during the heavy lifting (i.e. processing the measure group data and indexes).
Processing Threads: The graph below shows processing thread utilization. In this case we’re only using 9 threads (at max) and ~3 threads (on average) to process data into the SSAS database. This can be better – much better.
Rows Read/Write/Converted: even though we weren’t fully utilizing our resources (i.e. CPU, memory, threads, etc) we still had pretty good throughput. Focusing just on the peak, we’re converting ~840k rows/sec on average (maxing out ~868k)…not too shabby, but still plenty of room for improvement.
Test02: Process in stages, No Partitions (~17.5 minutes)
This test is identical to the last – only difference being that we’ve chopped up the processing into 3 stages: dimensions (via Process Full), ProcessData, and finally ProcessIndexes. This isn’t a performance tweak (though we did seem to shave off a few seconds) – it is intended to help us more easily isolate the bottlenecks in the various stages of processing (e.g. maybe dimension processing is the major bottleneck and we need to eliminate attributes or build better attribute hierarchies). Once we’re done tuning, we can always go back and test a ProcessFull against the entire SSAS database to see how it stacks up against the staged processing scenario.
Note: while proof reading I realized the improvement might have come from having the data in the buffer-cache of the source system (instead of having to pull it off of disk) – in all future tests I made a point to restart the SQL Server service (on AZVM-SQLDW) to make sure and wipe out the buffer cache.
Memory: no change of note.
CPU Utilization: no change of note.
Processing Threads: In the graph below, I’ve included the “Rows converted/sec” counter (dotted pink) simply to highlight the area where we’re churning through partition data. As you can see by the Processing Pool Busy non-IO threads counter (purple line) we’re pegged at 6 threads. That’s because we only have 3 partitions (1 partition per measure group) and SSAS uses 2 threads per partition.
Rows Read/Write/Converted: no change of note – but I’d like to point out something interesting related to the read vs write as they pertain to row conversion. In the graph below, we’re zoomed in on the following 3 counters:
- Rows written/sec: turquoise solid line
- Rows read/sec: red dotted line
- Rows converted/sec: green dotted line
…the reason it only looks like 2 lines is because [Rows read] and [Rows converted] are on top of each other while the [Rows written] is dancing around.
To understand why this is interesting we need to understand how SSAS processes data. From a high-level, data is “read” from the data source into a memory buffer (Buffer-A). Once Buffer-A is full, a separate thread begins processing data from Buffer-A and eventually writes it to a file on disk. Meanwhlie, the original reader-thread starts reading new data into a second memory buffer (Buffer-B). Once the processing-thread finishes with Buffer-A and the reader-thread finishes with Buffer-B, they swap. Now the processing-thread is processing data from Buffer-B to a file on disk while the reader-thread is reading new data into Buffer-A. And the process continues until the object being processed is complete.
With that in mind, the implication of the graph above is that the reader-thread is the bottleneck…the writer-thread is waiting on the reader-thread to finish loading the next buffer so they can swap. Hence why we see the bursty spikes by [Rows written] counter. Ultimately this ties the performance of [Rows converted] (a measure of overall processing throughput) to the performance of [Rows read].
The conclusion we should come to then, is that in order to speed up processing, we need to speed up the rate at which data is being read out of the source system and into the SSAS memory buffers. How we achieve that improvement depends on specifics of the environment (e.g. is there a network between the data source and the SSAS instance) and other performance counter observations (e.g. poor process thread utilization).
Test03: Process in stages, Partition by Month (~20 minutes)
The primary bottleneck in the previous test was suboptimal use of processing threads. A secondary bottleneck was the rate at which data was being read from the data source into the SSAS memory buffer (i.e. Rows read/sec counter). CPU utilization, while low, was not a bottleneck – it was merely a symptom of the suboptimal utilization of other resources.
Now, in order to overcome these bottlenecks, we are going to partition the measure groups (by month). This will help with processing thread utilization because partitions can be processed in parallel. So instead of limiting ourselves to 2 threads (i.e. reader-thread and processing-thread) per measure group, we’ll be able to allocate 2 threads per partition per measure group. In addition, we also need to increase the max number of connections per data source (from a default of 10) to some number higher (e.g. 50). These adjustments should also help the secondary bottleneck (reading data out of the data source) since we’ll be sending multiple queries to the source (1 per partition) instead of a single query.
Note: The data in this cube spans several years and partitioning by month means we’ll have a lot of partitions. This is not something we want to do manually – therefore I’ve taken this script by Teo Lachev (b | t) made a few adjustments, and used it to programmatically create the monthly partitions for each measure group.
Memory: from the graph below we see that memory utilization is much lower than before – more specifically, instead of the 2 large spikes (up to ~4GB) we now have a fairly even distribution of memory usage that never spikes above 1GB. This makes sense because we’ve broken up 3 large partitions (1 per measure group) into a bunch of smaller partitions (>40 per measure group)…so instead of reading in a whole bunch of data in a single batch for each measure group, we’re breaking it up into many smaller batches. This is also more idea from a future-scalability standpoint.
CPU Utilization: The graph below shows pretty decent CPU utilization – the sharp jags during partition processing are interesting and can partially be attributed to the breakup of a few large partitions into many small partitions – but there’s a bit more to it (which I’ll explain later). Then at the end we see steadily high CPU utilization during index and aggregation processing which are both CPU intensive tasks.
Processing Threads: The graph below looks much different – in a good way. Now we’re using way more threads from the processing thread pool – busy threads (purple) are hovering around 50 for the majority of processing.
Just for reference, here’s the same graph but before we partitioned the measure groups (purple line hovers around 6)…
Rows Read/Write/Converted: the graph below shows a much different picture than what we saw in the previous tests. In this case, we went from a smooth line (avg’ing 840k rows converted / sec) to a jagged line showing “bursts” in throughput. Its certainly possible that increasing the counter collection frequency from every 15 seconds to every 3 might smooth this line out a bit. Another consideration is that maybe our partitions are too small – and the ratio of data processing to general setup/teardown overhead (e.g. memory buffers, processing threads, etc) is starting to slip. Possibly – but if that were true, I’d expect to see more steady CPU utilization as setup/tear down of objects in memory requires CPU. At this point, there’s still lower-hanging fruit to focus on, so let’s stay out of the weeds with our theories.
So we eliminated the processing thread pool bottleneck, but it took us (2min) longer to process the SSAS database – why?
The graph above, of Rows Read/Write/Converted, is where the simple answer lies. While our max throughput (i.e. Rows converted/sec) increased from ~868k to ~947k, the average throughput (zooming in on just the partition processing section) actually decreased from ~840k to ~480k.
Test03 – zoomed in:
Test02 – for reference:
Unfortunately I don’t have any traces or images to support this – so you’ll just have to take my word that I was watching resource monitor and sp_whoisactive – but the bottleneck was the source database (i.e. AZVM-SQLDW). Resource Monitor was showing CPU maxed out and sp_whoisactive was showing RESOURCE_SEMAPHORE waits (as high as 60+seconds). This would explain the spikey’ness of our CPU utilization and Rows Read/Write/Converted graphs… if the SSAS processing server is waiting on data from the source system, the CPU will be idling.
Test04: Process In Stages, Partition by Month, Source System Indexing (~13 minutes)
In the previous test, we managed to overcome the processing thread bottleneck but at the same time added a new bottleneck on the source system. The net result was that we actually added 2 minutes (~10%) to the total processing duration…we want that trend to go in the other direction.
In order to address the resource consumption issues on the source system, let’s consider the following observations:
- we went from 3 large partition queries to >120 smaller queries so our concurrency went up
- these smaller queries now have a WHERE-clause filtering on a date range
- our table is stored as a HEAP
It should be pretty obvious that we might want to consider an indexing strategy.
Since we’re running SQL Server 2014, and this is a star-schema data model, and because this technology is awesome, we can add a clustered columnstore index. This should definitely help with getting data out of the source more efficiently for the following 3 reasons:
- segment elimination (for SSAS partition query WHERE-clause filter predicates)
- column elimination (only the columns needed to resolve the query are requested from disk)
- compression (more efficient IO)
We can prove all of this by taking a query that would be issued for a single partition processing job and executing it with STATISTICS IO/TIME and reviewing the plans to compare costs…
SET STATISTICS IO ON SET STATISTICS TIME ON SELECT [dbo_FactInternetSalesBig].[salesordernumber] ,[dbo_FactInternetSalesBig].[productkey] ,[dbo_FactInternetSalesBig].[orderdatekey] ,[dbo_FactInternetSalesBig].[duedatekey] ,[dbo_FactInternetSalesBig].[customerkey] ,[dbo_FactInternetSalesBig].[shipdatekey] FROM ( SELECT * FROM [dbo].[factinternetsalesbig] WHERE orderdatekey BETWEEN 20110901 AND 20110930 ) AS dbo_FactInternetSalesBig ORDER BY [dbo_FactInternetSalesBig].[salesordernumber] ASC SET STATISTICS TIME OFF SET STATISTICS IO OFF
Before, while the table exists as a heap, the above query took 136 seconds and 2mm logical reads…
Scan count 9, logical reads 2048419, physical reads 0, read-ahead reads 2048426, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 32919 ms, elapsed time = 136254 ms
…and after we add the clustered columnstore index, the query took 15 sec and required ~560k logical reads…
Scan count 8, logical reads 559865, physical reads 8, read-ahead reads 997560, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 5092 ms, elapsed time = 14928 ms
We are clearly getting more efficient IO on the source system which should translate to an increase in [Rows read/sec] SSAS performance counter. Another benefit of using a clustered columnstore index is the space savings due to compression:
…the table is now only about ~1.3GB where previously it was ~16 GB (about 90% smaller)
Memory: memory utilization is still holding steady – up to ~1.1 GB
CPU Utilization: is looking pretty good – averaging 80% across the entire processing cycle…
Processing Threads: good processing thread utilization and we’re even starting to see some “queuing” (red line w/ peak of 9) which typically means we can add more processing threads, but given the fact that we’re already seeing pretty good CPU utilization and the [Processing pool idle non-I/O threads] doesn’t hit zero, adding more threads isn’t going to help.
Rows Read/Write/Converted: last but certainly not least, we see substantial improvement in the throughput of partition processing…with [Rows converted / sec] poking just over ~1mm. [Rows read] (blue line) is still the drag since it is still completely covered by [Rows converted] (green line) and [Rows written] (red line) continues to dance around the trend with spikes as high as ~1.4mm rows/sec.
A quick zoom in on the meat of the processing work shows an average of ~800k rows converted per second.
While this figure (~800k) is still less than what we saw before partitioning the data (i.e. 840k), it is much higher than the last test (Test03; where FactInternetSalesBig was still a HEAP where we were only getting ~480k)…plus overall processing duration has improved from an original duration of 18min down to 13min (~30% improvement).
As you can see, Tuning processing is a bit of a balancing act. We reached a higher average row-conversion throughput with less partitions but it came at the expense of under-utilizing other resources (i.e. processing threads, CPU, etc) – the net result is that we were doing less work over the same amount of time. By adding partitions and source system indexing, we managed to improve processing duration by 30% even though we decreased our row-conversion throughput – and even managed to decrease our memory requirements.
Originally, I wasn’t planning on spending this much time/space discussing the art/science of tuning SSAS processing performance but hey, we all get a little carried away sometimes. In the next post, we’ll configure the backend SSAS servers (i.e. AZVM-SSAS-BE01 and AZVM-SSAS-BE02), reconfigure our project to use remote partitions, perform a bit of processing tuning, and eventually compare our results.