ADLC Step 5: Create Custom Aggregations

This is the sixth post in a series of posts covering the Aggregation Design Life-Cycle (ADLC).

  1. Introduction
  2. Create Initial Aggregation Designs
  3. Remove Ineffective Aggregations
  4. Capture Query Workload
  5. Run Usage-Based Optimization (UBO) Wizard
  6. Create Custom Aggregations
  7. Final Thoughts

In the last post, we discussed the Usage-Based Optimization (UBO) Wizard and some of the factors that might influence the process…or perhaps more accurately…how we should approach the process.  The end result is a set of aggregations, custom tailored to longest/most frequent queries (storage engine requests) hitting the cube, that are then merged in with the existing aggregation design.

Now, in this post, we are going to address the final step in the ADLC where we put the finishing touches on our aggregation designs by adding custom “hand-crafted” aggregations that, for one reason or another, were not created via the initial Aggregation Design Wizard or the UBO-Wizard.

ADLC - Step 5

Why is this step necessary?

Despite doing a thorough job with the Capture Query Workload step of the ADLC…making sure to capture a fully representative query workload across all reports, dashboards, and other reporting items…there will still be a small handful of queries (usually generated during ad-hoc analysis) that we will want to improve.  And that is what this step is for.

The typical scenario plays out like this: a business analyst is slicing and dicing the data with an Excel pivot table and notices that when he or she adds a certain attribute (usually a very granular attribute on a large dimension) onto the rows field, the pivot table all of a sudden takes a few minutes to refresh.

In the following sections we’ll walk through the steps for capturing the query, analyzing the execution details, and adding a custom aggregation to the existing aggregation design to help improve the performance of the query.

Capturing the Query Details

If we are going to build a custom aggregation to target a specific query, the first step is capturing the query…right?  There are a number of ways to do this and most deal with coordinating a test with the user to capture the slow performing query in a SQL Profiler trace.  The trace only needs to capture the Query End event shown below:

click to zoom
click to zoom

From the trace, we can use the Duration and TextData columns to retrieve the exact MDX for the longest running query in the trace.

Another option that works very well when your user base is using excel pivot tables for ad-hoc analysis is to have them install a handy (and FREE!) add-in called OLAP Pivot Table Extensions…which I previously blogged about here.  OLAP Pivot Table Extensions will allow the user to collect the poorly performing MDX and save everyone the trouble of having to coordinate schedules to reproduce the issue…up hill, in the snow, and possibly in production :-/

click to zoom
click to zoom

Analyzing the Query Details

Once we have exact query syntax, we can begin a more in-depth analysis to determine if an aggregation will actually help this query…and if so, which dimension attributes and measure group the aggregation needs to target.

Here’s the query we will use for the sake of an example…it’s a very basic query that already runs well under 100ms…so it’s not a very realistic example for this step in the ADLC…but by keeping the example simple, it will be easier for us to focus on the details in the SQL Profiler trace.

[sql]
WITH
MEMBER [Date].[Calendar].[QoQ Growth] AS
(
[Date].[Calendar].[Calendar Quarter].[Q4 CY 2007] –
[Date].[Calendar].[Calendar Quarter].[Q3 CY 2007]
)
SELECT
{
[Date].[Calendar].[Calendar Quarter].[Q3 CY 2007],
[Date].[Calendar].[Calendar Quarter].[Q4 CY 2007],
[Date].[Calendar].[QoQ Growth]
} ON 0,
NON EMPTY {
[Product].[Weight].AllMembers
} ON 1
FROM
[Adventure Works]
WHERE
[Measures].[Internet Sales Amount]
[/sql]

Now, before we start digging into the details, let’s take a high-level look at the query and see what objects (dimension attributes, measures) are referenced and then check the SSAS project to make sure the query will even be able to use an aggregation.  For example, if one of the measures is configured with a measure expression…an aggregation won’t help…so there’s no point in wasting any more time.

From there, the next step is to capture some basic summary information about the query execution which I like to do with MDX Studio:

QuerySummaryColdCache

This shows us the overall query duration, how many times the Storage Engine (SE) is being probed for data, number of cache hits, etc.

Next, we’ll zoom in on the details of the query execution via SQL Profiler which will provide us with some valuable information:

Cold Cache (click to zoom)
click to zoom

First it tell us where the query is spending the bulk of its processing time: Formula Engine (FE) or Storage Engine (SE).  Adding aggregations will only help reduce time spent getting data from the Storage Engine (SE).  So if the query is spending the majority of its time in the Formula Engine (FE), adding an aggregation will have little impact.  The quick n’dirty way to see the breakdown between time spent in the FE vs SE is to add up the duration of all the Query Subcube events (which gives you the time spent in the SE) and then subtract that value from the total duration of the query which allows us to back in to the FE duration.

Note: Another option, that I’ve seen recommended by Chris Webb (b | t) is to simply target all Query Subcube events > 500ms and build aggregations for those.

The other bit of really valuable information captured in this trace is if  (and how many times) data is being pulled from disk (Query Subcube; Non-cache data).  These are the events we are targeting and want to optimize with aggregations. In the screenshot above, we see that data is being pulled from disk with a 9ms duration. Using the Query Subcube Verbose event, we can get the information we need to build an aggregation to target this data pull.

click to zoom
click to zoom

Here’s a quick excerpt from Identifying and Resolving MDX Query Performance Bottlenecks in SQL Server 2005 Analysis Services explaining how to interpret the TextData column for the Query Subcube Verbose event:

    • A 0 indicates that the value for the default member (generally the All level) is requested.
    • An * indicates that values for all members of an attribute are requested, rather than a slice of one or more members.
    • A non-zero number indicates that a single attribute is requested for the slice, with the number representing the data ID of the member requested.
    • A + indicates that a slice on more than one member of a set is requested (the members of the slice are not displayed in the trace).
    • A – indicates that a slice below granularity is requested (the members of the slice are not displayed in the trace). Below granularity means that the subcube being returned is filtered on a set that is not included in the subquery. This occurs when a set is included in a WHERE clause, when dimension security is involved, and a number of other scenarios.

So in our example (screenshot above) we see that the following members from the Product and Date dimensions are being used to extract the necessary data to satisfy the query:

  • [Product].[Weight]:*
  • [Date].[Calendar Quarter]:*
  • [Date].[Calendar Semester]:*
  • [Date].[Calendar Year]:[CY 2007]

Note: I found it interesting that the [Calendar Year] and [Calendar Semester] attributes were included.  The [Calendar Year] specifically targets the 2007 member…which is likely due to the engine recognizing and eliminating unnecessary partitions (the measure group is partitioned by year in this example).  But I can’t quite wrap my head around the [Calendar Semester] inclusion…other than it is simply part of the attribute hierarchy defined in the Date dimension.  If you can explain why or point me in the right direction, I’d be much appreciated.

Creating the Custom Aggregation

So now that we have:

  1. reviewed the components of the query and determined that there’s nothing in there (ex. Measure Expressions) preventing us from leveraging aggregations
  2. broken down the components of the subcube request (via the Query Subcube Verbose event) and identifyed the attributes to include in the aggregation

…we are finally ready to switch back over to our SSAS project and build a custom aggregation.

There are a few ways to do this, but I recommend using BIDS Helper since the UI is much more intuitive. To get started, right-click on the Cube (from solution explorer) and select Edit Aggregations:

click to zoom
click to zoom

Next, right-click on the Aggregation Design to which you want to add the new custom aggregation, and select Edit:

click to zoom
click to zoom

Note: in a more realistic scenario, we’d see a bunch of aggregations defined in the main window…1 row per aggregation in the aggregation design.  But since this is just a lab environment and completely fabricated to demo the process, it’s blank.  

Finally, right-click anywhere in the main window and select Add Aggregation. Now simply browse the tree structure laying out all the dimension attributes (based on the attribute relationships) and select the necessary attributes based on the details of the Query Subcube Verbose event from the profiler trace in the previous section.

click to zoom
click to zoom

In this case we’ve selected Weight (from the Product dimension) and Calendar Quarter (from the Date dimension)…which we see circled in the screenshot above (bottom right corner).  Finally we can click ok to add the aggregation to the existing Aggregation Design.

Now…and this is a really nice and under-appreciated feature of BIDS Helper…we can go back to the Partitions tab of the Cube editor and there’s a button that allows us to deploy just the aggregation designs.

click to zoom
click to zoom

Once the updated Aggregation Design has been deployed, you can simply run a Process Index on the affected partitions and you’re ready to roll.

click to zoom
click to zoom

There’s no need to re-deploy and re-process the entire cube…in fact, that could be completely cost-prohibitive!

Measure the Impact of the Custom Aggregation

With the custom aggregation in place, we now want to test/measure the performance improvement…

click to zoom
click to zoom

In the trace above we see the Get Data From Aggregation event instead of the Query Subcube event.  Also, notice that the overall query duration has dropped ~10ms (from 23ms to 13ms) which is just about the same amount of time as the Query Subcube event in the trace before adding the custom aggregation.  Retrieving data from an aggregation is near instantaneous from a storage engine (SE) perspective…same with retrieving from SE cache.

click to zoom
click to zoom

The screenshot above shows the value in TextData column of the Get Data From Aggregation event in the profiler trace…identifying the aggregation we just created.

Considerations/Tips/Recommendations

  1. Be sure not to double count the Query Subcube and Query Subcube Verbose events when trying to back in to the breakdown of SE-duration vs FE-duration for a particular query.  These are duplicate events…the verbose version is only there to make it easier for us to interpret.
  2. When tracing the execution of a single query (or handful of queries) it’s best to use a development/test environment.  If that’s not available, then at least wait until after hours to reduce impact on the system.  And when even that’s not possible…run a quick trace and use the username column in the profiler trace to filter out everything but your session.
  3. Depending on the frequency of your ADLC iterations, this step can be run as a 1-off to improve performance of a new report or ad-hoc analysis.
  4. Remember that not all queries are ideal for cubes.  There’s a fine line between ad-hoc analysis and data extraction.  Cubes are not intended for analysts to extract millions and millions of rows.  So if you receive a “slow” query from a user and you notice that the query is attempting to return granular data from the fact table, consider an alternative method…such as an SSRS report against the DW/DataMart.

 

This post closes out the Aggregation Design Life Cycle…but you don’t want to miss the next post where we cover some final thoughts on aggregations for performance improvements and the overall life-cycle…so stay tuned.

1 thought on “ADLC Step 5: Create Custom Aggregations

Leave a Reply