Monitoring Azure Analysis Services Instances via Azure Portal
At some point in the past couple weeks/months, the Azure Analysis Services team managed to expose a few more metrics to the Azure AS Monitoring / Metrics blade.
Last I checked, there were only 2 metrics available… Memory and QPU… arguably two of the most important metrics to track as they indicate if/when it’s time to adjust the size of your instance. However, there’s much more involved in analyzing performance of an AS model so it’s nice to see the Azure AS team expanding access and insight in terms of measuring the workload.
From a high-level performance monitoring perspective, these counters fall into 1 of 2 categories:
Resource Utilization Metrics
- QPU: an abstract term used to compare query workload capacity… (similar to a “story point” in agile planning)… 100 QPU’s is about the equivalent of 5 pretty fast CPU cores
- Memory: tracks how much memory being used to store the data and satisfy the query/processing workloads (see this post for more details on workload memory usage).
- Command Pool Job Queue Length: in SSAS-2016, several new counters were added to the SSAS-Threads category in perfmon to track CPU thread utilization related to processing commands. Technically there are several other types of commands, but when it comes to performance, “processing” was the only one of concern. (the other “types” of commands were things like DDL-ish to add/alter a partition). This particular counter shows when a (processing) command is waiting on a thread to be allocated from the command pool… in other words, if you see a value > 1 here for an extended period of time, you have a bottleneck. In my experience, this is not a common place to see a bottleneck… however, I expect it could very likely become one if Azure AS becomes a junk yard for “auto-upgraded” Power BI solutions w/ auto-refresh capabilities.
- Processing Pool Job Queue Length: The processing thread pool allocates threads for processing the data structures that make up a tabular model. Same as the previous metric (and just like all the other “<x> pool job queue” perfmon counters) a value > 1 for and extended period of time indicates a bottleneck. However, this is a much more common place to find a bottleneck especially in larger models that leverage partitions and parallel processing.
- Query Pool Busy Threads: this metric shows how many queries are (con)currently executing… a great metric to track overtime and review on a recurring basis! In practice, the precision of this metric is not great, but at least somewhat directional. By that I mean (and this assumes these metrics are pulling from the underlying perfmon counters …which I believe to be a safe assumption) if you’re only taking a snapshot of thread usage every 5 seconds, but the average query duration is only 1-2 seconds, then you could be missing the largest spikes with a small sample. Personally, I would have preferred to see the Queue Length version of this counter instead (or in addition to)… as it’s a binary metric… you have a problem or you don’t.
- Current User Sessions: # of users… a pretty good indicator of user-concurrency.
- Successful Connections Per Sec: # of connections per second.
- Total Connection Requests: # of connection attempts since the service was last restarted.
- Total Connection Failures: # of failed connection attempts since the service was last restarted.
Truth be told, I don’t expect to see too many people leveraging the metrics via the Azure Portal interface (shown above) to do much more than “keep an eye on“ their Azure AS assets..and most of those folks are probably only doing it during the initial phases of an Azure AS implementation until they feel comfortable with the sizing choices and potential growth needs. More likely than not, the mature customers (or those with larger Azure footprints) are going to be using something like Azure Monitor to collect metrics from their Azure AS instances (as well as the corresponding metrics all their other assets) and manage them in a centralized solution where they can more easily do things like setup and manage alerts. And that’s just scratching the surface. Take it bit further and it’s not hard to imagine a scenario where metrics like those above in this blog post are being collected as inputs to determine when (and by how much) to scale up (or down) an azure AS instance.
- Collect metrics from Azure AS instance(s)
- Create an Alert to detect QPU (or Memory) saturation
- Have the Alert call a webhook…
- where the webhook kicks off an Azure Automation Runbook that scales up the Azure AS instance
If you’re going to build in the cloud, you might as well take full advantage, no?