70-463: Determine whether you need a star or snowflake schema…
If you’re a Kimball purist like myself, the title of this post should give you pause. I came across this phrase when reviewing the SkillsMeasured section for the 70-463 exam. At the time, the only scenarios I could come up with that would justify using a snowflake schema were the following:
Short on Disk Space?
Snowflake design removes some of the attribute redundancy in dimension tables – which technically saves *some* disk space. But from a size perspective, dimension tables typically pale in comparison to fact tables… by a few orders of magnitude. So really, how much disk is being saved?
If this is an add-on project where an other data mart is simply being added to the mix for a new business process and the preexisting data marts are already using snowflaked dimensions, then it would make sense to continue the trend in the name of conformed dimensions.
Neither of these scenarios were very satisfying at the time, so I went ahead and added it to the list of things to review in preparation for the exam and moved on.
Several days later, while re-reading The Data Warehouse Toolkit by Ralph Kimball on the flight up to my current client, I came to the part about outriggers…and that’s when the light bulb comes on.
On quick a side note, I’m happy to report that the Swype software on my Nexus 7 tablet (which kicks total ass) correctly identified “Kimball” when jotting down the reminder to write this blog post.
From The Data Warehouse Toolkit, Chapter 6, page 153:
…a dimension is said to be snowflaked when the low-cardinality columns in the dimension have been removed to separate normalized tables that then link back into the original dimension table. Generally, snowflaking is not recommended in a data warehouse environment because it almost always makes the user presentation more complex, in addition to having a negative impact on browsing performance. Despite this prohibition against snowflaking, there are some situations where you should build a dimension outrigger that has the appearance of a snowflaked table. Outriggers have special characteristics that cause them to be permissible snowflakes.
Here’s the example Kimball uses to explain the circumstances which call for the use of a snowflake schema:
In the image above, we see a fact table with a customer dimension. And hanging off of the customer dimension is an outrigger with a lot of demographic-based attributes pertaining to the county that the customer lives in.
This is considered an acceptable circumstance for leveraging a snowflake/outrigger design because there are so many county demographic attributes and they are all at such a higher granularity than the rest of the attributes in the customer dimension. Seeing as customer dimensions are notorious for being some of the largest dimensions found in data warehouses – think Amazon – this is one instance where the disk space savings could be non-negligible.
The other justification Kimball gives is that the data for the county demographic attributes is loaded at a different time from the rest of the customer attributes. Unfortunately, i have a bit of trouble relating this to a real world scenario…I have been fortunate enough to work on projects where all the source data was available during the same processing window.
For what its worth, Kimball does mention, in nearly the same breath, that a view can be placed on top of the dimension and outrigger tables so that they appear to the user as a single dimension via the client application.