Categories
Uncategorized

70-463: Determine whether you need a star or snowflake schema…

If you’re a Kimball purist like myself, the title of this post should give you pause.  I came across this phrase when reviewing the SkillsMeasured section for the 70-463 exam.  At the time, the only scenarios I could come up with that would justify using a snowflake schema were the following:

Short on Disk Space?
Snowflake design removes some of the attribute redundancy in dimension tables – which technically saves *some* disk space.  But from a size perspective, dimension tables typically pale in comparison to fact tables… by a few orders of magnitude.  So really, how much disk is being saved?

Standardization?
If this is an add-on project where an other data mart is simply being added to the mix for a new business process and the preexisting data marts are already using snowflaked dimensions, then it would make sense to continue the trend in the name of conformed dimensions.

Neither of these scenarios were very satisfying at the time, so I went ahead and added it to the list of things to review in preparation for the exam and moved on.

Several days later, while re-reading The Data Warehouse Toolkit by Ralph Kimball on the flight up to my current client, I came to the part about outriggers…and that’s when the light bulb comes on.

On quick a side note, I’m happy to report that the Swype software on my Nexus 7 tablet (which kicks total ass) correctly identified “Kimball” when jotting down the reminder to write this blog post.

From The Data Warehouse Toolkit, Chapter 6, page 153:

…a dimension is said to be snowflaked when the low-cardinality columns in the dimension have been removed to separate normalized tables that then link back into the original dimension table. Generally, snowflaking is not recommended in a data warehouse environment because it almost always makes the user presentation more complex, in addition to having a negative impact on browsing performance. Despite this prohibition against snowflaking, there are some situations where you should build a dimension outrigger that has the appearance of a snowflaked table. Outriggers have special characteristics that cause them to be permissible snowflakes.

Here’s the example Kimball uses to explain the circumstances which call for the use of a snowflake schema:

In the image above, we see a fact table with a customer dimension.  And hanging off of the customer dimension is an outrigger with a lot of demographic-based attributes pertaining to the county that the customer lives in.

This is considered an acceptable circumstance for leveraging a snowflake/outrigger design because there are so many county demographic attributes and they are all at such a higher granularity than the rest of the attributes in the customer dimension. Seeing as customer dimensions are notorious for being some of the largest dimensions found in data warehouses – think Amazon – this is one instance where the disk space savings could be non-negligible.

The other justification Kimball gives is that the data for the county demographic attributes is loaded at a different time from the rest of the customer attributes.  Unfortunately, i have a bit of trouble relating this to a real world scenario…I have been fortunate enough to work on projects where all the source data was available during the same processing window.

For what its worth, Kimball does mention, in nearly the same breath, that a view can be placed on top of the dimension and outrigger tables so that they appear to the user as a single dimension via the client application.

13 replies on “70-463: Determine whether you need a star or snowflake schema…”

Hey, came across your blog. Its nice to have this kind of approach

So, I didnt actually get if the example above is a purely snowflake or if it is supposed to be a mixed between the two. It looks like pure snowflake to me. So what would be the rule? If you have too many fields on the “dimension”, snowflake it, otherwise denormalize it into a star schema?

Like

Hi Diego – thanks for reading.

Yes, the example above (Figure 6.3) is a snowflake. In a purely denormalized dimensional model, the attributes in the “County Demographics Outrigger Dimension” table would be combined with those in the “Customer Dimension” table, forming a single Customer Dimension table.

The rule, based on my understanding of the text, is that you should consider snowflake when working with a large (in terms of the number of rows) and wide (in terms of the number of attributes) dimension where there is a “set” of attributes that exhibit one or more of the following characteristics:

a) come from a separate data source
b) have a separate update frequency
c) are at a higher granularity

…relative to the rest of the attributes in the dimension.

Like

sure, just responded.

FWIW, I plan on taking all the exams instead of simply upgrading for the reasons listed in my response to your question on stackexchange. Good luck!

Like

I think the difference between snowflake and outrigger is often blurred because snowflake is always a normalization whereas outrigger doesn’t have to be – it can be merely a separation of say customer attributes specific to the customer but changing at a different rate or its use is different, etc. Basically you are wanting to unburden and un-clutter the main dimension by using outrigger. I have used them but very rarely.

Like

Hi Gary, thanks for dropping by. I had to read your comment a few times before I picked up on the nuance… but yes, that does make sense… thanks for clearing that up!

Like

Is the many-2-many dimension considered as a snowflake or not? It sure isn’t a star in my opinion 🙂

Microsoft uses the snowflake diagram in the AdventureWorksDW for the product – subcategory – category dimensions, which seems this is an application of the outriggers rule.

Like

Hi Koen – thanks for reading!

You have an excellent point in regards to the many-2-many dimension scenario.

According to MSDN in reference to the Many-2-Many scenario: “Data structures do not always conform to the snowflake or star schema model where one fact is associated with a single dimension member.”
http://msdn.microsoft.com/en-us/library/ms345139(v=sql.90).aspx

In the past I’ve considered the many-2-many scenario a star-schema design by thinking of the bridge table as a (fact-less) fact table…so that the D-F-D-F pattern is maintained…and all is right in the Kimball universe. But maybe that logic is too simplified.

Either way – excellent point! Thanks for contributing.

Like

If I create a STAR Schema, I’m grappling with a Dimension table which has a 6 level hierarchy within it. Aside from using Self Joins and complicating the navigation, is there an easier technique when you require Multiple Hierarchies with deep levels on the same Dimension Table ?

Like

Is the 6-level hierarchy currently modeled as a parent-child? Also, is it always 6 levels or is it “ragged”?

the multiple hierarchies (assuming they are all modeled as Parent-Child) is interesting. That sort of complexity leads me to believe there may be room for additional discovery to see if it can be simplified in some way… or at least presented to the user in a more simplistic manner.

Like

Refer to Kimbal’s discussion re. hierarchies which is based on whether it’s ragged or not. If its fixed length then denormalize it – end of story. Bottom line is its never good idea to model parent-child in a star schema as BI tools hate that. Remember this is about making data easy to consume. Now if its truly ragged or the number of levels could change then there’s no good solution – it gets messy with either recursive and/or bridge table. Your end users will hate you. I avoid that at all costs. If its ragged – balance it. If the levels can change – try to get users to not do that which may take unraveling why they want to change levels which is often the hierarchy is overloaded to serve multiple business processes analysis and needs to be split into multiple hierarchies.

Like

Leave a comment