Dimensional Modeling: Junk vs Degenerate
Two commonly misunderstood dimensional modeling techniques are Junk Dimensions and Degenerate Dimensions. This post is aimed at clearing up the confusion and providing some context and use-cases for each.
There are certain scenarios where you will find that the source for a fact table contains a bunch of low-cardinality attributes that don’t really relate to any of the other attributes describing these facts. Some of the more common examples are bit/character based “flags” or “codes” which are useful to the end users for filtering and aggregating the facts. For example, imagine a user who wants to analyze orders from the order fact table that are flagged as “reprocessed”…they can either filter for facts with the reprocessed flag if they are only interested in that subset…or they can group by the “reprocessed” flag calculate things like the percent of orders that are “reprocessed”…
Instead of building a separate dimension for each of these individual attributes, another option is to combine them and build what’s known as a Junk Dimension based on the Cartesian product of each of these attributes and they’re corresponding range of values.
Note: This can technically be done in a view, but I don’t recommend it…instead, my suggestion is to materialize the values in a table and update the ETL solution to add new values if necessary.
This technique does 2 important things:
- Saves Disk Space
consider a single 4byte integer key linking to the junk dimension vs. a handful of 4byte integer keys each linking to a separate dimension. Might not sound like a lot on a per-record basis, but once you extrapolate out over a 100mm record fact table the savings really adds up.
- Improves End-User Experience
By keeping the total number of dimensions down to a manageable size it will be easier for your end-users to find the attributes they’re looking for during ad-hoc analysis. Kimball recommends <= 26 dimensions per fact table – of course there are always a few edge-case exceptions.
Here’s a before-and-after image to help you picture the change:
The Degenerate Dimension is another modeling technique for attributes found in the source transaction table. The main differences between these attributes and the ones that would fall into our Junk Dimension are as follows:
these are typically high-cardinality attributes – in some cases having a 1 to 1 relationship with the fact. These are likely to be the business keys of the fact table such as Purchase Order Number, Work Order Number, etc. Another potential candidate for the degenerate dimension is free-form comment fields.
- Use Case for End-Users
these attributes are not going to be used for filtering/aggregating facts. Instead, these are the types of attributes that are typically going to be used in drilldown or data mining scenarios (ex. Market Basket Analysis). For example, imagine a user who is analyzing purchase orders in the “delayed” status. After drilling down on the delayed POs for a certain supplier in a certain time period…the next step might be to pick up the Purchase Order Number which would allow this user to trace this small subset of PO’s back to the source system to find out why they are “delayed”.
Despite the name, these attributes typically remain in the fact table. There really isn’t much point in moving them out to an actual dimension – because of the high-cardinality there’s likely to be zero space savings…in fact it would probably cost you space due to the additional surrogate-keys. You’ll also likely be paying a heavy price on the join at query time.
Analysis Services Implementation
For Junk Dimensions, you will create a new dimension at the project-level pointing it to the table (or view) in the data warehouse that materializes the distinct combinations of values for the various junk-attributes. After configuring the dimension at the SSAS project-level, it can be added to the cube(s) and linked up to the measure group(s) via regular relationships (where appropriate).
For Degenerate Dimensions, the process is the same except you base the project-level dimension off of the fact table (or view). Once the project-level dimension is configured, it can be added to the cube(s) and linked up to the measure group(s) using “fact”-relationships (where appropriate).
There are a few additional caveats to be aware of when using degenerate dimensions in SSAS:
- It is typically a good idea (especially for larger cubes) to set the storage mode for Degenerate Dimensions to ROLAP – yes ROLAP is slow, but it will greatly reduce the amount of time and memory required during processing.
- Don’t forget to train your users so that they understand the use-case and limitations involved with large dimensions (ex. DON’T drag the attributes from this dimension into the rows/columns until you’ve applied a lot of filters to the measures) – otherwise you’ll be in support-mode hell when the complain about performance/timeouts when access the cube.
- If necessary, you can set the visibility of the degenerate dimension to FALSE and hide the degenerate dimension from the users. Then you can surface the information via drillthrough/actions for which you can control the rows returned…if a user is smart enough to access hidden dimension attributes via custom MDX statements, then they should be smart enough to understand how and when to use them…and not complain about performance ;-P
- Don’t forget to adjust the ROLAPDimensionProcessingEffort server property…you can read the details here…otherwise you’re likely to run into the “Maximum number of ROLAP rows to process in one query has been exceeded” error.