My notes about the Data Warehouse Toolkit (Kimball).
Atomic Star Schema → (optional) OLAP Cube (e.g.: product, market, date)
A measurement event in the physical world has a one-to-one relationship to a single row in the corresponding fact table.
Measure types:
SUM(extended_sales) / SUM(quantity))You should not store redundant text in a fact table.
In transaction fact tables, if there is no activity, the row should be absent — do not fill with zeros.
Types:
| Type | Description | Zero-fill empty rows? |
|---|---|---|
| Transaction | One row per event at a point in time | No — absent row means no activity |
| Periodic Snapshot | One row per period, regardless of activity | Yes — maintain snapshot continuity |
| Accumulating Snapshot | One row per process/pipeline instance, updated as milestones occur | Yes — row exists from start to finish |
Rules:
Best Practices:
Answers: Who, What, Where, When, How, Why
Fact vs. Dimension attribute — when it’s ambiguous:
| Attribute | Description |
|---|---|
| Fact | A measurement that takes on many values and participates in calculations; a continuous value |
| Dimension | A discrete descriptive value |
The data is only useful if it is simple to understand.
Best Practices:
| SCD Type | Strategy | Use when |
|---|---|---|
| Type 1 | Overwrite the old value | History doesn’t matter |
| Type 2 | Add a new row with a new surrogate key | Full history must be preserved |
| Type 3 | Add a new column for the new value | Only current and previous value matter |
A dimension is conformed when it means the same thing across multiple fact tables and data marts.
If two fact tables share a conformed dimension, a BI tool can query both and align the results correctly — this is the integration mechanism Kimball proposes instead of a normalized enterprise model.
ETL Best Practices:
load_date, source_system) on every table for traceabilityETL systems should be off-limits to business users — we don’t want busy ETL professionals distracted by unpredictable BI queries.
Before finalizing a dimensional model, verify:
The grain defines what a single row in the fact table represents — always atomic.
Examples:
Answers: Who, What, Where, When, Why, How
Scenario: A grocery chain with 100 stores across 5 states, each carrying a full complement of departments (Grocery, Frozen, Dairy, Meat, etc.) and approximately 60,000 SKUs per store. Data is collected at cash registers (POS system).
| Design Step | Decision |
|---|---|
| Business Process | Purchases at POS system |
| Grain | One row per individual product scanned per sales transaction |
| Dimensions | Date, Store, Product, Promotion, Cashier, Payment Method |
| Facts | Sales quantity, unit prices, discounts, sales amount, cost amount |
If a new dimension causes additional rows, it should be disqualified or the grain must be revisited.
Sales quantity, extended discount, sales, and cost dollar amounts are additive across all dimensions.
Derived facts (e.g., Extended Gross Profit = Extended Sales − Extended Cost) are generally recommended to be stored on disk.
| Name | Description | Example |
|---|---|---|
| Regular Unit Price | Original price per unit before any discount | $1,000 |
| Discount Unit Price | Discount applied per unit | $100 |
| Net Unit Price | Final price per unit after discount | $900 |
| Extended Discount Dollar Amount | Total discount for all units (discount × quantity) | $200 |
| Extended Sales Dollar Amount | Total revenue after discount (net price × quantity) | $1,800 |
| Extended Cost Dollar Amount | Total cost for all units (cost × quantity) | $1,200 |
| Extended Gross Profit Dollar Amount | Total profit (sales − cost) | $600 |
Non-Additive Facts:
SUM(gross_profit) / SUM(extended_sales) or equivalently SUM(extended_sales - extended_cost) / SUM(extended_sales)The Date dimension is typically the first and most reused dimension; usually daily-grained.
See kimballgroup.com → Tools → Utilities for a pre-built Date dimension.
Current and relative date attributes (e.g., is_current_date, is_current_month) may have an update lag of up to one day — it is better to compute these at query time than to store them.
Time-of-day is typically separated from the Date dimension to avoid row explosion.
"Tropicana OJ 64oz" and individual brand, size, flavor columns)Numeric values — Attribute or Fact?
| Signal | Put it in… |
|---|---|
| Used primarily in calculations | Fact table |
| Used primarily to filter, group, or label | Dimension table |
| Serves both purposes | Both — with distinct meanings |
Example: Standard price stored in the fact captures the valuation at purchase time; stored in the dimension it indicates the current list price.
Rule of thumb:
Drilling Down / Up:
The Store dimension usually does not have a pre-existing source table — the team often builds it by combining multiple source systems.
Required attributes: city, country, city-state combination, state, zip code, selling square footage, total square footage, first open date, remodel date, store manager.
Best Practice: Create role-playing views of the Date dimension to represent different business events (e.g., order date, shipment date, delivery date). This avoids multiple joins to the same table (
JOIN dim_date d1 ... JOIN dim_date d2 ...) and keeps queries readable.
Also called the Causal Dimension — it describes the factors believed to cause a change in product sales.
Must include: temporary price reductions, aisle displays, newspaper ads, coupons, etc. These are typically combined into a single dimension row.
Key business questions it enables:
| Scenario | Recommendation |
|---|---|
| Null FK in fact table | Never allow it — add a row in the dimension (e.g., “Not Applicable”) and point the FK there |
| Null attribute in dimension | Replace with a descriptive string: “Unknown”, “Not Applicable” |
| Null value in a fact column | Usually acceptable — aggregate functions (MIN, MAX, COUNT, AVG) handle NULLs correctly |
A single transaction may involve more than one payment method (e.g., cash + gift card), which would violate the grain of the sales fact table.
Solution: Create a separate Payment fact table with a finer grain — one row per payment method per transaction — rather than forcing multiple payment methods into the sales fact row.
A degenerate dimension is a dimension key that has no associated dimension table — it carries no descriptive attributes beyond the identifier itself (e.g., a POS transaction number, order number, or invoice number).
The dimensional model is designed to accommodate change gracefully:
| Change | How to handle |
|---|---|
| Add a new dimension (e.g., Frequent Shopper) | Add a new FK column to the fact table; backfill historical rows with a “Prior to Program” surrogate key |
| Add a new dimension attribute | Add a new column; fill historical rows with “Not Available” or equivalent |
| Add a new measured fact (same grain, same process) | Add a new column to the existing fact table |
| Add a new measured fact (different grain or process) | Create a new fact table |
Use case: Which products were on promotion but did not sell?
A Factless Fact Table contains only foreign keys — no numeric measures (or a dummy constant like 1). It records that an event could have happened but carries no measurement.
Promotion Coverage schema:
promotion_coverage_fact
├── date_key (FK)
├── product_key (FK)
├── store_key (FK)
└── promotion_key (FK)
Query pattern to find promoted-but-unsold products:
promotion_coverage_fact → all products on promotionsales_fact → all products actually sold(1) EXCEPT (2)Surrogate keys are artificial keys — the DW system has no dependency on the source system’s key type or values, so changes to the source key have no impact.
Fact table surrogate key — not useful for BI queries, but has ETL benefits:
UPDATE with INSERT + DELETE (safer in bulk ETL)Avoid it. Snowflaking normalizes dimension tables into multiple related tables, which hurts query readability and BI tool usability.
One acceptable exception: outrigger dimensions — a secondary dimension attached to a dimension table (not to the fact table directly).
Example:
Product Dimension→Product Introduction Date Dimension(date, calendar month, year, fiscal year, …)Even so, outriggers add joins and reduce readability — use them only when the size or volatility of the sub-dimension justifies it.
A Centipede Fact Table occurs when a fact table has an excessive number of dimension foreign keys.
Junk Dimension — combine several low-cardinality flags and indicators (that would otherwise each get their own FK column) into a single dimension table:
| Without Junk Dimension | With Junk Dimension |
|---|---|
is_gift_flag (FK) |
transaction_type_key (FK → junk_dim) |
is_rush_order_flag (FK) |
(all flags live as columns inside junk_dim) |
payment_status_key (FK) |
|
return_flag (FK) |
The junk dimension contains one row per unique combination of flag values. It reduces the number of FK columns in the fact table without losing any information.
| Practice | Why |
|---|---|
| Always declare the grain before choosing dimensions and facts | Prevents scope creep and mixed granularity |
| Flatten hierarchies into dimension rows | Enables drill-up/down without joins; improves BI tool usability |
| Replace NULL dimension attributes with “Unknown” or “N/A” | Avoids GROUP BY / WHERE surprises and confuses end users |
| Use descriptive text labels instead of Y/N flags | Reports are self-explanatory without lookup tables |
| Store derived facts (e.g., gross profit) on disk | Avoids repeated computation; ensures consistent results |
| Never sum non-additive facts (unit price, ratios) | Use SUM(numerator) / SUM(denominator) instead |
| Use a Factless Fact Table for promotion coverage | Enables “promoted but not sold” analysis |
| Store degenerate dimensions (order/transaction numbers) in the fact table | They have no attributes; a separate table would be empty |
| Pre-populate the Date dimension 10–20 years ahead | Avoids ETL failures on future dates |
| Separate time-of-day from the Date dimension | Prevents row explosion on the Date dimension |
Chapter 4 extends the retail dimensional model into the inventory supply chain, demonstrating how the same business pipeline spawns multiple fact tables — each with its own grain, metrics, and modeling pattern.
Core pipeline:
Purchase Order → Receive Warehouse Deliveries → Warehouse Product Inventory → Store Delivery → Store Product Inventory → Retail Sales
Key highlights:
quantity_on_hand is the canonical example of a semi-additive fact: valid to sum across products and stores, but meaningless to sum across timeSQL AVG on semi-additive facts is a silent trap — the denominator includes all rows, not distinct dates; always aggregate per date first using a window functionLevels of inventory measurement taken at regular intervals, each placed as a separate row in the fact table.
Example: The most atomic level of detail provided by the operational inventory system is a daily inventory snapshot per product per store.
| Key | Columns |
|---|---|
| Dimensions | date_key, product_key, store_key |
| Facts | quantity_on_hand |
To reduce data volumes, it is common to either:
Semi-additive fact: quantity_on_hand cannot be aggregated over time, but can be summed across stores.
Caution with
AVG:SQL AVGaverages over all rows returned by the query — not just the number of distinct dates.Example: 3 products × 4 stores × 7 dates → the denominator would be 84, not 7.
Correct approach: Use a window function to first aggregate per date, then average the results — this is the most robust solution.
SUM(quantity_on_hand) / COUNT(DISTINCT date_key)also works, but breaks silently on sparse data (if not every product has a snapshot for every date, the denominator will be wrong).
The periodic snapshot alone is rarely sufficient — it is usually necessary to add metrics that provide meaning and insight:
qty_sold / qty_on_hand (note: the standard financial formula is COGS (Cost of Goods Sold) / Average Inventory — this quantity-based ratio is a simplified proxy)Another way to model inventory is through inventory transactions: receive, place, release, inspect, bin, ship, return, remove, etc.
It is impractical to rely solely on the transaction model — it is too dense, though technically possible. (This was more true in Kimball’s era. Modern columnar stores like Snowflake, BigQuery, and Redshift handle transaction-grain tables well — the “too dense” argument is less compelling today.)
If measurements have different natural granularities or dimensionalities, they should belong to different fact tables.
Useful when you can identify each individual product (e.g., by serial number) to track its full lifecycle.
The Bus Architecture is the framework that integrates dimensional data marts across the entire organization into a cohesive enterprise warehouse — without requiring a single monolithic system.
The “bus” analogy comes from electrical engineering: multiple components plug into it independently but share a standard interface. In DW terms, conformed dimensions and conformed facts serve as that shared interface.
Key properties:
Drill-across: When two fact tables share a conformed dimension, a BI tool can query both and align results on that shared key — this is Kimball’s integration mechanism instead of a normalized enterprise model.
The Bus Matrix is the master planning document for the entire DW/BI system — the most important artifact to create before building anything.
Structure:
| Business Process | Date | Customer | Product | Employee | Store | Promotion |
|---|---|---|---|---|---|---|
| Sales Orders | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Inventory | ✓ | ✓ | ✓ | |||
| HR Payroll | ✓ | ✓ | ✓ | |||
| Marketing Spend | ✓ | ✓ | ✓ | ✓ |
What it reveals:
Variations:
How to use it:
Date and Customer almost always appear in row 1)A checkmark in the matrix is a commitment: this dimension will mean the same thing in this mart as everywhere else. That commitment is what makes drill-across queries possible.
Common Mistakes:
A shrunken rollup conformed dimension is a higher-grain version of a base conformed dimension that contains a subset of rows and a subset of attributes — and still conforms to it.
It exists to serve fact tables that operate at a coarser grain than the base dimension (e.g., a Marketing Spend fact table at brand level, while the base Product dimension is at SKU level).
Conformance rules — a shrunken dimension conforms when:
Renaming a column or redefining its meaning in the shrunken version breaks conformance — drill-across queries will produce incorrect results.
Example:
Base Product dimension (SKU-level grain):
| product_key | sku | product_name | brand | category | department |
|---|---|---|---|---|---|
| 1 | SKU-001 | Tropicana OJ 64oz | Tropicana | Juice | Beverages |
| 2 | SKU-002 | Minute Maid OJ 64oz | Minute Maid | Juice | Beverages |
Shrunken rollup Brand dimension (brand-level grain):
| brand_key | brand | category | department |
|---|---|---|---|
| 10 | Tropicana | Juice | Beverages |
| 11 | Minute Maid | Juice | Beverages |
Do not remove attributes from the base dimension — the shrunken rollup is an addition, not a replacement. The Product dimension keeps all its columns.
Physical table or view?
| Situation | Use |
|---|---|
| All rollup attributes come from the base dimension | View (CREATE VIEW brand_dim AS SELECT DISTINCT ...) |
| Rollup has its own attributes or its own source system | Physical table |
| Rollup is queried heavily at scale | Physical table (for performance) |
Prefer a view — it stays in sync automatically and guarantees conformance by definition. Only materialize into a physical table when there is a concrete reason to do so.
In the Bus Matrix, a shrunken rollup is annotated with a partial or shaded checkmark to indicate the business process uses a rolled-up version of a shared dimension, not the full atomic one.
| Practice | Why |
|---|---|
| Model each business process as its own fact table | Each process has a unique grain, timestamp, and metrics — mixing them corrupts the grain |
| Choose the right fact table type per process | Periodic Snapshot for regular level measurements; Transaction for discrete events; Accumulating Snapshot for pipeline lifecycles |
| Zero-fill periodic snapshots for inactive products | Maintains continuity for time-series calculations; absence of a row implies missing data, not zero inventory |
Never sum quantity_on_hand across time |
It is semi-additive — summing over dates produces meaningless totals |
Never use SQL AVG directly on semi-additive facts |
The denominator counts all rows, not distinct dates — aggregate per date first using a window function |
Use SUM(qty) / COUNT(DISTINCT date_key) only on dense data |
Breaks silently on sparse snapshots where not every product has a row for every date |
Add enrichment metrics beyond quantity_on_hand |
Turns, days of supply, extended cost at inventory — raw on-hand quantity alone provides little business insight |
| Use the correct Turns formula for financial reporting | COGS / Average Inventory, not qty_sold / qty_on_hand — the quantity ratio is only a simplified operational proxy |
| Do not rely solely on the transaction model for inventory | Too granular for most analytical queries; combine with a periodic snapshot for level-based analysis |
| Use the accumulating snapshot only for processes with a defined start, end, and milestones | It is updated in place — unsuitable for open-ended or unbounded processes |
| Build the Bus Matrix before writing any code | It is the master planning artifact — defines scope, surfaces conformance conflicts, and aligns business and IT |
| Derive Bus Matrix rows from business processes, not reports or IT systems | Report-centric rows produce a matrix that is hard to maintain and impossible to conform |
| Use the most granular grain in the Bus Matrix columns | One column per hierarchy level inflates the matrix; roll hierarchy levels into the dimension, not into separate columns |
| Implement shrunken rollup dimensions as views when possible | A view stays in sync with the base dimension automatically and guarantees conformance by definition |
| Never rename or redefine attributes in a shrunken rollup | Doing so breaks conformance — drill-across queries on that attribute will produce incorrect results |
Key business questions procurement data must answer:
Business processes in scope: Purchase Requisitions, Purchase Orders, Shipping Notifications, Warehouse Receipts, Vendor Invoices, Vendor Payments.
| Type | Name | Strategy | Key trait |
|---|---|---|---|
| Type 0 | Retain Original | Never overwrite | Attribute should never change (e.g., original contract date, surrogate key) |
| Type 1 | Overwrite | Replace the old value in place | History is lost; always reflects the most recent assignment; can affect aggregated reports |
| Type 2 | Add New Row | Insert a new row with a new surrogate key | Full history preserved; requires effective_date, expiration_date, and is_current flag; the same entity will have multiple surrogate keys |
| Type 3 | Add New Column | Add a column for the previous value | Only current and one prior value are accessible; rarely used |
| Type 4 | Mini-Dimension | Extract rapidly changing attributes into a separate dimension | Keeps the base dimension stable; mini-dimension has its own surrogate key |
| Type 5 | Mini-Dimension + Type 1 Outrigger | Type 4 mini-dimension + a Type 1 FK on the base dimension pointing to the current profile | Enables both historical (via fact table FK) and current-state (via base dimension outrigger) queries without exploding either table |
| Type 6 | Type 1 + 2 + 3 | Add a new row (Type 2) + overwrite a “current value” column on all rows (Type 1) + keep a “prior value” column (Type 3) | Both current and historical values are accessible on every row; most complex to maintain |
| Type 7 | Dual Type 1 + Type 2 | Maintain both a Type 1 (current-only) and a Type 2 (full history) view of the same dimension | Fact rows can join to either view depending on whether the query needs current or historical context |
Types 1 and 2 cover the vast majority of real-world cases. Types 4–7 are advanced techniques — reach for them only when Types 1 and 2 are clearly insufficient. Applying higher types prematurely adds ETL complexity with little benefit.
Type 1 attributes inside a Type 2 dimension: When a non-versioned attribute changes, the business must decide whether to update only the current row or all historical rows. There is no universal answer — it depends on whether historical reports should reflect the value at the time of the event or always the latest value.
Backfilling Type 2 history: It is possible to load historical rows into a Type 2 dimension on initial load, but it requires the source system to provide effective dates for each change. This is often difficult in practice — source systems rarely retain that history — but it is not impossible and should not be ruled out without investigation.
Type 5 = Type 4 (mini-dimension) + a Type 1 FK on the base dimension pointing to the customer’s current profile.
The problem it solves: Some dimension attributes change frequently (e.g., age band, income bracket, credit score tier). Applying Type 2 to a large dimension like Customer causes it to grow uncontrollably. Type 4 alone solves the size problem but still lets you query current state by joining the fact table to the mini-dimension — what it loses is the ability to ask “show me all customers currently in the Gold tier” without scanning the fact table at all. The Type 1 outrigger on the base dimension solves that specific gap.
Structure:
customer_dim (base)
├── customer_key (SK)
├── customer_id
├── name
├── ...stable attributes...
└── current_profile_key ← Type 1 outrigger — always overwritten to latest profile
customer_profile_dim (mini-dimension)
├── profile_key (SK)
├── age_band (e.g., "25–34")
├── income_bracket (e.g., "50k–75k")
├── purchase_frequency (e.g., "Weekly")
└── credit_score_tier (e.g., "Good")
sales_fact
├── date_key
├── customer_key ← FK to base customer_dim
├── customer_profile_key ← FK to mini-dimension (profile at time of sale)
└── sales_amount
Two query paths:
| Need | Join path |
|---|---|
| What profile did this customer have at purchase time? | sales_fact → customer_profile_dim via profile_key |
| What does this customer look like today? | customer_dim → customer_profile_dim via current_profile_key |
ETL behavior: When a customer’s profile changes, insert a new row into customer_profile_dim (new profile_key) and overwrite current_profile_key on the customer_dim row (Type 1). The fact table retains the old profile_key — history is preserved there automatically.
Type 6 (1+2+3=6) adds both a “current value” column and a “prior value” column to every row of a Type 2 dimension, then overwrites the current value column across all rows whenever the attribute changes.
Structure (example: Customer Region):
customer_dim
├── customer_key (SK)
├── customer_id
├── region ← Type 2 versioned value (value at time this row was created)
├── current_region ← Type 1 overwritten on ALL rows (always reflects latest)
├── prior_region ← Type 3 prior value column
├── effective_date
├── expiration_date
└── is_current
Example — customer moves from East to West:
| customer_key | customer_id | region | current_region | prior_region | is_current |
|---|---|---|---|---|---|
| 201 | C-001 | East | West | East | N |
| 202 | C-001 | West | West | East | Y |
region on row 201 is still “East” — the historical value when that row was activecurrent_region on both rows is “West” — overwritten everywhere (Type 1)prior_region retains “East” (Type 3) — but only for this change; on a second change (e.g., West → North), prior_region on the new row would be “West”, not “East”. The Type 3 column holds only one prior value — it does not chain history across multiple changesQuery flexibility:
| Analysis need | Column to use |
|---|---|
| What region was the customer in when they bought? | region (Type 2 versioned) |
| What region is the customer in now? | current_region (Type 1) |
| What was their previous region? | prior_region (Type 3) |
The cost: Every attribute change triggers an UPDATE across all historical rows for that entity to refresh current_region. This is the most ETL-intensive SCD type. It is elegant for querying but expensive to maintain at scale.
Type 7 maintains the Type 2 dimension as normal, but also exposes a Type 1 current-state view by filtering to only is_current = 'Y' rows. The fact table stores both the Type 2 surrogate key (historical) and the durable natural key (for joining to the current-state view).
Structure:
customer_dim (Type 2, full history)
├── customer_key (SK — changes with each new version)
├── customer_id (durable natural key — stable across versions)
├── region
├── effective_date
├── expiration_date
└── is_current
-- Type 1 current-state view (no ETL needed — just a filter)
CREATE VIEW customer_current_dim AS
SELECT * FROM customer_dim WHERE is_current = 'Y'
sales_fact
├── customer_key ← FK to Type 2 dimension (historical context)
├── customer_id ← durable key for joining to current-state view
└── ...
Two query paths:
| Need | Join |
|---|---|
| Historical — what was the customer’s region when they bought? | sales_fact.customer_key → customer_dim |
| Current — what region is the customer in today? | sales_fact.customer_id → customer_current_dim |
Advantage over Type 6: No UPDATE across historical rows — the current-state view is just a filter. The ETL complexity stays the same as a standard Type 2.
Disadvantage: BI tools must be aware of two keys (customer_key and customer_id) and two join paths — this adds semantic complexity for report writers.