Data Engineering: Notes of Data Warehouse Toolkit

21 March 2026, Carlos Pena

My notes about the Data Warehouse Toolkit (Kimball).


Chapter 1

Atomic Star Schema → (optional) OLAP Cube (e.g.: product, market, date)

A measurement event in the physical world has a one-to-one relationship to a single row in the corresponding fact table.

Measure types:

You should not store redundant text in a fact table.

In transaction fact tables, if there is no activity, the row should be absent — do not fill with zeros.


Fact Table

Types:

Type Description Zero-fill empty rows?
Transaction One row per event at a point in time No — absent row means no activity
Periodic Snapshot One row per period, regardless of activity Yes — maintain snapshot continuity
Accumulating Snapshot One row per process/pipeline instance, updated as milestones occur Yes — row exists from start to finish

Rules:

Best Practices:


Dimension Table

Answers: Who, What, Where, When, How, Why

Fact vs. Dimension attribute — when it’s ambiguous:

Attribute Description
Fact A measurement that takes on many values and participates in calculations; a continuous value
Dimension A discrete descriptive value

The data is only useful if it is simple to understand.

Best Practices:

SCD Type Strategy Use when
Type 1 Overwrite the old value History doesn’t matter
Type 2 Add a new row with a new surrogate key Full history must be preserved
Type 3 Add a new column for the new value Only current and previous value matter

Conformed Dimensions

A dimension is conformed when it means the same thing across multiple fact tables and data marts.

If two fact tables share a conformed dimension, a BI tool can query both and align the results correctly — this is the integration mechanism Kimball proposes instead of a normalized enterprise model.


Architecture

Operational Source Systems

ETL System

ETL Best Practices:

Data Presentation Area

ETL systems should be off-limits to business users — we don’t want busy ETL professionals distracted by unpredictable BI queries.


DW Myths

  1. “Dimensional models are only for summary data”
    • Summary data should complement granular detail to improve performance for common queries — not replace it.
  2. “Dimensional models are departmental, not enterprise”
    • Multiple business functions often want to analyze the same metrics.
  3. “Dimensional models are not scalable”
    • Fact tables with 2 trillion rows have been reported in production.
  4. “Dimensional models are only for predictable usage”
    • Design should center on the measurement process, not on predefined reports or analyses.
  5. “Dimensional models can’t be integrated”
    • Conformed dimensions and the Bus Architecture are precisely the integration mechanism.

Agile


Key Design Checklist

Before finalizing a dimensional model, verify:


Chapter 3 - Retail

Four-Step Dimensional Design Process

Step 1: Select the Business Process

Step 2: Declare the Grain

The grain defines what a single row in the fact table represents — always atomic.

Examples:

Step 3: Identify the Dimensions

Answers: Who, What, Where, When, Why, How

Step 4: Identify the Facts


Retail Case Study

Scenario: A grocery chain with 100 stores across 5 states, each carrying a full complement of departments (Grocery, Frozen, Dairy, Meat, etc.) and approximately 60,000 SKUs per store. Data is collected at cash registers (POS system).

Design Step Decision
Business Process Purchases at POS system
Grain One row per individual product scanned per sales transaction
Dimensions Date, Store, Product, Promotion, Cashier, Payment Method
Facts Sales quantity, unit prices, discounts, sales amount, cost amount

If a new dimension causes additional rows, it should be disqualified or the grain must be revisited.

Facts Detail

Sales quantity, extended discount, sales, and cost dollar amounts are additive across all dimensions.

Derived facts (e.g., Extended Gross Profit = Extended Sales − Extended Cost) are generally recommended to be stored on disk.

Name Description Example
Regular Unit Price Original price per unit before any discount $1,000
Discount Unit Price Discount applied per unit $100
Net Unit Price Final price per unit after discount $900
Extended Discount Dollar Amount Total discount for all units (discount × quantity) $200
Extended Sales Dollar Amount Total revenue after discount (net price × quantity) $1,800
Extended Cost Dollar Amount Total cost for all units (cost × quantity) $1,200
Extended Gross Profit Dollar Amount Total profit (sales − cost) $600

Non-Additive Facts:


Dimension Details

Date Dimension

The Date dimension is typically the first and most reused dimension; usually daily-grained.

See kimballgroup.com → Tools → Utilities for a pre-built Date dimension.

Current and relative date attributes (e.g., is_current_date, is_current_month) may have an update lag of up to one day — it is better to compute these at query time than to store them.

Time-of-Day

Time-of-day is typically separated from the Date dimension to avoid row explosion.

Product Dimension

Numeric values — Attribute or Fact?

Signal Put it in…
Used primarily in calculations Fact table
Used primarily to filter, group, or label Dimension table
Serves both purposes Both — with distinct meanings

Example: Standard price stored in the fact captures the valuation at purchase time; stored in the dimension it indicates the current list price.

Rule of thumb:

Drilling Down / Up:

Store Dimension

The Store dimension usually does not have a pre-existing source table — the team often builds it by combining multiple source systems.

Required attributes: city, country, city-state combination, state, zip code, selling square footage, total square footage, first open date, remodel date, store manager.

Best Practice: Create role-playing views of the Date dimension to represent different business events (e.g., order date, shipment date, delivery date). This avoids multiple joins to the same table (JOIN dim_date d1 ... JOIN dim_date d2 ...) and keeps queries readable.

Promotion Dimension

Also called the Causal Dimension — it describes the factors believed to cause a change in product sales.

Must include: temporary price reductions, aisle displays, newspaper ads, coupons, etc. These are typically combined into a single dimension row.

Key business questions it enables:

Null Foreign Keys, Null Attributes, and Null Facts

Scenario Recommendation
Null FK in fact table Never allow it — add a row in the dimension (e.g., “Not Applicable”) and point the FK there
Null attribute in dimension Replace with a descriptive string: “Unknown”, “Not Applicable”
Null value in a fact column Usually acceptable — aggregate functions (MIN, MAX, COUNT, AVG) handle NULLs correctly

Payment Method Dimension

A single transaction may involve more than one payment method (e.g., cash + gift card), which would violate the grain of the sales fact table.

Solution: Create a separate Payment fact table with a finer grain — one row per payment method per transaction — rather than forcing multiple payment methods into the sales fact row.

Degenerate Dimensions

A degenerate dimension is a dimension key that has no associated dimension table — it carries no descriptive attributes beyond the identifier itself (e.g., a POS transaction number, order number, or invoice number).

Extensibility

The dimensional model is designed to accommodate change gracefully:

Change How to handle
Add a new dimension (e.g., Frequent Shopper) Add a new FK column to the fact table; backfill historical rows with a “Prior to Program” surrogate key
Add a new dimension attribute Add a new column; fill historical rows with “Not Available” or equivalent
Add a new measured fact (same grain, same process) Add a new column to the existing fact table
Add a new measured fact (different grain or process) Create a new fact table

Factless Fact Table

Use case: Which products were on promotion but did not sell?

A Factless Fact Table contains only foreign keys — no numeric measures (or a dummy constant like 1). It records that an event could have happened but carries no measurement.

Promotion Coverage schema:

promotion_coverage_fact
├── date_key        (FK)
├── product_key     (FK)
├── store_key       (FK)
└── promotion_key   (FK)

Query pattern to find promoted-but-unsold products:

  1. Query promotion_coverage_fact → all products on promotion
  2. Query sales_fact → all products actually sold
  3. Return (1) EXCEPT (2)

Surrogate Keys

Surrogate keys are artificial keys — the DW system has no dependency on the source system’s key type or values, so changes to the source key have no impact.

Fact table surrogate key — not useful for BI queries, but has ETL benefits:

Snowflaking (Normalized Dimensions)

Avoid it. Snowflaking normalizes dimension tables into multiple related tables, which hurts query readability and BI tool usability.

One acceptable exception: outrigger dimensions — a secondary dimension attached to a dimension table (not to the fact table directly).

Example: Product DimensionProduct Introduction Date Dimension (date, calendar month, year, fiscal year, …)

Even so, outriggers add joins and reduce readability — use them only when the size or volatility of the sub-dimension justifies it.

Centipede Fact Table

A Centipede Fact Table occurs when a fact table has an excessive number of dimension foreign keys.

Junk Dimension — combine several low-cardinality flags and indicators (that would otherwise each get their own FK column) into a single dimension table:

Without Junk Dimension With Junk Dimension
is_gift_flag (FK) transaction_type_key (FK → junk_dim)
is_rush_order_flag (FK) (all flags live as columns inside junk_dim)
payment_status_key (FK)  
return_flag (FK)  

The junk dimension contains one row per unique combination of flag values. It reduces the number of FK columns in the fact table without losing any information.

Best Practices Summary — Chapter 3

Practice Why
Always declare the grain before choosing dimensions and facts Prevents scope creep and mixed granularity
Flatten hierarchies into dimension rows Enables drill-up/down without joins; improves BI tool usability
Replace NULL dimension attributes with “Unknown” or “N/A” Avoids GROUP BY / WHERE surprises and confuses end users
Use descriptive text labels instead of Y/N flags Reports are self-explanatory without lookup tables
Store derived facts (e.g., gross profit) on disk Avoids repeated computation; ensures consistent results
Never sum non-additive facts (unit price, ratios) Use SUM(numerator) / SUM(denominator) instead
Use a Factless Fact Table for promotion coverage Enables “promoted but not sold” analysis
Store degenerate dimensions (order/transaction numbers) in the fact table They have no attributes; a separate table would be empty
Pre-populate the Date dimension 10–20 years ahead Avoids ETL failures on future dates
Separate time-of-day from the Date dimension Prevents row explosion on the Date dimension

Chapter 4 - Inventory

Summary

Chapter 4 extends the retail dimensional model into the inventory supply chain, demonstrating how the same business pipeline spawns multiple fact tables — each with its own grain, metrics, and modeling pattern.

Core pipeline:

Purchase Order → Receive Warehouse Deliveries → Warehouse Product Inventory → Store Delivery → Store Product Inventory → Retail Sales

Key highlights:


Inventory Periodic Snapshot

Levels of inventory measurement taken at regular intervals, each placed as a separate row in the fact table.

Example: The most atomic level of detail provided by the operational inventory system is a daily inventory snapshot per product per store.

Key Columns
Dimensions date_key, product_key, store_key
Facts quantity_on_hand

To reduce data volumes, it is common to either:

Semi-additive fact: quantity_on_hand cannot be aggregated over time, but can be summed across stores.

Caution with AVG: SQL AVG averages over all rows returned by the query — not just the number of distinct dates.

Example: 3 products × 4 stores × 7 dates → the denominator would be 84, not 7.

Correct approach: Use a window function to first aggregate per date, then average the results — this is the most robust solution.

SUM(quantity_on_hand) / COUNT(DISTINCT date_key) also works, but breaks silently on sparse data (if not every product has a snapshot for every date, the denominator will be wrong).

Additional Inventory Metrics

The periodic snapshot alone is rarely sufficient — it is usually necessary to add metrics that provide meaning and insight:

Inventory Transactions

Another way to model inventory is through inventory transactions: receive, place, release, inspect, bin, ship, return, remove, etc.

It is impractical to rely solely on the transaction model — it is too dense, though technically possible. (This was more true in Kimball’s era. Modern columnar stores like Snowflake, BigQuery, and Redshift handle transaction-grain tables well — the “too dense” argument is less compelling today.)

If measurements have different natural granularities or dimensionalities, they should belong to different fact tables.

Inventory Accumulating Snapshot

Useful when you can identify each individual product (e.g., by serial number) to track its full lifecycle.

Enterprise Data Warehouse Bus Architecture

The Bus Architecture is the framework that integrates dimensional data marts across the entire organization into a cohesive enterprise warehouse — without requiring a single monolithic system.

The “bus” analogy comes from electrical engineering: multiple components plug into it independently but share a standard interface. In DW terms, conformed dimensions and conformed facts serve as that shared interface.

Key properties:

Drill-across: When two fact tables share a conformed dimension, a BI tool can query both and align results on that shared key — this is Kimball’s integration mechanism instead of a normalized enterprise model.

Enterprise Data Warehouse Bus Matrix

The Bus Matrix is the master planning document for the entire DW/BI system — the most important artifact to create before building anything.

Structure:

Business Process Date Customer Product Employee Store Promotion
Sales Orders
Inventory      
HR Payroll      
Marketing Spend    

What it reveals:

Variations:

How to use it:

  1. Workshop rows with business stakeholders — processes come from the business, not from IT systems
  2. Identify the highest-value row and build that mart first
  3. Lock down shared dimensions early (Date and Customer almost always appear in row 1)
  4. Revisit and expand as new processes are onboarded — the matrix is a living document

A checkmark in the matrix is a commitment: this dimension will mean the same thing in this mart as everywhere else. That commitment is what makes drill-across queries possible.

Common Mistakes:

Shrunken Rollup Conformed Dimension

A shrunken rollup conformed dimension is a higher-grain version of a base conformed dimension that contains a subset of rows and a subset of attributes — and still conforms to it.

It exists to serve fact tables that operate at a coarser grain than the base dimension (e.g., a Marketing Spend fact table at brand level, while the base Product dimension is at SKU level).

Conformance rules — a shrunken dimension conforms when:

  1. Attribute subset — every column in the shrunken dimension also exists in the base dimension with the same name and meaning
  2. Row subset — every row corresponds to one or more rows in the base dimension (it is a rollup, not an arbitrary filter)

Renaming a column or redefining its meaning in the shrunken version breaks conformance — drill-across queries will produce incorrect results.

Example:

Base Product dimension (SKU-level grain):

product_key sku product_name brand category department
1 SKU-001 Tropicana OJ 64oz Tropicana Juice Beverages
2 SKU-002 Minute Maid OJ 64oz Minute Maid Juice Beverages

Shrunken rollup Brand dimension (brand-level grain):

brand_key brand category department
10 Tropicana Juice Beverages
11 Minute Maid Juice Beverages

Do not remove attributes from the base dimension — the shrunken rollup is an addition, not a replacement. The Product dimension keeps all its columns.

Physical table or view?

Situation Use
All rollup attributes come from the base dimension View (CREATE VIEW brand_dim AS SELECT DISTINCT ...)
Rollup has its own attributes or its own source system Physical table
Rollup is queried heavily at scale Physical table (for performance)

Prefer a view — it stays in sync automatically and guarantees conformance by definition. Only materialize into a physical table when there is a concrete reason to do so.

In the Bus Matrix, a shrunken rollup is annotated with a partial or shaded checkmark to indicate the business process uses a rolled-up version of a shared dimension, not the full atomic one.


Best Practices Summary — Chapter 4

Practice Why
Model each business process as its own fact table Each process has a unique grain, timestamp, and metrics — mixing them corrupts the grain
Choose the right fact table type per process Periodic Snapshot for regular level measurements; Transaction for discrete events; Accumulating Snapshot for pipeline lifecycles
Zero-fill periodic snapshots for inactive products Maintains continuity for time-series calculations; absence of a row implies missing data, not zero inventory
Never sum quantity_on_hand across time It is semi-additive — summing over dates produces meaningless totals
Never use SQL AVG directly on semi-additive facts The denominator counts all rows, not distinct dates — aggregate per date first using a window function
Use SUM(qty) / COUNT(DISTINCT date_key) only on dense data Breaks silently on sparse snapshots where not every product has a row for every date
Add enrichment metrics beyond quantity_on_hand Turns, days of supply, extended cost at inventory — raw on-hand quantity alone provides little business insight
Use the correct Turns formula for financial reporting COGS / Average Inventory, not qty_sold / qty_on_hand — the quantity ratio is only a simplified operational proxy
Do not rely solely on the transaction model for inventory Too granular for most analytical queries; combine with a periodic snapshot for level-based analysis
Use the accumulating snapshot only for processes with a defined start, end, and milestones It is updated in place — unsuitable for open-ended or unbounded processes
Build the Bus Matrix before writing any code It is the master planning artifact — defines scope, surfaces conformance conflicts, and aligns business and IT
Derive Bus Matrix rows from business processes, not reports or IT systems Report-centric rows produce a matrix that is hard to maintain and impossible to conform
Use the most granular grain in the Bus Matrix columns One column per hierarchy level inflates the matrix; roll hierarchy levels into the dimension, not into separate columns
Implement shrunken rollup dimensions as views when possible A view stays in sync with the base dimension automatically and guarantees conformance by definition
Never rename or redefine attributes in a shrunken rollup Doing so breaks conformance — drill-across queries on that attribute will produce incorrect results

Chapter 5 - Procurement (Acquisition)

Key business questions procurement data must answer:

Business processes in scope: Purchase Requisitions, Purchase Orders, Shipping Notifications, Warehouse Receipts, Vendor Invoices, Vendor Payments.

Single vs. Multiple Transaction Fact Tables

Complementary Procurement Accumulating Snapshot

Slowly Changing Dimensions (SCD) — All Types

Type Name Strategy Key trait
Type 0 Retain Original Never overwrite Attribute should never change (e.g., original contract date, surrogate key)
Type 1 Overwrite Replace the old value in place History is lost; always reflects the most recent assignment; can affect aggregated reports
Type 2 Add New Row Insert a new row with a new surrogate key Full history preserved; requires effective_date, expiration_date, and is_current flag; the same entity will have multiple surrogate keys
Type 3 Add New Column Add a column for the previous value Only current and one prior value are accessible; rarely used
Type 4 Mini-Dimension Extract rapidly changing attributes into a separate dimension Keeps the base dimension stable; mini-dimension has its own surrogate key
Type 5 Mini-Dimension + Type 1 Outrigger Type 4 mini-dimension + a Type 1 FK on the base dimension pointing to the current profile Enables both historical (via fact table FK) and current-state (via base dimension outrigger) queries without exploding either table
Type 6 Type 1 + 2 + 3 Add a new row (Type 2) + overwrite a “current value” column on all rows (Type 1) + keep a “prior value” column (Type 3) Both current and historical values are accessible on every row; most complex to maintain
Type 7 Dual Type 1 + Type 2 Maintain both a Type 1 (current-only) and a Type 2 (full history) view of the same dimension Fact rows can join to either view depending on whether the query needs current or historical context

Types 1 and 2 cover the vast majority of real-world cases. Types 4–7 are advanced techniques — reach for them only when Types 1 and 2 are clearly insufficient. Applying higher types prematurely adds ETL complexity with little benefit.

Type 1 attributes inside a Type 2 dimension: When a non-versioned attribute changes, the business must decide whether to update only the current row or all historical rows. There is no universal answer — it depends on whether historical reports should reflect the value at the time of the event or always the latest value.

Backfilling Type 2 history: It is possible to load historical rows into a Type 2 dimension on initial load, but it requires the source system to provide effective dates for each change. This is often difficult in practice — source systems rarely retain that history — but it is not impossible and should not be ruled out without investigation.

SCD Type 5 — Mini-Dimension + Type 1 Outrigger

Type 5 = Type 4 (mini-dimension) + a Type 1 FK on the base dimension pointing to the customer’s current profile.

The problem it solves: Some dimension attributes change frequently (e.g., age band, income bracket, credit score tier). Applying Type 2 to a large dimension like Customer causes it to grow uncontrollably. Type 4 alone solves the size problem but still lets you query current state by joining the fact table to the mini-dimension — what it loses is the ability to ask “show me all customers currently in the Gold tier” without scanning the fact table at all. The Type 1 outrigger on the base dimension solves that specific gap.

Structure:

customer_dim (base)
├── customer_key          (SK)
├── customer_id
├── name
├── ...stable attributes...
└── current_profile_key   ← Type 1 outrigger — always overwritten to latest profile

customer_profile_dim (mini-dimension)
├── profile_key           (SK)
├── age_band              (e.g., "25–34")
├── income_bracket        (e.g., "50k–75k")
├── purchase_frequency    (e.g., "Weekly")
└── credit_score_tier     (e.g., "Good")

sales_fact
├── date_key
├── customer_key          ← FK to base customer_dim
├── customer_profile_key  ← FK to mini-dimension (profile at time of sale)
└── sales_amount

Two query paths:

Need Join path
What profile did this customer have at purchase time? sales_factcustomer_profile_dim via profile_key
What does this customer look like today? customer_dimcustomer_profile_dim via current_profile_key

ETL behavior: When a customer’s profile changes, insert a new row into customer_profile_dim (new profile_key) and overwrite current_profile_key on the customer_dim row (Type 1). The fact table retains the old profile_key — history is preserved there automatically.


SCD Type 6 — Type 1 + 2 + 3 Combined

Type 6 (1+2+3=6) adds both a “current value” column and a “prior value” column to every row of a Type 2 dimension, then overwrites the current value column across all rows whenever the attribute changes.

Structure (example: Customer Region):

customer_dim
├── customer_key      (SK)
├── customer_id
├── region            ← Type 2 versioned value (value at time this row was created)
├── current_region    ← Type 1 overwritten on ALL rows (always reflects latest)
├── prior_region      ← Type 3 prior value column
├── effective_date
├── expiration_date
└── is_current

Example — customer moves from East to West:

customer_key customer_id region current_region prior_region is_current
201 C-001 East West East N
202 C-001 West West East Y

Query flexibility:

Analysis need Column to use
What region was the customer in when they bought? region (Type 2 versioned)
What region is the customer in now? current_region (Type 1)
What was their previous region? prior_region (Type 3)

The cost: Every attribute change triggers an UPDATE across all historical rows for that entity to refresh current_region. This is the most ETL-intensive SCD type. It is elegant for querying but expensive to maintain at scale.


SCD Type 7 — Dual Type 1 + Type 2 Dimensions

Type 7 maintains the Type 2 dimension as normal, but also exposes a Type 1 current-state view by filtering to only is_current = 'Y' rows. The fact table stores both the Type 2 surrogate key (historical) and the durable natural key (for joining to the current-state view).

Structure:

customer_dim (Type 2, full history)
├── customer_key   (SK — changes with each new version)
├── customer_id    (durable natural key — stable across versions)
├── region
├── effective_date
├── expiration_date
└── is_current

-- Type 1 current-state view (no ETL needed — just a filter)
CREATE VIEW customer_current_dim AS
  SELECT * FROM customer_dim WHERE is_current = 'Y'

sales_fact
├── customer_key    ← FK to Type 2 dimension (historical context)
├── customer_id     ← durable key for joining to current-state view
└── ...

Two query paths:

Need Join
Historical — what was the customer’s region when they bought? sales_fact.customer_keycustomer_dim
Current — what region is the customer in today? sales_fact.customer_idcustomer_current_dim

Advantage over Type 6: No UPDATE across historical rows — the current-state view is just a filter. The ETL complexity stays the same as a standard Type 2.

Disadvantage: BI tools must be aware of two keys (customer_key and customer_id) and two join paths — this adds semantic complexity for report writers.