Data Engineering: Notes of Data Warehouse Toolkit

21 March 2026, Carlos Pena

My notes about the Data Warehouse Toolkit (Kimball).


Chapter 1

Atomic Star Schema → (optional) OLAP Cube (e.g.: product, market, date)

A measurement event in the physical world has a one-to-one relationship to a single row in the corresponding fact table.

Measure types:

You should not store redundant text in a fact table.

In transaction fact tables, if there is no activity, the row should be absent — do not fill with zeros.


Fact Table

Types:

Type Description Zero-fill empty rows?
Transaction One row per event at a point in time No — absent row means no activity
Periodic Snapshot One row per period, regardless of activity Yes — maintain snapshot continuity
Accumulating Snapshot One row per process/pipeline instance, updated as milestones occur Yes — row exists from start to finish

Rules:

Best Practices:


Dimension Table

Answers: Who, What, Where, When, How, Why

Fact vs. Dimension attribute — when it’s ambiguous:

Attribute Description
Fact A measurement that takes on many values and participates in calculations; a continuous value
Dimension A discrete descriptive value

The data is only useful if it is simple to understand.

Best Practices:

SCD Type Strategy Use when
Type 1 Overwrite the old value History doesn’t matter
Type 2 Add a new row with a new surrogate key Full history must be preserved
Type 3 Add a new column for the new value Only current and previous value matter

Conformed Dimensions

A dimension is conformed when it means the same thing across multiple fact tables and data marts.

If two fact tables share a conformed dimension, a BI tool can query both and align the results correctly — this is the integration mechanism Kimball proposes instead of a normalized enterprise model.


Architecture

Operational Source Systems

ETL System

ETL Best Practices:

Data Presentation Area

ETL systems should be off-limits to business users — we don’t want busy ETL professionals distracted by unpredictable BI queries.


DW Myths

  1. “Dimensional models are only for summary data”
    • Summary data should complement granular detail to improve performance for common queries — not replace it.
  2. “Dimensional models are departmental, not enterprise”
    • Multiple business functions often want to analyze the same metrics.
  3. “Dimensional models are not scalable”
    • Fact tables with 2 trillion rows have been reported in production.
  4. “Dimensional models are only for predictable usage”
    • Design should center on the measurement process, not on predefined reports or analyses.
  5. “Dimensional models can’t be integrated”
    • Conformed dimensions and the Bus Architecture are precisely the integration mechanism.

Agile


Key Design Checklist

Before finalizing a dimensional model, verify:


Chapter 3 - Retail

Four-Step Dimensional Design Process

Step 1: Select the Business Process

Step 2: Declare the Grain

The grain defines what a single row in the fact table represents — always atomic.

Examples:

Step 3: Identify the Dimensions

Answers: Who, What, Where, When, Why, How

Step 4: Identify the Facts


Retail Case Study

Scenario: A grocery chain with 100 stores across 5 states, each carrying a full complement of departments (Grocery, Frozen, Dairy, Meat, etc.) and approximately 60,000 SKUs per store. Data is collected at cash registers (POS system).

Design Step Decision
Business Process Purchases at POS system
Grain One row per individual product scanned per sales transaction
Dimensions Date, Store, Product, Promotion, Cashier, Payment Method
Facts Sales quantity, unit prices, discounts, sales amount, cost amount

If a new dimension causes additional rows, it should be disqualified or the grain must be revisited.

Facts Detail

Sales quantity, extended discount, sales, and cost dollar amounts are additive across all dimensions.

Derived facts (e.g., Extended Gross Profit = Extended Sales − Extended Cost) are generally recommended to be stored on disk.

Name Description Example
Regular Unit Price Original price per unit before any discount $1,000
Discount Unit Price Discount applied per unit $100
Net Unit Price Final price per unit after discount $900
Extended Discount Dollar Amount Total discount for all units (discount × quantity) $200
Extended Sales Dollar Amount Total revenue after discount (net price × quantity) $1,800
Extended Cost Dollar Amount Total cost for all units (cost × quantity) $1,200
Extended Gross Profit Dollar Amount Total profit (sales − cost) $600

Non-Additive Facts:


Dimension Details

Date Dimension

The Date dimension is typically the first and most reused dimension; usually daily-grained.

See kimballgroup.com → Tools → Utilities for a pre-built Date dimension.

Current and relative date attributes (e.g., is_current_date, is_current_month) may have an update lag of up to one day — it is better to compute these at query time than to store them.

Time-of-Day

Time-of-day is typically separated from the Date dimension to avoid row explosion.

Product Dimension

Numeric values — Attribute or Fact?

Signal Put it in…
Used primarily in calculations Fact table
Used primarily to filter, group, or label Dimension table
Serves both purposes Both — with distinct meanings

Example: Standard price stored in the fact captures the valuation at purchase time; stored in the dimension it indicates the current list price.

Rule of thumb:

Drilling Down / Up:

Store Dimension

The Store dimension usually does not have a pre-existing source table — the team often builds it by combining multiple source systems.

Required attributes: city, country, city-state combination, state, zip code, selling square footage, total square footage, first open date, remodel date, store manager.

Best Practice: Create role-playing views of the Date dimension to represent different business events (e.g., order date, shipment date, delivery date). This avoids multiple joins to the same table (JOIN dim_date d1 ... JOIN dim_date d2 ...) and keeps queries readable.

Promotion Dimension

Also called the Causal Dimension — it describes the factors believed to cause a change in product sales.

Must include: temporary price reductions, aisle displays, newspaper ads, coupons, etc. These are typically combined into a single dimension row.

Key business questions it enables:

Null Foreign Keys, Null Attributes, and Null Facts

Scenario Recommendation
Null FK in fact table Never allow it — add a row in the dimension (e.g., “Not Applicable”) and point the FK there
Null attribute in dimension Replace with a descriptive string: “Unknown”, “Not Applicable”
Null value in a fact column Usually acceptable — aggregate functions (MIN, MAX, COUNT, AVG) handle NULLs correctly

Payment Method Dimension

A single transaction may involve more than one payment method (e.g., cash + gift card), which would violate the grain of the sales fact table.

Solution: Create a separate Payment fact table with a finer grain — one row per payment method per transaction — rather than forcing multiple payment methods into the sales fact row.

Degenerate Dimensions

A degenerate dimension is a dimension key that has no associated dimension table — it carries no descriptive attributes beyond the identifier itself (e.g., a POS transaction number, order number, or invoice number).

Extensibility

The dimensional model is designed to accommodate change gracefully:

Change How to handle
Add a new dimension (e.g., Frequent Shopper) Add a new FK column to the fact table; backfill historical rows with a “Prior to Program” surrogate key
Add a new dimension attribute Add a new column; fill historical rows with “Not Available” or equivalent
Add a new measured fact (same grain, same process) Add a new column to the existing fact table
Add a new measured fact (different grain or process) Create a new fact table

Factless Fact Table

Use case: Which products were on promotion but did not sell?

A Factless Fact Table contains only foreign keys — no numeric measures (or a dummy constant like 1). It records that an event could have happened but carries no measurement.

Promotion Coverage schema:

promotion_coverage_fact
├── date_key        (FK)
├── product_key     (FK)
├── store_key       (FK)
└── promotion_key   (FK)

Query pattern to find promoted-but-unsold products:

  1. Query promotion_coverage_fact → all products on promotion
  2. Query sales_fact → all products actually sold
  3. Return (1) EXCEPT (2)

Surrogate Keys

Surrogate keys are artificial keys — the DW system has no dependency on the source system’s key type or values, so changes to the source key have no impact.

Fact table surrogate key — not useful for BI queries, but has ETL benefits:

Snowflaking (Normalized Dimensions)

Avoid it. Snowflaking normalizes dimension tables into multiple related tables, which hurts query readability and BI tool usability.

One acceptable exception: outrigger dimensions — a secondary dimension attached to a dimension table (not to the fact table directly).

Example: Product DimensionProduct Introduction Date Dimension (date, calendar month, year, fiscal year, …)

Even so, outriggers add joins and reduce readability — use them only when the size or volatility of the sub-dimension justifies it.

Centipede Fact Table

A Centipede Fact Table occurs when a fact table has an excessive number of dimension foreign keys.

Junk Dimension — combine several low-cardinality flags and indicators (that would otherwise each get their own FK column) into a single dimension table:

Without Junk Dimension With Junk Dimension
is_gift_flag (FK) transaction_type_key (FK → junk_dim)
is_rush_order_flag (FK) (all flags live as columns inside junk_dim)
payment_status_key (FK)  
return_flag (FK)  

The junk dimension contains one row per unique combination of flag values. It reduces the number of FK columns in the fact table without losing any information.

Best Practices Summary — Chapter 3

Practice Why
Always declare the grain before choosing dimensions and facts Prevents scope creep and mixed granularity
Flatten hierarchies into dimension rows Enables drill-up/down without joins; improves BI tool usability
Replace NULL dimension attributes with “Unknown” or “N/A” Avoids GROUP BY / WHERE surprises and confuses end users
Use descriptive text labels instead of Y/N flags Reports are self-explanatory without lookup tables
Store derived facts (e.g., gross profit) on disk Avoids repeated computation; ensures consistent results
Never sum non-additive facts (unit price, ratios) Use SUM(numerator) / SUM(denominator) instead
Use a Factless Fact Table for promotion coverage Enables “promoted but not sold” analysis
Store degenerate dimensions (order/transaction numbers) in the fact table They have no attributes; a separate table would be empty
Pre-populate the Date dimension 10–20 years ahead Avoids ETL failures on future dates
Separate time-of-day from the Date dimension Prevents row explosion on the Date dimension

Chapter 4 - Inventory

Purchase Order → Receive Warehouse Deliveries → Warehouse Product Inventory → Store Delivery → Store Product Inventory → Retail Sales

Inventory Periodic Snapshot

Levels of inventory measurement taken at regular intervals, each placed as a separate row in the fact table.

Example: The most atomic level of detail provided by the operational inventory system is a daily inventory snapshot per product per store.

Key Columns
Dimensions date_key, product_key, store_key
Facts quantity_on_hand

To reduce data volumes, it is common to either:

Semi-additive fact: quantity_on_hand cannot be aggregated over time, but can be summed across stores.

Caution with AVG: SQL AVG averages over all rows returned by the query — not just the number of distinct dates.

Example: 3 products × 4 stores × 7 dates → the denominator would be 84, not 7.

Correct approach: Use a window function to first aggregate per date, then average the results — this is the most robust solution.

SUM(quantity_on_hand) / COUNT(DISTINCT date_key) also works, but breaks silently on sparse data (if not every product has a snapshot for every date, the denominator will be wrong).

Additional Inventory Metrics

The periodic snapshot alone is rarely sufficient — it is usually necessary to add metrics that provide meaning and insight:

Inventory Transactions

Another way to model inventory is through inventory transactions: receive, place, release, inspect, bin, ship, return, remove, etc.

It is impractical to rely solely on the transaction model — it is too dense, though technically possible. (This was more true in Kimball’s era. Modern columnar stores like Snowflake, BigQuery, and Redshift handle transaction-grain tables well — the “too dense” argument is less compelling today.)

If measurements have different natural granularities or dimensionalities, they should belong to different fact tables.

Inventory Accumulating Snapshot

Useful when you can identify each individual product (e.g., by serial number) to track its full lifecycle.