My notes about the Data Warehouse Toolkit (Kimball).
Atomic Star Schema → (optional) OLAP Cube (e.g.: product, market, date)
A measurement event in the physical world has a one-to-one relationship to a single row in the corresponding fact table.
Measure types:
SUM(extended_sales) / SUM(quantity))You should not store redundant text in a fact table.
In transaction fact tables, if there is no activity, the row should be absent — do not fill with zeros.
Types:
| Type | Description | Zero-fill empty rows? |
|---|---|---|
| Transaction | One row per event at a point in time | No — absent row means no activity |
| Periodic Snapshot | One row per period, regardless of activity | Yes — maintain snapshot continuity |
| Accumulating Snapshot | One row per process/pipeline instance, updated as milestones occur | Yes — row exists from start to finish |
Rules:
Best Practices:
Answers: Who, What, Where, When, How, Why
Fact vs. Dimension attribute — when it’s ambiguous:
| Attribute | Description |
|---|---|
| Fact | A measurement that takes on many values and participates in calculations; a continuous value |
| Dimension | A discrete descriptive value |
The data is only useful if it is simple to understand.
Best Practices:
| SCD Type | Strategy | Use when |
|---|---|---|
| Type 1 | Overwrite the old value | History doesn’t matter |
| Type 2 | Add a new row with a new surrogate key | Full history must be preserved |
| Type 3 | Add a new column for the new value | Only current and previous value matter |
A dimension is conformed when it means the same thing across multiple fact tables and data marts.
If two fact tables share a conformed dimension, a BI tool can query both and align the results correctly — this is the integration mechanism Kimball proposes instead of a normalized enterprise model.
ETL Best Practices:
load_date, source_system) on every table for traceabilityETL systems should be off-limits to business users — we don’t want busy ETL professionals distracted by unpredictable BI queries.
Before finalizing a dimensional model, verify:
The grain defines what a single row in the fact table represents — always atomic.
Examples:
Answers: Who, What, Where, When, Why, How
Scenario: A grocery chain with 100 stores across 5 states, each carrying a full complement of departments (Grocery, Frozen, Dairy, Meat, etc.) and approximately 60,000 SKUs per store. Data is collected at cash registers (POS system).
| Design Step | Decision |
|---|---|
| Business Process | Purchases at POS system |
| Grain | One row per individual product scanned per sales transaction |
| Dimensions | Date, Store, Product, Promotion, Cashier, Payment Method |
| Facts | Sales quantity, unit prices, discounts, sales amount, cost amount |
If a new dimension causes additional rows, it should be disqualified or the grain must be revisited.
Sales quantity, extended discount, sales, and cost dollar amounts are additive across all dimensions.
Derived facts (e.g., Extended Gross Profit = Extended Sales − Extended Cost) are generally recommended to be stored on disk.
| Name | Description | Example |
|---|---|---|
| Regular Unit Price | Original price per unit before any discount | $1,000 |
| Discount Unit Price | Discount applied per unit | $100 |
| Net Unit Price | Final price per unit after discount | $900 |
| Extended Discount Dollar Amount | Total discount for all units (discount × quantity) | $200 |
| Extended Sales Dollar Amount | Total revenue after discount (net price × quantity) | $1,800 |
| Extended Cost Dollar Amount | Total cost for all units (cost × quantity) | $1,200 |
| Extended Gross Profit Dollar Amount | Total profit (sales − cost) | $600 |
Non-Additive Facts:
SUM(gross_profit) / SUM(extended_sales) or equivalently SUM(extended_sales - extended_cost) / SUM(extended_sales)The Date dimension is typically the first and most reused dimension; usually daily-grained.
See kimballgroup.com → Tools → Utilities for a pre-built Date dimension.
Current and relative date attributes (e.g., is_current_date, is_current_month) may have an update lag of up to one day — it is better to compute these at query time than to store them.
Time-of-day is typically separated from the Date dimension to avoid row explosion.
"Tropicana OJ 64oz" and individual brand, size, flavor columns)Numeric values — Attribute or Fact?
| Signal | Put it in… |
|---|---|
| Used primarily in calculations | Fact table |
| Used primarily to filter, group, or label | Dimension table |
| Serves both purposes | Both — with distinct meanings |
Example: Standard price stored in the fact captures the valuation at purchase time; stored in the dimension it indicates the current list price.
Rule of thumb:
Drilling Down / Up:
The Store dimension usually does not have a pre-existing source table — the team often builds it by combining multiple source systems.
Required attributes: city, country, city-state combination, state, zip code, selling square footage, total square footage, first open date, remodel date, store manager.
Best Practice: Create role-playing views of the Date dimension to represent different business events (e.g., order date, shipment date, delivery date). This avoids multiple joins to the same table (
JOIN dim_date d1 ... JOIN dim_date d2 ...) and keeps queries readable.
Also called the Causal Dimension — it describes the factors believed to cause a change in product sales.
Must include: temporary price reductions, aisle displays, newspaper ads, coupons, etc. These are typically combined into a single dimension row.
Key business questions it enables:
| Scenario | Recommendation |
|---|---|
| Null FK in fact table | Never allow it — add a row in the dimension (e.g., “Not Applicable”) and point the FK there |
| Null attribute in dimension | Replace with a descriptive string: “Unknown”, “Not Applicable” |
| Null value in a fact column | Usually acceptable — aggregate functions (MIN, MAX, COUNT, AVG) handle NULLs correctly |
A single transaction may involve more than one payment method (e.g., cash + gift card), which would violate the grain of the sales fact table.
Solution: Create a separate Payment fact table with a finer grain — one row per payment method per transaction — rather than forcing multiple payment methods into the sales fact row.
A degenerate dimension is a dimension key that has no associated dimension table — it carries no descriptive attributes beyond the identifier itself (e.g., a POS transaction number, order number, or invoice number).
The dimensional model is designed to accommodate change gracefully:
| Change | How to handle |
|---|---|
| Add a new dimension (e.g., Frequent Shopper) | Add a new FK column to the fact table; backfill historical rows with a “Prior to Program” surrogate key |
| Add a new dimension attribute | Add a new column; fill historical rows with “Not Available” or equivalent |
| Add a new measured fact (same grain, same process) | Add a new column to the existing fact table |
| Add a new measured fact (different grain or process) | Create a new fact table |
Use case: Which products were on promotion but did not sell?
A Factless Fact Table contains only foreign keys — no numeric measures (or a dummy constant like 1). It records that an event could have happened but carries no measurement.
Promotion Coverage schema:
promotion_coverage_fact
├── date_key (FK)
├── product_key (FK)
├── store_key (FK)
└── promotion_key (FK)
Query pattern to find promoted-but-unsold products:
promotion_coverage_fact → all products on promotionsales_fact → all products actually sold(1) EXCEPT (2)Surrogate keys are artificial keys — the DW system has no dependency on the source system’s key type or values, so changes to the source key have no impact.
Fact table surrogate key — not useful for BI queries, but has ETL benefits:
UPDATE with INSERT + DELETE (safer in bulk ETL)Avoid it. Snowflaking normalizes dimension tables into multiple related tables, which hurts query readability and BI tool usability.
One acceptable exception: outrigger dimensions — a secondary dimension attached to a dimension table (not to the fact table directly).
Example:
Product Dimension→Product Introduction Date Dimension(date, calendar month, year, fiscal year, …)Even so, outriggers add joins and reduce readability — use them only when the size or volatility of the sub-dimension justifies it.
A Centipede Fact Table occurs when a fact table has an excessive number of dimension foreign keys.
Junk Dimension — combine several low-cardinality flags and indicators (that would otherwise each get their own FK column) into a single dimension table:
| Without Junk Dimension | With Junk Dimension |
|---|---|
is_gift_flag (FK) |
transaction_type_key (FK → junk_dim) |
is_rush_order_flag (FK) |
(all flags live as columns inside junk_dim) |
payment_status_key (FK) |
|
return_flag (FK) |
The junk dimension contains one row per unique combination of flag values. It reduces the number of FK columns in the fact table without losing any information.
| Practice | Why |
|---|---|
| Always declare the grain before choosing dimensions and facts | Prevents scope creep and mixed granularity |
| Flatten hierarchies into dimension rows | Enables drill-up/down without joins; improves BI tool usability |
| Replace NULL dimension attributes with “Unknown” or “N/A” | Avoids GROUP BY / WHERE surprises and confuses end users |
| Use descriptive text labels instead of Y/N flags | Reports are self-explanatory without lookup tables |
| Store derived facts (e.g., gross profit) on disk | Avoids repeated computation; ensures consistent results |
| Never sum non-additive facts (unit price, ratios) | Use SUM(numerator) / SUM(denominator) instead |
| Use a Factless Fact Table for promotion coverage | Enables “promoted but not sold” analysis |
| Store degenerate dimensions (order/transaction numbers) in the fact table | They have no attributes; a separate table would be empty |
| Pre-populate the Date dimension 10–20 years ahead | Avoids ETL failures on future dates |
| Separate time-of-day from the Date dimension | Prevents row explosion on the Date dimension |
Purchase Order → Receive Warehouse Deliveries → Warehouse Product Inventory → Store Delivery → Store Product Inventory → Retail Sales
Levels of inventory measurement taken at regular intervals, each placed as a separate row in the fact table.
Example: The most atomic level of detail provided by the operational inventory system is a daily inventory snapshot per product per store.
| Key | Columns |
|---|---|
| Dimensions | date_key, product_key, store_key |
| Facts | quantity_on_hand |
To reduce data volumes, it is common to either:
Semi-additive fact: quantity_on_hand cannot be aggregated over time, but can be summed across stores.
Caution with
AVG:SQL AVGaverages over all rows returned by the query — not just the number of distinct dates.Example: 3 products × 4 stores × 7 dates → the denominator would be 84, not 7.
Correct approach: Use a window function to first aggregate per date, then average the results — this is the most robust solution.
SUM(quantity_on_hand) / COUNT(DISTINCT date_key)also works, but breaks silently on sparse data (if not every product has a snapshot for every date, the denominator will be wrong).
The periodic snapshot alone is rarely sufficient — it is usually necessary to add metrics that provide meaning and insight:
qty_sold / qty_on_hand (note: the standard financial formula is COGS (Cost of Goods Sold) / Average Inventory — this quantity-based ratio is a simplified proxy)Another way to model inventory is through inventory transactions: receive, place, release, inspect, bin, ship, return, remove, etc.
It is impractical to rely solely on the transaction model — it is too dense, though technically possible. (This was more true in Kimball’s era. Modern columnar stores like Snowflake, BigQuery, and Redshift handle transaction-grain tables well — the “too dense” argument is less compelling today.)
If measurements have different natural granularities or dimensionalities, they should belong to different fact tables.
Useful when you can identify each individual product (e.g., by serial number) to track its full lifecycle.