DataEng: Data Modeling

30 July 2025, Carlos Pena

Data Engineering

My notes about “DeepLearning.AI Data Engineering” (It will still be updated, I’m making a dump for easy access in the future)

C4: Data Modeling, Transformation, and Serving

🔹 Denormalized Form

Definition: Data with redundancy and often nested structures (e.g., JSON).
Use case: Faster reads, fewer joins, often used in analytics or document databases.
Trade-off: Storage waste + potential data inconsistency.

🔹 Normal Forms (Relational Modeling)

🔹 1NF – First Normal Form

Each column must be atomic (no arrays or JSON inside a column).
Table must have a Primary Key (PK).
No repeating groups of columns.
✅ Example: Instead of colors = ["red","blue"], explode into multiple rows.
🛠 Useful tools:
- pd.json_normalize → Flatten nested JSON.
- pd.explode → Convert list to multiple rows.
- pd.factorize → Encode categorical values.

🔹 2NF – Second Normal Form

Requirement: Already in 1NF.
Remove partial dependencies: no non-key column should depend only on part of a composite key.
Solution: Split into multiple tables.
✅ Example: In a sales table keyed by (order_id, product_id), the customer_name depends only on order_id. → Move customer data to a separate table.

🔹 3NF – Third Normal Form

Requirement: Already in 2NF.
Remove transitive dependencies: non-key column should not depend on another non-key column.
✅ Example: If city → state, and state → country, then city should not sit with country in the same table. Normalize into separate entities.
Goal: Reduce redundancy, improve referential integrity.

🔹 Star Schema (OLAP Modeling)

Goal: Optimize for analytical queries (BI dashboards, reporting).
Structure:
- Fact Table = Business event (append-only).
  - Grain: Prefer atomic (lowest-level detail).
  - Surrogate Key: Auto-generated, meaningless but unique.
- Dimension Tables = Descriptive context (Who, What, Where, When).
- Conformed Dimension: A dimension shared across multiple fact tables.

Steps to Build:

Choose business process: e.g., Sales Transactions.
- Questions:
  - Which products sell in which stores?
  - How do sales vary by store or brand?
Declare the grain: e.g., Individual item in an order.
Identify dimensions:
- dim_store (surrogate key, store info)
- dim_item (product details)
- dim_date (calendar attributes: day, month, quarter, weekday)
Define facts (order line):
- item_quantity, item_price
- Foreign Keys: store_id, item_id, date_id
- Natural keys: (order_id, line_number)
- Surrogate PK: e.g., MD5(order_id + line_number)

🔹 Data Vault Modeling

Approach: Agile, scalable modeling for Data Warehouses.
Layers:
1. Staging: Insert-only raw data from multiple sources.
2. Data Vault:
  - Hubs: Core business keys (customers, products, employees).
    - Columns: business key, hash key, load date, record source.
  - Links: Relationships between hubs (transactions, associations).
    - Columns: hub keys, hash key, load date, record source.
  - Satellites: Contextual attributes (descriptions, metrics).
    - Columns: descriptive fields, load date, record source.
3. Information Delivery: Denormalized presentation layer (often star schema).
Strengths:
- Flexible for change.
- Historical tracking built-in.
- Good fit for environments with fast-changing requirements and/or agile.

🔥 Apache Spark Overview

Apache Spark is a distributed computing framework for large-scale data processing.
It generalizes MapReduce by performing operations in-memory, drastically reducing I/O overhead.

Key Concepts

RDD (Resilient Distributed Dataset): Immutable distributed collection of data.
DataFrame: Higher-level abstraction with schema, built on RDDs.
Spark SQL: Allows querying DataFrames using SQL syntax.
Lazy Evaluation: Transformations execute only when an action (show(), collect(), write()) is triggered.

🧩 Typed Data and Schemas

Explicit schema definition improves:

Performance (avoids runtime type inference)
Data consistency
Integration with SQL/BI tools

# TODO: improve this example
from pyspark.sql.types import *

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

test_df = spark.createDataFrame(list_of_tuples, schema)
test_df.show()

💾 Read/Writing Data to Relational Databases (JDBC)

test_df.write.jdbc(
    url=jdbc_url,
    table="test_schema.test_table",
    mode="overwrite",
    properties=jdbc_properties
)

customers_df = spark.read.jdbc(
    url=jdbc_url,
    table="classicmodels.customers",
    properties=jdbc_properties
)

# used to register a DataFrame as a temporary
# SQL-queryable view within the current Spark session
customers_df.createOrReplaceTempView("customers")

Custom SQL Functions (UDFs)

from pyspark.sql.types import StringType

def titleCase(text: str):
    return ' '.join(word.capitalize() for word in text.split())

spark.udf.register("titleUDF", titleCase, StringType())
spark.sql("SELECT book_id, titleUDF(book_name) AS title FROM books")

Query

dim_customers_df = spark.sql("""
    SELECT 
        CAST(customerNumber AS STRING) AS customer_number,
        ...
    FROM customers
""")

# Add columns

dim_customers_df = dim_customers_df.withColumn(
    "customer_key",
    surrogateUDF(array("customer_number"))
)

# 📆 Date Dimension Generation

from pyspark.sql.functions import (
    col, explode, sequence, year, month, dayofweek, 
    dayofmonth, dayofyear, weekofyear, date_format, lit
)
from pyspark.sql.types import DateType

start_date = "2003-01-01"
end_date = "2005-12-31"

date_range_df = spark.sql(f"""
    SELECT explode(sequence(to_date('{start_date}'), to_date('{end_date}'), interval 1 day)) AS date_day
""")

date_dim_df = date_range_df \
    .withColumn("day_of_week", dayofweek("date_day")) \
    .withColumn("day_of_month", dayofmonth("date_day")) \
    .withColumn("day_of_year", dayofyear("date_day")) \
    .withColumn("week_of_year", weekofyear("date_day")) \
    .withColumn("month_of_year", month("date_day")) \
    .withColumn("year_number", year("date_day")) \
    .withColumn("month_name", date_format("date_day", "MMMM")) \
    .withColumn("quarter_of_year", get_quarter_of_year_udf("date_day"))

date_dim_df.show()

🔁 Change Data Capture (CDC) Pipeline

The CDC pipeline ensures real-time synchronization between source and target systems through event streaming.

Flow:

SQL Database captures row-level changes (insert/update/delete).
Debezium streams these changes into Kafka topics.
Kafka brokers the change events.
Flink consumes from Kafka and writes updates to PostgreSQL.

Technologies:

Source: MySQL
CDC Engine: Debezium
Event Bus: Apache Kafka
Stream Processor: Apache Flink
Target: PostgreSQL

Layers Overview

Layer	Technology	Purpose
Source	MySQL	Transactional data source
CDC	Debezium	Change data capture
Stream	Kafka	Event streaming backbone
Processing	Flink	Real-time transformations
Serving	PostgreSQL	Analytical serving store