DataEng: Data Modeling

30 July 2025, Carlos Pena

Data Engineering

My notes about โ€œDeepLearning.AI Data Engineeringโ€ (It will still be updated, Iโ€™m making a dump for easy access in the future)

C4: Data Modeling, Transformation, and Serving

๐Ÿ”น Denormalized Form

๐Ÿ”น Normal Forms (Relational Modeling)

๐Ÿ”น 1NF โ€“ First Normal Form

๐Ÿ”น 2NF โ€“ Second Normal Form

๐Ÿ”น 3NF โ€“ Third Normal Form

๐Ÿ”น Star Schema (OLAP Modeling)

Steps to Build:

  1. Choose business process: e.g., Sales Transactions.
    • Questions:
      • Which products sell in which stores?
      • How do sales vary by store or brand?
  2. Declare the grain: e.g., Individual item in an order.
  3. Identify dimensions:
    • dim_store (surrogate key, store info)
    • dim_item (product details)
    • dim_date (calendar attributes: day, month, quarter, weekday)
  4. Define facts (order line):
    • item_quantity, item_price
    • Foreign Keys: store_id, item_id, date_id
    • Natural keys: (order_id, line_number)
    • Surrogate PK: e.g., MD5(order_id + line_number)

๐Ÿ”น Data Vault Modeling

๐Ÿ”ฅ Apache Spark Overview

Apache Spark is a distributed computing framework for large-scale data processing.
It generalizes MapReduce by performing operations in-memory, drastically reducing I/O overhead.

Key Concepts

๐Ÿงฉ Typed Data and Schemas

Explicit schema definition improves:

# TODO: improve this example
from pyspark.sql.types import *

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

test_df = spark.createDataFrame(list_of_tuples, schema)
test_df.show()

๐Ÿ’พ Read/Writing Data to Relational Databases (JDBC)

test_df.write.jdbc(
    url=jdbc_url,
    table="test_schema.test_table",
    mode="overwrite",
    properties=jdbc_properties
)

customers_df = spark.read.jdbc(
    url=jdbc_url,
    table="classicmodels.customers",
    properties=jdbc_properties
)

# used to register a DataFrame as a temporary
# SQL-queryable view within the current Spark session
customers_df.createOrReplaceTempView("customers")

Custom SQL Functions (UDFs)

from pyspark.sql.types import StringType

def titleCase(text: str):
    return ' '.join(word.capitalize() for word in text.split())

spark.udf.register("titleUDF", titleCase, StringType())
spark.sql("SELECT book_id, titleUDF(book_name) AS title FROM books")

Query

dim_customers_df = spark.sql("""
    SELECT 
        CAST(customerNumber AS STRING) AS customer_number,
        ...
    FROM customers
""")

# Add columns

dim_customers_df = dim_customers_df.withColumn(
    "customer_key",
    surrogateUDF(array("customer_number"))
)

# ๐Ÿ“† Date Dimension Generation

from pyspark.sql.functions import (
    col, explode, sequence, year, month, dayofweek, 
    dayofmonth, dayofyear, weekofyear, date_format, lit
)
from pyspark.sql.types import DateType

start_date = "2003-01-01"
end_date = "2005-12-31"

date_range_df = spark.sql(f"""
    SELECT explode(sequence(to_date('{start_date}'), to_date('{end_date}'), interval 1 day)) AS date_day
""")

date_dim_df = date_range_df \
    .withColumn("day_of_week", dayofweek("date_day")) \
    .withColumn("day_of_month", dayofmonth("date_day")) \
    .withColumn("day_of_year", dayofyear("date_day")) \
    .withColumn("week_of_year", weekofyear("date_day")) \
    .withColumn("month_of_year", month("date_day")) \
    .withColumn("year_number", year("date_day")) \
    .withColumn("month_name", date_format("date_day", "MMMM")) \
    .withColumn("quarter_of_year", get_quarter_of_year_udf("date_day"))

date_dim_df.show()

๐Ÿ” Change Data Capture (CDC) Pipeline

The CDC pipeline ensures real-time synchronization between source and target systems through event streaming.

Flow:

  1. SQL Database captures row-level changes (insert/update/delete).
  2. Debezium streams these changes into Kafka topics.
  3. Kafka brokers the change events.
  4. Flink consumes from Kafka and writes updates to PostgreSQL.

Technologies:


Layers Overview

Layer Technology Purpose
Source MySQL Transactional data source
CDC Debezium Change data capture
Stream Kafka Event streaming backbone
Processing Flink Real-time transformations
Serving PostgreSQL Analytical serving store