skip to content
fakhrimhd
Table of Contents

The Situation

A travel booking platform stores operational data across multiple tables: flights, hotels, customers, airlines, airports. The database works fine for running the business — processing bookings, managing inventory.

But when the analytics team asks simple questions:

  • “How many bookings did we get last week?”
  • “What’s the average ticket price trend this quarter?”
  • “Which routes are most profitable?”

The answers require complex queries joining 5+ tables, with no clear grain for aggregation. Every analyst writes their own version, getting slightly different numbers.

The ask: Build a data warehouse that makes these questions easy to answer.


Understanding the Source

The source database (pactravel) contains:

TableDescription
customersCustomer profiles
airlinesAirline master data
aircraftsAircraft types and capacity
airportsAirport codes and locations
hotelsHotel properties
flight_bookingsFlight reservation transactions
hotel_bookingsHotel reservation transactions

The core problem: OLTP schema ≠ OLAP schema.

OLTP (Online Transaction Processing) systems optimize for writes — fast inserts, normalized to avoid redundancy. Think booking systems, POS terminals, inventory updates.

OLAP (Online Analytical Processing) systems optimize for reads — fast aggregations, denormalized for fewer joins. Think data warehouses, BI dashboards, reporting.

This project transforms OLTP source data into an OLAP-friendly dimensional model.


Designing the Warehouse

Identifying Business Processes

Before touching any code, I mapped out what the business actually measures:

  1. Flight bookings — revenue, volume, routes, customer segments
  2. Hotel bookings — revenue, occupancy patterns, property performance

Each becomes a fact table — the central tables that store measurable events.

Defining the Grain

The grain answers: “What does one row represent?”

Fact TableGrain
fct_flight_bookingsOne flight booking (one passenger, one flight)
fct_hotel_bookingsOne hotel stay (one reservation, one property)

Getting the grain right is critical. Too coarse (daily aggregates) and you lose detail. Too fine (every click) and queries become slow.

Building Dimensions

Dimensions are the “by” in “revenue by airline” or “bookings by month”:

Star Schema ERD

Why separate dim_date and dim_time?

Pre-calculating date attributes (day of week, quarter, fiscal year) once means every query can filter by WHERE weekend_flag = true without computing it on the fly. Same for time-of-day analysis.

SCD Strategy

For this use case, I chose SCD Type 1 (overwrite on change):

  • Customer updates their email? Overwrite the old value.
  • Airline changes their name? Overwrite.

Type 1 is simpler and sufficient when you don’t need historical tracking of dimension changes. If we needed “what was the customer’s address when they booked?”, Type 2 would be necessary.


Building the Pipeline

The ELT Pattern

I chose ELT (Extract-Load-Transform) over ETL because:

  1. Leverage the warehouse — PostgreSQL is powerful; let it do the heavy lifting
  2. Transformations in SQL — Easier to version, test, and debug than Python transformations
  3. dbt fits perfectly — Purpose-built for the “T” in ELT
Source DB ──▶ Extract ──▶ Load ──▶ Staging ──▶ dbt ──▶ Marts
│ │ │
Python Python SQL models
(extract.py) (load.py) (staging → marts)

Why dbt?

dbt handles the transformation layer with features raw SQL doesn’t have:

  • ref() function — Automatic dependency resolution
  • Incremental models — Process only new data
  • Testing — Built-in data quality checks
  • Documentation — Lineage graphs and column descriptions
dbt Lineage Graph

Orchestrating with Airflow

The pipeline has dependencies: you can’t transform before loading, can’t load before extracting. Something needs to manage this.

Airflow provides:

  • DAG visualization — See the entire pipeline at a glance
  • Retry logic — Automatic retries on transient failures
  • Scheduling — Run daily, hourly, or on triggers
  • Monitoring — Alerts when tasks fail

The DAG:

Airflow DAG

Handling Incremental Loads

For fact tables that grow daily, rebuilding everything is wasteful. Instead:

{{
config(
materialized='incremental',
unique_key='sk_flight_booking_id',
incremental_strategy='merge'
)
}}
SELECT ...
FROM {{ ref('stg_flight_bookings') }}
{% if is_incremental() %}
WHERE departure_date >= (
SELECT MAX(departure_date) - INTERVAL '3 days'
FROM {{ this }}
)
{% endif %}

Why 3-day lookback?

Late-arriving data is common — a booking made Friday might not appear in the source until Monday. Looking back 3 days catches these without reprocessing the entire history.


The Result

What the Warehouse Enables

Before: Get last week’s booking count

SELECT COUNT(*)
FROM flight_bookings fb
JOIN customers c ON fb.customer_id = c.customer_id
JOIN airlines a ON fb.airline_id = a.airline_id
WHERE fb.booking_date >= CURRENT_DATE - INTERVAL '7 days';
-- Wait, is booking_date the right field? What timezone?

After:

SELECT COUNT(*)
FROM fct_flight_bookings f
JOIN dim_date d ON f.date_id = d.date_id
WHERE d.date_actual >= CURRENT_DATE - INTERVAL '7 days';

Clear grain. Consistent date handling. No ambiguity.

Revenue by Airline, by Month

SELECT
d.month_name,
a.airline_name,
SUM(f.price) as revenue,
COUNT(*) as bookings
FROM fct_flight_bookings f
JOIN dim_date d ON f.date_id = d.date_id
JOIN dim_airlines a ON f.sk_airline_id = a.sk_airline_id
GROUP BY d.month_name, a.airline_name
ORDER BY revenue DESC;

Two joins. Any analyst can write this.

Pipeline Stats

ComponentCount
Staging models7
Mart models7 (5 dims + 2 facts)
dbt tests70+
CI pipelineGitHub Actions

What I’d Improve

  1. Add snapshot tables — Track dimension history with dbt snapshots for Type 2 SCD where needed

  2. Partition fact tables — By month, for faster queries on recent data

  3. Add data freshness checks — Alert if source data stops flowing

  4. Build a semantic layer — Metrics definitions in dbt or a tool like Metabase


Lorem ipsum dolor sit amet, consectetur adipiscing elit.
00K00KMIT