Travel Booking Data Warehouse • fakhrimhd

The Situation

A travel booking platform stores operational data across multiple tables: flights, hotels, customers, airlines, airports. The database works fine for running the business — processing bookings, managing inventory.

But when the analytics team asks simple questions:

“How many bookings did we get last week?”
“What’s the average ticket price trend this quarter?”
“Which routes are most profitable?”

The answers require complex queries joining 5+ tables, with no clear grain for aggregation. Every analyst writes their own version, getting slightly different numbers.

The ask: Build a data warehouse that makes these questions easy to answer.

Understanding the Source

The source database (pactravel) contains:

Table	Description
`customers`	Customer profiles
`airlines`	Airline master data
`aircrafts`	Aircraft types and capacity
`airports`	Airport codes and locations
`hotels`	Hotel properties
`flight_bookings`	Flight reservation transactions
`hotel_bookings`	Hotel reservation transactions

The core problem: OLTP schema ≠ OLAP schema.

OLTP (Online Transaction Processing) systems optimize for writes — fast inserts, normalized to avoid redundancy. Think booking systems, POS terminals, inventory updates.

OLAP (Online Analytical Processing) systems optimize for reads — fast aggregations, denormalized for fewer joins. Think data warehouses, BI dashboards, reporting.

This project transforms OLTP source data into an OLAP-friendly dimensional model.

Designing the Warehouse

Identifying Business Processes

Before touching any code, I mapped out what the business actually measures:

Flight bookings — revenue, volume, routes, customer segments
Hotel bookings — revenue, occupancy patterns, property performance

Each becomes a fact table — the central tables that store measurable events.

Defining the Grain

The grain answers: “What does one row represent?”

Fact Table	Grain
`fct_flight_bookings`	One flight booking (one passenger, one flight)
`fct_hotel_bookings`	One hotel stay (one reservation, one property)

Getting the grain right is critical. Too coarse (daily aggregates) and you lose detail. Too fine (every click) and queries become slow.

Building Dimensions

Dimensions are the “by” in “revenue by airline” or “bookings by month”:

Why separate dim_date and dim_time?

Pre-calculating date attributes (day of week, quarter, fiscal year) once means every query can filter by WHERE weekend_flag = true without computing it on the fly. Same for time-of-day analysis.

SCD Strategy

For this use case, I chose SCD Type 1 (overwrite on change):

Customer updates their email? Overwrite the old value.
Airline changes their name? Overwrite.

Type 1 is simpler and sufficient when you don’t need historical tracking of dimension changes. If we needed “what was the customer’s address when they booked?”, Type 2 would be necessary.

Building the Pipeline

The ELT Pattern

I chose ELT (Extract-Load-Transform) over ETL because:

Leverage the warehouse — PostgreSQL is powerful; let it do the heavy lifting
Transformations in SQL — Easier to version, test, and debug than Python transformations
dbt fits perfectly — Purpose-built for the “T” in ELT

Source DB ──▶ Extract ──▶ Load ──▶ Staging ──▶ dbt ──▶ Marts
                │           │                    │
            Python      Python              SQL models
           (extract.py) (load.py)        (staging → marts)

Why dbt?

dbt handles the transformation layer with features raw SQL doesn’t have:

ref() function — Automatic dependency resolution
Incremental models — Process only new data
Testing — Built-in data quality checks
Documentation — Lineage graphs and column descriptions

Orchestrating with Airflow

The pipeline has dependencies: you can’t transform before loading, can’t load before extracting. Something needs to manage this.

Airflow provides:

DAG visualization — See the entire pipeline at a glance
Retry logic — Automatic retries on transient failures
Scheduling — Run daily, hourly, or on triggers
Monitoring — Alerts when tasks fail

The DAG:

Handling Incremental Loads

For fact tables that grow daily, rebuilding everything is wasteful. Instead:

{{
  config(
    materialized='incremental',
    unique_key='sk_flight_booking_id',
    incremental_strategy='merge'
  )
}}

SELECT ...
FROM {{ ref('stg_flight_bookings') }}

{% if is_incremental() %}
WHERE departure_date >= (
  SELECT MAX(departure_date) - INTERVAL '3 days'
  FROM {{ this }}
)
{% endif %}

Why 3-day lookback?

Late-arriving data is common — a booking made Friday might not appear in the source until Monday. Looking back 3 days catches these without reprocessing the entire history.

The Result

What the Warehouse Enables

Before: Get last week’s booking count

SELECT COUNT(*)
FROM flight_bookings fb
JOIN customers c ON fb.customer_id = c.customer_id
JOIN airlines a ON fb.airline_id = a.airline_id
WHERE fb.booking_date >= CURRENT_DATE - INTERVAL '7 days';
-- Wait, is booking_date the right field? What timezone?

After:

SELECT COUNT(*)
FROM fct_flight_bookings f
JOIN dim_date d ON f.date_id = d.date_id
WHERE d.date_actual >= CURRENT_DATE - INTERVAL '7 days';

Clear grain. Consistent date handling. No ambiguity.

Revenue by Airline, by Month

SELECT
  d.month_name,
  a.airline_name,
  SUM(f.price) as revenue,
  COUNT(*) as bookings
FROM fct_flight_bookings f
JOIN dim_date d ON f.date_id = d.date_id
JOIN dim_airlines a ON f.sk_airline_id = a.sk_airline_id
GROUP BY d.month_name, a.airline_name
ORDER BY revenue DESC;

Two joins. Any analyst can write this.

Pipeline Stats

Component	Count
Staging models	7
Mart models	7 (5 dims + 2 facts)
dbt tests	70+
CI pipeline	GitHub Actions

What I’d Improve

Add snapshot tables — Track dimension history with dbt snapshots for Type 2 SCD where needed
Partition fact tables — By month, for faster queries on recent data
Add data freshness checks — Alert if source data stops flowing
Build a semantic layer — Metrics definitions in dbt or a tool like Metabase

fakhrimhd/travel-booking-dwh

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

00K00KMIT