NYC Taxi Analytics: Medallion Data Pipeline

A complete medallion-style analytics pipeline for NYC taxi trips, built from Bronze, Silver, Gold, and EDA notebooks and delivered through a Power BI report for operational insight.

RoleData Engineer & BI Developer
TimelineApril 2026
ToolsSpark, Delta Lake, Fabric, Power BI, SQL
NYC Taxi medallion architecture diagram

Project Overview

This project transforms raw NYC taxi trip files into a curated analytics model through a classic medallion pipeline. The Bronze notebook ingests monthly parquet files from the NYC Open Data API, the Silver notebook validates and enriches the rows, the Gold notebook aggregates the data into business-friendly tables, and the EDA notebook validates the outputs and explores trip demand patterns before the Power BI report turns the pipeline into actionable insight.

Challenge

Raw taxi trip data is high-volume and noisy. The main challenge was to convert operational records into business-ready insights while keeping data quality high, preserving refresh reliability, and enabling non-technical users to explore trends such as daily demand, hourly peaks, and revenue behavior.

Bronze to Silver

Raw taxi files are downloaded, profiled, filtered to 2019, and cleaned for invalid dates, passengers, fares, and distances.

Silver to Gold

The cleansed dataset is shaped into daily, hourly, and zone-level aggregates for reporting and analysis.

EDA and Validation

Notebook checks verify row counts and trends, then compare demand patterns across dates, hours, and pickup zones.

Consumption Layer

The final Power BI report surfaces the pipeline as a business dashboard for trend exploration and decision support.

Data Flow (End-to-End)

1. Ingestion Layer

The Bronze notebook uses Python requests inside Fabric/Spark runtime to download monthly 2019 yellow taxi parquet files from the NYC Open Data source. Files are landed into the raw zone and persisted as bronze_nyc_taxi_trips.

2. Validation & Cleansing Layer

The Silver notebook applies quality rules: valid pickup/dropoff sequence, positive trip distance, and non-negative fare/total amounts. It also derives trip duration and temporal features (pickup hour, day, month) for analysis.

3. Aggregation Layer

The Gold notebook builds analytical marts for consumption: daily metrics, hourly demand, and zone-level performance. This decouples heavy transformation from reporting and improves dashboard performance.

4. Analytical Validation Layer

The EDA notebook validates business behavior in the curated tables, checks trend consistency, and confirms that metrics are decision-ready before visualization publishing.

5. BI Consumption Layer

Power BI consumes the Gold outputs to deliver KPI cards and trend visuals for total trips, total revenue, average fare, hourly demand, and date-based movement.

Medallion Workflow

Bronze Notebook

Downloads the 2019 yellow taxi parquet files, checks schema and volume, and writes the raw table.

  • bronze_nyc_taxi_trips
  • Raw landing zone
  • Source: NYC Open Data API

Silver Notebook

Filters bad records, keeps 2019 trips only, and adds trip duration plus time-based features.

  • silver_nyc_taxi_trips
  • Trip duration minutes
  • Pickup hour, day, month

Gold Notebook

Creates curated analytics tables for daily, hourly, and zone-level reporting.

  • gold_daily_trips
  • gold_hourly_demand
  • gold_zone_metrics

EDA Notebook

Explores trends from the curated tables and checks the reporting layer before dashboard delivery.

  • Daily trip trend
  • Hourly demand shape
  • Top pickup zones

Data Pipeline Architecture

NYC taxi medallion data pipeline architecture diagram

Source data moves from ingestion into Bronze, Silver, and Gold layers before reaching Power BI and exploratory analysis.

Notebook Outputs

Bronze Notebook Output

  • Raw rows loaded into bronze_nyc_taxi_trips
  • Schema and timestamp range validated
  • Invalid passenger, fare, and distance checks

Silver Notebook Output

  • Filtered to 2019 and positive trip durations
  • Derived pickup_hour, pickup_day, and pickup_month
  • Wrote cleaned silver_nyc_taxi_trips table

Gold Notebook Output

  • Daily trip, revenue, and average fare metrics
  • Hourly demand and zone performance tables
  • Ready for semantic modeling in Power BI

EDA Notebook Output

  • Trend checks for trip volume and revenue
  • Hourly demand analysis across all 24 hours
  • Pickup zone comparison for operational planning

Power BI Dashboard Preview

NYC Taxi Trips Dashboard 2019 showing KPI cards, hourly demand, average fare by hour, and date-based trend lines

Dashboard snapshot with KPI cards (82M trips, 1.55bn revenue, 19.00 avg fare), hourly demand and fare behavior, plus daily trip and revenue trends.

Technical Stack Used

Data Engineering

  • Microsoft Fabric Notebook
  • PySpark DataFrames
  • Delta tables (Bronze/Silver/Gold)

Processing & Modeling

  • SQL and Spark transformations
  • Feature engineering on timestamps
  • Curated marts for BI consumption

Analytics & Validation

  • EDA notebook checks on trends
  • Demand and revenue sanity checks
  • Zone-level demand prioritization

Visualization Layer

  • Power BI dashboard design
  • KPI cards and trend visuals
  • Time filters and interactive slicing

Need this project or need this type of service?

Get in touch if you want a similar medallion pipeline, Power BI report, or end-to-end data engineering solution.

Get in touch