Skip to content

VuBacktracking/airflow-data-ingestion

Repository files navigation

Airflow Data Insgestion Pipeline

Architecture

Tasks

  • Design overall data pipeline architecture
  • Define and configure Airflow DAG for orchestration
    • Extract and load raw taxi data to Amazon S3
    • Transform raw data into structured format
    • Convert transformed data to Delta format
    • Persist transformed data to PostgreSQL
  • Configure Trino to connect to Delta Lake on S3
  • Manage infrastructure with Terraform modules
    • Provision Amazon S3 bucket
    • Provision EC2 instance
  • Set up CI for Pull Requests (e.g., GitHub Actions)

Pipeline

Prequisites

Setup Infrastructures

1. Setup Python Environment

make install

Troubeshoot

    - ./.env:/opt/airflow/.env
    ~> dotenv_path = Path(__file__).resolve().parent.parent.parent / ".env"

s3fs module not found

pip uninstall aiobotocore
pip install --upgrade botocore boto3 s3fs

References

About

Airflow Data Ingestion Project for practicing and reviewing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published