This repository documents the Data Engineering Pipeline built using Azure Data Factory (ADF), Databricks and Azure Data Lake - ADLS Gen2 using Medallion Architecture. The pipeline automates data extraction, loading and transformation (ELT) processes, with support for incremental data loads and advanced transformations. This repo will help freshners and newbies to understand the core concepts including slowly changes dimensiions[1,2], merge concept in datalake, delta format and much more...
- Introduction
- Tags
- Architecture Overview
- Pipeline Components
- Data Transformation
- Key Features
- Settings and Configurations
- Monitoring and Logs
- Setup and Deployment
- Future Enhancements
This pipeline automates the extraction, loading and transformation (ELT) process. It integrates Azure Data Factory and Databricks for scalable and efficient data processing. The design supports both batch and incremental data loads.
- Tools: Azure Data Factory, Azure Databricks, SQL Database, Data Lake
- Technologies: ELT, Change Data Capture (CDC), Incremental Data Processing, Big Data
- Features: Parallel Processing, Metadata-Driven Pipeline, Watermarking
- Source Systems: Data is ingested from a Git repository.
- Staging Area: Data is moved to a SQL database for preprocessing.
- Transformation: Data is cleaned and transformed using ADF and Databricks.
- Storage: Processed data is stored in a data lake and dimension/fact tables.
- Purpose: Copies data from the Git repository to SQL staging.
- Purpose: Reads metadata to determine incremental data ranges.
- Purpose: Iterates through datasets or tasks for parallel processing.
- Sub-Activities:
src2landing
: Moves data to the landing zone.extractCDC
: Extracts incremental (CDC) data.updateWatermark
: Updates processing metadata.database2lake_bronze
: Moves structured data to the data lake's bronze layer.
- Product_Dimension: Handles product-related data transformation to create SCD2 .
- Branch_Dimension: Processes branch-specific data to handle SCD1.
- Sales_Fact: Creates fact tables for sales metrics.
- Incremental Data Processing: Processes only new or updated records.
- Parallel Execution: Speeds up execution using the
ForEach
activity. - Watermarking: Tracks last processed data for incremental loads.
- Advanced Transformations: Uses Databricks for data modeling and analytics.
- Source System: Define the data source (e.g., Git repo connection string).
- Watermark Configuration:
Last Processed Timestamp
: Tracks incremental loads.
- SQL Database: For staging and intermediate data processing.
- Data Lake: For storing processed data.
- Databricks Workspace: For advanced transformations.
- Uses AutoResolveIntegrationRuntime for data movement.
- Pipeline Status: Provides status, runtime, and duration for each activity.
- Error Logs: Captures failures for debugging.
- Monitoring Tools: Azure Monitor and Log Analytics can be integrated for advanced tracking.
- Azure Subscription with Data Factory enabled.
- Databricks Workspace configured.
- Required permissions for source and destination systems.
- Clone this repository.
- Create your own tables [watermark and transactional], you can skip this step but you will need a little modification in pipeline ;)
- Import the pipeline JSON into Azure Data Factory.
- Import the notebooks in databricks
- Configure the linked services.
- Run the pipeline with test data to verify the setup.
- Add real-time data ingestion using Event Hubs or Kafka.
- Enhance monitoring with detailed dashboards in Power BI or Grafana.
- Automate deployment with CI/CD pipelines (e.g., GitHub Actions or Azure DevOps).