Azure Data Factory/Databricks - Data Engineering Pipeline

This repository documents the Data Engineering Pipeline built using Azure Data Factory (ADF), Databricks and Azure Data Lake - ADLS Gen2 using Medallion Architecture. The pipeline automates data extraction, loading and transformation (ELT) processes, with support for incremental data loads and advanced transformations. This repo will help freshners and newbies to understand the core concepts including slowly changes dimensiions[1,2], merge concept in datalake, delta format and much more...

Introduction

This pipeline automates the extraction, loading and transformation (ELT) process. It integrates Azure Data Factory and Databricks for scalable and efficient data processing. The design supports both batch and incremental data loads.

Architecture Overview

Source Systems: Data is ingested from a Git repository.
Staging Area: Data is moved to a SQL database for preprocessing.
Transformation: Data is cleaned and transformed using ADF and Databricks.
Storage: Processed data is stored in a data lake and dimension/fact tables.

Pipeline Components

1. Copy Data (gitGetterToSQL)

Purpose: Copies data from the Git repository to SQL staging.

2. Lookup (readIncremental)

Purpose: Reads metadata to determine incremental data ranges.

3. ForEach (taskIterations)

Purpose: Iterates through datasets or tasks for parallel processing.
Sub-Activities:
- src2landing: Moves data to the landing zone.
- extractCDC: Extracts incremental (CDC) data.
- updateWatermark: Updates processing metadata.
- database2lake_bronze: Moves structured data to the data lake's bronze layer.

Data Transformation

Databricks Notebooks

Product_Dimension: Handles product-related data transformation to create SCD2 .
Branch_Dimension: Processes branch-specific data to handle SCD1.
Sales_Fact: Creates fact tables for sales metrics.

Key Features

Incremental Data Processing: Processes only new or updated records.
Parallel Execution: Speeds up execution using the ForEach activity.
Watermarking: Tracks last processed data for incremental loads.
Advanced Transformations: Uses Databricks for data modeling and analytics.

Settings and Configurations

Pipeline Parameters

Source System: Define the data source (e.g., Git repo connection string).
Watermark Configuration:
- Last Processed Timestamp: Tracks incremental loads.

Linked Services

SQL Database: For staging and intermediate data processing.
Data Lake: For storing processed data.
Databricks Workspace: For advanced transformations.

Integration Runtime

Uses AutoResolveIntegrationRuntime for data movement.

Monitoring and Logs

Pipeline Status: Provides status, runtime, and duration for each activity.
Error Logs: Captures failures for debugging.
Monitoring Tools: Azure Monitor and Log Analytics can be integrated for advanced tracking.

Setup and Deployment

Prerequisites

Azure Subscription with Data Factory enabled.
Databricks Workspace configured.
Required permissions for source and destination systems.

Deployment Steps

Clone this repository.
Create your own tables [watermark and transactional], you can skip this step but you will need a little modification in pipeline ;)
Import the pipeline JSON into Azure Data Factory.
Import the notebooks in databricks
Configure the linked services.
Run the pipeline with test data to verify the setup.

Future Enhancements

Add real-time data ingestion using Event Hubs or Kafka.
Enhance monitoring with detailed dashboards in Power BI or Grafana.
Automate deployment with CI/CD pipelines (e.g., GitHub Actions or Azure DevOps).

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
databricks/Workspace/sales_pipeline		databricks/Workspace/sales_pipeline
datafactory/Data_Extraction_Pipeline		datafactory/Data_Extraction_Pipeline
images		images
raw_data		raw_data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Azure Data Factory/Databricks - Data Engineering Pipeline

Table of Contents

Introduction

Tags

Architecture Overview

Pipeline Components

1. Copy Data (gitGetterToSQL)

2. Lookup (readIncremental)

3. ForEach (taskIterations)

Data Transformation

Databricks Notebooks

Key Features

Settings and Configurations

Pipeline Parameters

Linked Services

Integration Runtime

Monitoring and Logs

Setup and Deployment

Prerequisites

Deployment Steps

Future Enhancements

About

Uh oh!

Releases

Packages

Languages

Techorra/azurefactory-databricks

Folders and files

Latest commit

History

Repository files navigation

Azure Data Factory/Databricks - Data Engineering Pipeline

Table of Contents

Introduction

Tags

Architecture Overview

Pipeline Components

1. Copy Data (gitGetterToSQL)

2. Lookup (readIncremental)

3. ForEach (taskIterations)

Data Transformation

Databricks Notebooks

Key Features

Settings and Configurations

Pipeline Parameters

Linked Services

Integration Runtime

Monitoring and Logs

Setup and Deployment

Prerequisites

Deployment Steps

Future Enhancements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages