feat: Add gitignore support to document reading pipeline #338

Dustyposa · 2025-09-04T09:50:04Z

🚀 Feature Overview

Added gitignore support to the document reading pipeline, allowing automatic exclusion of git-ignored files and directories when processing project documents.

📋 Key Changes

Core Functionality

New get_git_ignore_path_set() function: Uses git ls-files command to extract git-ignored path sets
Extended read_all_documents() function: Added use_gitignore parameter (default: True)
Backward compatible: Existing code requires no changes, gitignore support enabled by default

Test Coverage

11 comprehensive test cases: Covers basic pattern handling, edge cases, and integration tests
Error handling: Tests non-git directories, empty gitignore files, and other edge cases

🔧 Technical Implementation

# New function
def get_git_ignore_path_set(repo_path: str) -> set:
    """Extract git-ignored paths using git ls-files command"""
    
# Extended existing function
def read_all_documents(path: str, ..., use_gitignore: bool = True):
    """Now supports gitignore filtering"""

🧪 Test Verification

# Run all tests
uv run pytest test/ -v

# Run gitignore functionality tests
uv run pytest test/test_gitignore_functionality.py -v

All tests pass with execution time of 1.34s, maintaining high performance.

💥 Breaking Changes

⚠️ Default behavior change: read_all_documents() now enables gitignore filtering by default

Migration Guide:

To maintain original behavior, set use_gitignore=False
In most cases, the new default behavior is more intuitive

📁 Files Changed

api/data_pipeline.py - Core functionality implementation
test/test_gitignore_functionality.py - New test file
test/README.md - New test documentation
pyproject.toml - Updated dependencies

🔍 Testing

The test suite includes:

Basic gitignore pattern matching
Edge cases (non-git repos, empty gitignore)
Integration with read_all_documents
Error handling and parameter validation
Mock testing for git command execution

All tests are designed to be fast, reliable, and comprehensive.

gemini-code-assist

Summary of Changes

Hello @Dustyposa, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the document reading pipeline by integrating gitignore support. The primary goal is to prevent the processing of files and directories that are typically ignored by Git, such as build artifacts, temporary files, and dependency modules. This change streamlines the document ingestion process, making it more efficient and ensuring that only relevant source code and documentation are considered.

Highlights

Gitignore Integration: Introduced a new get_git_ignore_path_set() function to identify git-ignored files and directories using git ls-files.
Enhanced Document Reading: The read_all_documents() function now includes a use_gitignore parameter (defaulting to True) to automatically exclude git-ignored paths during document processing.
Comprehensive Testing: Added a new test file (test/test_gitignore_functionality.py) with 11 test cases covering various scenarios, including basic patterns, edge cases, and error handling for git commands.
Default Behavior Change: The read_all_documents() function now enables gitignore filtering by default, which might require users to explicitly set use_gitignore=False to revert to the original behavior.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable feature by adding gitignore support to the document reading pipeline. The implementation is solid and accompanied by a comprehensive set of tests. My review focuses on enhancing the maintainability of the core logic and improving the robustness and clarity of the tests. Specifically, I've suggested refactoring how git-ignored paths are handled for better code clarity and strengthening several test assertions. I also noted some testing anti-patterns like using print statements and unnecessary try/except blocks. Finally, adding a true integration test for read_all_documents would further improve confidence in this new feature.

gemini-code-assist · 2025-09-04T09:52:13Z

api/data_pipeline.py

+        if use_gitignore:
+            final_excluded_dirs.update(get_git_ignore_path_set(path))


The current implementation adds all paths from get_git_ignore_path_set (both files and directories) into final_excluded_dirs. While this works due to the logic in should_process_file, it's confusing and could be difficult to maintain. It would be clearer to separate the ignored files and directories and add them to their respective exclusion lists (final_excluded_files and final_excluded_dirs).

Suggested change

if use_gitignore:

final_excluded_dirs.update(get_git_ignore_path_set(path))

if use_gitignore:

ignored_paths = get_git_ignore_path_set(path)

final_excluded_files.update(p for p in ignored_paths if not p.endswith('/'))

final_excluded_dirs.update(p for p in ignored_paths if p.endswith('/'))

gemini-code-assist · 2025-09-04T09:52:13Z