Skip to content

Conversation

Dustyposa
Copy link
Contributor

🚀 Feature Overview

Added gitignore support to the document reading pipeline, allowing automatic exclusion of git-ignored files and directories when processing project documents.

📋 Key Changes

Core Functionality

  • New get_git_ignore_path_set() function: Uses git ls-files command to extract git-ignored path sets
  • Extended read_all_documents() function: Added use_gitignore parameter (default: True)
  • Backward compatible: Existing code requires no changes, gitignore support enabled by default

Test Coverage

  • 11 comprehensive test cases: Covers basic pattern handling, edge cases, and integration tests
  • Error handling: Tests non-git directories, empty gitignore files, and other edge cases

🔧 Technical Implementation

# New function
def get_git_ignore_path_set(repo_path: str) -> set:
    """Extract git-ignored paths using git ls-files command"""
    
# Extended existing function
def read_all_documents(path: str, ..., use_gitignore: bool = True):
    """Now supports gitignore filtering"""

🧪 Test Verification

# Run all tests
uv run pytest test/ -v

# Run gitignore functionality tests
uv run pytest test/test_gitignore_functionality.py -v

All tests pass with execution time of 1.34s, maintaining high performance.

💥 Breaking Changes

⚠️ Default behavior change: read_all_documents() now enables gitignore filtering by default

Migration Guide:

  • To maintain original behavior, set use_gitignore=False
  • In most cases, the new default behavior is more intuitive

📁 Files Changed

  • api/data_pipeline.py - Core functionality implementation
  • test/test_gitignore_functionality.py - New test file
  • test/README.md - New test documentation
  • pyproject.toml - Updated dependencies

🔍 Testing

The test suite includes:

  • Basic gitignore pattern matching
  • Edge cases (non-git repos, empty gitignore)
  • Integration with read_all_documents
  • Error handling and parameter validation
  • Mock testing for git command execution

All tests are designed to be fast, reliable, and comprehensive.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @Dustyposa, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the document reading pipeline by integrating gitignore support. The primary goal is to prevent the processing of files and directories that are typically ignored by Git, such as build artifacts, temporary files, and dependency modules. This change streamlines the document ingestion process, making it more efficient and ensuring that only relevant source code and documentation are considered.

Highlights

  • Gitignore Integration: Introduced a new get_git_ignore_path_set() function to identify git-ignored files and directories using git ls-files.
  • Enhanced Document Reading: The read_all_documents() function now includes a use_gitignore parameter (defaulting to True) to automatically exclude git-ignored paths during document processing.
  • Comprehensive Testing: Added a new test file (test/test_gitignore_functionality.py) with 11 test cases covering various scenarios, including basic patterns, edge cases, and error handling for git commands.
  • Default Behavior Change: The read_all_documents() function now enables gitignore filtering by default, which might require users to explicitly set use_gitignore=False to revert to the original behavior.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature by adding gitignore support to the document reading pipeline. The implementation is solid and accompanied by a comprehensive set of tests. My review focuses on enhancing the maintainability of the core logic and improving the robustness and clarity of the tests. Specifically, I've suggested refactoring how git-ignored paths are handled for better code clarity and strengthening several test assertions. I also noted some testing anti-patterns like using print statements and unnecessary try/except blocks. Finally, adding a true integration test for read_all_documents would further improve confidence in this new feature.

Comment on lines +211 to +212
if use_gitignore:
final_excluded_dirs.update(get_git_ignore_path_set(path))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation adds all paths from get_git_ignore_path_set (both files and directories) into final_excluded_dirs. While this works due to the logic in should_process_file, it's confusing and could be difficult to maintain. It would be clearer to separate the ignored files and directories and add them to their respective exclusion lists (final_excluded_files and final_excluded_dirs).

Suggested change
if use_gitignore:
final_excluded_dirs.update(get_git_ignore_path_set(path))
if use_gitignore:
ignored_paths = get_git_ignore_path_set(path)
final_excluded_files.update(p for p in ignored_paths if not p.endswith('/'))
final_excluded_dirs.update(p for p in ignored_paths if p.endswith('/'))

Comment on lines +85 to +95
# Check that ignored files are in the result
ignored_files = [path for path in ignored_paths if not path.endswith('/')]
ignored_dirs = [path for path in ignored_paths if path.endswith('/')]

# Log files should be ignored
assert any("debug.log" in path for path in ignored_files)
assert any("error.log" in path for path in ignored_files)

# Directories should be ignored
assert any("build/" in path for path in ignored_dirs)
assert any("node_modules/" in path for path in ignored_dirs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertions using any("..." in path) are too weak and could lead to false positives (e.g., it would pass if the path was some/prefix/debug.log). Since get_git_ignore_path_set should return a precise set of relative paths, it's better to assert for exact set equality to make the test more robust.

Suggested change
# Check that ignored files are in the result
ignored_files = [path for path in ignored_paths if not path.endswith('/')]
ignored_dirs = [path for path in ignored_paths if path.endswith('/')]
# Log files should be ignored
assert any("debug.log" in path for path in ignored_files)
assert any("error.log" in path for path in ignored_files)
# Directories should be ignored
assert any("build/" in path for path in ignored_dirs)
assert any("node_modules/" in path for path in ignored_dirs)
# Check that ignored files and directories are in the result
expected_paths = {"debug.log", "error.log", "build/", "node_modules/"}
assert ignored_paths == expected_paths

assert any("build/" in path for path in ignored_dirs)
assert any("node_modules/" in path for path in ignored_dirs)

print(f"✓ Basic gitignore patterns test passed. Ignored paths: {ignored_paths}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

print() statements should be removed from tests as they add noise to the test output and are not standard practice for automated testing. If you need to output information for debugging, consider using the logging module instead. This comment applies to other print() statements in this file as well (e.g., lines 114, 131, 151, etc.).

ignored_paths = get_git_ignore_path_set(self.test_dir)

# Verify results - should be empty or minimal
assert isinstance(ignored_paths, set)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This assertion is too weak. With an empty .gitignore file, get_git_ignore_path_set should return an empty set. The test should verify this explicitly for correctness.

Suggested change
assert isinstance(ignored_paths, set)
assert ignored_paths == set()

ignored_paths = get_git_ignore_path_set(self.test_dir)

# Verify results
assert isinstance(ignored_paths, set)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This assertion is too weak. When no .gitignore file exists, get_git_ignore_path_set should return an empty set. The test should verify this explicitly.

Suggested change
assert isinstance(ignored_paths, set)
assert ignored_paths == set()

Comment on lines +174 to +200
try:
# This should not raise an error
result = read_all_documents(
path=self.test_dir,
use_gitignore=True,
is_ollama_embedder=False
)
# Verify that get_git_ignore_path_set was called
mock_git_ignore.assert_called_once_with(self.test_dir)

# Verify the result structure and content
assert isinstance(result, list), "Result should be a list of Document objects"

# Extract file paths from the result
result_files = [doc.meta_data['file_path'] for doc in result]

# Check that ignored files are NOT in the result
assert "debug.log" not in result_files, "debug.log should be ignored"
assert "build/output.js" not in result_files, "build/output.js should be ignored"

# Check that non-ignored files ARE in the result
assert "README.md" in result_files, "README.md should be included"
assert "src/main.py" in result_files, "src/main.py should be included"

print(f"✓ read_all_documents with use_gitignore=True test passed. Files processed: {result_files}")
except Exception as e:
pytest.fail(f"read_all_documents with use_gitignore=True failed: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using try...except Exception to call pytest.fail() is an anti-pattern in pytest. If an exception is not expected, it's better to let the test fail naturally. This makes the test code cleaner and the traceback more direct. This comment also applies to other tests in this file using the same pattern (e.g., lines 223-246, 262-271, 316-336).

            # This should not raise an error
            result = read_all_documents(
                path=self.test_dir,
                use_gitignore=True,
                is_ollama_embedder=False
            )
            # Verify that get_git_ignore_path_set was called
            mock_git_ignore.assert_called_once_with(self.test_dir)
            
            # Verify the result structure and content
            assert isinstance(result, list), "Result should be a list of Document objects"
            
            # Extract file paths from the result
            result_files = [doc.meta_data['file_path'] for doc in result]
            
            # Check that ignored files are NOT in the result
            assert "debug.log" not in result_files, "debug.log should be ignored"
            assert "build/output.js" not in result_files, "build/output.js should be ignored"
            
            # Check that non-ignored files ARE in the result
            assert "README.md" in result_files, "README.md should be included"
            assert "src/main.py" in result_files, "src/main.py should be included"
            
            print(f"✓ read_all_documents with use_gitignore=True test passed. Files processed: {result_files}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant