-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: Add gitignore support to document reading pipeline #338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add gitignore support to document reading pipeline #338
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @Dustyposa, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request enhances the document reading pipeline by integrating gitignore support. The primary goal is to prevent the processing of files and directories that are typically ignored by Git, such as build artifacts, temporary files, and dependency modules. This change streamlines the document ingestion process, making it more efficient and ensuring that only relevant source code and documentation are considered.
Highlights
- Gitignore Integration: Introduced a new get_git_ignore_path_set() function to identify git-ignored files and directories using git ls-files.
- Enhanced Document Reading: The read_all_documents() function now includes a use_gitignore parameter (defaulting to True) to automatically exclude git-ignored paths during document processing.
- Comprehensive Testing: Added a new test file (test/test_gitignore_functionality.py) with 11 test cases covering various scenarios, including basic patterns, edge cases, and error handling for git commands.
- Default Behavior Change: The read_all_documents() function now enables gitignore filtering by default, which might require users to explicitly set use_gitignore=False to revert to the original behavior.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable feature by adding gitignore
support to the document reading pipeline. The implementation is solid and accompanied by a comprehensive set of tests. My review focuses on enhancing the maintainability of the core logic and improving the robustness and clarity of the tests. Specifically, I've suggested refactoring how git-ignored paths are handled for better code clarity and strengthening several test assertions. I also noted some testing anti-patterns like using print
statements and unnecessary try/except
blocks. Finally, adding a true integration test for read_all_documents
would further improve confidence in this new feature.
if use_gitignore: | ||
final_excluded_dirs.update(get_git_ignore_path_set(path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation adds all paths from get_git_ignore_path_set
(both files and directories) into final_excluded_dirs
. While this works due to the logic in should_process_file
, it's confusing and could be difficult to maintain. It would be clearer to separate the ignored files and directories and add them to their respective exclusion lists (final_excluded_files
and final_excluded_dirs
).
if use_gitignore: | |
final_excluded_dirs.update(get_git_ignore_path_set(path)) | |
if use_gitignore: | |
ignored_paths = get_git_ignore_path_set(path) | |
final_excluded_files.update(p for p in ignored_paths if not p.endswith('/')) | |
final_excluded_dirs.update(p for p in ignored_paths if p.endswith('/')) |
# Check that ignored files are in the result | ||
ignored_files = [path for path in ignored_paths if not path.endswith('/')] | ||
ignored_dirs = [path for path in ignored_paths if path.endswith('/')] | ||
|
||
# Log files should be ignored | ||
assert any("debug.log" in path for path in ignored_files) | ||
assert any("error.log" in path for path in ignored_files) | ||
|
||
# Directories should be ignored | ||
assert any("build/" in path for path in ignored_dirs) | ||
assert any("node_modules/" in path for path in ignored_dirs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertions using any("..." in path)
are too weak and could lead to false positives (e.g., it would pass if the path was some/prefix/debug.log
). Since get_git_ignore_path_set
should return a precise set of relative paths, it's better to assert for exact set equality to make the test more robust.
# Check that ignored files are in the result | |
ignored_files = [path for path in ignored_paths if not path.endswith('/')] | |
ignored_dirs = [path for path in ignored_paths if path.endswith('/')] | |
# Log files should be ignored | |
assert any("debug.log" in path for path in ignored_files) | |
assert any("error.log" in path for path in ignored_files) | |
# Directories should be ignored | |
assert any("build/" in path for path in ignored_dirs) | |
assert any("node_modules/" in path for path in ignored_dirs) | |
# Check that ignored files and directories are in the result | |
expected_paths = {"debug.log", "error.log", "build/", "node_modules/"} | |
assert ignored_paths == expected_paths |
assert any("build/" in path for path in ignored_dirs) | ||
assert any("node_modules/" in path for path in ignored_dirs) | ||
|
||
print(f"✓ Basic gitignore patterns test passed. Ignored paths: {ignored_paths}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print()
statements should be removed from tests as they add noise to the test output and are not standard practice for automated testing. If you need to output information for debugging, consider using the logging
module instead. This comment applies to other print()
statements in this file as well (e.g., lines 114, 131, 151, etc.).
ignored_paths = get_git_ignore_path_set(self.test_dir) | ||
|
||
# Verify results - should be empty or minimal | ||
assert isinstance(ignored_paths, set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignored_paths = get_git_ignore_path_set(self.test_dir) | ||
|
||
# Verify results | ||
assert isinstance(ignored_paths, set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try: | ||
# This should not raise an error | ||
result = read_all_documents( | ||
path=self.test_dir, | ||
use_gitignore=True, | ||
is_ollama_embedder=False | ||
) | ||
# Verify that get_git_ignore_path_set was called | ||
mock_git_ignore.assert_called_once_with(self.test_dir) | ||
|
||
# Verify the result structure and content | ||
assert isinstance(result, list), "Result should be a list of Document objects" | ||
|
||
# Extract file paths from the result | ||
result_files = [doc.meta_data['file_path'] for doc in result] | ||
|
||
# Check that ignored files are NOT in the result | ||
assert "debug.log" not in result_files, "debug.log should be ignored" | ||
assert "build/output.js" not in result_files, "build/output.js should be ignored" | ||
|
||
# Check that non-ignored files ARE in the result | ||
assert "README.md" in result_files, "README.md should be included" | ||
assert "src/main.py" in result_files, "src/main.py should be included" | ||
|
||
print(f"✓ read_all_documents with use_gitignore=True test passed. Files processed: {result_files}") | ||
except Exception as e: | ||
pytest.fail(f"read_all_documents with use_gitignore=True failed: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using try...except Exception
to call pytest.fail()
is an anti-pattern in pytest
. If an exception is not expected, it's better to let the test fail naturally. This makes the test code cleaner and the traceback more direct. This comment also applies to other tests in this file using the same pattern (e.g., lines 223-246, 262-271, 316-336).
# This should not raise an error
result = read_all_documents(
path=self.test_dir,
use_gitignore=True,
is_ollama_embedder=False
)
# Verify that get_git_ignore_path_set was called
mock_git_ignore.assert_called_once_with(self.test_dir)
# Verify the result structure and content
assert isinstance(result, list), "Result should be a list of Document objects"
# Extract file paths from the result
result_files = [doc.meta_data['file_path'] for doc in result]
# Check that ignored files are NOT in the result
assert "debug.log" not in result_files, "debug.log should be ignored"
assert "build/output.js" not in result_files, "build/output.js should be ignored"
# Check that non-ignored files ARE in the result
assert "README.md" in result_files, "README.md should be included"
assert "src/main.py" in result_files, "src/main.py should be included"
print(f"✓ read_all_documents with use_gitignore=True test passed. Files processed: {result_files}")
🚀 Feature Overview
Added gitignore support to the document reading pipeline, allowing automatic exclusion of git-ignored files and directories when processing project documents.
📋 Key Changes
Core Functionality
get_git_ignore_path_set()
function: Usesgit ls-files
command to extract git-ignored path setsread_all_documents()
function: Addeduse_gitignore
parameter (default:True
)Test Coverage
🔧 Technical Implementation
🧪 Test Verification
All tests pass with execution time of 1.34s, maintaining high performance.
💥 Breaking Changes
read_all_documents()
now enables gitignore filtering by defaultMigration Guide:
use_gitignore=False
📁 Files Changed
api/data_pipeline.py
- Core functionality implementationtest/test_gitignore_functionality.py
- New test filetest/README.md
- New test documentationpyproject.toml
- Updated dependencies🔍 Testing
The test suite includes:
read_all_documents
All tests are designed to be fast, reliable, and comprehensive.