-
Notifications
You must be signed in to change notification settings - Fork 548
chore(types): Type-clean embeddings/ (25 errors) #1383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
||
def _init_model(self): | ||
"""Initialize the model used for computing the embeddings.""" | ||
# Provide defaults if not specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move the defaults to constructor? at line 52
if not self._current_batch_finished_event: | ||
raise Exception("self._current_batch_finished_event not initialized") | ||
|
||
assert self._current_batch_finished_event is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this redundant? also it must not use assert.
return EmbeddingsCacheConfig( | ||
key_generator=self._key_generator.name, | ||
store=self._cache_store.name, | ||
key_generator=self._key_generator.name if self._key_generator else "sha256", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i find defining defaults here problematic. What Pyright did not like about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There already is a default here in EmbeddingsCacheConfig
:
key_generator: str = Field(
default="sha256",
description="The method to use for generating the cache keys.",
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some redundancies in default values defined in different files, and also redundant checks for None
. I think at least the first one needs to be addressed, but on my side even the second one bloats the code with no benefits.
def _init_model(self): | ||
"""Initialize the model used for computing the embeddings.""" | ||
# Provide defaults if not specified | ||
model = self.embedding_model or "sentence-transformers/all-MiniLM-L6-v2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some defaults in llmrails.py
therefore in "normal" usage these are never None
. We should at least the same defaults (e.g. FastEmbed
):
On my side, there are some type checking errors that might happen (for example, having None
here) by just using static analysis tools, but the actual "normal" usage flow in Guardrails makes it never happen.
if self._model is None: | ||
self._init_model() | ||
|
||
if not self._model: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are already throwing an ValueError
in _init_model
if an error when initializing the model. Does this make sense to throw another one here?
|
||
# We check if we reached the max batch size | ||
if len(self._req_queue) >= self.max_batch_size: | ||
if not self._current_batch_full_event: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again this is redundant, self._current_batch_full_event
cannot be None
here as per earlier check and assertion.
return EmbeddingsCacheConfig( | ||
key_generator=self._key_generator.name, | ||
store=self._cache_store.name, | ||
key_generator=self._key_generator.name if self._key_generator else "sha256", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There already is a default here in EmbeddingsCacheConfig
:
key_generator: str = Field(
default="sha256",
description="The method to use for generating the cache keys.",
)
Description
Cleaned the colang/ directory with help from Cursor/Claude 4 Sonnet to get the low-hanging items.
Type Error Fixes Report - Embeddings Module
Overview
This report analyzes the type error fixes made in commit
5517ce81fe11bf22ab9787f2abb5bc488c30a41d
to address 31 type errors in the embeddings module. The changes focus on improving type safety, adding proper null checks, and handling optional dependencies correctly.Risk Assessment by Category
🔴 High Risk Fixes (2 fixes)
1. Property-based Attribute Access Pattern Change
Files:
basic.py
(lines 48-99),cache.py
(lines 38, 86, 47, 51, 106, 110)Original Errors:
Declaration "embedding_size" is obscured by a declaration of the same name (reportRedeclaration)
Declaration "cache_config" is obscured by a declaration of the same name (reportRedeclaration)
Declaration "embeddings" is obscured by a declaration of the same name (reportRedeclaration)
Fixed Code:
Fix Explanation: Removed class-level attribute declarations that were causing redeclaration errors and moved to instance-level attributes with proper type annotations. Added
name: str
class attribute declarations in abstract base classes.Alternative Fixes: Could have used
ClassVar
annotations to distinguish class vs instance attributes, but the implemented approach is cleaner and follows Python best practices.Risk Justification: High risk because this is a significant architectural change that moves from class-level to instance-level attribute definitions. While functionally equivalent, it changes the object model and could affect introspection or dynamic attribute access patterns.
2. Embedding Model Initialization with Null Checks
Files:
basic.py
(lines 125-140, 152-157)Original Errors:
"encode_async" is not a known attribute of "None" (reportOptionalMemberAccess)
Cannot assign to attribute "embedding_model" for class "BasicEmbeddingsIndex*" Type "Unknown | None" is not assignable to type "str"
Fixed Code:
Fix Explanation: Added explicit validation and default values for embedding model initialization, with explicit exceptions when model creation fails.
Alternative Fixes: Could have used assertions or type guards, but explicit exceptions provide better error messages for debugging.
Risk Justification: High risk because it changes the error handling behavior by adding explicit validation and raising exceptions earlier. This could affect existing error handling patterns and introduce new failure modes.
🟡 Medium Risk Fixes (5 fixes)
3. Async Event Initialization Guards
Files:
basic.py
(lines 203-206, 218-221, 264-270)Original Errors:
"wait" is not a known attribute of "None" (reportOptionalMemberAccess)
"set" is not a known attribute of "None" (reportOptionalMemberAccess)
Type "Event | None" is not assignable to declared type "Event"
Fixed Code:
Fix Explanation: Added runtime checks that will fail fast if the batching system is used incorrectly. The assertions help with type narrowing but could cause crashes in edge cases where the events aren't properly initialized.
Alternative Fixes: Could have used conditional attribute access with optional chaining, but explicit checks provide better error reporting.
Risk Justification: Medium risk because it adds runtime checks that will fail fast if the batching system is used incorrectly, potentially causing crashes in edge cases.
4. Optional Parameter Type Annotations
Files:
basic.py
(lines 52-57),cache.py
(lines 159, 221-223)Original Errors:
Expression of type "None" cannot be assigned to parameter of type "EmbeddingsCacheConfig | Dict[str, Any]"
Expression of type "None" cannot be assigned to parameter of type "float"
Fixed Code:
Fix Explanation: Added proper Optional type annotations to parameters that can accept None values, making the API contract more explicit.
Alternative Fixes: Could have provided default values instead of allowing None, but Optional maintains backward compatibility.
Risk Justification: Medium risk because it improves type safety without changing functionality, but makes the API contract more explicit which could reveal existing usage errors.
5. Cache Store Type Safety Improvements
Files:
cache.py
(lines 256-260, 285-288, 298-300)Original Errors:
Expression of type "None" cannot be assigned to parameter of type "KeyGenerator"
Expression of type "None" cannot be assigned to parameter of type "CacheStore"
Fixed Code:
Fix Explanation: Added defensive programming patterns that gracefully handle uninitialized cache components by returning None or early return instead of crashing.
Alternative Fixes: Could have required cache components to always be initialized, but graceful degradation is more robust.
Risk Justification: Medium risk because it adds defensive programming patterns that change behavior to return None instead of potentially crashing, which might mask configuration errors.
6. Class Attribute Name Detection
Files:
cache.py
(lines 47, 51, 106, 110)Original Errors:
Cannot access attribute "name" for class "type[KeyGenerator]*"
Cannot access attribute "name" for class "type[CacheStore]*"
Fixed Code:
Fix Explanation: Added
hasattr()
checks before accessing thename
attribute to prevent AttributeError when subclasses don't define the expected attribute.Alternative Fixes: Could have enforced that all subclasses define
name
through abstract properties, but defensive checking is more flexible.Risk Justification: Medium risk because it changes the behavior when subclasses don't have
name
attributes, potentially masking implementation errors.7. Store Config Type Validation
Files:
cache.py
(lines 232-235)Original Errors:
Argument expression after ** must be a mapping with a "str" key type
Fixed Code:
Fix Explanation: Added runtime type validation to ensure dictionary unpacking is safe by checking if the value is actually a dict before unpacking.
Alternative Fixes: Could have used try/catch around the unpacking, but explicit type checking is clearer.
Risk Justification: Medium risk because it changes behavior when invalid configuration data is provided, defaulting to empty dict instead of potentially crashing.
🟢 Low Risk Fixes (5 fixes)
8. Import Error Suppression with Type Comments
Files: All provider files (
fastembed.py
,openai.py
,sentence_transformers.py
,basic.py
)Original Errors:
Import "annoy" could not be resolved (reportMissingImports)
Import "redis" could not be resolved (reportMissingImports)
Import "langchain_nvidia_ai_endpoints" could not be resolved (reportMissingImports)
Import "torch" could not be resolved (reportMissingImports)
"TextEmbedding" is unknown import symbol (reportAttributeAccessIssue)
"AsyncOpenAI" is unknown import symbol (reportAttributeAccessIssue)
"SentenceTransformer" is unknown import symbol (reportAttributeAccessIssue)
Fixed Code:
Fix Explanation: Added
# type: ignore
comments to suppress type checker warnings for optional dependencies that may not be installed in all environments.Alternative Fixes: Could have used conditional imports with try/catch, but type comments are simpler for this use case.
Risk Justification: Low risk because these are pure type checking suppressions with no runtime impact. These are legitimate optional dependencies that should be ignored by type checkers when not available.
9. Redis Optional Import Pattern
Files:
cache.py
(lines 25-28, 202-205)Original Errors:
Import "redis" could not be resolved (reportMissingImports)
Fixed Code:
Fix Explanation: Implemented proper optional dependency pattern that gracefully handles missing packages with informative error messages when the functionality is actually needed.
Alternative Fixes: Could have made redis a required dependency, but optional dependencies provide better user experience.
Risk Justification: Low risk because this is a standard Python practice for optional dependencies with no functional changes to existing behavior.
10. Enhanced get_config Method Null Safety
Files:
cache.py
(lines 247-248)Original Errors:
Cannot access attribute "name" for class "KeyGenerator"
Fixed Code:
Fix Explanation: Added null checks before accessing
.name
attributes and provided sensible defaults for the configuration.Alternative Fixes: Could have required these components to always be initialized, but providing defaults makes the API more robust.
Risk Justification: Low risk because it adds safety checks with reasonable defaults, improving robustness without changing normal operation.
11. Singledispatchmethod Type Registration
Files:
cache.py
(lines 256, 267, 285, 293)Original Errors: Implicit type registration for singledispatchmethod
Fixed Code:
Fix Explanation: Made type registration explicit in singledispatchmethod decorators to improve type checker understanding and code clarity.
Alternative Fixes: Could have relied on implicit type inference, but explicit registration is clearer and more maintainable.
Risk Justification: Low risk because this only makes existing type dispatch behavior explicit without changing functionality.
12. OpenAI Version Check Type Annotation
Files:
openai.py
(line 56)Original Error:
"__version__" is not a known attribute of module "openai"
Fixed Code:
Fix Explanation: Added type ignore comment for version check that type checker can't verify but is valid at runtime.
Alternative Fixes: Could have used try/catch around the version check, but type ignore is simpler for this specific case.
Risk Justification: Low risk because it's just suppressing a type checker warning for a legitimate runtime check.
Summary
The majority of fixes are low-risk type safety improvements that don't change runtime behavior. The high-risk fixes involve architectural improvements (property-based attributes) and enhanced error handling that could affect existing usage patterns. Medium-risk fixes add defensive programming and explicit type contracts that improve robustness but could potentially change error handling behavior.
All fixes follow defensive programming practices and prefer explicit error handling over silent failures, which improves debugging and system reliability. The implemented solutions generally maintain backward compatibility while significantly improving type safety for better developer experience and reduced runtime errors.
Test Plan
Type-checking
Unit-tests
Local CLI check
Related Issue(s)
Top-level PR to merge into before develop-branch merge: #1367
Checklist