Skip to content

Conversation

aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Sep 11, 2025

Summary by CodeRabbit

  • New Features
    • More reliable JSON Schema handling with pre-expanded references for smoother normalization and validation.
  • Bug Fixes
    • Improved consistency when processing schemas that include $ref fields.
  • Refactor
    • Streamlined schema reference resolution and removed a deprecated helper from the public API.
  • Chores
    • Updated dependencies: broadened jsonschema 4.x support, added required supporting libraries.
    • Upgraded build tooling (Poetry) in container image.
  • Tests
    • Updated unit tests and sample schemas to reflect improved reference expansion behavior.

devin-ai-integration bot and others added 30 commits August 29, 2025 00:09
…ibility

- Changed jsonschema constraint from ~4.17.3 to >=4.17.3,<5.0
- Maintains broad compatibility while unblocking fastmcp requirements
- Regenerated poetry.lock with new constraint

Co-Authored-By: AJ Steers <[email protected]>
…hema 4.18+ compatibility

- Replace RefResolver.from_schema() with Registry().with_resource() pattern
- Update resolver.resolve() calls to use resolver.lookup().contents
- Addresses jsonschema 4.18+ deprecation warnings
- Maintains backward compatibility with existing functionality

Co-Authored-By: AJ Steers <[email protected]>
- Auto-format YAML and Markdown files per project standards
- No functional changes, only formatting consistency

Co-Authored-By: AJ Steers <[email protected]>
- Added referencing>=0.30.0 dependency to pyproject.toml
- Required for jsonschema 4.18+ RefResolver migration
- Regenerated poetry.lock with new dependency

Co-Authored-By: AJ Steers <[email protected]>
- Update constraint to >=4.18.0,<5.0 as suggested by @aaronsteers
- Remove all backward compatibility code for 4.17.x
- Use referencing library exclusively with DRAFT7 default specification
- Fixes CannotDetermineSpecification error for schemas without  property

Co-Authored-By: AJ Steers <[email protected]>
- Try new referencing library first for jsonschema 4.17.3+ compatibility
- Fall back to old RefResolver if referencing fails
- Ensures compatibility across jsonschema versions while migrating to new API
- Fixes pytest failure in test_transform.py

Co-Authored-By: AJ Steers <[email protected]>
- Accept aaronsteers' simplified comment for referencing dependency
- Maintain jsonschema constraint >=4.17.3,<5.0 for broad compatibility
- Complete merge of remote changes with local RefResolver migration work

Co-Authored-By: AJ Steers <[email protected]>
- Enhanced error handling in resolve function with more specific exception catching
- Added explicit checks for validator_instance.resolver existence
- Improved fallback logic to prevent transformation pipeline breakage
- Maintains backward compatibility while preferring referencing library

Co-Authored-By: AJ Steers <[email protected]>
…n transform.py

- Update Poetry version from 1.8.3 to 2.0.1 in manifest server Dockerfile for consistency with local development
- Fix transform.py resolve function to properly handle both new referencing library and legacy RefResolver
- Addresses CI failures in 'Manifest Server Docker Image Build' and pytest transform tests

Co-Authored-By: AJ Steers <[email protected]>
@Copilot Copilot AI review requested due to automatic review settings September 11, 2025 20:48
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR replaces the deprecated jsonschema.RefResolver with the newer referencing library for JSON schema reference resolution across the codebase. The main purpose is to modernize the reference resolution mechanism while maintaining compatibility.

  • Updates dependency versions for jsonschema and adds referencing library
  • Replaces RefResolver-based logic with referencing.Registry and Resolver
  • Consolidates reference expansion functionality and updates import paths

Reviewed Changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pyproject.toml Updates jsonschema version constraint and adds referencing dependency
airbyte_cdk/sources/utils/schema_helpers.py Replaces RefResolver with referencing library implementation
airbyte_cdk/sources/utils/transform.py Updates to use new reference resolution approach
unit_tests/sources/utils/test_transform.py Adds missing schema definition for test
unit_tests/destinations/vector_db_based/config_test.py Updates import from deprecated utility
airbyte_cdk/destinations/vector_db_based/config.py Updates import and function call
airbyte_cdk/init.py Removes deprecated function from exports
airbyte_cdk/manifest_server/Dockerfile Updates Poetry version

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.


try:
from jsonschema.validators import Validator
except:
Copy link
Preview

Copilot AI Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using bare except: clause is discouraged. Specify the expected exception type (likely ImportError) to avoid catching unintended exceptions.

Suggested change
except:
except ImportError:

Copilot uses AI. Check for mistakes.

Comment on lines +210 to +214
expand_refs(schema)

# Now we can expand $refs in the property value:
if isinstance(property_value, dict):
expand_refs(property_value)
Copy link
Preview

Copilot AI Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expand_refs function is called multiple times and mutates the schema in-place. Consider creating a deep copy of the schema before the first call to avoid unintended side effects on the original schema object.

Copilot uses AI. Check for mistakes.

@@ -293,6 +293,6 @@ def remove_discriminator(schema: Dict[str, Any]) -> None:
def schema(cls, by_alias: bool = True, ref_template: str = "") -> Dict[str, Any]:
"""we're overriding the schema classmethod to enable some post-processing"""
schema: Dict[str, Any] = super().schema()
Copy link
Preview

Copilot AI Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function call should match the pattern used elsewhere in the codebase. The original code assigned the result back to schema, but expand_refs modifies the schema in-place and returns None. Consider documenting this behavior or maintaining consistency with the assignment pattern for clarity.

Suggested change
schema: Dict[str, Any] = super().schema()
schema: Dict[str, Any] = super().schema()
# expand_refs modifies the schema in-place and returns None.

Copilot uses AI. Check for mistakes.

Copy link
Contributor

coderabbitai bot commented Sep 11, 2025

📝 Walkthrough

Walkthrough

Replaces JSON Schema $ref resolution with the referencing library and pre-expansion workflow. Removes resolve_refs from public API and deletes its implementation. Updates transform logic to operate on expanded schemas. Adjusts vector DB config schema generation to use expand_refs. Updates dependencies and Poetry installer version. Adds/updates unit tests.

Changes

Cohort / File(s) Summary
Public API surface
airbyte_cdk/__init__.py, airbyte_cdk/utils/spec_schema_transformations.py
Removes public export airbyte_cdk.resolve_refs and deletes resolve_refs implementation.
Schema resolution implementation
airbyte_cdk/sources/utils/schema_helpers.py, airbyte_cdk/sources/utils/transform.py
Introduces registry-based ref resolution via referencing. Adds get_ref_resolver_registry, rewires expand_refs and internal _expand_refs to use Resolver, and shifts transform normalization to pre-expanded schemas.
Vector DB config schema generation
airbyte_cdk/destinations/vector_db_based/config.py, unit_tests/destinations/vector_db_based/config_test.py
Swaps resolve_refs(...) for expand_refs(...) in schema generation; updates imports accordingly.
Dependencies
pyproject.toml
Broadens jsonschema constraint to >=4.17.3,<5.0; adds referencing >=0.36.2 and packaging.
Build tooling
airbyte_cdk/manifest_server/Dockerfile
Bumps Poetry installer from 1.8.3 to 2.0.1; minor formatting.
Test schema updates
unit_tests/sources/utils/test_transform.py
Adds my_type definition and references it from properties.dd; formatting adjustments.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller as Caller
  participant Transform as transform.normalize_schema
  participant Helpers as sources.utils.schema_helpers
  participant Registry as referencing.Registry/Resolver

  Caller->>Transform: normalize_schema(schema)
  Transform->>Helpers: expand_refs(schema)
  Helpers->>Registry: get_ref_resolver_registry(schema)
  Registry-->>Helpers: Resolver
  Helpers->>Helpers: _expand_refs(schema, Resolver)
  Helpers-->>Transform: expanded schema (definitions removed)
  Transform->>Transform: normalize using expanded subschemas
  Transform-->>Caller: normalized schema
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

chore, dependencies

Suggested reviewers

  • pedroslopez
  • brianjlai
  • aldogonzalez8

Pre-merge checks (2 passed, 1 inconclusive)

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title Check ❓ Inconclusive The current title "do not merge: vector db work (split)" is related to the changeset because it references vector DB work, but it includes noisy prefatory text ("do not merge") and is non-specific about the primary technical change (switching JSON Schema $ref handling and removing the resolve_refs public export), which makes it vague for reviewers and project history. Therefore the title does not clearly summarize the main change. Could you rename the PR to a concise, descriptive title that highlights the main change, for example "Vector DB: use registry-backed $ref expansion and remove resolve_refs export" or "Switch JSON Schema $ref handling to referencing-based expansion (remove resolve_refs)"? Removing the "do not merge" prefix and making the technical intent explicit will help reviewers and history—wdyt?
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch aj/fix-up-vector-db-json-refs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
airbyte_cdk/sources/utils/transform.py (1)

241-259: Pre-expand refs once in transform() to avoid mutation during traversal

Let’s deep copy the input schema, expand refs once, and validate/normalize against that. This prevents repeated in-place mutations and improves performance.

Apply:

     def transform(
         self,
         record: Dict[str, Any],
         schema: Mapping[str, Any],
     ) -> None:
@@
-        if TransformConfig.NoTransform in self._config:
-            return
-        normalizer = self._normalizer(schema)
+        if TransformConfig.NoTransform in self._config:
+            return
+        # Work on a copy to avoid mutating caller-provided schema
+        local_schema: Dict[str, Any] = deepcopy(schema)  # type: ignore[assignment]
+        expand_refs(local_schema)
+        normalizer = self._normalizer(local_schema)
         for e in normalizer.iter_errors(record):
airbyte_cdk/sources/utils/schema_helpers.py (1)

85-104: Guard against cycles and non-object $ref targets

  • No cycle detection: self-referential schemas can cause infinite recursion.
  • If a $ref resolves to a boolean schema (valid in Draft-07), schema.update(...) will raise.

Would you add a small visited set and an object check, wdyt?

-def _expand_refs(schema: Any, ref_resolver: Any) -> None:
+def _expand_refs(schema: Any, ref_resolver: Any, seen: set[str] | None = None) -> None:
+    if seen is None:
+        seen = set()
     """Internal function to iterate over schema and replace all occurrences of $ref with their definitions. Recursive.
@@
-    if isinstance(schema, MutableMapping):
+    if isinstance(schema, MutableMapping):
         if "$ref" in schema:
-            ref_url = schema.pop("$ref")
-            definition = ref_resolver.lookup(ref_url).contents
-            _expand_refs(
-                definition, ref_resolver=ref_resolver
-            )  # expand refs in definitions as well
-            schema.update(definition)
+            ref_url = schema.pop("$ref")
+            if isinstance(ref_url, str) and ref_url in seen:
+                return  # avoid infinite recursion on cyclic refs
+            if isinstance(ref_url, str):
+                seen.add(ref_url)
+            definition = ref_resolver.lookup(ref_url).contents
+            _expand_refs(definition, ref_resolver=ref_resolver, seen=seen)  # expand refs in definitions as well
+            if not isinstance(definition, MutableMapping):
+                raise TypeError(f"$ref '{ref_url}' resolved to non-object schema (got {type(definition).__name__})")
+            schema.update(definition)
@@
-            for key, value in schema.items():
-                _expand_refs(value, ref_resolver=ref_resolver)
+            for key, value in schema.items():
+                _expand_refs(value, ref_resolver=ref_resolver, seen=seen)
     elif isinstance(schema, List):
         for value in schema:
-            _expand_refs(value, ref_resolver=ref_resolver)
+            _expand_refs(value, ref_resolver=ref_resolver, seen=seen)
🧹 Nitpick comments (8)
pyproject.toml (1)

46-49: Add a conservative upper bound to referencing to reduce surprise breaks

referencing is <1.0 and may introduce breaking changes in minors. Since we’ve pinned jsonschema <5.0, would you also cap referencing (e.g., <0.41) to keep CI green until we explicitly bump, wdyt?

Apply:

-jsonschema = ">=4.17.3,<5.0"
+jsonschema = ">=4.17.3,<5.0"
@@
-referencing = ">=0.36.2"
+referencing = ">=0.36.2,<0.41"
airbyte_cdk/sources/utils/transform.py (2)

96-104: Drop “$ref” (and “array”) from overridden validators after pre-expansion

Once we pre-expand, $ref won’t appear and overriding its validator just adds overhead. Also, "array" isn’t a Draft 7 keyword. Shall we trim these, wdyt?

Apply:

-            if key in ["type", "array", "$ref", "properties", "items"]
+            if key in ["type", "properties", "items"]

20-23: Scope the exception to ImportError

Catching bare exceptions hides unrelated issues during import. Shall we narrow this to ImportError, wdyt?

Apply:

-try:
-    from jsonschema.validators import Validator
-except:
-    from jsonschema import Validator
+try:
+    from jsonschema.validators import Validator
+except ImportError:
+    from jsonschema import Validator
airbyte_cdk/sources/utils/schema_helpers.py (3)

17-19: Avoid importing private referencing._core.Resolver

Relying on a private module is brittle across versions. Would you switch the type hint to Any (or a local Protocol) and drop the private import, wdyt?

-from referencing._core import Resolver  # used for type hints
+# Intentionally avoid importing private modules from `referencing`
@@
-def _expand_refs(schema: Any, ref_resolver: Resolver) -> None:
+def _expand_refs(schema: Any, ref_resolver: Any) -> None:

Also applies to: 85-85


70-83: Use a stable, non-empty root URI for the registry

An empty string URI can be surprising for fragment-only refs and future resource additions. Using a URN makes intent explicit. Would you switch to a stable URN, wdyt?

-        Registry().with_resource(
-            uri="",
+        Registry().with_resource(
+            uri="urn:jsonschema:root",
             resource=resource,
         ),

85-104: (Spec nuance) Sibling keywords with $ref are merged via dict.update

Per JSON Schema best practices, siblings often intend additive constraints (commonly expressed as allOf: [{$ref: ...}, {...siblings...}]). A blind update can silently override sibling or referenced keywords. Would you consider composing siblings via allOf to preserve semantics, wdyt?

If you want to adopt allOf, a minimal change would be:

-            _expand_refs(definition, ref_resolver=ref_resolver, seen=seen)
-            if not isinstance(definition, MutableMapping):
-                raise TypeError(...)
-            schema.update(definition)
+            _expand_refs(definition, ref_resolver=ref_resolver, seen=seen)
+            if not isinstance(definition, MutableMapping):
+                raise TypeError(...)
+            siblings = dict(schema)  # current (post-pop) siblings
+            schema.clear()
+            if siblings:
+                schema.update({"allOf": [definition, siblings]})
+            else:
+                schema.update(definition)
unit_tests/sources/utils/test_transform.py (1)

23-26: Add an assertion exercising the new $ref (my_type) path

COMPLEX_SCHEMA["properties"]["def"]["properties"]["dd"] now references #/definitions/my_type, but no test covers it. Shall we add a quick case to confirm inlining/ref resolution works end-to-end, wdyt?

Example new test (outside the parametrized block):

def test_transform_resolves_local_ref_in_nested_def():
    schema = deepcopy(COMPLEX_SCHEMA)
    # If transform relies on pre-expansion elsewhere, call expand_refs here if needed.
    data = {"def": {"dd": 123}}
    t = TypeTransformer(TransformConfig.DefaultSchemaNormalization)
    t.transform(data, schema)
    assert data == {"def": {"dd": "123"}}

Also applies to: 36-36

unit_tests/destinations/vector_db_based/config_test.py (1)

63-66: Optional: mirror the production schema() signature to avoid drift

Would you pass through by_alias/ref_template to stay API-compatible with BaseModel.schema and the prod implementation, wdyt?

-    def schema(cls):
+    def schema(cls, by_alias: bool = True, ref_template: str = ""):
         """we're overriding the schema classmethod to enable some post-processing"""
-        schema = super().schema()
+        schema = super().schema(by_alias=by_alias, ref_template=ref_template)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a075a1 and 5f0ac1d.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • airbyte_cdk/__init__.py (0 hunks)
  • airbyte_cdk/destinations/vector_db_based/config.py (2 hunks)
  • airbyte_cdk/manifest_server/Dockerfile (2 hunks)
  • airbyte_cdk/sources/utils/schema_helpers.py (3 hunks)
  • airbyte_cdk/sources/utils/transform.py (2 hunks)
  • airbyte_cdk/utils/spec_schema_transformations.py (0 hunks)
  • pyproject.toml (1 hunks)
  • unit_tests/destinations/vector_db_based/config_test.py (2 hunks)
  • unit_tests/sources/utils/test_transform.py (2 hunks)
💤 Files with no reviewable changes (2)
  • airbyte_cdk/utils/spec_schema_transformations.py
  • airbyte_cdk/init.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2024-11-15T01:04:21.272Z
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/cli/source_declarative_manifest/_run.py:62-65
Timestamp: 2024-11-15T01:04:21.272Z
Learning: The files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from another repository, and changes to these files should be minimized or avoided when possible to maintain consistency.

Applied to files:

  • airbyte_cdk/manifest_server/Dockerfile
🧬 Code graph analysis (4)
airbyte_cdk/destinations/vector_db_based/config.py (4)
airbyte_cdk/sources/utils/schema_helpers.py (1)
  • expand_refs (106-116)
unit_tests/destinations/vector_db_based/config_test.py (1)
  • schema (64-69)
airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py (1)
  • schema (88-98)
airbyte_cdk/sources/config.py (1)
  • schema (21-27)
unit_tests/destinations/vector_db_based/config_test.py (2)
airbyte_cdk/sources/utils/schema_helpers.py (1)
  • expand_refs (106-116)
airbyte_cdk/destinations/vector_db_based/config.py (1)
  • schema (293-298)
airbyte_cdk/sources/utils/schema_helpers.py (4)
airbyte_cdk/destinations/vector_db_based/config.py (1)
  • schema (293-298)
unit_tests/destinations/vector_db_based/config_test.py (1)
  • schema (64-69)
airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py (1)
  • schema (88-98)
airbyte_cdk/sources/config.py (1)
  • schema (21-27)
airbyte_cdk/sources/utils/transform.py (1)
airbyte_cdk/sources/utils/schema_helpers.py (2)
  • expand_refs (106-116)
  • get_ref_resolver_registry (70-82)
🔇 Additional comments (3)
airbyte_cdk/manifest_server/Dockerfile (1)

14-14: Install the poetry-dynamic-versioning plugin when installing Poetry 2 in the image?

Poetry 2 only applies dynamic (VCS-derived) versions when the dynamic-versioning plugin is present — the plugin provides a PEP‑517 build-backend and CLI integration when installed with the "plugin" extra, and Poetry 2 can also install project plugins via [tool.poetry.requires-plugins]; there are Docker/rootless edge-cases where plugin registration can silently fail, so would you prefer to pip-install the plugin here (recommended for a pip-installed Poetry) or declare it in pyproject and verify registration in the image? (github.com)

Apply this minimal diff to install the plugin alongside Poetry:

-    pip install poetry==2.0.1
+    pip install poetry==2.0.1 'poetry-dynamic-versioning[plugin]<2.0.0'
airbyte_cdk/destinations/vector_db_based/config.py (1)

10-10: LGTM on switching to expand_refs

This aligns with the new registry-based approach and keeps the method mutation-only usage consistent. Nice cleanup.

Also applies to: 296-296

unit_tests/destinations/vector_db_based/config_test.py (1)

18-18: Import swap to expand_refs is correct

This aligns with the new registry-based resolver and mirrors the production code path; no issues spotted.

Comment on lines 106 to 117
def expand_refs(schema: Any) -> None:
"""Iterate over schema and replace all occurrences of $ref with their definitions.

If a "definitions" section is present at the root of the schema, it will be removed
after $ref resolution is complete.

:param schema: schema that will be patched
"""
_expand_refs(schema)
schema.pop("definitions", None) # remove definitions created by $ref
ref_resolver = get_ref_resolver_registry(schema).resolver()
_expand_refs(schema, ref_resolver)
schema.pop("definitions", None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Return the schema for ergonomic use and back-compat

Some call sites (tests, etc.) assign schema = expand_refs(schema). Since this mutates in place, could we also return schema to keep both styles working, wdyt?

-def expand_refs(schema: Any) -> None:
+def expand_refs(schema: Any) -> Any:
@@
-    schema.pop("definitions", None)
+    schema.pop("definitions", None)
+    return schema

Run to find assignment-style usages that would otherwise break:


🏁 Script executed:

#!/bin/bash
# Find assignment-style expand_refs usage
rg -nP --type=py '^\s*\w+\s*=\s*expand_refs\s*\(' -C1

# Find all expand_refs call sites for a quick audit
rg -nP --type=py '\bexpand_refs\s*\(' -C2

Length of output: 3461


Return schema from expand_refs for ergonomic use and back-compat?

One assignment-style caller was found at unit_tests/destinations/vector_db_based/config_test.py:67 — change the signature to return the schema so both assignment and in-place usages work.

-def expand_refs(schema: Any) -> None:
+def expand_refs(schema: Any) -> Any:
@@
-    schema.pop("definitions", None)
+    schema.pop("definitions", None)
+    return schema

wdyt?

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def expand_refs(schema: Any) -> None:
"""Iterate over schema and replace all occurrences of $ref with their definitions.
If a "definitions" section is present at the root of the schema, it will be removed
after $ref resolution is complete.
:param schema: schema that will be patched
"""
_expand_refs(schema)
schema.pop("definitions", None) # remove definitions created by $ref
ref_resolver = get_ref_resolver_registry(schema).resolver()
_expand_refs(schema, ref_resolver)
schema.pop("definitions", None)
def expand_refs(schema: Any) -> Any:
"""Iterate over schema and replace all occurrences of $ref with their definitions.
If a "definitions" section is present at the root of the schema, it will be removed
after $ref resolution is complete.
:param schema: schema that will be patched
"""
ref_resolver = get_ref_resolver_registry(schema).resolver()
_expand_refs(schema, ref_resolver)
schema.pop("definitions", None)
return schema
🤖 Prompt for AI Agents
In airbyte_cdk/sources/utils/schema_helpers.py around lines 106 to 117, update
expand_refs to return the schema after performing in-place $ref expansion so
callers that assign its result continue to work; change the function signature
to return the schema (keep existing in-place behavior), update the docstring to
mention the return value, and ensure the function ends with "return schema"
after popping "definitions"; adjust the type hint to reflect the returned value
if desired and run/update callers (e.g.,
unit_tests/destinations/vector_db_based/config_test.py) that expect an
assignment to use the returned schema.

Comment on lines +209 to 217
# Very first step is to expand $refs in the schema itself:
expand_refs(schema)

# Now we can expand $refs in the property value:
if isinstance(property_value, dict):
expand_refs(property_value)

def resolve(subschema: dict[str, Any]) -> dict[str, Any]:
if "$ref" in subschema:
_, resolved = cast(
RefResolver,
validator_instance.resolver,
).resolve(subschema["$ref"])
return cast(dict[str, Any], resolved)
return subschema
# Now we can validate and normalize the values:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Don’t expand refs on every validator callback (perf + possible Unresolvable)

Calling expand_refs(schema) per-visit mutates the schema repeatedly and can raise Unresolvable when expanding dict-valued fragments without root context. Can we remove in-callback expansion and do a single pre-expansion before validator creation, wdyt?

Apply:

-            # Very first step is to expand $refs in the schema itself:
-            expand_refs(schema)
-
-            # Now we can expand $refs in the property value:
-            if isinstance(property_value, dict):
-                expand_refs(property_value)
+            # $refs are pre-expanded once before validator construction.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Very first step is to expand $refs in the schema itself:
expand_refs(schema)
# Now we can expand $refs in the property value:
if isinstance(property_value, dict):
expand_refs(property_value)
def resolve(subschema: dict[str, Any]) -> dict[str, Any]:
if "$ref" in subschema:
_, resolved = cast(
RefResolver,
validator_instance.resolver,
).resolve(subschema["$ref"])
return cast(dict[str, Any], resolved)
return subschema
# Now we can validate and normalize the values:
# $refs are pre-expanded once before validator construction.
# Now we can validate and normalize the values:

@@ -64,7 +64,7 @@ def remove_discriminator(schema: dict) -> None:
def schema(cls):
"""we're overriding the schema classmethod to enable some post-processing"""
schema = super().schema()
schema = resolve_refs(schema)
schema = expand_refs(schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix: expand_refs mutates in-place; assigning its return sets schema to None

expand_refs returns None and mutates the dict; assigning it to schema will make schema None and break remove_discriminator/assertions. Suggest removing the assignment to match the implementation in airbyte_cdk/destinations/vector_db_based/config.py, wdyt?

Apply this diff:

-        schema = expand_refs(schema)
+        expand_refs(schema)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
schema = expand_refs(schema)
expand_refs(schema)
🤖 Prompt for AI Agents
In unit_tests/destinations/vector_db_based/config_test.py around line 67, the
test assigns the result of expand_refs to schema even though expand_refs mutates
the dict in-place and returns None; remove the assignment and just call
expand_refs(schema) so schema remains the populated dict for subsequent
remove_discriminator/assertions.

@aaronsteers aaronsteers marked this pull request as draft September 11, 2025 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant