Skip to content

Conversation

aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Aug 6, 2025

Add JSON schema generation to build process for schemastore.org registration

Pairs with this PR, to be re-opened after ours merges:

Summary

This PR adds JSON schema generation to the Python CDK build process alongside the existing pydantic model generation. The build process now creates declarative_component_schema.json from the existing declarative_component_schema.yaml file, enabling registration with schemastore.org for enhanced schema validation and auto-completion support.

Key Changes:

  • Modified bin/generate_component_manifest_files.py to include JSON schema generation
  • Added generate_json_schema() function with custom DateTimeEncoder to handle datetime objects
  • Integrated JSON generation into the main build flow via poetry run poe build
  • Generated 238KB JSON schema file with proper $schema and $id fields

Maintenance:

This fits neatly into the already-existing build process which generates the Pydantic models from the existing yaml file. We just now generate 2 files instead of just the one.

This can be run with poe build locally or with the slash command /poe build in PRs. (The /poe build command is also documented in the welcome message, so it is easy to remember the command name.)

Notes

  • The YAML file already contains valid JSON Schema with $schema and $id fields, making conversion straightforward
  • Custom DateTimeEncoder was necessary due to datetime objects in the YAML that aren't JSON serializable
  • This enables broader tooling ecosystem integration beyond just Python pydantic models
  • JSON file will be committed to git as a build artifact, similar to existing pydantic models

Link to Devin run: https://app.devin.ai/sessions/5fc8aa08db104607acbe970add5f1002
Requested by: @aaronsteers (AJ Steers - [email protected])

Summary by CodeRabbit

  • New Features

    • Automatically generates a JSON version of the component schema from the existing YAML schema.
  • Chores

    • Improved handling of date/time during schema conversion for more consistent serialization.
    • Switched manifest generation to a local generation flow and updated the build task to invoke the new generator.
  • Documentation

    • Fixed README rendering by escaping an asterisk so the example displays correctly.

…tration

- Modified bin/generate_component_manifest_files.py to generate JSON schema alongside pydantic models
- Added generate_json_schema() function with custom DateTimeEncoder to handle datetime objects
- Generated airbyte_cdk/sources/declarative/declarative_component_schema.json from YAML source
- JSON file can now be registered with schemastore.org for schema validation and auto-completion

Co-Authored-By: AJ Steers <[email protected]>
@Copilot Copilot AI review requested due to automatic review settings August 6, 2025 22:30
Copy link
Contributor

Original prompt from AJ Steers
@Devin - Can you please locate the 'build' step (also poe build) in the Python CDK? I specifically want to add a step so that in addition to building pydantic models from the yaml file (double-check I have that direction correct), we would also generate a json version of that file. The json version of the file can then be registered with <http://schemastore.org|schemastore.org>.

Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.

Copy link

github-actions bot commented Aug 6, 2025

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1754519222-add-json-schema-generation#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1754519222-add-json-schema-generation

Helpful Resources

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment

📝 Edit this welcome message.

Copy link
Contributor

coderabbitai bot commented Aug 6, 2025

📝 Walkthrough

Walkthrough

Rewrote bin/generate_component_manifest_files.py from a container-based async pipeline to a local, synchronous pipeline: added generate_json_schema(), switched to subprocess datamodel-codegen calls, added post_process_codegen(), removed dagger/anyio and related container scripts, and updated pyproject assemble task. Minor README and header edits elsewhere.

Changes

Cohort / File(s) Change Summary
Manifest generator script
bin/generate_component_manifest_files.py
Rewrote flow to run locally and synchronously: added generate_json_schema(), invoke datamodel-codegen via subprocess per YAML, write to a temp generated/, and added post_process_codegen() to adjust generated Python files and imports; removed container/dagger logic and related variables.
Removed wrapper script
bin/generate-component-manifest-dagger.sh
Deleted container wrapper that previously installed dagger and ran the generator inside a Python image.
Build task
pyproject.toml
Changed [tool.poe.tasks].assemble command from the deleted shell wrapper to uv run bin/generate_component_manifest_files.py.
Documentation fix
airbyte_cdk/manifest_migrations/README.md
Escaped an asterisk in markdown (changed * to \*).
Header cleanup
airbyte_cdk/sources/declarative/models/declarative_component_schema.py
Removed copyright header line only.

Sequence Diagram(s)

sequenceDiagram
    participant Dev as Developer / CLI
    participant Script as generate_component_manifest_files.py
    participant FS as Local FS
    participant DMG as datamodel-codegen (subprocess)

    Dev->>Script: run script
    Note right of Script #DFF2E1: 1. generate_json_schema()\n(read YAML -> write JSON)
    Script->>FS: read declarative_component_schema.yaml
    FS-->>Script: YAML
    Script->>FS: write declarative_component_schema.json

    Note right of Script #FFF3D1: 2. per-YAML codegen\n(temp generated/)
    Script->>FS: create TemporaryDirectory / generated/
    Script->>FS: iterate YAML files
    loop per YAML
      Script->>DMG: run datamodel-codegen --input <yaml> --output generated/<name>.py ...
      DMG-->>Script: generated/<name>.py
    end

    Note right of Script #F2E7FF: 3. post-processing
    Script->>FS: post_process_codegen(generated/, generated_post_processed/)
    FS-->>Script: final post-processed files
    Script->>FS: write final output (LOCAL_OUTPUT_DIR_PATH)
    Script-->>Dev: print status / exit
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

chore

Suggested reviewers

  • aldogonzalez8

Would you like a unit test added for generate_json_schema() to verify YAML-to-JSON conversion and datetime serialization, wdyt?

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1754519222-add-json-schema-generation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbit or @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
bin/generate_component_manifest_files.py (2)

34-52: The JSON schema generation looks solid, but what about error handling?

The function correctly reads the YAML file, handles datetime serialization with a custom encoder, and writes the JSON output. However, I'm wondering if we should add some error handling for common failure scenarios? wdyt?

For instance:

  • What if the YAML file doesn't exist or is malformed?
  • What if the output directory isn't writable?
  • Should we validate that the generated JSON is actually valid JSON schema format?

Here's a more robust version with error handling:

 def generate_json_schema():
     """Generate JSON schema from the YAML file for schemastore.org registration."""
     yaml_file_path = f"{LOCAL_YAML_DIR_PATH}/declarative_component_schema.yaml"
     json_file_path = f"{LOCAL_YAML_DIR_PATH}/declarative_component_schema.json"

-    with open(yaml_file_path, "r") as yaml_file:
-        schema_data = yaml.safe_load(yaml_file)
+    try:
+        with open(yaml_file_path, "r") as yaml_file:
+            schema_data = yaml.safe_load(yaml_file)
+    except FileNotFoundError:
+        print(f"Error: YAML file not found at {yaml_file_path}")
+        raise
+    except yaml.YAMLError as e:
+        print(f"Error: Invalid YAML in {yaml_file_path}: {e}")
+        raise

     class DateTimeEncoder(json.JSONEncoder):
         def default(self, obj):
             if hasattr(obj, "isoformat"):
                 return obj.isoformat()
             return super().default(obj)

-    with open(json_file_path, "w") as json_file:
-        json.dump(schema_data, json_file, indent=2, cls=DateTimeEncoder)
+    try:
+        with open(json_file_path, "w") as json_file:
+            json.dump(schema_data, json_file, indent=2, cls=DateTimeEncoder)
+    except IOError as e:
+        print(f"Error: Could not write JSON file to {json_file_path}: {e}")
+        raise

     print(f"Generated JSON schema: {json_file_path}")

42-46: Smart datetime handling! Could we make the encoder name more specific?

The DateTimeEncoder correctly handles objects with isoformat method, which covers datetime objects. However, the class name might be a bit generic since it handles any object with isoformat, not just datetime objects. Maybe something like IsoFormatEncoder would be more precise? wdyt?

-    class DateTimeEncoder(json.JSONEncoder):
+    class IsoFormatEncoder(json.JSONEncoder):
         def default(self, obj):
             if hasattr(obj, "isoformat"):
                 return obj.isoformat()
             return super().default(obj)

     with open(json_file_path, "w") as json_file:
-        json.dump(schema_data, json_file, indent=2, cls=DateTimeEncoder)
+        json.dump(schema_data, json_file, indent=2, cls=IsoFormatEncoder)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e1664ec and 544d623.

📒 Files selected for processing (1)
  • bin/generate_component_manifest_files.py (3 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: pnilan
PR: airbytehq/airbyte-python-cdk#0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.
📚 Learning: in the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-g...
Learnt from: pnilan
PR: airbytehq/airbyte-python-cdk#0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.

Applied to files:

  • bin/generate_component_manifest_files.py
📚 Learning: the files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from ...
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/cli/source_declarative_manifest/_run.py:62-65
Timestamp: 2024-11-15T01:04:21.272Z
Learning: The files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from another repository, and changes to these files should be minimized or avoided when possible to maintain consistency.

Applied to files:

  • bin/generate_component_manifest_files.py
📚 Learning: in the airbytehq/airbyte-python-cdk repository, ignore all `__init__.py` files when providing a reco...
Learnt from: pnilan
PR: airbytehq/airbyte-python-cdk#0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, ignore all `__init__.py` files when providing a recommended reviewing order.

Applied to files:

  • bin/generate_component_manifest_files.py
📚 Learning: in the project, the `run` function is defined in `airbyte_cdk/cli/source_declarative_manifest/_run.p...
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#58
File: pyproject.toml:108-111
Timestamp: 2024-11-15T00:58:15.446Z
Learning: In the project, the `run` function is defined in `airbyte_cdk/cli/source_declarative_manifest/_run.py` and is imported into the module's `__init__.py`.

Applied to files:

  • bin/generate_component_manifest_files.py
📚 Learning: when building the docker image for the `source-declarative-manifest`, copying files from `site-packa...
Learnt from: ChristoGrab
PR: airbytehq/airbyte-python-cdk#90
File: Dockerfile:16-21
Timestamp: 2024-12-02T18:36:04.346Z
Learning: When building the Docker image for the `source-declarative-manifest`, copying files from `site-packages` ensures that the SDM code runs from an installed package context, maintaining import paths and module consistency.

Applied to files:

  • bin/generate_component_manifest_files.py
📚 Learning: when modifying the `yamldeclarativesource` class in `airbyte_cdk/sources/declarative/yaml_declarativ...
Learnt from: ChristoGrab
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/sources/declarative/yaml_declarative_source.py:0-0
Timestamp: 2024-11-18T23:40:06.391Z
Learning: When modifying the `YamlDeclarativeSource` class in `airbyte_cdk/sources/declarative/yaml_declarative_source.py`, avoid introducing breaking changes like altering method signatures within the scope of unrelated PRs. Such changes should be addressed separately to minimize impact on existing implementations.

Applied to files:

  • bin/generate_component_manifest_files.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-shopify
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (3)
bin/generate_component_manifest_files.py (3)

3-3: LGTM on the new imports!

The json and yaml imports are appropriately placed and necessary for the new JSON schema generation functionality.

Also applies to: 11-11


136-136: Perfect placement of the JSON schema generation!

Calling generate_json_schema() at the beginning of main() ensures the JSON schema is generated before the rest of the build process. This maintains the build flow as intended in the PR objectives.


36-37: Required schemastore.org fields are already present
The declarative schema at airbyte_cdk/sources/declarative/declarative_component_schema.yaml includes both $schema and $id, so the generated JSON will satisfy schemastore.org requirements without extra validation. Great to go! wdyt?

Copy link

github-actions bot commented Aug 6, 2025

PyTest Results (Fast)

3 762 tests  ±0   3 751 ✅ ±0   7m 20s ⏱️ +11s
    1 suites ±0      11 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 49c5a48. ± Comparison against base commit eca065f.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Aug 6, 2025

PyTest Results (Full)

3 765 tests   3 754 ✅  11m 16s ⏱️
    1 suites     11 💤
    1 files        0 ❌

Results for commit 49c5a48.

♻️ This comment has been updated with latest results.

@devin-ai-integration devin-ai-integration bot changed the title Add JSON schema generation to build process for schemastore.org registration feat: Add JSON schema generation to build process for schemastore.org registration Aug 6, 2025
- Updated generate_json_schema() to create JSON file in generated/ subfolder
- Added directory creation logic to ensure generated/ folder exists
- Removed old JSON file from root declarative/ directory
- Improves organization by clearly marking auto-generated files

Co-Authored-By: AJ Steers <[email protected]>
@github-actions github-actions bot added the enhancement New feature or request label Aug 6, 2025
- Fixed whitespace formatting to comply with Ruff format check
- Added proper blank line after import statement
- Removed trailing whitespace

Co-Authored-By: AJ Steers <[email protected]>
@aaronsteers
Copy link
Contributor Author

aaronsteers commented Aug 7, 2025

/poe build

Running poe build...

Link to job logs.

❌ Poe command build failed. Please inspect the logs.

@aaronsteers
Copy link
Contributor Author

aaronsteers commented Aug 7, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

❌ Job failed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
bin/generate_component_manifest_files.py (2)

103-120: Fix typo and make import injection robust/idempotent.

  • Typo: “warinings” → “warnings”.
  • Current insertion relies on “\n\n” and may skip import, breaking replaced bases.

Apply:

-    # update the imports to include the new base model with deprecation warinings
+    # update the imports to include the new base model with deprecation warnings
@@
-    if len(classes_with_deprecated_fields) > 0:
-        # Find where to insert the base model - after imports but before class definitions
-        imports_end = post_processed_content.find(
-            "\n\n",
-            post_processed_content.find("from pydantic.v1 import"),
-        )
-        if imports_end > 0:
-            post_processed_content = (
-                post_processed_content[:imports_end]
-                + "\n\n"
-                + "from airbyte_cdk.sources.declarative.models.base_model_with_deprecations import (\n"
-                + "    BaseModelWithDeprecations,\n"
-                + ")"
-                + post_processed_content[imports_end:]
-            )
+    if classes_with_deprecated_fields:
+        # Insert once, right after the first pydantic import (fallback to top).
+        if "BaseModelWithDeprecations" not in post_processed_content:
+            m = re.search(r"^from pydantic\.v1 import[^\n]*\n", post_processed_content, flags=re.M)
+            insert_at = m.end() if m else 0
+            injection = (
+                "from airbyte_cdk.sources.declarative.models.base_model_with_deprecations import (\n"
+                "    BaseModelWithDeprecations,\n"
+                ")\n"
+            )
+            post_processed_content = (
+                post_processed_content[:insert_at] + injection + post_processed_content[insert_at:]
+            )

wdyt?


142-146: Make “_parameters” rename indentation-agnostic.

Exact string match is brittle; use regex so any indent works.

Apply:

-            post_processed_content = original_content.replace(
-            " _parameters:", " parameters:"
-            ).replace("from pydantic", "from pydantic.v1")
+            post_processed_content = re.sub(r"(^\s+)_parameters:", r"\1parameters:", original_content, flags=re.M)
+            post_processed_content = post_processed_content.replace("from pydantic", "from pydantic.v1")

wdyt?

♻️ Duplicate comments (1)
bin/generate_component_manifest_files.py (1)

42-50: Docstring update looks good.

The schemastore phrasing from the prior suggestion has been applied as discussed.

🧹 Nitpick comments (2)
bin/generate_component_manifest_files.py (2)

35-39: Deterministic module export order.

Small: ensure stable star-import order.

Apply:

-    for module_name in get_all_yaml_files_without_ext():
+    for module_name in sorted(get_all_yaml_files_without_ext()):
         header += f"from .{module_name} import *\n"

wdyt?


193-201: Improve error messaging when codegen tool is absent.

Catching only CalledProcessError misses the common “module not found” case. Add a tailored hint.

Apply:

             try:
                 result = subprocess.run(cmd, check=True, capture_output=True, text=True)
                 print(f"Generated {output_py}")
-            except subprocess.CalledProcessError as e:
+            except subprocess.CalledProcessError as e:
                 print(f"Error generating {output_py}: {e}")
                 print(f"stdout: {e.stdout}")
                 print(f"stderr: {e.stderr}")
+                if "No module named 'datamodel_code_generator'" in (e.stderr or ""):
+                    print("Hint: install `datamodel-code-generator` (e.g., via `uv run` script deps or as a dev dependency) or adjust the assemble task.", file=sys.stderr)
                 sys.exit(1)

wdyt?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bd615ad and 49c5a48.

📒 Files selected for processing (4)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (0 hunks)
  • bin/generate-component-manifest-dagger.sh (0 hunks)
  • bin/generate_component_manifest_files.py (4 hunks)
  • pyproject.toml (1 hunks)
💤 Files with no reviewable changes (2)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py
  • bin/generate-component-manifest-dagger.sh
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2024-11-15T01:04:21.272Z
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/cli/source_declarative_manifest/_run.py:62-65
Timestamp: 2024-11-15T01:04:21.272Z
Learning: The files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from another repository, and changes to these files should be minimized or avoided when possible to maintain consistency.

Applied to files:

  • bin/generate_component_manifest_files.py
📚 Learning: 2024-12-11T16:34:46.319Z
Learnt from: pnilan
PR: airbytehq/airbyte-python-cdk#0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.

Applied to files:

  • bin/generate_component_manifest_files.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)

Comment on lines +51 to +71
yaml_file_path = f"{LOCAL_YAML_DIR_PATH}/declarative_component_schema.yaml"
json_file_path = f"{LOCAL_YAML_DIR_PATH}/generated/declarative_component_schema.json"

with open(yaml_file_path, "r") as yaml_file:
schema_data = yaml.safe_load(yaml_file)

class DateTimeEncoder(json.JSONEncoder):
def default(self, obj):
if hasattr(obj, "isoformat"):
return obj.isoformat()
return super().default(obj)

import os

os.makedirs(os.path.dirname(json_file_path), exist_ok=True)

with open(json_file_path, "w") as json_file:
json.dump(schema_data, json_file, indent=2, cls=DateTimeEncoder)

print(f"Generated JSON schema: {json_file_path}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Harden JSON Schema generation: ensure $schema/$id, validate, and drop redundant import.

  • Add defaults for $schema and $id if missing.
  • Validate early with jsonschema to fail fast.
  • Remove the inner import os (already imported).

Apply:

     yaml_file_path = f"{LOCAL_YAML_DIR_PATH}/declarative_component_schema.yaml"
     json_file_path = f"{LOCAL_YAML_DIR_PATH}/generated/declarative_component_schema.json"

-    with open(yaml_file_path, "r") as yaml_file:
+    if not os.path.exists(yaml_file_path):
+        raise FileNotFoundError(f"Missing schema source: {yaml_file_path}")
+    with open(yaml_file_path, "r") as yaml_file:
         schema_data = yaml.safe_load(yaml_file)

     class DateTimeEncoder(json.JSONEncoder):
         def default(self, obj):
             if hasattr(obj, "isoformat"):
                 return obj.isoformat()
             return super().default(obj)

-    import os
-
-    os.makedirs(os.path.dirname(json_file_path), exist_ok=True)
+    # Ensure required fields for schemastore consumers
+    schema_data.setdefault("$schema", "https://json-schema.org/draft/2020-12/schema")
+    schema_data.setdefault("$id", "https://json.schemastore.org/airbyte-declarative-component.json")
+    os.makedirs(os.path.dirname(json_file_path), exist_ok=True)
+    # Validate and fail fast
+    try:
+        from jsonschema import Draft202012Validator
+        Draft202012Validator.check_schema(schema_data)
+    except Exception as e:
+        print(f"JSON Schema validation failed: {e}", file=sys.stderr)
+        sys.exit(1)

     with open(json_file_path, "w") as json_file:
-        json.dump(schema_data, json_file, indent=2, cls=DateTimeEncoder)
+        json.dump(schema_data, json_file, indent=2, ensure_ascii=False, cls=DateTimeEncoder)

wdyt?

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
yaml_file_path = f"{LOCAL_YAML_DIR_PATH}/declarative_component_schema.yaml"
json_file_path = f"{LOCAL_YAML_DIR_PATH}/generated/declarative_component_schema.json"
with open(yaml_file_path, "r") as yaml_file:
schema_data = yaml.safe_load(yaml_file)
class DateTimeEncoder(json.JSONEncoder):
def default(self, obj):
if hasattr(obj, "isoformat"):
return obj.isoformat()
return super().default(obj)
import os
os.makedirs(os.path.dirname(json_file_path), exist_ok=True)
with open(json_file_path, "w") as json_file:
json.dump(schema_data, json_file, indent=2, cls=DateTimeEncoder)
print(f"Generated JSON schema: {json_file_path}")
yaml_file_path = f"{LOCAL_YAML_DIR_PATH}/declarative_component_schema.yaml"
json_file_path = f"{LOCAL_YAML_DIR_PATH}/generated/declarative_component_schema.json"
if not os.path.exists(yaml_file_path):
raise FileNotFoundError(f"Missing schema source: {yaml_file_path}")
with open(yaml_file_path, "r") as yaml_file:
schema_data = yaml.safe_load(yaml_file)
class DateTimeEncoder(json.JSONEncoder):
def default(self, obj):
if hasattr(obj, "isoformat"):
return obj.isoformat()
return super().default(obj)
# Ensure required fields for schemastore consumers
schema_data.setdefault(
"$schema", "https://json-schema.org/draft/2020-12/schema"
)
schema_data.setdefault(
"$id", "https://json.schemastore.org/airbyte-declarative-component.json"
)
os.makedirs(os.path.dirname(json_file_path), exist_ok=True)
# Validate and fail fast
try:
from jsonschema import Draft202012Validator
Draft202012Validator.check_schema(schema_data)
except Exception as e:
print(f"JSON Schema validation failed: {e}", file=sys.stderr)
sys.exit(1)
with open(json_file_path, "w") as json_file:
json.dump(
schema_data,
json_file,
indent=2,
ensure_ascii=False,
cls=DateTimeEncoder,
)
print(f"Generated JSON schema: {json_file_path}")
🤖 Prompt for AI Agents
In bin/generate_component_manifest_files.py around lines 51 to 71, the JSON
schema generation should be hardened: remove the redundant inner "import os" and
instead rely on the module-level import, ensure the loaded YAML has defaults for
"$schema" and "$id" (set sensible defaults if missing), validate the resulting
schema object with jsonschema.validate (import jsonschema at module top and
raise/exit on validation errors to fail fast), then proceed to create the output
directory and dump using the DateTimeEncoder; make sure any mutation of
schema_data to add defaults happens before validation and before json.dump.

Comment on lines +173 to +191
cmd = [
"datamodel-codegen",
"--input",
input_yaml,
"--output",
output_py,
"--disable-timestamp",
"--enum-field-as-literal",
"one",
"--set-default-enum-member",
"--use-double-quotes",
"--remove-special-field-name-prefix",
# allow usage of the extra key such as `deprecated`, etc.
"--field-extra-keys",
# account the `deprecated` flag provided for the field.
"deprecated",
# account the `deprecation_message` provided for the field.
"deprecation_message",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Prefer module invocation to avoid console-script PATH issues.

Calling via sys.executable -m datamodel_code_generator is more portable and fixes failures when the CLI shim isn’t present.

Apply:

-            cmd = [
-                "datamodel-codegen",
+            cmd = [
+                sys.executable,
+                "-m",
+                "datamodel_code_generator",
                 "--input",
                 input_yaml,
                 "--output",
                 output_py,

wdyt?

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In bin/generate_component_manifest_files.py around lines 173 to 191, the
subprocess invocation uses the console script "datamodel-codegen" which can fail
if the CLI shim is not on PATH; change the command to invoke the package module
via the current Python interpreter by replacing the first entry with
sys.executable and adding "-m" and "datamodel_code_generator" (i.e.
[sys.executable, "-m", "datamodel_code_generator", ...]); also add an import sys
at the top of the file if not present so the script uses the running interpreter
to run the tool.


# Build tasks
assemble = {cmd = "bin/generate-component-manifest-dagger.sh", help = "Generate component manifest files."}
assemble = {cmd = "uv run bin/generate_component_manifest_files.py", help = "Generate component manifest files."}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Build failure risk: uv may be unavailable in CI.

The job logs show build/autofix failing; likely uv isn’t on PATH. Two options:

  1. Run uv within Poetry’s venv and declare it as a dev dep.
  2. Or drop uv and rely on dev deps exclusively.

Apply option 1 minimal changes:

-assemble = {cmd = "uv run bin/generate_component_manifest_files.py", help = "Generate component manifest files."}
+assemble = {cmd = "poetry run uv run bin/generate_component_manifest_files.py", help = "Generate component manifest files."}

And add dev deps (near the existing [tool.poetry.group.dev.dependencies]):

+[tool.poetry.group.dev.dependencies]
+uv = "^0.5"
+datamodel-code-generator = "0.26.3"

Alternatively, option 2:

-assemble = {cmd = "uv run bin/generate_component_manifest_files.py", help = "Generate component manifest files."}
+assemble = {cmd = "poetry run python bin/generate_component_manifest_files.py", help = "Generate component manifest files."}

(then keep datamodel-code-generator in dev deps). wdyt?

Run to confirm the current CI failure mode:


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "uv on PATH?"; command -v uv || echo "missing"
echo "datamodel-codegen on PATH?"; command -v datamodel-codegen || echo "missing"
rg -n '^assemble\s*=.*generate_component_manifest_files\.py' pyproject.toml -n

Length of output: 407


Install uv & datamodel-code-generator for the assemble script
CI logs confirm both uv and datamodel-codegen are missing, causing the task to fail. Two options:

  • Add them as dev dependencies and prefix with poetry run:
- assemble = {cmd = "uv run bin/generate_component_manifest_files.py", help = "Generate component manifest files."}
+ assemble = {cmd = "poetry run uv run bin/generate_component_manifest_files.py", help = "Generate component manifest files."}

Under [tool.poetry.group.dev.dependencies]:

uv = "^0.5"
datamodel-code-generator = "0.26.3"
  • Or invoke the script directly via Python:
- assemble = {cmd = "uv run bin/generate_component_manifest_files.py", help = "Generate component manifest files."}
+ assemble = {cmd = "poetry run python bin/generate_component_manifest_files.py", help = "Generate component manifest files."}

Which approach do you prefer? wdyt?

🤖 Prompt for AI Agents
In pyproject.toml around line 146, the assemble task uses the external tool "uv"
which is not installed in CI; update the assemble command to run the script with
Python under Poetry (e.g. change the cmd to invoke the Python script via "poetry
run python ...") so it no longer depends on the "uv" CLI, or alternatively add
"uv" and "datamodel-code-generator" to [tool.poetry.group.dev.dependencies] with
appropriate versions and keep the current command—pick one approach and apply it
consistently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant