Skip to content

Conversation

tenfyzhong
Copy link
Contributor

@tenfyzhong tenfyzhong commented Sep 25, 2025

What problem does this PR solve?

This PR introduces the necessary infrastructure to run integration tests against the TiDB "nextgen" architecture. Previously, our integration tests only supported the classic TiDB cluster setup. With the introduction of nextgen components and object storage (MinIO), a dedicated setup is required to validate TiCDC's compatibility and functionality with this new architecture.

Issue Number: close #2336 close #2561 close #1884

What is changed and how it works?

This PR refactors the TiDB cluster startup and teardown scripts to support both classic and nextgen TiDB architectures for integration testing.

  • New Health Check Utilities:

    • Added tests/integration_tests/_utils/check_pd_health to robustly check the health of a PD server using its /pd/api/v1/health endpoint.
    • Added tests/integration_tests/_utils/check_tidb_health to verify TiDB server readiness by executing a simple SELECT query via the mysql client.
  • Refactored TiDB Cluster Startup:

    • The tests/integration_tests/_utils/start_tidb_cluster script is now a dispatcher, conditionally invoking either start_tidb_cluster_classic or start_tidb_cluster_nextgen based on the NEXT_GEN environment variable.
    • The original logic for starting a classic TiDB cluster has been moved into a new script: tests/integration_tests/_utils/start_tidb_cluster_classic.
    • A new script tests/integration_tests/_utils/start_tidb_cluster_nextgen has been added to provision a TiDB cluster with nextgen components. This includes:
      • Starting a MinIO server to act as an S3-compatible object storage backend.
      • Starting pd-server-nextgen instances (supporting multiple upstream PDs).
      • Starting tikv-server-nextgen instances, configured to use MinIO for storage.
      • Starting two tidb-server-nextgen instances: one for the SYSTEM keyspace and another for a user-defined KEYSPACE_NAME.
      • Starting a classic pd-server, tikv-server, and tidb-server for the downstream cluster.
      • Utilizing the new check_pd_health and check_tidb_health scripts to ensure component readiness.
  • Updated TiDB Cluster Teardown:

    • tests/integration_tests/_utils/stop_tidb_cluster has been updated to correctly terminate tidb-server-nextgen, tikv-server-nextgen, pd-server-nextgen, and minio processes, along with their respective ports.
  • Enhanced Test Environment Preparation:

    • tests/integration_tests/_utils/test_prepare now includes additional environment variables for:
      • Nextgen TiDB system ports and status addresses (UP_TIDB_SYSTEM_PORT, UP_TIDB_SYSTEM_STATUS).
      • MinIO configuration (MINIO_API_PORT, MINIO_CONSOLE_PORT, MINIO_MC_ALIAS, MINIO_ROOT_USER, MINIO_ROOT_PASSWORD).
      • ANSI color codes for improved log readability.
  • Integration Test Runner Modifications:

    • tests/integration_tests/run.sh now sources _utils/test_prepare and sets export TICDC_NEWARCH=true and export CDC_BINARY=cdc.test to enable nextgen testing by default for relevant tests.

Check List

Tests

  • Integration test

Questions

Will it cause performance regression or break compatibility?

No, this PR primarily introduces new test infrastructure for a different TiDB architecture. It does not affect the performance or compatibility of TiCDC itself.

Do you need to update user documentation, design documentation or monitoring documentation?

No, these changes are internal to the integration test suite and do not require updates to user, design, or monitoring documentation.

Release note

None.

@ti-chi-bot ti-chi-bot bot added the release-note-none Denotes a PR that doesn't merit a release note. label Sep 25, 2025
Copy link

ti-chi-bot bot commented Sep 25, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign charlescheung96 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 25, 2025
Copy link

Summary of Changes

Hello @tenfyzhong, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the integration testing capabilities by introducing full support for the next-generation TiDB cluster architecture. This allows TiCDC to be validated against the new components and object storage mechanisms, ensuring compatibility and functionality in modern deployments. The changes primarily involve refactoring existing test utilities and adding new scripts to manage the lifecycle of next-gen TiDB clusters within the testing environment.

Highlights

  • Next-Gen TiDB Cluster Support: Introduced comprehensive infrastructure to enable integration tests against the new 'next-gen' TiDB cluster architecture, which includes next-gen components and MinIO object storage.
  • Refactored Cluster Startup/Teardown: The start_tidb_cluster script has been refactored to act as a dispatcher, dynamically invoking either classic or next-gen cluster startup based on an environment variable. The stop_tidb_cluster script was updated to properly terminate next-gen components.
  • New Health Check Utilities: Added dedicated utility scripts, check_pd_health and check_tidb_health, to robustly verify the readiness and health of PD and TiDB servers, respectively, using their API endpoints and SQL queries.
  • MinIO Integration: The next-gen cluster setup now includes the automatic provisioning and configuration of a MinIO server to serve as an S3-compatible object storage backend for tikv-server-nextgen.
  • Enhanced Test Environment Preparation: The test_prepare script has been updated with new environment variables to support next-gen TiDB system ports, status addresses, and MinIO configuration, alongside the addition of ANSI color codes for improved log readability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@tenfyzhong
Copy link
Contributor Author

/test all

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant infrastructure to enable integration testing for the next-gen TiDB architecture, which is a valuable addition. The refactoring of the cluster startup scripts and the introduction of new health checks are well-structured. However, the primary new script, start_tidb_cluster_nextgen, contains several critical bugs related to shell syntax, incorrect command arguments, and faulty variable handling that will prevent the test environment from starting correctly. I have provided detailed comments and suggestions to address these critical issues. Additionally, there are opportunities to improve code quality by refactoring duplicated code in the new health check scripts and correcting syntax in environment variable setup. Addressing these points will ensure the new testing infrastructure is robust and maintainable.

…n tests

- Introduced `check_pd_health` and `check_tidb_health` scripts for robust component readiness checks.
- Refactored `start_tidb_cluster` to dynamically select between classic and next-gen cluster startup based on the `NEXT_GEN` environment variable.
- Moved existing cluster startup logic to `start_tidb_cluster_classic`.
- Added `start_tidb_cluster_nextgen` to provision a next-gen TiDB cluster, which includes:
    - Starting MinIO as an S3-compatible object storage for DFS.
    - Configuring and launching multiple upstream PD instances.
    - Deploying dedicated TiDB instances for the `SYSTEM` keyspace and user keyspaces.
    - Configuring TiKV with DFS settings.
- Updated `test_prepare` to export all cluster-related environment variables and include new variables for MinIO and next-gen TiDB components.

Signed-off-by: tenfyzhong <[email protected]>
- Add `tidb-server-nextgen`, `tikv-server-nextgen`, `pd-server-nextgen` to the process kill list to ensure all TiDB cluster components are stopped during test cleanup.
- Add `minio` to the process kill list to stop the MinIO service if running.
- Kill processes on `UP_TIDB_SYSTEM_PORT` and `UP_TIDB_SYSTEM_STATUS` to clean up any lingering port bindings.

Signed-off-by: tenfyzhong <[email protected]>
- Source `_utils/test_prepare` script to handle common test setup.
- Set `CDC_BINARY` environment variable to `cdc.test` to ensure tests use the correct binary.

Signed-off-by: tenfyzhong <[email protected]>
- Standardize indentation from spaces to tabs for consistency.

Signed-off-by: tenfyzhong <[email protected]>
- Ensure `UP_TIDB_SYSTEM_*` variables default to their specific values, not general `UP_TIDB_*` values.
- Update `MINIO_*` environment variable defaults to use standard `:-` bash syntax.
- Add `KEYSPACE_NAME` environment variable with a default value.

Signed-off-by: tenfyzhong <[email protected]>
… improve configuration

- Migrate from `WORK_DIR` to `OUT_DIR` for all output paths.
- Remove `--keyspace-name` CLI option; it's now expected as an environment variable.
- Introduce new CLI options for component configurations (`--tidb-config`, `--pd-config`, `--tikv-config`) and retry times (`--retry`).
- Remove PID tracking by deleting the `env` file generation.
- Add explicit `mkdir -p` calls for all component data and log directories.
- Introduce a second upstream keyspace TiDB instance for testing.
- Adjust TiDB configuration files to include `max-server-connections` under `[instance]`.
- Change upstream keyspace TiDB `--path` to connect directly to PD.
- Ensure system TiDB health before starting keyspace TiDB.

Signed-off-by: tenfyzhong <[email protected]>
- Provides a utility script to conditionally add `-k KEYSPACE_NAME` to `cdc cli changefeed` commands.
- This allows integration tests to adapt CLI arguments based on the `NEXT_GEN` environment variable.

Signed-off-by: tenfyzhong <[email protected]>
- Replaced direct `cdc cli changefeed create` call with `cdc_cli_changefeed` helper function.
- Made the `run.sh` script executable.

Signed-off-by: tenfyzhong <[email protected]>
- Changed indentation from 4 spaces to tabs for consistency within the script.

Signed-off-by: tenfyzhong <[email protected]>
- Replace direct `cdc cli changefeed` calls with the `cdc_cli_changefeed` helper function.
- Standardize command invocation across integration tests for improved consistency and maintainability.
- Update file permissions to ensure test scripts are executable.

Signed-off-by: tenfyzhong <[email protected]>
@tenfyzhong tenfyzhong force-pushed the feat-integration-kv-cdc branch from 6c5b5b2 to 059b140 Compare September 26, 2025 06:53
@tenfyzhong
Copy link
Contributor Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant infrastructure changes to support integration testing for the next-generation TiDB cluster architecture. The refactoring of cluster startup/teardown scripts and the addition of new health checks are well-structured. However, I've identified a critical bug in the start_tidb_cluster_nextgen script where an incorrect configuration file is used, which would likely break the test setup. Additionally, there's a high-severity argument handling bug in the check_tidb_health script. I've also pointed out several instances of code duplication that could be refactored to improve maintainability and a minor code cleanup suggestion. Overall, the changes are a great step forward, but these issues should be addressed.

*) SINK_URI="mysql://normal:[email protected]:3306/" ;;
esac
cdc cli changefeed create --sink-uri="$SINK_URI"
# cdc_cli_changefeed create --sink-uri="$SINK_URI"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This commented-out line is redundant as the same command exists on the next line. It should be removed to improve code clarity.

@tenfyzhong
Copy link
Contributor Author

/test all

- The `check_tidb_health` script no longer accepts or processes a secondary TiDB port.
- Calls to `check_tidb_health` were updated to reflect the simplified signature, now performing separate checks for multiple ports if needed.
- Fixed a bug where `tidb-server-nextgen` for the secondary upstream TiDB instance was incorrectly using the primary instance's config file.
- Removed a redundant commented-out line in `autorandom/run.sh`.

Signed-off-by: tenfyzhong <[email protected]>
… tests

- This Dockerfile enables running next-gen integration tests in a consistent Docker environment.
- It utilizes a multi-stage build to:
    - Download Go 1.23.0 and necessary third-party binaries.
    - Build the `cdc` binary for integration testing.
- The base image is `rockylinux:8`, requiring a switch from `yum` to `dnf` for package management.
- Package installations were adjusted to:
    - Remove `musl-dev` as it's not available on Rocky 8.
    - Use the default `mysql` client from Rocky 8 repositories instead of `mysql 5.7`.

Signed-off-by: tenfyzhong <[email protected]>
- Upgrades the base image for the local integration test Dockerfile from Rocky Linux 8 to 9.3.
- Updates package installations for Rocky Linux 9:
    - Adds 'dnf update -y' for package refresh.
    - Includes 'nmap-ncat' as a new dependency.
    - Removes 'curl' as it's not directly used in the main stage's installations.
    - Adapts to the default MySQL client available on Rocky Linux 9.
- Comments out several build steps related to downloading and building integration test binaries.

Signed-off-by: tenfyzhong <[email protected]>
- The MinIO console is not required for integration tests.
- Simplifies the MinIO startup command.
- Avoids binding an unnecessary port.

Signed-off-by: tenfyzhong <[email protected]>
…tion tests

- The new function name is more descriptive, specifically for `cdc cli changefeed` commands.
- Improves consistency and readability across test scripts.

Signed-off-by: tenfyzhong <[email protected]>
- Replaced direct `cdc cli changefeed` calls with `run_cdc_cli changefeed`.
- This ensures consistency in how `cdc cli` commands are invoked within integration tests.
- Leverages a common utility function for CLI invocation.

Signed-off-by: tenfyzhong <[email protected]>
- The keyspace used in API v2 integration tests can now be configured via the `KEYSPACE_NAME` environment variable.
- This allows running tests against different keyspaces, useful for isolated testing environments.
- If `KEYSPACE_NAME` is not set, it defaults to "default" to maintain existing behavior.

Signed-off-by: tenfyzhong <[email protected]>
…ipts

- Removed `set -x` and `export PS4` from `cdc_cli_changefeed` and `start_tidb_cluster_nextgen`.
- This reduces log verbosity during integration test runs, making output cleaner.
- These debugging flags are not needed for normal test execution.

Signed-off-by: tenfyzhong <[email protected]>
- Remove `2>&1` from the `cdc_cli_changefeed create` command.
- Ensure that only standard output is processed when extracting the changefeed ID.
- This prevents potential parsing issues if stderr contains unexpected messages.

Signed-off-by: tenfyzhong <[email protected]>
- Disabled `set -x` to reduce verbose output in integration test logs.
- `set -x` prints every command executed, which can make logs noisy and harder to parse during normal test runs.
- The corresponding `set +x` was removed as it is no longer needed.

Signed-off-by: tenfyzhong <[email protected]>
@tenfyzhong
Copy link
Contributor Author

/test pull-cdc-mysql-integration-light

- Replaced `cdc_cli_changefeed query` with `cdc.test cli query` in integration tests.
- This change aligns the test utility with the current command structure for interacting with changefeeds.

Signed-off-by: tenfyzhong <[email protected]>
@tenfyzhong
Copy link
Contributor Author

/test pull-cdc-mysql-integration-light

- Replaced `run_cdc_cli changefeed` with `cdc.test cli changefeed` for consistency.
- This change aligns with the newer testing utility for interacting with the CDC CLI.

Signed-off-by: tenfyzhong <[email protected]>
@tenfyzhong
Copy link
Contributor Author

/test pull-cdc-mysql-integration-light

1 similar comment
@tenfyzhong
Copy link
Contributor Author

/test pull-cdc-mysql-integration-light

- The `cdc.test` command is being used in the integration tests for changefeed operations.
- This command is an alias or a wrapper that is not necessary for the actual execution of the `cdc cli changefeed` command.
- Removing the `.test` suffix simplifies the command and ensures that the actual `cdc cli changefeed` command is being invoked.

Signed-off-by: tenfyzhong <[email protected]>
@tenfyzhong
Copy link
Contributor Author

/test pull-cdc-mysql-integration-light

- When the kernel type is classic, the default keyspace value will be returned.
- This change ensures that the default keyspace is correctly handled for the classic kernel type in the API.

Signed-off-by: tenfyzhong <[email protected]>
@tenfyzhong
Copy link
Contributor Author

/test pull-cdc-mysql-integration-light

- The kill command in the `stop_minio` function was not handling cases where the process might have already exited.
- Adding `|| true` ensures that the script continues to run even if the `kill` command fails (e.g., if the PID is no longer valid).
- This prevents potential test failures due to non-zero exit codes from the `kill` command.

Signed-off-by: tenfyzhong <[email protected]>
- Introduce the `NEXT_GEN` environment variable to control the next-generation features in integration tests.
- Set a default value of `0` if `NEXT_GEN` is not explicitly provided.

Signed-off-by: tenfyzhong <[email protected]>
@tenfyzhong
Copy link
Contributor Author

/test all

- Set `KEYSPACE_NAME` to 'keyspace1' if `NEXT_GEN` is 1.
- Otherwise, set `KEYSPACE_NAME` to 'default'.

Signed-off-by: tenfyzhong <[email protected]>
@tenfyzhong
Copy link
Contributor Author

/test all

- Previously, errors during storage closure were logged unconditionally.
- This change ensures that errors are only logged if an actual error occurs during the `storage.Close()` operation.
- This reduces unnecessary log noise when storage closure is successful.

Signed-off-by: tenfyzhong <[email protected]>
@tenfyzhong
Copy link
Contributor Author

/test pull-cdc-mysql-integration-heavy

- Removed `export PS4='+$(basename ${BASH_SOURCE}):${LINENO}:'`
- Removed `set -x`

These lines were added for debugging purposes and are no longer needed. The `set -x` command makes the script overly verbose, and the custom `PS4` setting is also not required for the test's functionality.

Signed-off-by: tenfyzhong <[email protected]>
- Removed `set -x` and `export PS4` from multiple test scripts.
- These flags were likely left in from previous debugging sessions and are not needed for normal test execution.

Signed-off-by: tenfyzhong <[email protected]>
@tenfyzhong
Copy link
Contributor Author

/test pull-cdc-mysql-integration-heavy

- Adjust assertion for the number of running capture servers after owner resignation.
- The test now correctly expects only one capture server to remain running, as the second capture server was already running and did not exit.

Signed-off-by: tenfyzhong <[email protected]>
@tenfyzhong
Copy link
Contributor Author

/test all

- The test `test_owner_retryable_error` was checking if there was only one capture server running.
- This check is redundant because the test's setup ensures only one capture server is running.
- Removing this check simplifies the test and avoids potential false negatives if the test environment changes unexpectedly.

Signed-off-by: tenfyzhong <[email protected]>
@tenfyzhong
Copy link
Contributor Author

/test all

- Added debug output to `owner.sh` to help diagnose potential issues with owner key retrieval.
- The `result` variable now captures the output of `cdc cli capture list`, and this output is printed before parsing for the owner ID. This will help in understanding what the `cdc cli capture list` command is returning in case of failures.

Signed-off-by: tenfyzhong <[email protected]>
@tenfyzhong
Copy link
Contributor Author

/test pull-cdc-mysql-integration-heavy

- Capture the full output of `cdc cli capture list` before parsing.
- This change ensures that the correct owner PID is retrieved even if there are multiple capture clients listed.
- The previous implementation assumed the first capture client was always the owner, which is not always true in all test scenarios.

Signed-off-by: tenfyzhong <[email protected]>
Copy link

ti-chi-bot bot commented Oct 15, 2025

@tenfyzhong: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cdc-mysql-integration-heavy 5b7d786 link true /test pull-cdc-mysql-integration-heavy
pull-error-log-review 5f4f0da link true /test pull-error-log-review

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@tenfyzhong
Copy link
Contributor Author

/test pull-cdc-mysql-integration-heavy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

1 participant