[SPARK-39650][SS] Fix incorrect value schema in streaming deduplication with backward compatibility #37041

HeartSaVioR · 2022-07-01T04:41:30Z

What changes were proposed in this pull request?

This PR proposes to fix the incorrect value schema in streaming deduplication. It stores the empty row having a single column with null (using NullType), but the value schema is specified as all columns, which leads incorrect behavior from state store schema compatibility checker.

This PR proposes to set the schema of value as StructType(Array(StructField("__dummy__", NullType))) to fit with the empty row. With this change, the streaming queries creating the checkpoint after this fix would work smoothly.

To not break the existing streaming queries having incorrect value schema, this PR proposes to disable the check for value schema on streaming deduplication. Disabling the value check was there for the format validation (we have two different checkers for state store), but it has been missing for state store schema compatibility check. To avoid adding more config, this PR leverages the existing config "format validation" is using.

Why are the changes needed?

This is a bug fix. Suppose the streaming query below:

# df has the columns `a`, `b`, `c`
val df = spark.readStream.format("...").load()
val query = df.dropDuplicate("a").writeStream.format("...").start()

while the query is running, df can produce a different set of columns (e.g. a, b, c, d) from the same source due to schema evolution. Since we only deduplicate the rows with column a, the change of schema should not matter for streaming deduplication, but state store schema checker throws error saying "value schema is not compatible" before this fix.

Does this PR introduce any user-facing change?

No, this is basically a bug fix which end users wouldn't notice unless they encountered a bug.

How was this patch tested?

New tests.

…on with backward compatibility

HeartSaVioR · 2022-07-01T04:47:47Z

cc. @zsxwing @viirya

HeartSaVioR · 2022-07-01T04:59:29Z

This PR does not deal with overwriting incorrect value schema file. If we want to leverage the schema file for understanding/reading state, ideally we should make the schema file be up to date. But we don't also overwrite the schema file when the schema is compatible.

We can track the effort as separate JIRA.

viirya

Only one minor comment

viirya · 2022-07-01T04:57:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala

@@ -515,7 +515,12 @@ object StateStore extends Logging {
          val checker = new StateSchemaCompatibilityChecker(storeProviderId, hadoopConf)
          // regardless of configuration, we check compatibility to at least write schema file
          // if necessary
-          val ret = Try(checker.check(keySchema, valueSchema)).toEither.fold(Some(_), _ => None)
+          // if the format validation for value schema is disabled, we also disable the schema


Should we also add a code comment at formatValidationCheckValue in StateStoreConf?

OK I'll leave a comment that the config is in effect for both checkers.

viirya · 2022-07-01T05:07:59Z

This PR does not deal with overwriting incorrect value schema file. If we want to leverage the schema file for understanding/reading state, ideally we should make the schema file be up to date. But we don't also overwrite the schema file when the schema is compatible.

Is that important? The state value is already not matched with the value schema, isn't? Not sure if that the schema is up to date is important.

HeartSaVioR · 2022-07-01T05:23:37Z

It's not important with current features of Structured Streaming. It's something like "future-proof" - when we plan to build a feature like reading the state (actually I even had a PR which was forgotten...), keeping the schema up-to-date will give the ideal UX to the end users, otherwise they would see the outdated schema, or even fail to read the state.

HeartSaVioR · 2022-07-01T05:28:24Z

missed to cc. @xuanyuanking as he authored the other checker.

HeartSaVioR · 2022-07-01T06:17:09Z

I'll keep this PR around a day to seek for further reviews. If there is no outstanding one, I'll merge this in tomorrow.

HeartSaVioR · 2022-07-02T13:45:27Z

Thanks! Merging to master/3.3/3.2.

…on with backward compatibility ### What changes were proposed in this pull request? This PR proposes to fix the incorrect value schema in streaming deduplication. It stores the empty row having a single column with null (using NullType), but the value schema is specified as all columns, which leads incorrect behavior from state store schema compatibility checker. This PR proposes to set the schema of value as `StructType(Array(StructField("__dummy__", NullType)))` to fit with the empty row. With this change, the streaming queries creating the checkpoint after this fix would work smoothly. To not break the existing streaming queries having incorrect value schema, this PR proposes to disable the check for value schema on streaming deduplication. Disabling the value check was there for the format validation (we have two different checkers for state store), but it has been missing for state store schema compatibility check. To avoid adding more config, this PR leverages the existing config "format validation" is using. ### Why are the changes needed? This is a bug fix. Suppose the streaming query below: ``` # df has the columns `a`, `b`, `c` val df = spark.readStream.format("...").load() val query = df.dropDuplicate("a").writeStream.format("...").start() ``` while the query is running, df can produce a different set of columns (e.g. `a`, `b`, `c`, `d`) from the same source due to schema evolution. Since we only deduplicate the rows with column `a`, the change of schema should not matter for streaming deduplication, but state store schema checker throws error saying "value schema is not compatible" before this fix. ### Does this PR introduce _any_ user-facing change? No, this is basically a bug fix which end users wouldn't notice unless they encountered a bug. ### How was this patch tested? New tests. Closes #37041 from HeartSaVioR/SPARK-39650. Authored-by: Jungtaek Lim <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]> (cherry picked from commit fe53603) Signed-off-by: Jungtaek Lim <[email protected]>

…on with backward compatibility ### What changes were proposed in this pull request? This PR proposes to fix the incorrect value schema in streaming deduplication. It stores the empty row having a single column with null (using NullType), but the value schema is specified as all columns, which leads incorrect behavior from state store schema compatibility checker. This PR proposes to set the schema of value as `StructType(Array(StructField("__dummy__", NullType)))` to fit with the empty row. With this change, the streaming queries creating the checkpoint after this fix would work smoothly. To not break the existing streaming queries having incorrect value schema, this PR proposes to disable the check for value schema on streaming deduplication. Disabling the value check was there for the format validation (we have two different checkers for state store), but it has been missing for state store schema compatibility check. To avoid adding more config, this PR leverages the existing config "format validation" is using. ### Why are the changes needed? This is a bug fix. Suppose the streaming query below: ``` # df has the columns `a`, `b`, `c` val df = spark.readStream.format("...").load() val query = df.dropDuplicate("a").writeStream.format("...").start() ``` while the query is running, df can produce a different set of columns (e.g. `a`, `b`, `c`, `d`) from the same source due to schema evolution. Since we only deduplicate the rows with column `a`, the change of schema should not matter for streaming deduplication, but state store schema checker throws error saying "value schema is not compatible" before this fix. ### Does this PR introduce _any_ user-facing change? No, this is basically a bug fix which end users wouldn't notice unless they encountered a bug. ### How was this patch tested? New tests. Closes apache#37041 from HeartSaVioR/SPARK-39650. Authored-by: Jungtaek Lim <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]> (cherry picked from commit fe53603) Signed-off-by: Jungtaek Lim <[email protected]>

[SPARK-39650][SS] Fix incorrect value schema in streaming deduplicati…

7e20b72

…on with backward compatibility

github-actions bot added SQL STRUCTURED STREAMING labels Jul 1, 2022

silly fix

28a6913

viirya approved these changes Jul 1, 2022

View reviewed changes

add code comment

1a95700

HeartSaVioR closed this in fe53603 Jul 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-39650][SS] Fix incorrect value schema in streaming deduplication with backward compatibility #37041

[SPARK-39650][SS] Fix incorrect value schema in streaming deduplication with backward compatibility #37041

Uh oh!

HeartSaVioR commented Jul 1, 2022

Uh oh!

HeartSaVioR commented Jul 1, 2022

Uh oh!

HeartSaVioR commented Jul 1, 2022

Uh oh!

viirya left a comment

Uh oh!

viirya Jul 1, 2022

Uh oh!

HeartSaVioR Jul 1, 2022

Uh oh!

HeartSaVioR Jul 1, 2022

Uh oh!

viirya commented Jul 1, 2022

Uh oh!

HeartSaVioR commented Jul 1, 2022 •

edited

Loading

Uh oh!

HeartSaVioR commented Jul 1, 2022

Uh oh!

HeartSaVioR commented Jul 1, 2022

Uh oh!

HeartSaVioR commented Jul 2, 2022 •

edited

Loading

Uh oh!

Uh oh!

[SPARK-39650][SS] Fix incorrect value schema in streaming deduplication with backward compatibility #37041

[SPARK-39650][SS] Fix incorrect value schema in streaming deduplication with backward compatibility #37041

Uh oh!

Conversation

HeartSaVioR commented Jul 1, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HeartSaVioR commented Jul 1, 2022

Uh oh!

HeartSaVioR commented Jul 1, 2022

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

viirya Jul 1, 2022

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Jul 1, 2022

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Jul 1, 2022

Choose a reason for hiding this comment

Uh oh!

viirya commented Jul 1, 2022

Uh oh!

HeartSaVioR commented Jul 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HeartSaVioR commented Jul 1, 2022

Uh oh!

HeartSaVioR commented Jul 1, 2022

Uh oh!

HeartSaVioR commented Jul 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

HeartSaVioR commented Jul 1, 2022 •

edited

Loading

HeartSaVioR commented Jul 2, 2022 •

edited

Loading