Skip to content

[SPARK-50767][SQL] Remove codegen of from_json #49411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

panbingkun
Copy link
Contributor

@panbingkun panbingkun commented Jan 8, 2025

What changes were proposed in this pull request?

The pr aims to remove codegen of from_json.

Why are the changes needed?

Based on the discussion and testing with SubExprEliminationBenchmark #48466 (comment),
after implementing codegen for from_json, there is a performance regression in the withFilter scenario with subExprElimination = true, codegen = true
Let's remove it first and will submit it after we solve the above issue.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass GA & Manually test.

Was this patch authored or co-authored using generative AI tooling?

No.

@LuciferYang
Copy link
Contributor

Could you update the test results for SubExprEliminationBenchmark in this pr to ensure that the changes are as expected? @panbingkun

@panbingkun
Copy link
Contributor Author

panbingkun commented Jan 8, 2025

Could you update the test results for SubExprEliminationBenchmark in this pr to ensure that the changes are as expected? @panbingkun

Well, okay.

JDK17: https://github.com/panbingkun/spark/actions/runs/12666974432
JDK21: https://github.com/panbingkun/spark/actions/runs/12666979089

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 8, 2025

+1 for including SubExprEliminationBenchmark here. All the others are covered in the following.

@cloud-fan
Copy link
Contributor

Does it mean any time we add codegen support for some functions, there is a risk of perf regression? Are we sure from_json is the only one or it's simply because SubExprEliminationBenchmark tests from_json?

@panbingkun
Copy link
Contributor Author

Does it mean any time we add codegen support for some functions, there is a risk of perf regression? Are we sure from_json is the only one or it's simply because SubExprEliminationBenchmark tests from_json?

  • In the two scenarios tested in SubExprEliminationBenchmark , it has been found that the filter scenario has perf regression.
  • I can test other functions, such as those that have existed for a long time in history.

@LuciferYang
Copy link
Contributor

@panbingkun After giving it some more thought, we could try enabling -Xlog:compilation to test the corresponding case. Is it possible that the filter in this test case generated an extremely large method after enabling Codegen, which in turn affected the compilation optimization?

If it weren't a || filter involving all fields, would the impact be less severe?

@LuciferYang
Copy link
Contributor

I found the following content in the log:

19:22:37.212 main INFO CodeGenerator: Generated method too long to be JIT compiled: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext is 58762 bytes

@panbingkun
Copy link
Contributor Author

panbingkun commented Jan 8, 2025

  • case 1 (3 filtering conditions with a data size of 1000000)
object FromJsonBenchmark extends SqlBasedBenchmark {
  import spark.implicits._

  def withFilter(rowsNum: Int, numIters: Int): Unit = {
    val benchmark = new Benchmark("from_json in Filter", rowsNum, output = output)

    withTempPath { path =>
      prepareDataInfo(benchmark)
      val numCols = 500
      val schema = writeWideRow(path.getAbsolutePath, rowsNum, numCols)

      val jsonValue = from_json($"value", schema)
      val predicate = jsonValue.getField(s"col0") >= lit(100000) ||
        jsonValue.getField(s"col50") >= lit(100000) ||
        jsonValue.getField(s"col123") >= lit(100000)

      val caseName = s"from_object, codegen: no"
      benchmark.addCase(caseName, numIters) { _ =>
        val df = spark.read
          .text(path.getAbsolutePath)
          .where(predicate)
        df.write.mode("overwrite").format("noop").save()
      }
      benchmark.run()
    }
  }

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
    val numIters = 3
    runBenchmark("Benchmark for performance of from_json codegen") {
      withFilter(1_000_000, numIters)
    }
  }
}
  • codegen for from_json
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes                          61029          62195        1781          0.0       61028.8       1.0X

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes                          61391          66201        7157          0.0       61391.2       1.0X

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes                          60653          61195         481          0.0       60652.7       1.0X
  • non-codegen for from_json
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no                         61289          62508        1155          0.0       61288.6       1.0X

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no                          61219          61663         386          0.0       61218.9       1.0X

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no                          61056          61362         287          0.0       61055.9       1.0X

@panbingkun
Copy link
Contributor Author

panbingkun commented Jan 8, 2025

  • case 2 (1 filtering conditions with a data size of 100000)
object FromJsonBenchmark extends SqlBasedBenchmark {
  import spark.implicits._

  def withFilter(rowsNum: Int, numIters: Int): Unit = {
    val benchmark = new Benchmark("from_json in Filter", rowsNum, output = output)

    withTempPath { path =>
      prepareDataInfo(benchmark)
      val numCols = 500
      val schema = writeWideRow(path.getAbsolutePath, rowsNum, numCols)

      val jsonValue = from_json($"value", schema)
      val predicate = jsonValue.getField(s"col0") >= lit(100000)

      val caseName = s"from_object, codegen: no"
      benchmark.addCase(caseName, numIters) { _ =>
        val df = spark.read
          .text(path.getAbsolutePath)
          .where(predicate)
        df.write.mode("overwrite").format("noop").save()
      }
      benchmark.run()
    }
  }

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
    val numIters = 3
    runBenchmark("Benchmark for performance of from_json codegen") {
      withFilter(1_000_00, numIters)
    }
  }
}
  • codegen for from_json
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes                          2325           2341          26          0.0       23249.6       1.0X

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes                          2237           2266          36          0.0       22373.5       1.0X

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes                          2317           2403          74          0.0       23172.3       1.0X
  • non-codegen for from_json
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no                           2264           2286          20          0.0       22639.3       1.0X


OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no                           2475           3010         554          0.0       24752.2       1.0X


OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no                           2315           2780         480          0.0       23150.6       1.0X

@panbingkun
Copy link
Contributor Author

From the above scenario, it seems that there is no a performance regression, and I am investigating other reasons.

@LuciferYang
Copy link
Contributor

before: code-gen-before.txt
after:after-code-gen.txt

It can be seen that after from_json enabling codegen, an huge processNext method is generated for filter. I suspect that this method is the reason why JIT cannot optimize it.

@LuciferYang
Copy link
Contributor

@cloud-fan @dongjoon-hyun @panbingkun How do we proceed with this issue?

I think the risk of generating huge filter methods has always existed, but it was hidden in this benchmark because from_json did not support code generation previously. So I believe the support of code generation in from_json is not the root cause.

As more functions come to support code generation, the probability of generating huge methods will increase, It should apply to more than just filters, right?.

Perhaps we need to find a more universal approach to split the generated methods in order to avoid this risk?

@panbingkun
Copy link
Contributor Author

Give me some time to look at the root cause.

@panbingkun
Copy link
Contributor Author

panbingkun commented Jan 9, 2025

In the withFilter scenario of SubExprEliminationBenchmark, the root cause as follows:

  val df = spark.read
              .text(path.getAbsolutePath)
              .where(predicate)
  df.write.mode("overwrite").format("noop").save()

Ultimately, optimize the 500 calls to from_json to only 1 call

there is no subexpressionElimination optimization here, 500 calls will ultimately be applied to JsonToStructs.

@panbingkun
Copy link
Contributor Author

panbingkun commented Jan 9, 2025

If we can implement subexpressionElimination optimization in the method FilterExec.doConsume, like ProjectExec.doConsume, that would be great.

override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String = {
val exprs = bindReferences[Expression](projectList, child.output)
val (subExprsCode, resultVars, localValInputs) = if (conf.subexpressionEliminationEnabled) {
// subexpression elimination
val subExprs = ctx.subexpressionEliminationForWholeStageCodegen(exprs)
val genVars = ctx.withSubExprEliminationExprs(subExprs.states) {
exprs.map(_.genCode(ctx))
}
(ctx.evaluateSubExprEliminationState(subExprs.states.values), genVars,
subExprs.exprCodesNeedEvaluate)
} else {
("", exprs.map(_.genCode(ctx)), Seq.empty)
}

cc @cloud-fan

@panbingkun
Copy link
Contributor Author

Does it mean any time we add codegen support for some functions, there is a risk of perf regression? Are we sure from_json is the only one or it's simply because SubExprEliminationBenchmark tests from_json?

So the answer to this question is:

In the SubExprEliminationBenchmark testing scenario, any expression that implements Codegen will have this issue, not just from_json, because our FilterExec is inconsistent in codegen and interpreted.

@cloud-fan
Copy link
Contributor

@panbingkun great investigation! +1 to implement subexpression elimination for FilterExec

@LuciferYang
Copy link
Contributor

@panbingkun great investigation! +1 to implement subexpression elimination for FilterExec

So this feature doesn't need to be revert, right?

@LuciferYang
Copy link
Contributor

any progress? @panbingkun

@panbingkun
Copy link
Contributor Author

any progress? @panbingkun

It is being implemented and will take some time.

@panbingkun
Copy link
Contributor Author

@LuciferYang
Copy link
Contributor

Got, although the performance result may degrade, I suggest submitting a pr to update the benchmark results of SubExprEliminationBenchmark for Apache Spark 4.0 and leave a TODO.

@dongjoon-hyun
Copy link
Member

Hi, @panbingkun , @cloud-fan , @LuciferYang . I ran the benchmark again as a part of regression check Today and I hit this still.

When I do the bisect, it also reached the original PR. So, although we did some follow-ups, is this still required?

2a1301133138ba0d5e2d969fc6428153903ffff1 is the first bad commit
commit 2a1301133138ba0d5e2d969fc6428153903ffff1
Author: panbingkun <[email protected]>
Date:   Wed Oct 16 09:10:06 2024 +0200

    [SPARK-49966][SQL] Codegen Support for JsonToStructs(`from_json`)

@LuciferYang
Copy link
Contributor

@panbingkun is currently on vacation. According to @panbingkun's intention, if FilterExec supports subexpressionElimination in the codegen scenario, this pr would not be needed. BUt, #49573 is still under development.

@dongjoon-hyun
Copy link
Member

Got it. Thank you for the info, @LuciferYang ~

dongjoon-hyun added a commit that referenced this pull request Jan 31, 2025
…Scala 2.13.16

### What changes were proposed in this pull request?

This PR aims to regenerate benchmark results of `master` after upgrading to Scala 2.13.16.
- #49478

### Why are the changes needed?

To check a regression again. Currently, there is only one known benchmark suite failure.
- #49411 (comment)

### Does this PR introduce _any_ user-facing change?

No, this updates only test results.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #49744 from dongjoon-hyun/bm_41.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Jan 31, 2025
…g to Scala 2.13.16

### What changes were proposed in this pull request?

This PR aims to regenerate benchmark results of `branch-4.0` after upgrading to Scala 2.13.16.
- #49478

### Why are the changes needed?

To check a regression again
- #49411 (comment)

### Does this PR introduce _any_ user-facing change?

No, this updates only test results.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #49741 from dongjoon-hyun/bm_40.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@panbingkun panbingkun closed this Feb 5, 2025
@panbingkun
Copy link
Contributor Author

@panbingkun is currently on vacation. According to @panbingkun's intention, if FilterExec supports subexpressionElimination in the codegen scenario, this pr would not be needed. BUt, #49573 is still under development.

Yeah, that's right.
Currently, I want to focus on implementing subexpressionElimination for FilterExec in the WSC scenario.

Thanks @dongjoon-hyun @LuciferYang ❤️

MaxGekk pushed a commit that referenced this pull request Feb 18, 2025
### What changes were proposed in this pull request?

This reopens #49411 to fix the performance regression in 4.0.

### Why are the changes needed?

It's non-trivial to support CSE for Filter in whole stage codegen. We should not rush but revert the codegen support in 4.0 so that we have more time to get it right in 4.1.

Note: 4.0 also adds codegen support for a few other expressions, but `from_json` is special as it's quite expensive and the performance regression is very significant with it.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #49992 from cloud-fan/json.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
MaxGekk pushed a commit that referenced this pull request Feb 18, 2025
### What changes were proposed in this pull request?

This reopens #49411 to fix the performance regression in 4.0.

### Why are the changes needed?

It's non-trivial to support CSE for Filter in whole stage codegen. We should not rush but revert the codegen support in 4.0 so that we have more time to get it right in 4.1.

Note: 4.0 also adds codegen support for a few other expressions, but `from_json` is special as it's quite expensive and the performance regression is very significant with it.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #49992 from cloud-fan/json.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
(cherry picked from commit a8b694f)
Signed-off-by: Max Gekk <[email protected]>
Pajaraja pushed a commit to Pajaraja/spark that referenced this pull request Mar 6, 2025
### What changes were proposed in this pull request?

This reopens apache#49411 to fix the performance regression in 4.0.

### Why are the changes needed?

It's non-trivial to support CSE for Filter in whole stage codegen. We should not rush but revert the codegen support in 4.0 so that we have more time to get it right in 4.1.

Note: 4.0 also adds codegen support for a few other expressions, but `from_json` is special as it's quite expensive and the performance regression is very significant with it.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#49992 from cloud-fan/json.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants