Skip to content

Spark integration causes significant slowdowns or even the entire job to run out of memory and fail #1245

@martimlobao

Description

@martimlobao

Environment

How do you use Sentry?
Sentry SaaS (sentry.io)

Which SDK and version?
[email protected] using spark integration

Steps to Reproduce

This is essentially a MWE of what our setup looks like:

from pyspark import SparkConf
from pyspark.sql import SparkSession

import sentry_sdk
from sentry_sdk.integrations.spark import SparkIntegration

sentry_sdk.init(SENTRY_DSN, integrations=[SparkIntegration()])

def get_spark_context(job_name):
    conf = SparkConf().setAppName(job_name)
    conf = conf.set("spark.python.use.daemon", True)
    conf = conf.set("spark.python.daemon.module", "sentry_daemon")
    session = SparkSession.builder.config(conf=conf).getOrCreate()
    session.sparkContext.addPyFile(".../sentry_daemon.py")
    return session.sparkContext

sc = get_spark_context("my_job")

for batch in batches:
    sc.textFile(batch.input_path).map(some_function).saveAsTextFile(batch.output_path)

sc.stop()
  1. I'm able to get sentry to log exceptions properly using the sentry daemon and the above configuration.
  2. However, I noticed that each batch took progressively longer: without spark integration in sentry, each batch takes ~3 hours to run, but with the integration enabled, the first took 3 hours, the second took 6, the third 9, and so on.
  3. I was able to work around the issue by creating and stopping the spark context within each batch instead of having one for the entire loop.
  4. However, now the job eventually fails due to an out-of-memory error after a few batches, even though we have plenty of resources and we have never encountered this issue at this stage in our pipeline before.

Expected Result

The job would run with Spark Sentry integration normally.

Actual Result

The job either takes progressively longer to finish or will eventually run out of memory and fail.

This is the stdout of EMR cluster:

# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 13796"...

My understanding is that Spark integration is not being actively maintained and is considered somewhat experimental. Any help here would be greatly appreciated, even if just a potential workaround and not an actual fix.

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions