Spark integration causes significant slowdowns or even the entire job to run out of memory and fail

### Environment

How do you use Sentry?
Sentry SaaS (sentry.io)

Which SDK and version?
sentry@v1.3.1 using spark integration

### Steps to Reproduce

This is essentially a MWE of what our setup looks like:

```python
from pyspark import SparkConf
from pyspark.sql import SparkSession

import sentry_sdk
from sentry_sdk.integrations.spark import SparkIntegration

sentry_sdk.init(SENTRY_DSN, integrations=[SparkIntegration()])

def get_spark_context(job_name):
    conf = SparkConf().setAppName(job_name)
    conf = conf.set("spark.python.use.daemon", True)
    conf = conf.set("spark.python.daemon.module", "sentry_daemon")
    session = SparkSession.builder.config(conf=conf).getOrCreate()
    session.sparkContext.addPyFile(".../sentry_daemon.py")
    return session.sparkContext

sc = get_spark_context("my_job")

for batch in batches:
    sc.textFile(batch.input_path).map(some_function).saveAsTextFile(batch.output_path)

sc.stop()
```

1. I'm able to get sentry to log exceptions properly using the [sentry daemon](https://docs.sentry.io/platforms/python/guides/pyspark/#worker) and the above configuration.
2. However, I noticed that each batch took progressively longer: without spark integration in sentry, each batch takes ~3 hours to run, but with the integration enabled, the first took 3 hours, the second took 6, the third 9, and so on.
3. I was able to work around the issue by creating and stopping the spark context within each batch instead of having one for the entire loop.
4. However, now the job eventually fails due to an out-of-memory error after a few batches, even though we have plenty of resources and we have never encountered this issue at this stage in our pipeline before.

### Expected Result

The job would run with Spark Sentry integration normally.

### Actual Result

The job either takes progressively longer to finish or will eventually run out of memory and fail.

This is the `stdout` of EMR cluster:
```
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 13796"...
```

My understanding is that Spark integration is not being actively maintained and is considered somewhat experimental. Any help here would be greatly appreciated, even if just a potential workaround and not an actual fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark integration causes significant slowdowns or even the entire job to run out of memory and fail #1245

Environment

Steps to Reproduce

Expected Result

Actual Result

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Spark integration causes significant slowdowns or even the entire job to run out of memory and fail #1245

Description

Environment

Steps to Reproduce

Expected Result

Actual Result

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions