[Bug] createDataFrame in colab doesn't work #14555

psydok · 2025-04-26T20:58:40Z

Is there an existing issue for this?

I have searched the existing issues and did not find a match.

Who can help?

I keep getting this error no matter what I do or what examples from your documentation I use.

What are you working on?

I'm trying to start a ner model for quality assessment. but I don't understand where else I can look to see what is causing the error. I tried to put other versions and load a different dataset (as in the documentation [(Alice, 1)]) the error is the same.

https://github.com/JohnSnowLabs/spark-nlp/blob/ac203f2906e0cc6ce7185cb929b90c708820ea9e/docs/_posts/maziyarpanahi/2020-02-03-wikiner_840B_300_it.md#how-to-use

Current Behavior

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/pyspark/serializers.py", line 458, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/usr/local/lib/python3.11/dist-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 602, in dump
    return Pickler.dump(self, obj)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 692, in reducer_override
    return self._function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 565, in _function_reduce
    return self._dynamic_function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 546, in _dynamic_function_reduce
    state = _function_getstate(func)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 157, in _function_getstate
    f_globals_ref = _extract_code_globals(func.__code__)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyspark/cloudpickle/cloudpickle.py", line 334, in _extract_code_globals
    out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyspark/cloudpickle/cloudpickle.py", line 334, in <dictcomp>
    out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
                 ~~~~~^^^^^^^
IndexError: tuple index out of range

Expected Behavior

no errors

Steps To Reproduce

# Install PySpark and Spark NLP
#! pip install -q pyspark==3.3.0 spark-nlp==4.2.8

# Install Spark NLP Display lib
#! pip install --upgrade -q spark-nlp-display

import json
import pandas as pd
import numpy as np

import sparknlp
import pyspark.sql.functions as F

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline
from pyspark.sql.types import StringType, IntegerType

empty_data = spark.createDataFrame([[('', 1)]], verifySchema=False)

Spark NLP version and Apache Spark

Spark NLP version 4.2.8
Apache Spark version: 3.3.0

Type of Spark Application

Python Application

Java Version

No response

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

Spark NLP version 4.2.8
Apache Spark version: 3.3.0

The text was updated successfully, but these errors were encountered:

prabod · 2025-05-16T05:27:09Z

Hi @psydok,

This is due to Colab upgrading the python version to 3.11. You need to upgrade to pyspark 3.4.0.

# Install PySpark and Spark NLP
! pip install -q pyspark==3.4.0 spark-nlp==6.0.0

# Install Spark NLP Display lib
! pip install --upgrade -q spark-nlp-display

import json
import pandas as pd
import numpy as np

import sparknlp
import pyspark.sql.functions as F

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline
from pyspark.sql.types import StringType, IntegerType

spark = sparknlp.start()
empty_data = spark.createDataFrame([[('', 1)]], verifySchema=False)

@DevinTDHa Need to update the docs to set the pyspark version to 3.4.0 in Colab

psydok added the question label Apr 26, 2025

psydok assigned maziyarpanahi Apr 26, 2025

psydok changed the title ~~createDataFrame in colab doesn't work.~~ [Bug] createDataFrame in colab doesn't work Apr 26, 2025

prabod closed this as completed May 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] createDataFrame in colab doesn't work #14555

[Bug] createDataFrame in colab doesn't work #14555

psydok commented Apr 26, 2025

prabod commented May 16, 2025

Uh oh!

[Bug] createDataFrame in colab doesn't work #14555

[Bug] createDataFrame in colab doesn't work #14555

Comments

psydok commented Apr 26, 2025

Is there an existing issue for this?

Who can help?

What are you working on?

Current Behavior

Expected Behavior

Steps To Reproduce

Spark NLP version and Apache Spark

Type of Spark Application

Java Version

Java Home Directory

Setup and installation

Operating System and Version

Link to your project (if available)

Additional Information

prabod commented May 16, 2025

Uh oh!