Skip to content

[Bug] createDataFrame in colab doesn't work #14555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
psydok opened this issue Apr 26, 2025 · 1 comment
Closed
1 task done

[Bug] createDataFrame in colab doesn't work #14555

psydok opened this issue Apr 26, 2025 · 1 comment
Assignees
Labels

Comments

@psydok
Copy link

psydok commented Apr 26, 2025

Is there an existing issue for this?

  • I have searched the existing issues and did not find a match.

Who can help?

I keep getting this error no matter what I do or what examples from your documentation I use.

What are you working on?

I'm trying to start a ner model for quality assessment. but I don't understand where else I can look to see what is causing the error. I tried to put other versions and load a different dataset (as in the documentation [(Alice, 1)]) the error is the same.

https://github.com/JohnSnowLabs/spark-nlp/blob/ac203f2906e0cc6ce7185cb929b90c708820ea9e/docs/_posts/maziyarpanahi/2020-02-03-wikiner_840B_300_it.md#how-to-use

Current Behavior

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/pyspark/serializers.py", line 458, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/usr/local/lib/python3.11/dist-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 602, in dump
    return Pickler.dump(self, obj)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 692, in reducer_override
    return self._function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 565, in _function_reduce
    return self._dynamic_function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 546, in _dynamic_function_reduce
    state = _function_getstate(func)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 157, in _function_getstate
    f_globals_ref = _extract_code_globals(func.__code__)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyspark/cloudpickle/cloudpickle.py", line 334, in _extract_code_globals
    out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyspark/cloudpickle/cloudpickle.py", line 334, in <dictcomp>
    out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
                 ~~~~~^^^^^^^
IndexError: tuple index out of range

Expected Behavior

no errors

Steps To Reproduce

# Install PySpark and Spark NLP
#! pip install -q pyspark==3.3.0 spark-nlp==4.2.8

# Install Spark NLP Display lib
#! pip install --upgrade -q spark-nlp-display

import json
import pandas as pd
import numpy as np

import sparknlp
import pyspark.sql.functions as F

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline
from pyspark.sql.types import StringType, IntegerType

empty_data = spark.createDataFrame([[('', 1)]], verifySchema=False)

Spark NLP version and Apache Spark

Spark NLP version 4.2.8
Apache Spark version: 3.3.0

Type of Spark Application

Python Application

Java Version

No response

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

Spark NLP version 4.2.8
Apache Spark version: 3.3.0

@psydok psydok changed the title createDataFrame in colab doesn't work. [Bug] createDataFrame in colab doesn't work Apr 26, 2025
@prabod
Copy link
Contributor

prabod commented May 16, 2025

Hi @psydok,

This is due to Colab upgrading the python version to 3.11. You need to upgrade to pyspark 3.4.0.

# Install PySpark and Spark NLP
! pip install -q pyspark==3.4.0 spark-nlp==6.0.0

# Install Spark NLP Display lib
! pip install --upgrade -q spark-nlp-display

import json
import pandas as pd
import numpy as np

import sparknlp
import pyspark.sql.functions as F

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline
from pyspark.sql.types import StringType, IntegerType

spark = sparknlp.start()
empty_data = spark.createDataFrame([[('', 1)]], verifySchema=False)

@DevinTDHa Need to update the docs to set the pyspark version to 3.4.0 in Colab

@prabod prabod closed this as completed May 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants