Skip to content

Jsonlines export error #2615

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TevenLeScao opened this issue Jul 9, 2021 · 10 comments · Fixed by #2617
Closed

Jsonlines export error #2615

TevenLeScao opened this issue Jul 9, 2021 · 10 comments · Fixed by #2617
Assignees
Labels
bug Something isn't working

Comments

@TevenLeScao
Copy link
Contributor

Describe the bug

When exporting large datasets in jsonlines (c4 in my case) the created file has an error every 9999 lines: the 9999th and 10000th are concatenated, thus breaking the jsonlines format. This sounds like it is related to batching, which is by 10000 by default

Steps to reproduce the bug

This what I'm running:

in python:

from datasets import load_dataset
ptb = load_dataset("ptb_text_only")
ptb["train"].to_json("ptb.jsonl")

then out of python:

head -10000 ptb.jsonl

Expected results

Properly separated lines

Actual results

The last line is a concatenation of two lines

Environment info

  • datasets version: 1.9.1.dev0
  • Platform: Linux-5.4.0-1046-gcp-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyArrow version: 4.0.1
@TevenLeScao TevenLeScao added the bug Something isn't working label Jul 9, 2021
@albertvillanova
Copy link
Member

Thanks for reporting @TevenLeScao! I'm having a look...

@TevenLeScao
Copy link
Contributor Author

(not sure what just happened on the assignations sorry)

@TevenLeScao
Copy link
Contributor Author

For some reason this happens (both datasets version are on master) only on Python 3.6 and not Python 3.8.

@albertvillanova
Copy link
Member

@TevenLeScao we are using pandas to serialize the dataset to JSON Lines. So it must be due to pandas. Could you please check the pandas version causing the issue?

@albertvillanova
Copy link
Member

@TevenLeScao I have just checked it: this was a bug in pandas and it was fixed in version 1.2: pandas-dev/pandas#36898

@lhoestq
Copy link
Member

lhoestq commented Jul 9, 2021

Thanks ! I'm creating a PR

@albertvillanova
Copy link
Member

Well I though it was me who has taken on this issue... 😅

@lhoestq
Copy link
Member

lhoestq commented Jul 9, 2021

Sorry, I was also talking to teven offline so I already had the PR ready before noticing x)

@albertvillanova
Copy link
Member

I was also already working in my PR... Nevermind. Next time we should pay attention if there is somebody (self-)assigned to an issue and if he/she is still working on it before overtaking it... 😄

@lhoestq
Copy link
Member

lhoestq commented Jul 9, 2021

The fix is available on master @TevenLeScao , thanks for reporting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants