Skip to content

[WIP] Fix Arrow conversion for TokenSpanArray with multi-doc #182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 30, 2021

Conversation

BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Mar 30, 2021

This fixes Arrow conversion of a TokenSpanArray that possibly uses multiple SpanArrays with multiple documents. The conversion follows these steps:

  1. Get SpanArrays used in the TokenSpanArray and convert these to arrow, each is a struct with begins, ends, and dictionary array with the document
  2. Concat the arrow SpanArrays together to make an arrow list of them
  3. Build an arrow dictionary array that maps the index of the SpanArray used by each element of the TokenSpanArray to the SpanArray in the arrow list
  4. create a struct array with token begins, token ends, and the token dictionary array

From #179

@BryanCutler
Copy link
Member Author

@frreiss this is still a bit of a WIP, need to add some tests like actual multi-doc array. Please have a look at the implementation when you can. It's a little complicated since we have dictionary arrays that point to other dictionary arrays, but overall I think it's a better approach than before.

Some areas are rough due to some issues in pyarrow, that probably could be made into a JIRA to be improved upstream, but I'll have to look into it more. Also, it seems this only works with pyarrow >= 2.0.0, so we might want to bump up the minimum version.

@BryanCutler BryanCutler requested a review from frreiss March 30, 2021 23:01
Copy link
Member

@frreiss frreiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I didn't know you could stuff structured data into Arrow dictionaries -- saves a lot of work :)

@frreiss frreiss merged commit 59d5361 into CODAIT:master Mar 30, 2021
@frreiss
Copy link
Member

frreiss commented Mar 30, 2021

Merging this in its current form, as we need to have at least a semi-functional version of serialization in master so we can cut another release. @BryanCutler can you make a second PR with those testing improvements when they're ready?

@BryanCutler
Copy link
Member Author

Thanks @frreiss . Yeah, I'll do a followup with testing requirements. What do you think of bumping the minimum pyarrow to 2.0.0? Alternatively, we could just raise an error if the user tries to serialize with a lesser version.

@BryanCutler BryanCutler deleted the arrow-TokenSpan-multidoc-179 branch March 31, 2021 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants