Skip to content

ENH: Integration with Hugging Face Hub #46000

@lvwerra

Description

@lvwerra

Hi Pandas devs and Pandas community 🤗

I am reaching out to you to see if you would be interested in an integration with the Hugging Face Hub. We have been hosting datasets on the hub for a while and are now close to 3000 public datasets not counting all the private datasets.

In both the models and datasets areas of the Hugging Face ecosystem we use the push_to_hub functionality to upload datasets and models to the Hub in one line. Similarly, these assets can be loaded from the Hub in a single line with the load_dataset and from_pretrained functions, respectively.

We wanted to ask you whether you would be interested to add the huggingface_hub dependancy such that any DataFrame could be pushed and pulled from the hub.

Here are a few use-cases where such a functionality would add value:

  • Save and document raw as well as processed datasets on the hub (also as backup)
    • Datasets on the Hub have a preview (see an example here)
    • Datasets on the Hub can be documented with a Readme and linked to models trained on them
    • Datasets on the Hub are versioned (using git-lfs in the background)
  • Share datasets with students for lectures or group projects
  • Share datasets within an organization (publicly or privately)

Here is how such an integration could look like:

# upload a DataFrame to the Hub:
df.push_to_hub("my_dataset", org="my_org")

# load a DataFrame from the Hub:
df = DataFrame.from_hub("my_dataset", org="my_org")

Here is the documentation on publishing files on the Hugging Face Hub using the huggingface_hub library:
https://github.com/huggingface/huggingface_hub/tree/main/src/huggingface_hub#publish-files-to-the-hub

I am curious to hear what you think about this and please let me know if I can clarify anything!

cc @osanseviero @julien-c

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions