-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Description
Hi Pandas devs and Pandas community 🤗
I am reaching out to you to see if you would be interested in an integration with the Hugging Face Hub. We have been hosting datasets on the hub for a while and are now close to 3000 public datasets not counting all the private datasets.
In both the models and datasets areas of the Hugging Face ecosystem we use the push_to_hub
functionality to upload datasets and models to the Hub in one line. Similarly, these assets can be loaded from the Hub in a single line with the load_dataset
and from_pretrained
functions, respectively.
We wanted to ask you whether you would be interested to add the huggingface_hub
dependancy such that any DataFrame
could be pushed and pulled from the hub.
Here are a few use-cases where such a functionality would add value:
- Save and document raw as well as processed datasets on the hub (also as backup)
- Datasets on the Hub have a preview (see an example here)
- Datasets on the Hub can be documented with a Readme and linked to models trained on them
- Datasets on the Hub are versioned (using
git-lfs
in the background)
- Share datasets with students for lectures or group projects
- Share datasets within an organization (publicly or privately)
Here is how such an integration could look like:
# upload a DataFrame to the Hub:
df.push_to_hub("my_dataset", org="my_org")
# load a DataFrame from the Hub:
df = DataFrame.from_hub("my_dataset", org="my_org")
Here is the documentation on publishing files on the Hugging Face Hub using the huggingface_hub
library:
https://github.com/huggingface/huggingface_hub/tree/main/src/huggingface_hub#publish-files-to-the-hub
I am curious to hear what you think about this and please let me know if I can clarify anything!