Skip to content

Conversation

rmitsch
Copy link
Collaborator

@rmitsch rmitsch commented Sep 19, 2025

Description

Adds a example uv script for zero-shot classification using sieves. The script provides a complete workflow for classifying Hugging Face datasets with automatic device detection, structured outputs, and dataset publishing.

Inspired by https://huggingface.co/datasets/uv-scripts/classification.

Caution

I don't have access to a pro/enterprise HF account yet, so I couldn't test running this with hf jobs. Maybe @davanstrien or @ivyleavedtoadflax can give it a shot?

Related Issues

-

Changes Made

  • New example script: examples/create_classification_dataset_with_sieves.py - A comprehensive uv-compatible script for zero-shot text classification
  • Enhanced Classification task: Improved to_hf_dataset() method with proper label normalization and multi-label support
  • Package metadata: Updated setup.py with corrected package name

Key Features of the Example Script:

  • Dual model support: Handles both dedicated zero-shot classification models and general language models via Outlines
  • Multi-label classification: Optional multi-label mode with configurable thresholds
  • Comprehensive preprocessing: Text validation, length limits, and cleaning
  • Rich logging: Detailed statistics and distribution reporting
  • Label descriptions: Optional semantic descriptions for better classification
  • HF Hub integration: Automatic authentication and dataset publishing
  • uv compatibility: Uses PEP 723 inline script metadata for dependency management

Usage Examples:

# Single-label
uv run examples/create_classification_dataset.py \
  --input-dataset stanfordnlp/imdb \
  --column text \
  --labels "positive,ambivalent,negative" \
  --model MoritzLaurer/deberta-v3-large-zeroshot-v2.0 \
  --output-dataset your-username/imdb-classified

# Multi-label
uv run examples/create_classification_dataset.py \
  --input-dataset ag_news \
  --column text \
  --labels "world,sports,business,science" \
  --multi-label \
  --model MoritzLaurer/deberta-v3-large-zeroshot-v2.0 \
  --output-dataset your-username/agnews-multilabel

Checklist

  • Tests have been extended to cover changes in functionality
  • Existing and new tests succeed
  • Documentation updated (if applicable)
  • Related issues linked

Screenshots/Examples (if applicable)

The script provides comprehensive logging output showing:

  • Device selection (GPU/CPU)
  • Dataset loading and preprocessing statistics
  • Classification progress and results
  • Label distribution analysis
  • Success rates and error handling

@rmitsch rmitsch self-assigned this Sep 19, 2025
@rmitsch rmitsch marked this pull request as draft September 19, 2025 19:08
Copy link

codecov bot commented Sep 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #155   +/-   ##
=======================================
  Coverage   92.56%   92.56%           
=======================================
  Files          64       64           
  Lines        3281     3281           
=======================================
  Hits         3037     3037           
  Misses        244      244           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rmitsch rmitsch marked this pull request as ready for review September 19, 2025 19:57
@rmitsch rmitsch changed the title feat: Add comprehensive zero-shot classification example script feat: Add zero-shot classification example script Sep 19, 2025
@rmitsch rmitsch changed the title feat: Add zero-shot classification example script feat: Add zero-shot classification example uv script Sep 19, 2025
@ivyleavedtoadflax
Copy link
Contributor

ivyleavedtoadflax commented Sep 22, 2025

Managed to get it to run with:

hf jobs uv run examples/create_classification_dataset.py classify \
    --input-dataset imdb \
    --column text \
    --labels "positive,ambivalent,negative" \
    --model MoritzLaurer/deberta-v3-large-zeroshot-v2.0 \
    --output-dataset mattupson/imdb-classified-v2 \
    --max-samples 500 \
    --flavor a10g-small

but seems to fail due to

 ImportError: libGL.so.1: cannot open shared object file: No such file or directory

  This error occurred when the script tried to import OpenCV (cv2):

  File "/root/.cache/uv/environments-v2/script-912247c0edd68a55/lib/python3.12/site-packag
  es/cv2/__init__.py", line 153, in bootstrap
      native_module = importlib.import_module("cv2")
  ImportError: libGL.so.1: cannot open shared object file: No such file or directory

Looks like libGL.so.1 (OpenGL) is not on teh docker container running the hf job.

@ivyleavedtoadflax
Copy link
Contributor

@davanstrien is there a way to use a custom docker container?

@rmitsch
Copy link
Collaborator Author

rmitsch commented Sep 22, 2025

@ivyleavedtoadflax You can use any image as base AFAIK, as long as it's on Docker Hub. Maybe give https://hub.docker.com/r/nvidia/cuda a shot?

@ivyleavedtoadflax
Copy link
Contributor

@ivyleavedtoadflax You can use any image as base AFAIK, as long as it's on Docker Hub. Maybe give https://hub.docker.com/r/nvidia/cuda a shot?

Also failed. I started looking in to using a custom docker container - the same that HF uses but just with the OpenCV deps added.

Note that this is happening because some of the OCR dependencies require it. If we moved the OCR dependencies to extra then we might be able to sidestep the problem entirely

@rmitsch
Copy link
Collaborator Author

rmitsch commented Oct 1, 2025

Also failed. I started looking in to using a custom docker container - the same that HF uses but just with the OpenCV deps added.

Note that this is happening because some of the OCR dependencies require it. If we moved the OCR dependencies to extra > then we might be able to sidestep the problem entirely

@ivyleavedtoadflax Done in #163. Also updated to the new API design. Try again? Shouldn't require opencv anymore and work with the default Docker image.

@ivyleavedtoadflax
Copy link
Contributor

Great, i will try again

@ivyleavedtoadflax
Copy link
Contributor

Weirdly this does not solve the issue, it continues to asks for the ingestion dependencies

@rmitsch
Copy link
Collaborator Author

rmitsch commented Oct 12, 2025

@ivyleavedtoadflax Fixed and tested. I can run this with:

hf jobs uv run examples/create_classification_dataset.py classify \  
    --input-dataset imdb \
    --column text \
    --labels "positive,ambivalent,negative" \
    --model MoritzLaurer/deberta-v3-large-zeroshot-v2.0 \
    --output-dataset {YOUR_USERNAME}/imdb-classified-v2 \
    --max-samples 5 \
    --batch-size 10 \
    --hf-token ...

LMK if this works for you too?

@davanstrien
Copy link

Looks great! One quick suggestion: it would be better to load the HF token from environment so it can be passed as a secret in the jobs command.

i.e. Instead of --hf-token flag, just grab it from the environment:

token = os.environ.get("HF_TOKEN") or get_token()
if token:
    login(token=token)

Then in the HF Jobs example:

  hfjobs run --flavor l4x1 \
    -s HF_TOKEN \
    uv run https://github.com/raw/MantisAI/sieves/main/examples/create_classification_dataset.py \
    --input-dataset stanfordnlp/imdb \
    --column text \
    --labels "positive,negative" \
    --model HuggingFaceTB/SmolLM-360M-Instruct \
    --output-dataset your-username/imdb-classified

Otherwise, this looks very nice!

@rmitsch
Copy link
Collaborator Author

rmitsch commented Oct 13, 2025

Looks great! One quick suggestion: it would be better to load the HF token from environment so it can be passed as a secret in the jobs command.

Good catch, thanks! Added in 2bbdc5e.

I'll leave this PR open for a bit longer in case @ivyleavedtoadflax has any further comments, then I'll merge.

@ivyleavedtoadflax
Copy link
Contributor

Weirdly I can't get this to work. I've updated my HF_TOKEN

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants