feat: Add zero-shot classification example `uv` script #155

rmitsch · 2025-09-19T19:04:43Z

Description

Adds a example uv script for zero-shot classification using sieves. The script provides a complete workflow for classifying Hugging Face datasets with automatic device detection, structured outputs, and dataset publishing.

Inspired by https://huggingface.co/datasets/uv-scripts/classification.

Caution

I don't have access to a pro/enterprise HF account yet, so I couldn't test running this with hf jobs. Maybe @davanstrien or @ivyleavedtoadflax can give it a shot?

Related Issues

-

Changes Made

New example script: examples/create_classification_dataset_with_sieves.py - A comprehensive uv-compatible script for zero-shot text classification
Enhanced Classification task: Improved to_hf_dataset() method with proper label normalization and multi-label support
Package metadata: Updated setup.py with corrected package name

Key Features of the Example Script:

Dual model support: Handles both dedicated zero-shot classification models and general language models via Outlines
Multi-label classification: Optional multi-label mode with configurable thresholds
Comprehensive preprocessing: Text validation, length limits, and cleaning
Rich logging: Detailed statistics and distribution reporting
Label descriptions: Optional semantic descriptions for better classification
HF Hub integration: Automatic authentication and dataset publishing
uv compatibility: Uses PEP 723 inline script metadata for dependency management

Usage Examples:

# Single-label
uv run examples/create_classification_dataset.py \
  --input-dataset stanfordnlp/imdb \
  --column text \
  --labels "positive,ambivalent,negative" \
  --model MoritzLaurer/deberta-v3-large-zeroshot-v2.0 \
  --output-dataset your-username/imdb-classified

# Multi-label
uv run examples/create_classification_dataset.py \
  --input-dataset ag_news \
  --column text \
  --labels "world,sports,business,science" \
  --multi-label \
  --model MoritzLaurer/deberta-v3-large-zeroshot-v2.0 \
  --output-dataset your-username/agnews-multilabel

Checklist

Tests have been extended to cover changes in functionality
Existing and new tests succeed
Documentation updated (if applicable)
Related issues linked

Screenshots/Examples (if applicable)

The script provides comprehensive logging output showing:

Device selection (GPU/CPU)
Dataset loading and preprocessing statistics
Classification progress and results
Label distribution analysis
Success rates and error handling

…ion and upload.

codecov · 2025-09-19T19:18:49Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #155   +/-   ##
=======================================
  Coverage   92.56%   92.56%           
=======================================
  Files          64       64           
  Lines        3281     3281           
=======================================
  Hits         3037     3037           
  Misses        244      244

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ivyleavedtoadflax · 2025-09-22T08:01:24Z

Managed to get it to run with:

hf jobs uv run examples/create_classification_dataset.py classify \
    --input-dataset imdb \
    --column text \
    --labels "positive,ambivalent,negative" \
    --model MoritzLaurer/deberta-v3-large-zeroshot-v2.0 \
    --output-dataset mattupson/imdb-classified-v2 \
    --max-samples 500 \
    --flavor a10g-small

but seems to fail due to

 ImportError: libGL.so.1: cannot open shared object file: No such file or directory

  This error occurred when the script tried to import OpenCV (cv2):

  File "/root/.cache/uv/environments-v2/script-912247c0edd68a55/lib/python3.12/site-packag
  es/cv2/__init__.py", line 153, in bootstrap
      native_module = importlib.import_module("cv2")
  ImportError: libGL.so.1: cannot open shared object file: No such file or directory

Looks like libGL.so.1 (OpenGL) is not on teh docker container running the hf job.

ivyleavedtoadflax · 2025-09-22T15:35:04Z

@davanstrien is there a way to use a custom docker container?

rmitsch · 2025-09-22T17:37:15Z

@ivyleavedtoadflax You can use any image as base AFAIK, as long as it's on Docker Hub. Maybe give https://hub.docker.com/r/nvidia/cuda a shot?

ivyleavedtoadflax · 2025-09-24T11:59:42Z

@ivyleavedtoadflax You can use any image as base AFAIK, as long as it's on Docker Hub. Maybe give https://hub.docker.com/r/nvidia/cuda a shot?

Also failed. I started looking in to using a custom docker container - the same that HF uses but just with the OpenCV deps added.

Note that this is happening because some of the OCR dependencies require it. If we moved the OCR dependencies to extra then we might be able to sidestep the problem entirely

rmitsch · 2025-10-01T22:04:59Z

Also failed. I started looking in to using a custom docker container - the same that HF uses but just with the OpenCV deps added.

Note that this is happening because some of the OCR dependencies require it. If we moved the OCR dependencies to extra > then we might be able to sidestep the problem entirely

@ivyleavedtoadflax Done in #163. Also updated to the new API design. Try again? Shouldn't require opencv anymore and work with the default Docker image.

ivyleavedtoadflax · 2025-10-07T13:40:54Z

Great, i will try again

ivyleavedtoadflax · 2025-10-07T14:32:36Z

Weirdly this does not solve the issue, it continues to asks for the ingestion dependencies

# Conflicts: # sieves/tasks/preprocessing/ingestion/ingestion_import.py # uv.lock

rmitsch · 2025-10-12T18:18:37Z

@ivyleavedtoadflax Fixed and tested. I can run this with:

hf jobs uv run examples/create_classification_dataset.py classify \  
    --input-dataset imdb \
    --column text \
    --labels "positive,ambivalent,negative" \
    --model MoritzLaurer/deberta-v3-large-zeroshot-v2.0 \
    --output-dataset {YOUR_USERNAME}/imdb-classified-v2 \
    --max-samples 5 \
    --batch-size 10 \
    --hf-token ...

LMK if this works for you too?

davanstrien · 2025-10-13T13:25:41Z

Looks great! One quick suggestion: it would be better to load the HF token from environment so it can be passed as a secret in the jobs command.

i.e. Instead of --hf-token flag, just grab it from the environment:

token = os.environ.get("HF_TOKEN") or get_token()
if token:
    login(token=token)

Then in the HF Jobs example:

  hfjobs run --flavor l4x1 \
    -s HF_TOKEN \
    uv run https://github.com/raw/MantisAI/sieves/main/examples/create_classification_dataset.py \
    --input-dataset stanfordnlp/imdb \
    --column text \
    --labels "positive,negative" \
    --model HuggingFaceTB/SmolLM-360M-Instruct \
    --output-dataset your-username/imdb-classified

Otherwise, this looks very nice!

# Conflicts: # sieves/tests/tasks/test_optimization.py

rmitsch · 2025-10-13T19:57:39Z

Looks great! One quick suggestion: it would be better to load the HF token from environment so it can be passed as a secret in the jobs command.

Good catch, thanks! Added in 2bbdc5e.

I'll leave this PR open for a bit longer in case @ivyleavedtoadflax has any further comments, then I'll merge.

ivyleavedtoadflax · 2025-10-14T16:25:20Z

Weirdly I can't get this to work. I've updated my HF_TOKEN

Raphael Mitsch added 4 commits September 18, 2025 23:55

feat: HF classification uv script.

2a113dc

chore: Correct package name.

1901e98

fix: Fix label normalization in classification HF dataset generation.

8e2553a

feat: Example uv script for zero-shot HF classification dataset creat…

c475b48

…ion and upload.

rmitsch self-assigned this Sep 19, 2025

rmitsch marked this pull request as draft September 19, 2025 19:08

Raphael Mitsch added 2 commits September 19, 2025 21:20

chore: Clean up.

cc16aff

fix: Fix classification dataset test.

5eaaa34

rmitsch requested a review from ivyleavedtoadflax September 19, 2025 19:55

rmitsch marked this pull request as ready for review September 19, 2025 19:57

rmitsch changed the title ~~feat: Add comprehensive zero-shot classification example script~~ feat: Add zero-shot classification example script Sep 19, 2025

rmitsch changed the title ~~feat: Add zero-shot classification example script~~ feat: Add zero-shot classification example uv script Sep 19, 2025

Raphael Mitsch added 2 commits September 21, 2025 15:26

chore: Rename OCR to Ingestion to better reflect task role.

f8b5451

Merge branch 'main' into feat/examples-jobs-classification

3b4332c

Merge branch 'main' into feat/examples-jobs-classification

e465380

Raphael Mitsch added 2 commits October 1, 2025 23:47

Merge branch 'main' into feat/examples-jobs-classification

48eaaa3

chore: Migrate to sieves >= 0.15.

a4a5864

Raphael Mitsch added 2 commits October 4, 2025 22:18

Merge branch 'main' into feat/examples-jobs-classification

a3a289b

chore: Align with changes in v0.16.0.

bc63a74

Raphael Mitsch added 3 commits October 11, 2025 14:17

Merge branch 'main' into feat/examples-jobs-classification

b3b37d6

fix: Make ingestion import conditional.

204f980

Merge branch 'main' into feat/examples-jobs-classification

052616e

# Conflicts: # sieves/tasks/preprocessing/ingestion/ingestion_import.py # uv.lock

Raphael Mitsch added 4 commits October 12, 2025 17:46

chore: Bump required sieves version.

a333204

Merge branch 'main' into feat/examples-jobs-classification

ce28b8d

Merge branch 'main' into feat/examples-jobs-classification

4400a5c

chore: Bump Sieves version.

e14c8fa

chore: Comment out deprecated code.

f8a8bf1

Raphael Mitsch added 3 commits October 13, 2025 17:38

chore: Remove outdated imports.

835bbb8

feat: Read HF token from env var/grab from authenticated HF CLI.

2bbdc5e

Merge branch 'main' into feat/examples-jobs-classification

ea5485c

# Conflicts: # sieves/tests/tasks/test_optimization.py

Uh oh!

feat: Add zero-shot classification example uv script #155

Are you sure you want to change the base?

feat: Add zero-shot classification example uv script #155

Uh oh!

Conversation

rmitsch commented Sep 19, 2025 • edited by ivyleavedtoadflax Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Changes Made

Key Features of the Example Script:

Usage Examples:

Checklist

Screenshots/Examples (if applicable)

Uh oh!

codecov bot commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ivyleavedtoadflax commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivyleavedtoadflax commented Sep 22, 2025

Uh oh!

rmitsch commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivyleavedtoadflax commented Sep 24, 2025

Uh oh!

rmitsch commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivyleavedtoadflax commented Oct 7, 2025

Uh oh!

ivyleavedtoadflax commented Oct 7, 2025

Uh oh!

rmitsch commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davanstrien commented Oct 13, 2025

Uh oh!

rmitsch commented Oct 13, 2025

Uh oh!

ivyleavedtoadflax commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Add zero-shot classification example `uv` script #155

feat: Add zero-shot classification example `uv` script #155

rmitsch commented Sep 19, 2025 •

edited by ivyleavedtoadflax

Loading

codecov bot commented Sep 19, 2025 •

edited

Loading

ivyleavedtoadflax commented Sep 22, 2025 •

edited

Loading

rmitsch commented Sep 22, 2025 •

edited

Loading

rmitsch commented Oct 1, 2025 •

edited

Loading

rmitsch commented Oct 12, 2025 •

edited

Loading