Skip to content

Wrong classification outputs with WebOrganizer/FormatClassifier model based on gte-base-en-v1.5 #605

@WissamAntoun

Description

@WissamAntoun

System Info

text-embeddings-inference version 1.7 (cpu, volta, hopper)

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Does the predict enpoint support classification models based on [gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) like WebOrganizer/FormatClassifier?

I tried running the following and the outputs are a lot different:

model=WebOrganizer/FormatClassifier

docker run --gpus all --rm -p 4567:4567 ghcr.io/huggingface/text-embeddings-inference:cpu-1.7 \
--model-id $model \
--tokenization-workers 4 \
--dtype float16 \
--max-concurrent-requests 12000 \
--max-batch-tokens 4194304 \
--max-batch-requests 30 \
--payload-limit 100000000 \
--auto-truncate \
--port 4567

and

curl --request POST \
  --url http://localhost:4567/predict \
  --header 'content-type: application/json' \
  --header 'user-agent: vscode-restclient' \
  --data '{"inputs": "http://oceana.org/es/node/2845\n\nGiant Manta Ray\nGiant Manta Ray Manta birostris\nDivers often describe the experience of swimming beneath a manta ray as like being overtaken by a huge flying saucer. This ray is the biggest in the world, but like the biggest shark, the whale shark, it is a harmless consumer of plankton.\nWhen feeding, it swims along with its cavernous mouth wide open, beating its huge triangular wings slowly up and down. On either side of the mouth, which is at the front of the head, there are two long paddles, called cephalic lobes. These lobes help funnel plankton into the mouth. A stingerless whiplike tail trails behind.\nGiant manta rays tend to be found over high points like seamounts where currents bring plankton up to them. Small fish called remoras often travel attached to these giants, feeding on food scraps along the way. Giant mantas are ovoviviparous, so the eggs develop and hatch inside the mother. These rays can leap high out of the water, to escape predators, clean their skin of parasites or communicate.","raw_scores": false}'

Got:

[
  {
    "score": 0.042218093,
    "label": "Knowledge Article"
  },
  {
    "score": 0.04220779,
    "label": "Nonfiction Writing"
  },
  {
    "score": 0.042174313,
    "label": "Content Listing"
  },
  {
    "score": 0.042081743,
    "label": "News (Org.)"
  },
  {
    "score": 0.042049654,
    "label": "Product Page"
  },
  {
    "score": 0.042047087,
    "label": "Comment Section"
  },
  {
    "score": 0.04203041,
    "label": "Structured Data"
  },
  {
    "score": 0.041975286,
    "label": "Personal Blog"
  },
  {
    "score": 0.04192536,
    "label": "About (Org.)"
  },
  {
    "score": 0.04182121,
    "label": "News Article"
  },
  {
    "score": 0.04173575,
    "label": "Truncated"
  },
  {
    "score": 0.041673746,
    "label": "Customer Support"
  },
  {
    "score": 0.04149639,
    "label": "Tutorial"
  },
  {
    "score": 0.04145083,
    "label": "Spam / Ads"
  },
  {
    "score": 0.04142933,
    "label": "Audio Transcript"
  },
  {
    "score": 0.041401524,
    "label": "About (Pers.)"
  },
  {
    "score": 0.041401524,
    "label": "Creative Writing"
  },
  {
    "score": 0.04134849,
    "label": "User Review"
  },
  {
    "score": 0.041303087,
    "label": "Q&A Forum"
  },
  {
    "score": 0.04129553,
    "label": "Academic Writing"
  },
  {
    "score": 0.041285448,
    "label": "Documentation"
  },
  {
    "score": 0.041282926,
    "label": "Listicle"
  },
  {
    "score": 0.04121495,
    "label": "Legal Notices"
  },
  {
    "score": 0.041149598,
    "label": "FAQ"
  }
]

Expected behavior

Expected using HF Transformers

[
  {
    "label": "Knowledge Article",
    "score": 0.9897820949554443
  },
  {
    "label": "Nonfiction Writing",
    "score": 0.003917765337973833
  },
  {
    "label": "Academic Writing",
    "score": 0.0027708113193511963
  },
  {
    "label": "About (Org.)",
    "score": 0.00043766063754446805
  },
  {
    "label": "Structured Data",
    "score": 0.00042599134030751884
  },
  {
    "label": "Tutorial",
    "score": 0.00035667812335304916
  },
  {
    "label": "News Article",
    "score": 0.0002632917312439531
  },
  {
    "label": "Truncated",
    "score": 0.000245161063503474
  },
  {
    "label": "Product Page",
    "score": 0.00022304613958112895
  },
  {
    "label": "News (Org.)",
    "score": 0.00017519851098768413
  },
  {
    "label": "Customer Support",
    "score": 0.0001508084824308753
  },
  {
    "label": "Creative Writing",
    "score": 0.00014316561282612383
  },
  {
    "label": "Documentation",
    "score": 0.00013527838746085763
  },
  {
    "label": "Personal Blog",
    "score": 0.00012406610767357051
  },
  {
    "label": "Q&A Forum",
    "score": 0.00012398010585457087
  },
  {
    "label": "About (Pers.)",
    "score": 0.00011963656288571656
  },
  {
    "label": "Legal Notices",
    "score": 0.00011266718502156436
  },
  {
    "label": "Listicle",
    "score": 0.00010732848750194535
  },
  {
    "label": "Audio Transcript",
    "score": 7.090720464475453e-05
  },
  {
    "label": "FAQ",
    "score": 7.070914580253884e-05
  },
  {
    "label": "Comment Section",
    "score": 6.661662337137386e-05
  },
  {
    "label": "User Review",
    "score": 6.555868458235636e-05
  },
  {
    "label": "Content Listing",
    "score": 6.322086846921593e-05
  },
  {
    "label": "Spam / Ads",
    "score": 4.835603613173589e-05
  }
]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions