Skip to content

tokenize route got mismatch tokens #525

@franklucky001

Description

@franklucky001

System Info

{
"model_id": "/data/BAAI/bge-m3",
"model_sha": null,
"model_dtype": "float16",
"model_type": {
"embedding": {
"pooling": "cls"
}
},
"max_concurrent_requests": 512,
"max_input_length": 8192,
"max_batch_tokens": 16384,
"max_batch_requests": null,
"max_client_batch_size": 32,
"auto_truncate": false,
"tokenization_workers": 48,
"version": "1.6.0",
"sha": "57d8fc8128ab94fcf06b4463ba0d83a4ca25f89b",
"docker_label": "sha-57d8fc8"
}

docker compose

services:
  dense-embed:
    image: ghcr.io/huggingface/text-embeddings-inference:turing-1.6
    container_name: dense-embed
    env_file: .env
    command: --model-id ${DENSE_MODEL_ID}  --pooling cls
    ports:
      - "${DENSE_PORT:-8080}:80"
    volumes: 
      - "./data:/data"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

model info

  • BAAI/bge-m3
    /tokenize
{"inputs": ["这是一个文本向量化的测试句子"]}
[
    {
        "id": 0,
        "text": "<s>",
        "special": true,
        "start": null,
        "stop": null
    },
    {
        "id": 6,
        "text": "这是一",
        "special": false,
        "start": 0,
        "stop": 3
    },
    {
        "id": 100013,
        "text": "这是一个文本向量化的测试",
        "special": false,
        "start": 0,
        "stop": 12
    },
    {
        "id": 189061,
        "text": "句子",
        "special": false,
        "start": 12,
        "stop": 18
    },
    {
        "id": 2110,
        "text": "",
        "special": false,
        "start": 18,
        "stop": 21
    },
    {
        "id": 3272,
        "text": "",
        "special": false,
        "start": 21,
        "stop": 24
    },
    {
        "id": 41904,
        "text": "",
        "special": false,
        "start": 24,
        "stop": 30
    },
    {
        "id": 49125,
        "text": "",
        "special": false,
        "start": 30,
        "stop": 36
    },
    {
        "id": 27683,
        "text": "",
        "special": false,
        "start": 36,
        "stop": 39
    },
    {
        "id": 1344,
        "text": "",
        "special": false,
        "start": 39,
        "stop": 42
    },
    {
        "id": 2,
        "text": "</s>",
        "special": true,
        "start": null,
        "stop": null
    }
]

tokenizer with transformers api

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
encoded = tokenizer("这是一个文本向量化的测试句子")
tokenizer.convert_ids_to_tokens(encoded['input_ids'])
  • encoded {'input_ids': [0, 6, 100013, 189061, 2110, 3272, 41904, 49125, 27683, 1344, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
  • tokens ['', '▁', '这是一个', '文本', '向', '量', '化的', '测试', '句', '子', '']

token ids is OK, but token is mismatch

Expected behavior

same to transformers tokens result ['', '▁', '这是一个', '文本', '向', '量', '化的', '测试', '句', '子', '']

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions