tokenize route got mismatch tokens

### System Info

{
  "model_id": "/data/BAAI/bge-m3",
  "model_sha": null,
  "model_dtype": "float16",
  "model_type": {
    "embedding": {
      "pooling": "cls"
    }
  },
  "max_concurrent_requests": 512,
  "max_input_length": 8192,
  "max_batch_tokens": 16384,
  "max_batch_requests": null,
  "max_client_batch_size": 32,
  "auto_truncate": false,
  "tokenization_workers": 48,
  "version": "1.6.0",
  "sha": "57d8fc8128ab94fcf06b4463ba0d83a4ca25f89b",
  "docker_label": "sha-57d8fc8"
}

# docker compose
```yaml
services:
  dense-embed:
    image: ghcr.io/huggingface/text-embeddings-inference:turing-1.6
    container_name: dense-embed
    env_file: .env
    command: --model-id ${DENSE_MODEL_ID}  --pooling cls
    ports:
      - "${DENSE_PORT:-8080}:80"
    volumes: 
      - "./data:/data"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
```

### Information

- [x] Docker
- [ ] The CLI directly

### Tasks

- [ ] An officially supported command
- [ ] My own modifications

### Reproduction

# model info
- BAAI/bge-m3
/tokenize
```
{"inputs": ["这是一个文本向量化的测试句子"]}
```
```
[
    {
        "id": 0,
        "text": "<s>",
        "special": true,
        "start": null,
        "stop": null
    },
    {
        "id": 6,
        "text": "这是一",
        "special": false,
        "start": 0,
        "stop": 3
    },
    {
        "id": 100013,
        "text": "这是一个文本向量化的测试",
        "special": false,
        "start": 0,
        "stop": 12
    },
    {
        "id": 189061,
        "text": "句子",
        "special": false,
        "start": 12,
        "stop": 18
    },
    {
        "id": 2110,
        "text": "",
        "special": false,
        "start": 18,
        "stop": 21
    },
    {
        "id": 3272,
        "text": "",
        "special": false,
        "start": 21,
        "stop": 24
    },
    {
        "id": 41904,
        "text": "",
        "special": false,
        "start": 24,
        "stop": 30
    },
    {
        "id": 49125,
        "text": "",
        "special": false,
        "start": 30,
        "stop": 36
    },
    {
        "id": 27683,
        "text": "",
        "special": false,
        "start": 36,
        "stop": 39
    },
    {
        "id": 1344,
        "text": "",
        "special": false,
        "start": 39,
        "stop": 42
    },
    {
        "id": 2,
        "text": "</s>",
        "special": true,
        "start": null,
        "stop": null
    }
]
```
# tokenizer with transformers api
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
encoded = tokenizer("这是一个文本向量化的测试句子")
tokenizer.convert_ids_to_tokens(encoded['input_ids'])
```
> - encoded {'input_ids': [0, 6, 100013, 189061, 2110, 3272, 41904, 49125, 27683, 1344, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
> - tokens ['<s>', '▁', '这是一个', '文本', '向', '量', '化的', '测试', '句', '子', '</s>']

# token ids is OK, but token is mismatch


### Expected behavior

same to transformers tokens result ['<s>', '▁', '这是一个', '文本', '向', '量', '化的', '测试', '句', '子', '</s>']

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tokenize route got mismatch tokens #525

System Info

docker compose

Information

Tasks

Reproduction

model info

tokenizer with transformers api

token ids is OK, but token is mismatch

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tokenize route got mismatch tokens #525

Description

System Info

docker compose

Information

Tasks

Reproduction

model info

tokenizer with transformers api

token ids is OK, but token is mismatch

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions