-
Notifications
You must be signed in to change notification settings - Fork 305
Closed
Description
System Info
{
"model_id": "/data/BAAI/bge-m3",
"model_sha": null,
"model_dtype": "float16",
"model_type": {
"embedding": {
"pooling": "cls"
}
},
"max_concurrent_requests": 512,
"max_input_length": 8192,
"max_batch_tokens": 16384,
"max_batch_requests": null,
"max_client_batch_size": 32,
"auto_truncate": false,
"tokenization_workers": 48,
"version": "1.6.0",
"sha": "57d8fc8128ab94fcf06b4463ba0d83a4ca25f89b",
"docker_label": "sha-57d8fc8"
}
docker compose
services:
dense-embed:
image: ghcr.io/huggingface/text-embeddings-inference:turing-1.6
container_name: dense-embed
env_file: .env
command: --model-id ${DENSE_MODEL_ID} --pooling cls
ports:
- "${DENSE_PORT:-8080}:80"
volumes:
- "./data:/data"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
model info
- BAAI/bge-m3
/tokenize
{"inputs": ["这是一个文本向量化的测试句子"]}
[
{
"id": 0,
"text": "<s>",
"special": true,
"start": null,
"stop": null
},
{
"id": 6,
"text": "这是一",
"special": false,
"start": 0,
"stop": 3
},
{
"id": 100013,
"text": "这是一个文本向量化的测试",
"special": false,
"start": 0,
"stop": 12
},
{
"id": 189061,
"text": "句子",
"special": false,
"start": 12,
"stop": 18
},
{
"id": 2110,
"text": "",
"special": false,
"start": 18,
"stop": 21
},
{
"id": 3272,
"text": "",
"special": false,
"start": 21,
"stop": 24
},
{
"id": 41904,
"text": "",
"special": false,
"start": 24,
"stop": 30
},
{
"id": 49125,
"text": "",
"special": false,
"start": 30,
"stop": 36
},
{
"id": 27683,
"text": "",
"special": false,
"start": 36,
"stop": 39
},
{
"id": 1344,
"text": "",
"special": false,
"start": 39,
"stop": 42
},
{
"id": 2,
"text": "</s>",
"special": true,
"start": null,
"stop": null
}
]
tokenizer with transformers api
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
encoded = tokenizer("这是一个文本向量化的测试句子")
tokenizer.convert_ids_to_tokens(encoded['input_ids'])
- encoded {'input_ids': [0, 6, 100013, 189061, 2110, 3272, 41904, 49125, 27683, 1344, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
- tokens ['
', '▁', '这是一个', '文本', '向', '量', '化的', '测试', '句', '子', '']
token ids is OK, but token is mismatch
Expected behavior
same to transformers tokens result ['', '▁', '这是一个', '文本', '向', '量', '化的', '测试', '句', '子', '']
Metadata
Metadata
Assignees
Labels
No labels