Failing deployment on AWS Sagemaker endpoints

### System Info

Hello,

When attempting to deploy TEI 1.6.1 images on AWS Sagemaker GPU endpoints (e.g ml.g5.2xlarge), various errors led to a failed deployment , as summarized by the following CloudWatch logs:

 - `ghcr.io/huggingface/text-embeddings-inference:cuda-1.6.1` and `ghcr.io/huggingface/text-embeddings-inference:cuda-sha-7d4d9ec`

>  ./entrypoint.sh: line 10: [: -eq: unary operator expected
> ./entrypoint.sh: line 13: [: too many arguments
> ./entrypoint.sh: line 16: [: -eq: unary operator expected
> cuda compute cap is not supported

- ghcr.io/huggingface/text-embeddings-inference:86-1.6.1

> error:unexpected argument 'serve' found
> Usage: text-embeddings-router [OPTIONS]
> For more information, try '--help'.

Each deployment referenced model artifacts from [jinaai/jina-embeddings-v2-small-en](https://huggingface.co/jinaai/jina-embeddings-v2-small-en) supplied as an S3 archive to different deployment strategies leveraging: (1) `HuggingFaceModel`and (2) `Sagemaker Model` and associated `Endpoint Config` created with a boto3 Sagemaker client. Similarly, the deployment failed when instructing the endpoint to fetch model artifacts from the hub (approach using `HuggingFaceModel`)


**Remark**: When deploying TEI 1.4.0 using the HuggingFaceModel approach, asa retrieved by the following code:
```
from sagemaker.huggingface import get_huggingface_llm_image_uri
tei_image_uri = get_huggingface_llm_image_uri("huggingface-tei", version="1.4.0")
```
the process completes without errors as long as model artifacts are fetched from the hub. When supplying model artifacts in an S3 archive, deployment fails due to the incorrect backend being initialized, as discussed [here](https://github.com/huggingface/text-embeddings-inference/issues/556#issuecomment-2777686329) and addressed in https://github.com/huggingface/text-embeddings-inference/pull/559

### Information

- [x] Docker
- [ ] The CLI directly

### Tasks

- [x] An officially supported command
- [ ] My own modifications

### Reproduction

Deployment instructions using `HuggingFaceModel`:

- With model artifacts fetched from the hub
```
tei_image_uri = <image_uri>
emb_model = HuggingFaceModel(
    name="my-tei-model",
    role=role,
    #model_data=<s3_path_to_optional_model_artifacts),
    sagemaker_session= <sm_session>,
    image_uri=tei_image_uri,
    env={"HF_TASK": "feature-extraction",
         "HF_MODEL_ID": "jinaai/jina-embeddings-v2-small-en",
    },
)

emb_predictor = emb_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    endpoint_name="jina-embeddings-tei"
)
```
- With model artifacts stored in S3:
```
Modify the above code such that:
- model_data points to an S3 tar.gz. archive storing model artifacts
- HF_MODEL_ID points to /opt/ml/model
```

### Expected behavior

Deployment completes without errors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failing deployment on AWS Sagemaker endpoints #569

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failing deployment on AWS Sagemaker endpoints #569

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions