Skip to content

Support Alibaba-NLP/gte-large-en-v1.5 on CPU/MPS #375

@tmostak

Description

@tmostak

Feature request

We'd like to run the Alibaba-NLP/gte-large-en-v1.5 model on a CPU text-embedding-router server, but are hitting

Caused by:
Could not start backend: GTE is only supported on Cuda devices in fp16 with flash attention enabled

Is there any way to implement/allow this model to run on CPU?

Motivation

For some of our clients we need to support a CPU embedding server, and would like to use the Alibaba-NLP/gte-large-en-v1.5 model to avail ourselves of the long 8192 token context length.

Your contribution

We'd be happy to test and run performance benchmarks if needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions