This is an open source implementation of DirectAI's core service, provided both to give back to the open source community that we benefited greatly from, as well as to allow our clients to continue to have service in the event that we can no longer host our API.
We host zero-shot image models to allow clients to use computer vision at scale without having to collect / label lots of training data or train their own model. However, zero-shot models don't necessarily work out of the box for all cases. We introduce an algorithm for providing feedback to zero-shot models by extending the standard linear decision boundary in the model's embedding space into a two-stage nearest neighbors algorithm, which allows for much more fine-tuned control over what the model considers to belong in a particular class with minimal impact on runtime.
A hosted version of the Gradio frontend is available at sandbox.oss.directai.io, and a hosted version of the open source API is available at api.oss.directai.io, with auto-generated docs available at api.oss.directai.io/docs. WE MAKE NO GUARANTEES ABOUT UPTIME / AVAILABILITY OF THE HOSTED OPEN SOURCE IMPLEMENTATION. For a high uptime implementation, see our commercial offering at api.alpha.directai.io/docs.
- Set your logging level preference in
directai_fastapi/.env
. See options on python's logging documentation. An empty string input defaults tologging.INFO
. docker compose build && docker compose up
docker compose -f testing-docker-compose.yml build && docker compose -f testing-docker-compose.yml up
This repository is designed to require access to an Nvidia GPU with Ampere architecture. The Ampere architecture is used by the flash attention integration in the object detector. However, it could be modified to run on older Nvidia GPUs or on CPU. Feel free to submit a pull request or raise an issue if you need that support!
We've built infrastructure to make it easy to quickly run an arbitrary classifier against a dataset. If your images are organized like so:
/dataset_directory
│
├── image1.jpg
├── image2.jpg
├── image3.jpg
├── ...
└── imageN.jpg
and you have a JSON file defining the image classifier you'd like to run at classifier_config.json
, you can dump classification labels to an output.csv
via:
docker-compose build && docker-compose run local_fastapi python classify_directory.py --root=dataset_directory --classifier_json_file=classifier_config.json --output_file=output.csv
Make sure that all the files are mounted within the Docker container. You can do that by either modifying the volumes specified in docker-compose.yml
, or by placing them all within the .cache
directory which is mounted by default.
If your images have labels and are organized like so:
/dataset_directory
│
├── /label1
│ ├── image1.jpg
│ ├── image2.jpg
│ └── ...
│
├── /label2
│ ├── image1.jpg
│ ├── image2.jpg
│ └── ...
│
└── /labelN
├── image1.jpg
├── image2.jpg
└── ...
You can run an evaluation against the labels by running the command
docker compose build && docker compose run local_fastapi python classify_directory.py --root=dataset_directory --classifier_json_file=classifier_config.json --eval_only=True
If you want to run classifications on a custom dataset, you can either use our API or build a custom Ray Dataset and use the utilities defined in batch_processing.py
.
To launch a self-hosted version of this service in AWS, we'll spin up a fresh EC2 instance. Choose "Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.4 (Ubuntu 22.04)" as your AMI and g5.xlarge as your instance size. After that you should be able to just run:
git clone https://github.com/DirectAI/simple-data-free-model-server
cd simple-data-free-model-server
docker compose build && docker compose up
This repository presents the idea of a semantic nearest neighbors for building custom decision boundaries with late-fusion zero-shot models. We use CLIP and OWL-ViT-ST as our base late-stage zero-shot image classifier and object detector, respectively. See the code for implementation details.
The standard approach for a late-stage zero-shot image classifier is to take a set of n labels, embed them via the associated language model, and then append those embeddings to generate a linear classification layer on top of the image embedding from the associated image model. This head can also be interpreted as a nearest neighbors layer, as the predicted class is just the class with text embedding most similar to the image embedding.
Let
We define a meta class which contains both positive and negative examples. A score is computed for the meta class based on a goodness-of-fit estimate for the image given the positive and negative examples in the class. Then, the meta class with the highest score is predicted to be the true class.
In the simplest case of a single positive example per meta class, this is the same as the traditional zero-shot approach. To extend the paradigm to the case where there are multiple positive examples per meta class, we can run the traditional zero-shot approach over the set of all positive examples and then let the predicted class be the meta class which includes the highest-scoring positive example. This can be viewed as an n-way nearest neighbors problem where the samples are semantically-relevant text embeddings, and the correct class is the one with the most semantically-relevant example.
Let
We can reinterpret the above as a two-stage process. Instead of running an n-way nearest neighbors problem, we can take for each meta class the max of the relevancy scores of the provided examples, and then take the argmax over this meta class level score.
Let
To extend this to allow for negative examples, we can replace the first stage max with a two-class nearest neighbors boundary between the positive and negative examples. Then, we can let our meta class score be the score of the most relevant example if that example is positive, and the negative score if that example is negative. In other words, a meta class that has more relevant positive examples will result in a higher score, and a meta class with more relevant negative examples will result in a lower score.
Let
Note that if there are no negative examples, this is the same as the previous case, and if there are no negative examples and exactly one positive example per meta class this is equivalent to the traditional method. Also note that if the most relevant example for all of the meta classes is a negative example, then this function attempts to choose the 'least irrelevant' prediction.
This is our final prediction function, a two-layered nearest neighbors problem that incorporates positive and negative semantic evidence into its decision boundary. This can be run efficiently with an optimized scatter max function.
To extend late-fusion zero-shot object detectors to be able to incorporate positive and negative examples, we first compute for each meta class and each proposed bounding box the
In other words, we first run a two-way object detection problem for each meta class between its positive and negative examples, and then run an
- Make sure you run
pip install pre-commit
followed bypre-commit install
before attempting to commit to the repo.
Special thanks to Neo for funding DirectAI! Thank you to OpenCLIP and Huggingface for providing the model implementations that we use here.
If you have any questions or comments, raise an issue or reach out to Isaac Robinson at [email protected].
If you're interested in contributing, raise an issue or email Isaac and we'll write a contributing guide!
If you find this useful for your work, please cite this repository!