CAVA - Comprehensive Assessment for Voice Assistants

A framework for evaluating audio models across multiple tasks relevant to the full-stack flow for Voice Assistants. See our blog post for more details!

Setup

Create environment and install packages

conda create -n "cava" python=3.12 ipython -y
conda activate cava
pip install -e .

Add .env file

Include the .env file in the following format:

OPENAI_API_KEY=[KEY]
GEMINI_API_KEY=[KEY]

Adding a new dataset

To add a new dataset to the evaluation framework, follow these steps:

1. Prepare your audio files

Ensure your audio files are in a compatible format (WAV is recommended)
Place them in a dedicated directory within the project structure
For example: data/YourDatasetName/audio/

2. Create a dataset description file

Create a JSONL file with entries describing each audio file. Each line should be a valid JSON object with the following structure:

{
  "filename": "your_audio_file.wav",
  "field_name": "ground_truth_value",
  "sentence": "optional context or transcript",
  ... # All Additional Relevant Metadata

For example, for an emotion classification task:

{
  "filename": "angry_speech_1.wav",
  "emotion": "anger",
  "sentence": "I can't believe you did that!",
  "voice_gender": "female"
}

3. Using HuggingFace datasets

You can convert audio datasets from HuggingFace to the CAVA format using the included conversion script. This allows you to leverage existing audio datasets without manual file preparation.

HuggingFace Dataset Converter

The convert_from_hf.py script converts any HuggingFace audio dataset to CAVA format:

python convert_from_hf.py \
  --dataset WillHeld/werewolf \
  --split train \
  --audio-dir Werewolf \
  --output audio_inputs.jsonl \
  --preserve-columns

This will:

Download the specified dataset from HuggingFace
Extract the audio files to data/werewolf_data/
Create a JSONL file at data/werewolf_data/audio_inputs.jsonl with entries like:

{"filename": "0.wav", "werewolf": ["Justin", "Mike"], "PlayerNames": ["Justin", "Caitlynn", "Mitchell", "James", "Mike"], "endRoles": ["Werewolf", "Tanner", "Seer", "Robber", "Werewolf"], "votingOutcome": [3, 0, 3, 0, 0]}

You can then use this dataset like any other CAVA dataset by configuring a task with:

audio_dir: "werewolf_data/"
data_file: "audio_inputs.jsonl"

For more options and customization:

python convert_from_hf.py --help

4. Configure a new task

Add a new task configuration in src/cava/config.py by updating the create_task_configs() function:

def create_task_configs() -> Dict[str, TaskConfig]:
    return {
        # ... existing tasks ...
        
        "your_task": TaskConfig(
            name="your_task",
            prompt_template="Your task-specific prompt here",
            labels=["label1", "label2", "label3"],  # Optional for classification tasks
            max_new_tokens=10,                      # Adjust based on expected response length
            field_name="your_field_name",           # Field containing ground truth
            audio_dir="your_dataset_directory/",    # Path to audio files
            data_file="your_dataset_inputs.jsonl",  # Formatted data file
        ),
    }

5. Run evaluation

Assuming that the data for your evaluation is downloaded, run the evaluation using the command:

python src/cava/inference.py --task ${TASK_NAME}

Run Scripts

For each task, we should have a unified script which either reproduces the data or downloads it from a long-term storage solution such as HuggingFace. This should be put into the run_scripts directory.

For example, to download all Spoken Function Calling data, process it for use in CAVA, and then run the evaluation you can just run:

bash run_scripts/run_function_calling.sh

Prompt Templates

Prompt templates are used to guide the model in performing the task. Templates can include placeholders for dynamic content using the format {placeholder_name}.

For example:

prompt_template="Analyze the audio and determine if the speaker sounds {emotion_type}. Respond with only 'yes' or 'no'."

When no placeholders are used, the template is used as a prefix to the audio input.

Output format

After running evaluation, results will be saved in files named: [data_file]_[model_name]_[task_name]

For example: audio_inputs.jsonl_Qwen2-Audio-7B-Instruct_emotion

The output file will contain the original records with added prediction fields.

Speech Output Evaluation

For tasks that require evaluating a model's speech output (such as pronunciation or speech synthesis):

Set the speech_output parameter to True in your task configuration
Specify an output_audio_dir where generated audio will be saved
Define an appropriate evaluation metric in the task configuration

Citation

@misc{cava2025,
  title = {CAVA: Comprehensive Assessment of Voice Assistants},
  author = {Held, Will and Ryan, Michael J. and Shrivastava, Aditya and Khan, Ali Sartaz and Ziems, Caleb and Li, Ella and Bartelds, Martijn and Sun, Michael and Li, Tan and Gan, Woody and Yang, Diyi},
  year = {2025},
  url = {https://talkarena.org/cava},
  howpublished = {\url{https://github.com/SALT-NLP/CAVA}},
  note = {A benchmark for evaluating large audio models (LAMs) capabilities across six domains: turn taking, instruction following, function calling, tone awareness, safety, and latency}
}

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
data		data
data_collection		data_collection
run_scripts		run_scripts
src/cava		src/cava
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CAVA - Comprehensive Assessment for Voice Assistants

Setup

Create environment and install packages

Add .env file

Adding a new dataset

1. Prepare your audio files

2. Create a dataset description file

3. Using HuggingFace datasets

HuggingFace Dataset Converter

4. Configure a new task

5. Run evaluation

Run Scripts

Prompt Templates

Output format

Speech Output Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 8

Uh oh!

Languages

SALT-NLP/CAVA

Folders and files

Latest commit

History

Repository files navigation

CAVA - Comprehensive Assessment for Voice Assistants

Setup

Create environment and install packages

Add .env file

Adding a new dataset

1. Prepare your audio files

2. Create a dataset description file

3. Using HuggingFace datasets

HuggingFace Dataset Converter

4. Configure a new task

5. Run evaluation

Run Scripts

Prompt Templates

Output format

Speech Output Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 8

Uh oh!

Languages

Packages