Description
Issue encountered
Evaluating large models (> 30B parameters) is hard, especially with limited hardware. Especially when there are many metrics to be evaluated, it can significantly increase the time needed for the strong machine to be running. For example when I want to evaluate a 70B model on a large dataset and then compute many LLM-judge metrics, it can occupy a 4xA100 machine for days, incurring significant cost. The GPUs are only actually active during the first few hours for inference. Afterwards they are just sitting idle.
Solution/Feature
Therefore, ideally, we would like to just run inference with one metric and save the results to the details files. Then in a second step we load the responses from the details files and just run the metrics. For that we can use a significantly smaller machine. Loading from the details files is being added in PR #488. However, to evaluate the metrics, we still need to load the entire model into memory, defeating the purpose. Loading the model only right before running it would alleviate this issue.
Possible alternatives
Alternatively, we could mock the model.