The docstring is a paste from the image classification pipeline it seems https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/depth_estimation.py#L54-L83 . The correct output, as per https://huggingface.co/docs/transformers/main/tasks/monocular_depth_estimation, should be a dictionary with an image and a tensor cc @amyeroberts @ydshieh