Open
Description
Due to memory concerns, some of my pipeline stages have persistent outputs that I handle in some python scripts. The pipeline stage receives a JSON as input and outputs a folder containing image files. As there is no need to rewrite those files every execution, I check inside the folder if a file is present (read-only) to avoid reprocessing it. As the dataset can get quite large, the space consumption becomes worrisome, as every dvc repro executed needs to unprotect all the files in the folder, copying them from the cache to the workspace. If an output could be marked as safe, so it would only suffer from append/remove operations, the unprotect could be avoided, reducing the space usage.