Skip to content

dvc exp run: experiment metrics are not reported when metric files are on another device than training code #7863

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
AlexandreRozier opened this issue Jun 8, 2022 · 14 comments
Labels
A: experiments Related to dvc exp bug Did we break something? p3-nice-to-have It should be done this or next sprint

Comments

@AlexandreRozier
Copy link

AlexandreRozier commented Jun 8, 2022

Bug Report

Issue name

dvc exp run runs but does not store metrics.

Description

I'm running my training script on /dev/mapper/system-home and it outputs data (model checkpoints, metrics) in /data/.cache located on another partition (/dev/sdb1). /dev/sdb1 is a purposely large partition where we are supposed to store large files. Running dvc exp run works fine, but after completion dvc exp show does not show any metrics (aswell as dvc metrics show).

When outputting metrics to a folder on the same partition as the training script (/dev/mapper/system-home), dvc exp show works perfectly and shows metrics.

When using verbose mode, I get the following errors:

2022-06-08 17:49:08,234 DEBUG: [Errno 95] no more link types left to try out: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>: [Errno 18] Invalid cross
-device link
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 28, in _link
    func(from_path, to_path)
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/base.py", line 263, in reflink
    return self.fs.reflink(from_info, to_info)
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/local.py", line 156, in reflink
    return System.reflink(path1, path2)
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/system.py", line 112, in reflink
    System._reflink_linux(source, link_name)
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/system.py", line 96, in _reflink_linux
    fcntl.ioctl(d.fileno(), FICLONE, s.fileno())
OSError: [Errno 18] Invalid cross-device link

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 69, in _try_links
    return _link(link, from_fs, from_path, to_fs, to_path)
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 32, in _link
    raise OSError(
OSError: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>

The above exception was the direct cause of the following exception:

Traceback (most recent call lastInvalid cross
-device link):
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 124, in _test_link
    _try_links([link], from_fs, from_file, to_fs, to_file)
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 77, in _try_links
    raise OSError(
OSError: [Errno 95] no more link types left to try out

The full traceback can be found here:
trace.Log

The Invalid cross-device link part seems to show that dvc cannot handle cross-devices operations.

Reproduce

  1. Create a default project on partition /sda1/foo1 training and evaluating a model, writing metrics to another device /sdb1/foo2
# train.py on /sda1/foo1 
from dvclive import Live
live = Live( "/data/metrics") # /data mounted on /sdb1/foo2
for epoch in epochs:
    metrics = ...
    for metric_name, value in metrics.items():
          live.log(metric_name, value)
    live.next_step()

ex of /data/metrics.json:

{
    "step": 1,
    "loss": 0.7107148170471191,
    "directed_f1_weighed": 0.0,
    "undirected_f1_weighed": 0.0,
    "oriented_acc": 0.8346456692913385,
    "officical_f1_macro": 0.0
}

ex of /data/metrics/scalar/loss.tsv:

timestamp	step	loss
1654703111346	0	0.8031530231237411
1654703334339	1	0.7107148170471191
  1. dvc exp show doesn't show any metrics column
    image
    image

Expected

dvc metrics show actually shows metrics columns.

Environment information

Python 3.8.13

Description: Ubuntu 20.04.3 LTS
Release: 20.04
dvclive 0.8.2

Output of dvc doctor:

$ dvc doctor
DVC version: 2.10.2 (pip)
---------------------------------
Platform: Python 3.8.13 on Linux-5.4.0-91-generic-x86_64-with-glibc2.17
Supports:
        hdfs (fsspec = 2022.5.0, pyarrow = 3.0.0),
        webhdfs (fsspec = 2022.5.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2022.5.0, boto3 = 1.21.21)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/system-home
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/mapper/system-home
Repo: dvc, git

Additional Information (if any):

I think the error comes from a missing support of cross-device copying (check https://stackoverflow.com/questions/42392600/oserror-errno-18-invalid-cross-device-link). Do you have any ideas ? Thanks for this nice piece of software 👍

@AlexandreRozier AlexandreRozier changed the title Experiment metrics are not reported when metric files are on another partition Experiment metrics are not reported when metric files are on another device than training code Jun 8, 2022
@daavoo
Copy link
Contributor

daavoo commented Jun 9, 2022

Hi @AlexandreRozier! Is that the full traceback? Are there any other exceptions logged above the traceback you shared?

@daavoo daavoo added A: experiments Related to dvc exp diff/show bug Did we break something? labels Jun 9, 2022
@dtrifiro dtrifiro changed the title Experiment metrics are not reported when metric files are on another device than training code dvc exp run: experiment metrics are not reported when metric files are on another device than training code Jun 9, 2022
@AlexandreRozier
Copy link
Author

@daavoo Thanks for your answer, I added the full traceback in the issue description. You can also read it here:
trace.Log

@daavoo
Copy link
Contributor

daavoo commented Jun 10, 2022

Hi @AlexandreRozier . It looks like the errors in the exp run -v traceback (the one you shared in previous comment) are just coming from #7865 so (I think) nothing to worry about. It looks like the experiment is completed successfully.

Could you share the output of dvc exp show --json -vv?

@dfossl
Copy link

dfossl commented Jun 20, 2022

Hi @AlexandreRozier

I also have had a similar issue, but all on the same machine, actually. When I run dvc exp show in the dvc/git root directory I get the metrics, however when I run it in a subdirectory I do not.

Rolling back to version 2.10.2 fixes this for me.

Hopefully this provides some additional context that can help narrow down the issue.

Let me know if you also want my logging information as well.

@AlexandreRozier
Copy link
Author

@daavoo You might be right, the experiments are indeed completing successfully, but unfortunately they don't show up in dvc exp show. You'll find the output of dvc exp show --json -vv here :
dvc_exp_show_json.log
In particular, one can see that the metrics value is empty (line 1430)

@dfossl I'm also working on a single machine, but across 2 different partitions. And dvc exp show does not work at root directory level. But thanks for your feedback !

@pmrowla
Copy link
Contributor

pmrowla commented Jun 27, 2022

@AlexandreRozier is your metrics path inside your DVC repo? Or is it a completely external path?

From your sample code:

# train.py on /sda1/foo1 
from dvclive import Live
live = Live( "/data/metrics") # /data mounted on /sdb1/foo2

Where is your DVC project located (i.e. where is the .dvc directory)?

It sounds like /data/metrics is outside of the DVC project in which case DVC will not track it as an output - all output paths must be inside the DVC project root directory. Note that the path is what matters here, not the physical device a file is stored on.

So something like this would be allowed:

cd /foo
dvc init
mkdir bar
mount /dev/sdb2 bar
# write and track outputs in /foo/bar/... on /dev/sdb2

Since in this case, /foo/bar is inside the DVC repository located at /foo

@pmrowla
Copy link
Contributor

pmrowla commented Jun 27, 2022

But rather than nesting mount points like that, I think what you may be looking for is to configure dvc cache dir so that it resides on /dev/sdb1, and then configure your DVC repository (on /dev/sda1) to use symbolic links. Then you would set your metrics path so that it is inside the DVC repository. The end result would be that all of your output/metrics data would be stored in /dev/sdb1, and your repository would contain symlinks to those files.

see:
https://dvc.org/doc/command-reference/cache/dir#cache-dir
https://dvc.org/doc/user-guide/large-dataset-optimization

@AlexandreRozier
Copy link
Author

@pmrowla Thanks, I had not understood that all output paths had to be in the dvc repo.
For the sake of clarity:

  • The DVC project is located at /home/hx/ORIGAMI/tmp (on sda5)
  • The metrics are stored in /data/... (on sdb1)
    Mounting /sdb1 somewhere in the DVC project is not possible since other users also work on this server, but I'll take a look at the symlinking approach you're advising.

In conclusion, it seems that my issue originates in the lack of support for output file paths outside of the DVC repo, and can be closed if that's a DVC design choice :)
Thanks again for your help.

@pmrowla
Copy link
Contributor

pmrowla commented Jun 27, 2022

Right, normally the output paths are validated in commands like dvc add or dvc stage add and DVC will error out if the output is outside the repo. We probably need to add similar checks in dvclive

cc @daavoo

@daavoo
Copy link
Contributor

daavoo commented Jun 27, 2022

Right, normally the output paths are validated in commands like dvc add or dvc stage add and DVC will error out if the output is outside the repo. We probably need to add similar checks in dvclive

cc @daavoo

Not sure if this belongs in DVCLive though. I think that it might be better if dvc repro / dvc exp run do a similar check to the one done in dvc stage add.

Problem is that dvc stage add just relies on the --external flag and dvc repro would need to actually check the cache config

@pmrowla
Copy link
Contributor

pmrowla commented Jun 28, 2022

I'm not up to date on when/how the stage live output section is configured/injected into DVC, but whenever that happens we need to verify the path with

def check_stage_path(repo, path, is_wdir=False):

Problem is that dvc stage add just relies on the --external flag and dvc repro would need to actually check the cache config

It seems to me that it shouldn't matter whether or not the output path is cached? Even uncached outputs still have to be inside the repo. Is there an actual use case where dvclive outputs would need to be external?

@daavoo
Copy link
Contributor

daavoo commented Jun 28, 2022

I'm not up to date on when/how the stage live output section is configured/injected into DVC, but whenever that happens we need to verify the path with

def check_stage_path(repo, path, is_wdir=False):

We currently promote using --metrics and --plots flags instead of --live.
Regardless, running dvc stage add --live /external/path already raises a StageExternalOutputsError.

Problem is that dvc stage add just relies on the --external flag and dvc repro would need to actually check the cache config

It seems to me that it shouldn't matter whether or not the output path is cached? Even uncached outputs still have to be inside the repo. Is there an actual use case where dvclive outputs would need to be external?

What I mean is that dvc stage add raises an error depending on the --external flag. However, the --external flag doesn't persist the option anywhere in dvc.yaml nor .dvc/config.

I someone manually creates/edits the dvc.yaml to use an external path dvc repro doesn't raise an error and I was saying that in order to raise we would need to check the .dvc/config to verify the step in https://dvc.org/doc/user-guide/managing-external-data#setting-up-an-external-cache .
For non-cache outputs we wouldn't need to check the cache config.

@daavoo daavoo added this to DVC Jun 28, 2022
@daavoo daavoo moved this to Backlog in DVC Jun 28, 2022
@pmrowla
Copy link
Contributor

pmrowla commented Jun 28, 2022

I someone manually creates/edits the dvc.yaml to use an external path dvc repro doesn't raise an error and I was saying that in order to raise we would need to check the .dvc/config to verify the step in https://dvc.org/doc/user-guide/managing-external-data#setting-up-an-external-cache .

This isn't an error case though. External outputs with local fs paths are valid, and in this case there is no separate cache to configure. Local external outs still use the regular local cache (which defaults to .dvc/cache), so there isn't any additional config setting we can check to determine whether or not the user intended for the output to be external.

That's why I think that we need an explicit check for dvclive outputs. Assuming we do not want to support external outputs for dvclive metrics/plots, when a user runs Live("/foo/bar") we can explicitly error out here if the path is external. Whereas internally in DVC at stage runtime, it is much more difficult for us to determine whether or not /foo/bar is intentionally supposed to be an external output directory.

@skshetry skshetry added the p3-nice-to-have It should be done this or next sprint label Jun 28, 2022
@skshetry skshetry removed this from DVC Jun 28, 2022
@skshetry
Copy link
Collaborator

skshetry commented Jun 28, 2022

This will be fixed when we work on #3920. Closing in favour of that issue.

@skshetry skshetry closed this as not planned Won't fix, can't repro, duplicate, stale Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp bug Did we break something? p3-nice-to-have It should be done this or next sprint
Projects
None yet
Development

No branches or pull requests

5 participants