Skip to content

exp show: lockless issues #7693

@mattseddon

Description

@mattseddon

Bug Report

Description

After the recent changes to make exp show lockless I am seeing intermittent issues with the data returned during experiment runs.

The issues that I have seen so far are as follows:

Running checkpoint experiments from the queue

  1. exp show fails when running experiments from the queue with dulwich.porcelain.DivergedBranches errors.
  2. @pmrowla tried to patch the above but the change led to this behaviour.

Please take a look at the above behaviour and LMK what you think. I do not anticipate there being an easy fix. IMO we should consider "dropping support" for live tracking experiments run from the queue until the DVC mechanics have been updated.

Running checkpoint experiment in the workspace

  1. exp show returns a single dict for a set of checkpoints during an experiment. This happens intermittently and breaks our "live experiment" tracking.
  2. exp show shows running experiments as not running mid-run.

Reproduce

Run a checkpoint experiment and monitor the output of exp show.

Expected

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.10.2 (pip)
---------------------------------
Platform: Python 3.9.9 on macOS-12.3.1-x86_64-i386-64bit
Supports:
        webhdfs (fsspec = 2022.3.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc (subdir), git

Additional Information (if any):

I'll continue to add issues here as I find them. I have discussed with @pmrowla already. Wanted to raise the visibility by raising an issue to discuss possible mitigation and the priority for fixes.

Thanks

Metadata

Metadata

Assignees

Labels

A: experimentsRelated to dvc expdiscussionrequires active participation to reach a conclusionp1-importantImportant, aka current backlog of things to doproduct: VSCodeIntegration with VSCode extension

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions