Skip to content

output: use strings instead of PathInfo for performance reasons #3663

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 23, 2020

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Apr 22, 2020

  • ❗ I have followed the Contributing to DVC checklist.

  • πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Partial fix for #3635.

@pmrowla pmrowla added the performance improvement over resource / time consuming tasks label Apr 22, 2020
@pmrowla pmrowla self-assigned this Apr 22, 2020
@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 22, 2020

The optimization in OutputBase._collect_used_dir_cache improves the _cloud_status runtime from 938s to 337s in cProfile, for an S3 remote with 2M files.

There are still several places we can improve in RemoteBASE and RemoteLOCAL, since remote.checksum_to_path_info is slow when dealing with large numbers of files. The issue here would be that it looks like it may end up being a fairly substantial refactoring, since optimizing where pathinfo is currently used in RemoteLOCAL.status ends up affecting several other places in the upload/download pipeline which currently expect pathinfos. In testing, I was able to get _cloud_status runtime down to 221s (but with broken push/pull).

@efiop, should I continue with this or do we want to just leave it at only the used_dir_cache change for now?

@efiop
Copy link
Contributor

efiop commented Apr 22, 2020

@pmrowla Your call. I would be perfectly happy to merge this first though. Astonishing results!

@efiop
Copy link
Contributor

efiop commented Apr 22, 2020

@pmrowla About checksum_to_path_info. We used to have checksum_to_path not that long ago, so maybe we need to go back and just convert to PathInfo where we really need it. Or we could have both checksum_to_path and checksum_to_path_info and use where appropriate, not sure.

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 22, 2020

@efiop in that case, once I fix whatever is failing CI in windows I'll mark this PR as reviewable/mergeable. We can leave the original issue open and I'll keep looking into optimizing checksum_to_path_info

@pmrowla pmrowla changed the title [WIP] remote: use strings instead of PathInfo for performance reasons output: use strings instead of PathInfo for performance reasons Apr 23, 2020
@pmrowla pmrowla requested a review from efiop April 23, 2020 06:29
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pmrowla !

@efiop efiop merged commit a2de48b into iterative:master Apr 23, 2020
@pmrowla pmrowla deleted the 3635-pathinfo branch April 23, 2020 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance improvement over resource / time consuming tasks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants