Skip to content

Make data relocation at the end of workflow runs faster #850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 3, 2018

Conversation

psafont
Copy link
Contributor

@psafont psafont commented Aug 1, 2018

  • When moving files, only calculate realpaths of links instead of doing it
    for all the files in the destination directory
  • Avoid stating through all the files in the destination folder
  • Avoid stating through all the folders in the destination folder

@psafont psafont changed the title WIP: performance: make data relocation faster WIP: make data relocation faster at the end of the workflow Aug 1, 2018
@codecov
Copy link

codecov bot commented Aug 1, 2018

Codecov Report

Merging #850 into master will decrease coverage by 1.1%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #850      +/-   ##
==========================================
- Coverage   73.82%   72.71%   -1.11%     
==========================================
  Files          34       34              
  Lines        6429     6429              
  Branches     1626     1626              
==========================================
- Hits         4746     4675      -71     
- Misses       1231     1300      +69     
- Partials      452      454       +2
Impacted Files Coverage Δ
cwltool/sandboxjs.py 55.12% <0%> (-24.36%) ⬇️
cwltool/pathmapper.py 86.88% <0%> (-1.1%) ⬇️
cwltool/process.py 82.6% <0%> (-0.84%) ⬇️
cwltool/command_line_tool.py 87.22% <0%> (-0.82%) ⬇️
cwltool/singularity.py 46.97% <0%> (-0.68%) ⬇️
cwltool/job.py 69.85% <0%> (-0.6%) ⬇️
cwltool/provenance.py 78.83% <0%> (-0.25%) ⬇️
cwltool/docker.py 48.11% <0%> (+0.83%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ffbe75d...086b7ea. Read the comment docs.

outdir = (runtimeContext.outdir or
tempfile.mkdtemp(prefix=getdefault(runtimeContext.tmp_outdir_prefix,
DEFAULT_TMP_PREFIX)))
outdir = fs_access.realpath(outdir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I wouldn't call fs_access.realpath on the output of tempfile.mkdtemp..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not mean to change the behaviour here, just the formatting. having said that, I understood that fs_access.realpath was already being called on the output of tempfile.mkdtemp Did I misunderstand the statement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, missed that this was a pure reformat. I'll investigate further, elsewhere.

@psafont
Copy link
Contributor Author

psafont commented Aug 3, 2018

I'm happy with the performance of the code right now, although it is still scanning the destination folder. I've tried to simplify the code so further changes are easier to do, if they are needed. (like saving all symlinks in the outputs and then only changing those, without going through all the folder in the output folder)

The only thing I'm not sure about is if I included the scandir in the dependencies correctly since it's not needed for python 3.5+

I've not replaced the os.walk in pathmapper since I think that adding code for a library to make it a bit faster does not seem like it's going to give a tangible performance improvement.

mr-c
mr-c previously requested changes Aug 3, 2018
@@ -287,20 +279,38 @@ def relocateOutputs(outputObj, # type: Union[Dict[Text, Any],List[Di
if action not in ("move", "copy"):
return outputObj

def moveIt(src, dst):
def _collectDirEntries(obj):
# type: (Union[Dict[Text, Any], List[Dict[Text, Any]]], List[Dict[Text, Any]]) -> Iterator[Dict[Text, Any]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error: Type signature has too many arguments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Tried to also fix the type problems from the consolidation into a single temp variable, then processing it contents that, but couldn't make it work.

@psafont
Copy link
Contributor Author

psafont commented Aug 3, 2018

The two failures might have been because I used the wrong variable in a specific location. (strange the normal tests do not complain)

cwltool/process.py:13: error: Module 'os' has no attribute 'scandir'
cwltool/process.py:15: error: Cannot find module named 'scandir'

Not sure why this happens in Travis' python 3.5 when it fine on python 3.6+

@mr-c
Copy link
Member

mr-c commented Aug 3, 2018

Not sure why this happens in Travis' python 3.5 when it fine on python 3.6+
We run mypy to do the type checking using Python 3.5 (in both py3 and py2 mode) as the results are presumed to be the same from any version. Perhaps we should check the types in each of py{3.5,6,7}?

The error

cwltool/process.py:13: error: Module 'os' has no attribute 'scandir'
cwltool/process.py:15: error: Cannot find module named 'scandir'

Is due to lack of type annotations for scandir.

@psafont
Copy link
Contributor Author

psafont commented Aug 6, 2018

Getting back to the complex flow of an inlined function makes it work. I tried to make the behaviour clearer by removing an indent level.

It's fishy, though, if the order is to 'move', why doesn't move actually move the files and they are copied instead?

@psafont psafont force-pushed the oswalk branch 2 times, most recently from e4f3fca to 7dd073f Compare August 7, 2018 09:16
@psafont psafont changed the title WIP: make data relocation faster at the end of the workflow Make data relocation at the end of workflow runs faster Aug 7, 2018
@psafont psafont force-pushed the oswalk branch 2 times, most recently from e0f649a to 056d910 Compare August 7, 2018 09:43
@psafont
Copy link
Contributor Author

psafont commented Aug 7, 2018

Seeing the coverage report, the most complex parts of the function are not being used: Mainly the folder merging on the relocate subfunction and relinking of symlinks at the end of the process.

Is there any case where this functionality may be useful?
If it is, it may be worth adding it to conformance tests since it seems to be a rare corner case.
If it isn't it may be worth rescoping the purpose of the function and surrounding code to simplify it.

Code seems related to #317

@tetron
Copy link
Member

tetron commented Aug 23, 2018

@psafont could you add some unit tests? These functions are pretty self contained so it should be straightforward to write unit tests for them. The solution to not enough coverage for the tricky bits is to improve the coverage :-)

Copy link
Member

@tetron tetron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good opportunity to add tests.

@psafont
Copy link
Contributor Author

psafont commented Aug 23, 2018

Since this seems to be part of an extension I'll add tests to tests/test_ext.py

@psafont psafont force-pushed the oswalk branch 5 times, most recently from 534644d to 5e4660c Compare August 24, 2018 13:03
@psafont psafont force-pushed the oswalk branch 6 times, most recently from bc80d40 to 4bdf862 Compare August 29, 2018 08:18
@psafont
Copy link
Contributor Author

psafont commented Aug 29, 2018

After reenabling the tests on Windows with Docker there are 2 types of failures, aside from the SSH transient ones:

  • Problems to overwrite a file, on test_write_write_conflict[log] the file doesn't get updated, on test_double_overwrite[log] it only gets updated once.
  • Problems to create folders on test_disable_dir_overwrite_without_ext:
Error collecting output for parameter 'out':
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3: Traceback (most recent call last):
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3: 
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3:   File "c:\jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\cwltool\command_line_tool.py", line 601, in collect_output_ports
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3:     compute_checksum=compute_checksum)
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3: 
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3:   File "c:\jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\cwltool\command_line_tool.py", line 766, in collect_output
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3:     adjustFileObjs(r, revmap)
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3: 
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3:   File "c:\jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\cwltool\pathmapper.py", line 46, in adjustFileObjs
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3:     visit_class(rec, ("File",), op)
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3: 
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3:   File "c:\jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\cwltool\utils.py", line 210, in visit_class
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3:     visit_class(rec[d], cls, op)
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3: 
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3:   File "c:\jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\cwltool\utils.py", line 213, in visit_class
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3:     visit_class(d, cls, op)
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3: 
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3:   File "c:\jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\cwltool\utils.py", line 208, in visit_class
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3:     op(rec)
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3: 
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3:   File "c:\jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\cwltool\command_line_tool.py", line 165, in revmap_file
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3:     u"file pass through." % (path, builder.outdir))
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3: 
C:jenkins\workspace\cwltool_PR-850-5IQF7Z2DEV6462R5RK5SND5K4PBWDJ6EGRCQUCWDYNLCK5KY67ZA\tests\wf\updatedir.cwl:13:3: cwltool.errors.WorkflowException: Output file path C:/Users/Dan/AppData/Local/Temp/tmpwjra_6lk/inp/blurb must be within designated output directory (/var/spool/cwl) or an input file pass through.

Do we want to fix those in this merge request or is it better if we do so in another one, tracked with an issue? (all of them happen before and after my changes)

Change image for workflows needing python to make it compatible with
Windows.
Changed style to pytest, as part of common-workflow-language#885
Uncomment all code, but still skip those tests
When moving files, only calculate realpaths of links instead of doing it
for all the files in the destination directory
Also changed some names and reduced indentation levels
Remove two parameters that are always used with default values and the
logic associated with dealing with non-default values
@mr-c
Copy link
Member

mr-c commented Sep 3, 2018

@psafont you can create another issue/PR for those un-related test failures. Thanks!

@mr-c
Copy link
Member

mr-c commented Sep 3, 2018

@psafont Shall I merge this?

@psafont
Copy link
Contributor Author

psafont commented Sep 3, 2018

Go ahead!

@mr-c mr-c merged commit c27774b into common-workflow-language:master Sep 3, 2018
@psafont psafont deleted the oswalk branch September 3, 2018 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants