Skip to content

downloads remote inputs via HTTP(S) #466

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mr-c opened this issue Jul 13, 2017 · 11 comments
Closed

downloads remote inputs via HTTP(S) #466

mr-c opened this issue Jul 13, 2017 · 11 comments

Comments

@mr-c
Copy link
Member

mr-c commented Jul 13, 2017

Expected Behavior

a URI should be accepted for inputs with type: File

http://www.commonwl.org/v1.0/CommandLineTool.html#File

Actual Behavior

No such file or directory: '/home/michael/src/2017-cloud-workflows-misc/http://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/rna.SRR948778.bam'

Workflow Code

cwltool https://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/Dockstore.cwl \
 --bam_input https://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/rna.SRR948778.bam

Full Traceback

/home/michael/src/2017-cloud-workflows-misc/env/bin/cwltool 1.0.20170712193248
https://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/Dockstore.cwl:3:1: unrecognized extension field `http://purl.org/dc/terms/creator`.  Did you include a $schemas section?
[job Dockstore.cwl] initializing from https://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/Dockstore.cwl
[job Dockstore.cwl] {
    "bam_input": {
        "class": "File", 
        "location": "https://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/rna.SRR948778.bam", 
        "basename": "rna.SRR948778.bam", 
        "nameroot": "rna.SRR948778", 
        "nameext": ".bam"
    }, 
    "mem_gb": 0
}
Got workflow error
Traceback (most recent call last):
  File "/home/michael/src/2017-cloud-workflows-misc/env/local/lib/python2.7/site-packages/cwltool/main.py", line 270, in single_job_executor
    for r in jobiter:
  File "/home/michael/src/2017-cloud-workflows-misc/env/local/lib/python2.7/site-packages/cwltool/draft2tool.py", line 323, in job
    builder.pathmapper = self.makePathMapper(reffiles, builder.stagedir, **make_path_mapper_kwargs)
  File "/home/michael/src/2017-cloud-workflows-misc/env/local/lib/python2.7/site-packages/cwltool/draft2tool.py", line 204, in makePathMapper
    return PathMapper(reffiles, kwargs["basedir"], stagedir)
  File "/home/michael/src/2017-cloud-workflows-misc/env/local/lib/python2.7/site-packages/cwltool/pathmapper.py", line 180, in __init__
    self.setup(dedup(referenced_files), basedir)
  File "/home/michael/src/2017-cloud-workflows-misc/env/local/lib/python2.7/site-packages/cwltool/pathmapper.py", line 228, in setup
    self.visit(fob, stagedir, basedir, copy=fob.get("writable"), staged=True)
  File "/home/michael/src/2017-cloud-workflows-misc/env/local/lib/python2.7/site-packages/cwltool/pathmapper.py", line 217, in visit
    self.visitlisting(obj.get("secondaryFiles", []), stagedir, basedir, copy=copy, staged=staged)
  File "/home/michael/src/2017-cloud-workflows-misc/env/local/lib/python2.7/site-packages/schema_salad/sourceline.py", line 152, in __exit__
    raise self.makeError(six.text_type(exc_value))
ValidationException: params.yaml:3:5: [Errno 2] No such file or directory: '/home/michael/src/2017-cloud-workflows-misc/http://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/rna.SRR948778.bam'
Workflow error, try again with --debug for more information:
params.yaml:3:5: [Errno 2] No such file or directory:
                 '/home/michael/src/2017-cloud-workflows-misc/http://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/rna.SRR948778.bam'
Traceback (most recent call last):
  File "/home/michael/src/2017-cloud-workflows-misc/env/local/lib/python2.7/site-packages/cwltool/main.py", line 886, in main
    **vars(args))
  File "/home/michael/src/2017-cloud-workflows-misc/env/local/lib/python2.7/site-packages/cwltool/main.py", line 285, in single_job_executor
    raise WorkflowException(Text(e))
WorkflowException: params.yaml:3:5: [Errno 2] No such file or directory: '/home/michael/src/2017-cloud-workflows-misc/http://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/rna.SRR948778.bam'
@mr-c mr-c added the bug label Jul 13, 2017
@tetron
Copy link
Member

tetron commented Jul 13, 2017

Strictly speaking, this isn't a regression because it was never implemented for file inputs in the first place, only for document loading. But I agree it should fetch remote http resources, the main challanege is there are some caching issues to work out if you don't want to have to pull large inputs on every run.

@mr-c
Copy link
Member Author

mr-c commented Jul 13, 2017

@tetron Thanks for the clarification. I thought it was implemented from the beginning

@mr-c mr-c changed the title regression: no longer downloads remote resources via HTTP(S) downloads remote inputs via HTTP(S) Jul 13, 2017
@mr-c mr-c added enhancement and removed bug labels Jul 13, 2017
@denis-yuen
Copy link
Member

denis-yuen commented Jul 13, 2017

re: file caching

possible inspiration? https://dockstore.org/docs/advanced-features#input-file-cache

@mr-c
Copy link
Member Author

mr-c commented Jul 14, 2017

Thank you for the pointer @denis-yuen Yes, we should reuse cwltools cachedir feature here

@standage
Copy link

Is this related, or should I open a separate thread?

$ cwl-runner --validate https://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/Dockstore.cwl
/usr/local/bin/cwl-runner 1.0.20170713151519
Tool definition failed initialization:
(u'https://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/Dockstore.cwl', AttributeError("'HTTPResponse' object has no attribute 'chunked'",))

@mr-c
Copy link
Member Author

mr-c commented Jul 15, 2017

@standage not related (and I can't reproduce with either 1.0.20170713151519 or the latest dev 1.0.20170714133745) Can you open a separate issue with the output of pip freeze?

@standage
Copy link

I can't reproduce either. :-)

I'll just chalk it up to transient environment config weirdness.

@kapilkd13
Copy link
Contributor

Hi @mr-c I am thinking of two ways to do this. in Pathmapper https://github.com/common-workflow-language/cwltool/blob/master/cwltool/pathmapper.py#L219

  1. While creating a MapperEnt object, we can download the input over http/s into a temp file and use its path as resolved path. creating something like path(httplink)->(temppath, targetPath).
  2. When creating MapperEnt object, download the http file content and assign it to resolved path and setting type to CreateFile, marking it as input on the fly.
    Is there a better way/position to do this?
    Personally, I like first one as it allows us to later implement caching over the downloaded file.

@tetron
Copy link
Member

tetron commented Jul 26, 2017

I option 1 is the right one. CreateFile is for file literals, and stores the the data directly in memory, which won't work if the data is large. For comparison, the arvados-cwl-runner does something similar, although for uploading local files to the server rather than downloading locally, but the principal is the same:

https://github.com/curoverse/arvados/blob/master/sdk/cwl/arvados_cwl/pathmapper.py#L136

@kapilkd13
Copy link
Contributor

@mr-c Can we close this

@mr-c
Copy link
Member Author

mr-c commented Aug 12, 2017

Yep! To get an issue to automatically close when a PR is merged, end the Pull Request description with Closes: #NNN

@mr-c mr-c closed this as completed Aug 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants