-
Notifications
You must be signed in to change notification settings - Fork 87
AWS mirrored SRA file download #355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
Dev -> Master for v1.12.0 release
aws s3 cp \\ | ||
--region us-east-1 \\ | ||
--no-sign-request \\ | ||
${args} \\ | ||
s3://sra-pub-run-odp/sra/${run_accession}/${run_accession} \\ | ||
${prefix}.sra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nextflow has native method of downloading from AWS using the SDK, do we think this will be more efficient than using that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Nextflow:
- No worker processes
- No copying files around
- Lower latency
Using a process:
- Can offload downloading to a worker process
- Leave Nextflow 'head' process more available
- Could supply faster networking to worker nodes (which may be in a different env).
# Verify download | ||
if [ ! -f "${prefix}.sra" ]; then | ||
echo "ERROR: Failed to download ${run_accession} from AWS S3" | ||
exit 1 | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be necessary, aws cp
will exit with 1
if it fails.
--region us-east-1 \\ | ||
--no-sign-request \\ | ||
${args} \\ | ||
s3://sra-pub-run-odp/sra/${run_accession}/${run_accession} \\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is just doing aws cp
, why not use the file operator in native Nextflow, e.g. file/fromPath("s3://sra-pub-run-odp/sra/${run_accession}/${run_accession}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this, we can probably write a function to return meta, file("s3://etc")
and make that into a pseudo-process, then call it exactly the same way. This wouldn't actually copy the file, just a pointer to it so we would only ever move the file once, which will be infinitely more efficient than using a process (literally!). It would still copy the file to the publishDir via normal Nextflow mechanisms.
Per Issue #354, I've taken a stab at adapting my direct AWS download method for S3-mirrored SRA files. For ease of integration, I've so far just specified environment options with aws cli installed, but perhaps it would be more appropriate to try and instead use existing nextflow AWS integrations?
I've tested this on a couple datasets of various sizes I'm working with right now and it seems to be working quite reliably.
I did some very basic benchmarking of performance differences.


On a study of 28 samples with around 10 Gbp sequences, the AWS method was around 5 times faster for the download step than the SRA Prefetch approach. The bulk of the time was still taken up by unpacking, though.
PR checklist
Changes:
addition of aws cli download method for mirrored SRA files
If you've fixed a bug or added code that should be tested, add tests!
Tests added (plus nf-test snapshots) for the revised workflow, the new subworkflow, and the added module
Uses existing test data and approaches from sratools downloads.
nf-core lint
).There are linting errors in the dev branch, but no new linting errors introduced.
nextflow run . -profile test,docker --outdir <OUTDIR>
).Tests passing.
nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).