Skip to content

Conversation

tanaes
Copy link

@tanaes tanaes commented Aug 15, 2025

Per Issue #354, I've taken a stab at adapting my direct AWS download method for S3-mirrored SRA files. For ease of integration, I've so far just specified environment options with aws cli installed, but perhaps it would be more appropriate to try and instead use existing nextflow AWS integrations?

I've tested this on a couple datasets of various sizes I'm working with right now and it seems to be working quite reliably.

I did some very basic benchmarking of performance differences.
Screenshot 2025-08-15 at 1 31 49 AM
Screenshot 2025-08-15 at 1 31 58 AM

On a study of 28 samples with around 10 Gbp sequences, the AWS method was around 5 times faster for the download step than the SRA Prefetch approach. The bulk of the time was still taken up by unpacking, though.

PR checklist

  • This comment contains a description of changes (with reason).

Changes:

  • addition of aws cli download method for mirrored SRA files

  • If you've fixed a bug or added code that should be tested, add tests!

Tests added (plus nf-test snapshots) for the revised workflow, the new subworkflow, and the added module

  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/fetchngs branch on the nf-core/test-datasets repository.

Uses existing test data and approaches from sratools downloads.

  • Make sure your code lints (nf-core lint).

There are linting errors in the dev branch, but no new linting errors introduced.

  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).

Tests passing.

  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

Comment on lines +26 to +31
aws s3 cp \\
--region us-east-1 \\
--no-sign-request \\
${args} \\
s3://sra-pub-run-odp/sra/${run_accession}/${run_accession} \\
${prefix}.sra
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nextflow has native method of downloading from AWS using the SDK, do we think this will be more efficient than using that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Nextflow:

  • No worker processes
  • No copying files around
  • Lower latency

Using a process:

  • Can offload downloading to a worker process
  • Leave Nextflow 'head' process more available
  • Could supply faster networking to worker nodes (which may be in a different env).

Comment on lines +33 to +37
# Verify download
if [ ! -f "${prefix}.sra" ]; then
echo "ERROR: Failed to download ${run_accession} from AWS S3"
exit 1
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary, aws cp will exit with 1 if it fails.

--region us-east-1 \\
--no-sign-request \\
${args} \\
s3://sra-pub-run-odp/sra/${run_accession}/${run_accession} \\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is just doing aws cp, why not use the file operator in native Nextflow, e.g. file/fromPath("s3://sra-pub-run-odp/sra/${run_accession}/${run_accession}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this, we can probably write a function to return meta, file("s3://etc") and make that into a pseudo-process, then call it exactly the same way. This wouldn't actually copy the file, just a pointer to it so we would only ever move the file once, which will be infinitely more efficient than using a process (literally!). It would still copy the file to the publishDir via normal Nextflow mechanisms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants