Skip to content

cli now supports decompressing to stdout #44

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 8, 2021

Conversation

aabiddanda
Copy link
Contributor

Inspired by gzip where one can decompress to stdout I've added a flag to the cli -c that allows one to decompress the tree-sequence to stdout and redirect to a file of their choosing.
This was inspired by a use-case where I wanted to decompress a single simulation to multiple test cases.

I'm happy for this edit to be rewritten or a broader discussion of whether this belongs in the CLI.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great addition, thanks @aabiddanda! It would be great to add the -c/--stdout semantics a-la gzip. In general, keeping things as gzip-like is a central goal, so this is a very nice addition.

There's a few things we'd need to do though I think, as we do want to make sure we're following the fill gzip "-c" semantics:

  1. We need to make sure we don't remove the input file, so remove_input should also check if --stdout has been set.
  2. We need to raise an error if --stdout is specified with --compress, since it's not totally straightforward to implement the Zarr zipstore on stdout. (I guess this should be logged as a follow-up issue)
  3. Make sure that this interacts with other options in a gzip-like way (I haven't gone through the details here, not sure what's involved)
  4. Add tests to make sure we're following the correct semantics. I.e., we'd want to test that the input file is left alone by copying the TestDecompressSemantics.test_keep test.
  5. Add a test to make sure that we're correctly outputting a stream of multiple tree sequences to stdout when we have multiple input files, which can be read by tskit. (tskit.load() should consume them one-by-one from a file).

How does that sound?

@jeromekelleher
Copy link
Member

Don't worry about the CI tests failing @aabiddanda, there's a bit of breakage in our CI which I'm sorting out in #45. Just get the tests passing locally and it'll be fine.

@codecov-commenter
Copy link

codecov-commenter commented Nov 4, 2021

Codecov Report

Merging #44 (db80be0) into main (55bc8c4) will decrease coverage by 0.59%.
The diff coverage is 75.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #44      +/-   ##
==========================================
- Coverage   97.70%   97.10%   -0.60%     
==========================================
  Files           6        6              
  Lines         305      311       +6     
  Branches       55       57       +2     
==========================================
+ Hits          298      302       +4     
- Misses          5        6       +1     
- Partials        2        3       +1     
Impacted Files Coverage Δ
tszip/cli.py 98.01% <75.00%> (-1.99%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55bc8c4...db80be0. Read the comment docs.

@aabiddanda
Copy link
Contributor Author

Is there a clear way in python to obtain the path to file-like objects for stdout? Building the tests using outfile = pathlib.Path('/dev/stdout') was substantially more efficient to avoid fileno errors.

tszip/cli.py Outdated
if args.stdout:
args.keep = True
# NOTE: this is likely not compliant across different systems
outfile = pathlib.Path("/dev/stdout")
Copy link
Member

@jeromekelleher jeromekelleher Nov 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to use sys.stdout, right? So rather than a path we pass a file-like object to ts.dump() which will do the right thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This certainly works when running in the command line (e.g. tszip -d -c tests/files/1.0.0.trees.tsz > test.trees). However when running in the pytest suite it seems to complain substantially about this (see image attached). I think the crux is that sys.stdout is not the same as a standard file-handle and does not have the fileno operation?

I find it odd that this only breaks in the pytest environment though and not directly on my commandline.

Screen Shot 2021-11-04 at 8 56 01 AM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yuck, this is just an artefact of pytest. Pytest is capturing stdout and it's replacing it with a "not real" file.

We do something similar in msprime for the msp ancestry command, which outputs to stdout by default. Perhaps there's something in the capture_output function which helps?

@jeromekelleher
Copy link
Member

I had a play with this @aabiddanda and it really is quite fiddly to get stdout diverted correctly here. This is what I came up with:

def capture_output(func, *args, binary=False, **kwargs):
    """
    Runs the specified function and arguments, and returns the
    tuple (stdout, stderr) as strings.
    """
    with tempfile.TemporaryDirectory() as tmpdir:
        stdout_path = pathlib.Path(tmpdir) / "stdout"
        stderr_path = pathlib.Path(tmpdir) / "stderr"
        mode = "wb+" if binary else "w+"
        saved_stdout = sys.stdout
        saved_stderr = sys.stderr
        with open(stdout_path, mode) as stdout, open(stderr_path, mode) as stderr:
            try:
                sys.stdout = stdout
                sys.stderr = stderr
                with mock.patch("signal.signal"):
                    func(*args, **kwargs)
                stdout.seek(0)
                stderr.seek(0)
                stdout_output = stdout.read()
                stderr_output = stderr.read()
            finally:
                sys.stdout = saved_stdout
                sys.stderr = saved_stderr
    return stdout_output, stderr_output

by using a real file, we avoid the fileno problem and so we can work directly with sys.stdout. We need the binary argument as otherwise we try to interpret the tskit output as unicode and it borks. There's probably a more elegant way to do it, but this works 😄

@aabiddanda
Copy link
Contributor Author

Main TODO left on the testing front is to address the following:
"""
Add a test to make sure that we're correctly outputting a stream of multiple tree sequences to stdout when we have multiple input files, which can be read by tskit. (tskit.load() should consume them one-by-one from a file).
"""

I'll be addressing this shortly

@aabiddanda
Copy link
Contributor Author

Upon some reflection/checking it seems like this type of reading multiple tree-sequences from a single file (compressed or otherwise) is a no-go and this is not a feature in tskit.load. Unless I am missing something very obvious here. I could see two ways to go:

  1. Put in an error if trying to decompress more than 1 file to stdout (saying that this is not currently supported)
  2. Wait till tskit.load supports loading multiple tree-sequences from the same file to merge this PR in.

The first seems more achievable quickly ...
Example code illustrating this issue:

import msprime
import tskit
import numpy as np
import os

print(f"tskit version: {tskit.__version__}")
print(f"msprime version: {msprime.__version__}")

# simulate two different tree sequences and dump to different text files
ts1 = msprime.simulate(10, mutation_rate=10, random_seed=1)
ts2 = msprime.simulate(20, mutation_rate=10, random_seed=2)
ts1.dump('t1.trees')
ts2.dump('t2.trees')

os.system('cat t1.trees t2.trees > combined.trees')
ts_joint = tskit.load('combined.trees')
# Error: it only matches the first entry ... not both
assert ts_joint.tables == ts1.tables

Based on the C API it seems like it is there but perhaps this feature has not been propagated to the Python API yet by wrapping the loadf function (at least as of tskit v0.3.7)? Would appreciate any thoughts on this as it seems like a relatively important design choice.

@jeromekelleher
Copy link
Member

It works all right @aabiddanda, there's just a slight difference in how you get the tree sequences files from the combined one:

os.system('cat t1.trees t2.trees > combined.trees')
with open("combined.trees", "r") as f:
    ts1a = tskit.load(f)
    ts2a = tskit.load(f)

assert ts1a == ts1
assert ts2a == ts2

So, tskit will read complete tree sequences sequentially from a stream, but if you point it to a file path it'll just read the first.

@jeromekelleher jeromekelleher mentioned this pull request Nov 7, 2021
Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @aabiddanda! One minor suggestion and we're good to merge after a squash

@aabiddanda
Copy link
Contributor Author

Ok should be all set to go on this one and its a nice clean additional feature (the tutorial on squashing is very helpful for someone like me that tends to over-commit!). Thanks for the tips & tricks @jeromekelleher!

@mergify mergify bot merged commit ccaad01 into tskit-dev:main Nov 8, 2021
@jeromekelleher
Copy link
Member

Excellent, thanks @aabiddanda! Squashing is a real git super-power I think - makes it look like you get everything exactly right first time 😉

@aabiddanda aabiddanda deleted the cli_outfile_edits branch November 8, 2021 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants