cli now supports decompressing to stdout #44

aabiddanda · 2021-10-26T21:15:55Z

Inspired by gzip where one can decompress to stdout I've added a flag to the cli -c that allows one to decompress the tree-sequence to stdout and redirect to a file of their choosing.
This was inspired by a use-case where I wanted to decompress a single simulation to multiple test cases.

I'm happy for this edit to be rewritten or a broader discussion of whether this belongs in the CLI.

jeromekelleher

This is a great addition, thanks @aabiddanda! It would be great to add the -c/--stdout semantics a-la gzip. In general, keeping things as gzip-like is a central goal, so this is a very nice addition.

There's a few things we'd need to do though I think, as we do want to make sure we're following the fill gzip "-c" semantics:

We need to make sure we don't remove the input file, so remove_input should also check if --stdout has been set.
We need to raise an error if --stdout is specified with --compress, since it's not totally straightforward to implement the Zarr zipstore on stdout. (I guess this should be logged as a follow-up issue)
Make sure that this interacts with other options in a gzip-like way (I haven't gone through the details here, not sure what's involved)
Add tests to make sure we're following the correct semantics. I.e., we'd want to test that the input file is left alone by copying the TestDecompressSemantics.test_keep test.
Add a test to make sure that we're correctly outputting a stream of multiple tree sequences to stdout when we have multiple input files, which can be read by tskit. (tskit.load() should consume them one-by-one from a file).

How does that sound?

tszip/cli.py

jeromekelleher · 2021-10-27T11:36:00Z

Don't worry about the CI tests failing @aabiddanda, there's a bit of breakage in our CI which I'm sorting out in #45. Just get the tests passing locally and it'll be fine.

codecov-commenter · 2021-11-04T00:39:36Z

Codecov Report

Merging #44 (db80be0) into main (55bc8c4) will decrease coverage by 0.59%.
The diff coverage is 75.00%.

@@            Coverage Diff             @@
##             main      #44      +/-   ##
==========================================
- Coverage   97.70%   97.10%   -0.60%     
==========================================
  Files           6        6              
  Lines         305      311       +6     
  Branches       55       57       +2     
==========================================
+ Hits          298      302       +4     
- Misses          5        6       +1     
- Partials        2        3       +1

Impacted Files	Coverage Δ
tszip/cli.py	`98.01% <75.00%> (-1.99%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55bc8c4...db80be0. Read the comment docs.

aabiddanda · 2021-11-04T00:57:59Z

Is there a clear way in python to obtain the path to file-like objects for stdout? Building the tests using outfile = pathlib.Path('/dev/stdout') was substantially more efficient to avoid fileno errors.

jeromekelleher · 2021-11-04T12:47:05Z

tszip/cli.py

+        if args.stdout:
+            args.keep = True
+            # NOTE: this is likely not compliant across different systems
+            outfile = pathlib.Path("/dev/stdout")


We should be able to use sys.stdout, right? So rather than a path we pass a file-like object to ts.dump() which will do the right thing.

This certainly works when running in the command line (e.g. tszip -d -c tests/files/1.0.0.trees.tsz > test.trees). However when running in the pytest suite it seems to complain substantially about this (see image attached). I think the crux is that sys.stdout is not the same as a standard file-handle and does not have the fileno operation?

I find it odd that this only breaks in the pytest environment though and not directly on my commandline.

Ah, yuck, this is just an artefact of pytest. Pytest is capturing stdout and it's replacing it with a "not real" file.

We do something similar in msprime for the msp ancestry command, which outputs to stdout by default. Perhaps there's something in the capture_output function which helps?

tests/test_cli.py

jeromekelleher · 2021-11-04T19:34:37Z

I had a play with this @aabiddanda and it really is quite fiddly to get stdout diverted correctly here. This is what I came up with:

def capture_output(func, *args, binary=False, **kwargs):
    """
    Runs the specified function and arguments, and returns the
    tuple (stdout, stderr) as strings.
    """
    with tempfile.TemporaryDirectory() as tmpdir:
        stdout_path = pathlib.Path(tmpdir) / "stdout"
        stderr_path = pathlib.Path(tmpdir) / "stderr"
        mode = "wb+" if binary else "w+"
        saved_stdout = sys.stdout
        saved_stderr = sys.stderr
        with open(stdout_path, mode) as stdout, open(stderr_path, mode) as stderr:
            try:
                sys.stdout = stdout
                sys.stderr = stderr
                with mock.patch("signal.signal"):
                    func(*args, **kwargs)
                stdout.seek(0)
                stderr.seek(0)
                stdout_output = stdout.read()
                stderr_output = stderr.read()
            finally:
                sys.stdout = saved_stdout
                sys.stderr = saved_stderr
    return stdout_output, stderr_output

by using a real file, we avoid the fileno problem and so we can work directly with sys.stdout. We need the binary argument as otherwise we try to interpret the tskit output as unicode and it borks. There's probably a more elegant way to do it, but this works 😄

aabiddanda · 2021-11-05T16:28:07Z

Main TODO left on the testing front is to address the following:
"""
Add a test to make sure that we're correctly outputting a stream of multiple tree sequences to stdout when we have multiple input files, which can be read by tskit. (tskit.load() should consume them one-by-one from a file).
"""

I'll be addressing this shortly

aabiddanda · 2021-11-06T23:43:49Z

Upon some reflection/checking it seems like this type of reading multiple tree-sequences from a single file (compressed or otherwise) is a no-go and this is not a feature in tskit.load. Unless I am missing something very obvious here. I could see two ways to go:

Put in an error if trying to decompress more than 1 file to stdout (saying that this is not currently supported)
Wait till tskit.load supports loading multiple tree-sequences from the same file to merge this PR in.

The first seems more achievable quickly ...
Example code illustrating this issue:

import msprime
import tskit
import numpy as np
import os

print(f"tskit version: {tskit.__version__}")
print(f"msprime version: {msprime.__version__}")

# simulate two different tree sequences and dump to different text files
ts1 = msprime.simulate(10, mutation_rate=10, random_seed=1)
ts2 = msprime.simulate(20, mutation_rate=10, random_seed=2)
ts1.dump('t1.trees')
ts2.dump('t2.trees')

os.system('cat t1.trees t2.trees > combined.trees')
ts_joint = tskit.load('combined.trees')
# Error: it only matches the first entry ... not both
assert ts_joint.tables == ts1.tables

Based on the C API it seems like it is there but perhaps this feature has not been propagated to the Python API yet by wrapping the loadf function (at least as of tskit v0.3.7)? Would appreciate any thoughts on this as it seems like a relatively important design choice.

jeromekelleher · 2021-11-07T13:39:55Z

It works all right @aabiddanda, there's just a slight difference in how you get the tree sequences files from the combined one:

os.system('cat t1.trees t2.trees > combined.trees')
with open("combined.trees", "r") as f:
    ts1a = tskit.load(f)
    ts2a = tskit.load(f)

assert ts1a == ts1
assert ts2a == ts2

So, tskit will read complete tree sequences sequentially from a stream, but if you point it to a file path it'll just read the first.

jeromekelleher

Looks great, thanks @aabiddanda! One minor suggestion and we're good to merge after a squash

tszip/cli.py

aabiddanda · 2021-11-07T20:08:24Z

Ok should be all set to go on this one and its a nice clean additional feature (the tutorial on squashing is very helpful for someone like me that tends to over-commit!). Thanks for the tips & tricks @jeromekelleher!

jeromekelleher · 2021-11-08T08:54:24Z

Excellent, thanks @aabiddanda! Squashing is a real git super-power I think - makes it look like you get everything exactly right first time 😉

jeromekelleher reviewed Oct 27, 2021

View reviewed changes

tszip/cli.py Outdated Show resolved Hide resolved

jeromekelleher reviewed Nov 4, 2021

View reviewed changes

aabiddanda commented Nov 4, 2021

View reviewed changes

tests/test_cli.py Outdated Show resolved Hide resolved

jeromekelleher mentioned this pull request Nov 7, 2021

Compress to stdout #49

Closed

jeromekelleher approved these changes Nov 7, 2021

View reviewed changes

tszip/cli.py Outdated Show resolved Hide resolved

jeromekelleher mentioned this pull request Nov 7, 2021

Miscellaneous packaging stuff #50

Merged

aabiddanda force-pushed the cli_outfile_edits branch from c19e896 to 00e2a3e Compare November 7, 2021 20:02

cli now supports decompressing to stdout

fd49159

aabiddanda force-pushed the cli_outfile_edits branch from 00e2a3e to fd49159 Compare November 7, 2021 20:07

jeromekelleher added the AUTOMERGE-REQUESTED label Nov 8, 2021

mergify bot merged commit ccaad01 into tskit-dev:main Nov 8, 2021

mergify bot removed the AUTOMERGE-REQUESTED label Nov 8, 2021

jeromekelleher mentioned this pull request Nov 8, 2021

Update changelog for 0.2.0 #51

Closed

aabiddanda deleted the cli_outfile_edits branch November 8, 2021 11:45

cli now supports decompressing to stdout #44

cli now supports decompressing to stdout #44

Uh oh!

Conversation

aabiddanda commented Oct 26, 2021

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeromekelleher commented Oct 27, 2021

Uh oh!

codecov-commenter commented Nov 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

aabiddanda commented Nov 4, 2021

Uh oh!

jeromekelleher Nov 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aabiddanda Nov 4, 2021

Choose a reason for hiding this comment

Uh oh!

jeromekelleher Nov 4, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeromekelleher commented Nov 4, 2021

Uh oh!

aabiddanda commented Nov 5, 2021

Uh oh!

aabiddanda commented Nov 6, 2021

Uh oh!

jeromekelleher commented Nov 7, 2021

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aabiddanda commented Nov 7, 2021

Uh oh!

jeromekelleher commented Nov 8, 2021

Uh oh!

Uh oh!

codecov-commenter commented Nov 4, 2021 •

edited

Loading

jeromekelleher Nov 4, 2021 •

edited

Loading