-
Notifications
You must be signed in to change notification settings - Fork 7
cli now supports decompressing to stdout #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great addition, thanks @aabiddanda! It would be great to add the -c/--stdout semantics a-la gzip. In general, keeping things as gzip-like is a central goal, so this is a very nice addition.
There's a few things we'd need to do though I think, as we do want to make sure we're following the fill gzip "-c" semantics:
- We need to make sure we don't remove the input file, so remove_input should also check if --stdout has been set.
- We need to raise an error if
--stdout
is specified with--compress
, since it's not totally straightforward to implement the Zarr zipstore on stdout. (I guess this should be logged as a follow-up issue) - Make sure that this interacts with other options in a gzip-like way (I haven't gone through the details here, not sure what's involved)
- Add tests to make sure we're following the correct semantics. I.e., we'd want to test that the input file is left alone by copying the
TestDecompressSemantics.test_keep
test. - Add a test to make sure that we're correctly outputting a stream of multiple tree sequences to stdout when we have multiple input files, which can be read by tskit. (tskit.load() should consume them one-by-one from a file).
How does that sound?
Don't worry about the CI tests failing @aabiddanda, there's a bit of breakage in our CI which I'm sorting out in #45. Just get the tests passing locally and it'll be fine. |
Codecov Report
@@ Coverage Diff @@
## main #44 +/- ##
==========================================
- Coverage 97.70% 97.10% -0.60%
==========================================
Files 6 6
Lines 305 311 +6
Branches 55 57 +2
==========================================
+ Hits 298 302 +4
- Misses 5 6 +1
- Partials 2 3 +1
Continue to review full report at Codecov.
|
Is there a clear way in |
tszip/cli.py
Outdated
if args.stdout: | ||
args.keep = True | ||
# NOTE: this is likely not compliant across different systems | ||
outfile = pathlib.Path("/dev/stdout") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to use sys.stdout
, right? So rather than a path we pass a file-like object to ts.dump()
which will do the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This certainly works when running in the command line (e.g. tszip -d -c tests/files/1.0.0.trees.tsz > test.trees
). However when running in the pytest
suite it seems to complain substantially about this (see image attached). I think the crux is that sys.stdout
is not the same as a standard file-handle and does not have the fileno
operation?
I find it odd that this only breaks in the pytest
environment though and not directly on my commandline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yuck, this is just an artefact of pytest. Pytest is capturing stdout and it's replacing it with a "not real" file.
We do something similar in msprime for the msp ancestry
command, which outputs to stdout by default. Perhaps there's something in the capture_output
function which helps?
I had a play with this @aabiddanda and it really is quite fiddly to get stdout diverted correctly here. This is what I came up with: def capture_output(func, *args, binary=False, **kwargs):
"""
Runs the specified function and arguments, and returns the
tuple (stdout, stderr) as strings.
"""
with tempfile.TemporaryDirectory() as tmpdir:
stdout_path = pathlib.Path(tmpdir) / "stdout"
stderr_path = pathlib.Path(tmpdir) / "stderr"
mode = "wb+" if binary else "w+"
saved_stdout = sys.stdout
saved_stderr = sys.stderr
with open(stdout_path, mode) as stdout, open(stderr_path, mode) as stderr:
try:
sys.stdout = stdout
sys.stderr = stderr
with mock.patch("signal.signal"):
func(*args, **kwargs)
stdout.seek(0)
stderr.seek(0)
stdout_output = stdout.read()
stderr_output = stderr.read()
finally:
sys.stdout = saved_stdout
sys.stderr = saved_stderr
return stdout_output, stderr_output by using a real file, we avoid the fileno problem and so we can work directly with sys.stdout. We need the |
Main TODO left on the testing front is to address the following: I'll be addressing this shortly |
Upon some reflection/checking it seems like this type of reading multiple tree-sequences from a single file (compressed or otherwise) is a no-go and this is not a feature in tskit.load. Unless I am missing something very obvious here. I could see two ways to go:
The first seems more achievable quickly ...
Based on the C API it seems like it is there but perhaps this feature has not been propagated to the Python API yet by wrapping the |
It works all right @aabiddanda, there's just a slight difference in how you get the tree sequences files from the combined one: os.system('cat t1.trees t2.trees > combined.trees')
with open("combined.trees", "r") as f:
ts1a = tskit.load(f)
ts2a = tskit.load(f)
assert ts1a == ts1
assert ts2a == ts2 So, tskit will read complete tree sequences sequentially from a stream, but if you point it to a file path it'll just read the first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks @aabiddanda! One minor suggestion and we're good to merge after a squash
c19e896
to
00e2a3e
Compare
00e2a3e
to
fd49159
Compare
Ok should be all set to go on this one and its a nice clean additional feature (the tutorial on squashing is very helpful for someone like me that tends to over-commit!). Thanks for the tips & tricks @jeromekelleher! |
Excellent, thanks @aabiddanda! Squashing is a real git super-power I think - makes it look like you get everything exactly right first time 😉 |
Inspired by
gzip
where one can decompress to stdout I've added a flag to the cli-c
that allows one to decompress the tree-sequence to stdout and redirect to a file of their choosing.This was inspired by a use-case where I wanted to decompress a single simulation to multiple test cases.
I'm happy for this edit to be rewritten or a broader discussion of whether this belongs in the CLI.