-
Notifications
You must be signed in to change notification settings - Fork 7
Compressing tree-sequences to stdout #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #53 +/- ##
==========================================
- Coverage 97.76% 97.56% -0.20%
==========================================
Files 6 6
Lines 313 329 +16
Branches 62 65 +3
==========================================
+ Hits 306 321 +15
- Misses 5 6 +1
Partials 2 2
Continue to review full report at Codecov.
|
Thanks for following up on this @aabiddanda! Have you tried passing |
I've tried every permutation of trying to get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of this version @aabiddanda?
tszip/compression.py
Outdated
@@ -95,8 +97,13 @@ def compress(ts, path, variants_only=False): | |||
with zarr.ZipStore(filename, mode="w") as store: | |||
root = zarr.group(store=store) | |||
compress_zarr(ts, root, variants_only=variants_only) | |||
os.replace(filename, destination) | |||
logging.info(f"Wrote {destination}") | |||
if stdout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, looks like we're actually writing a temporary file anyway, so it's not that bad. How about this (untested!)
if isinstance(destination, (str, pathlib.Path)):
os.replace(filename, destination)
logging.info(f"Wrote {destination}")
else:
# Assume that destination is a file-like object open in "wb" mode.
with open(filename, "rb") as source:
chunk_size = 2**10 # 1MiB
for chunk in iter(functools.partial(source.read, 64), b""):
destination.write(chunk)
That way we don't need the stdout option and can write to any suitable file-like object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I like this solution as it helps keep the function signature the same. I think it takes a little bit of messing with because it does not reliably work when constructing an in-place directory when the path being fed in is sys.stdout
, but I think that it could be worked around. I'll see what I can do for this to make it work with the various tests in place as well.
destination = str(path)
# Write the file into a temporary directory on the same file system so that
# we can write the output atomically.
destdir = os.path.dirname(os.path.abspath(destination))
with tempfile.TemporaryDirectory(dir=destdir, prefix=".tszip_work_") as tmpdir:
After a brief hiatus - the issues with running |
c7faf57
to
693c669
Compare
Thanks for picking this up again @aabiddanda! I've added a few small tweaks, what do you think? Basically I don't like the idea of complicating the library code just to make the tests work - I think this is cleaner, even if the tests are a bit more complicated. If you're happy, then can you rebase and squash the commits down to one so we can merge please? |
277a579
to
c3cb856
Compare
Should be squashed and good to go. However, I think that it should be noted that the behavior for compressing multiple files to stdout is a little bit different from
If we only can compress one tree-sequence file at a time on the commandline, then we should potentially raise a warning if a user tries to compress multiple tree-sequences to a single file? |
Looks like we still have 14 commits @aabiddanda, so maybe you didn't force push or something? This guide should help, but I'm happy to do the squashing if you prefer. Re the multiple files thing - hmm, good catch. Maybe we should merge this much first, and open a separate issue to track that problem? We can then decide how to deal with it before releasing. (Just raising an error is fine if we need to IMO - it's an edge case) |
author Arjun Biddanda <[email protected]> 1636730041 -0500 committer Arjun Biddanda <[email protected]> 1645130969 -0500 parent 340315a author Arjun Biddanda <[email protected]> 1636730041 -0500 committer Arjun Biddanda <[email protected]> 1645130930 -0500 feature: tszip compression to stdout
c3cb856
to
2b7ed66
Compare
Alright I think that its now squashed to one commit @jeromekelleher (I'm always amazed by the rebasing process). Re: multiple files I agree with your suggestion of making a separate issue and adding in a test potentially for that edge case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, for taking the time to work this through @aabiddanda!
In response to #49, I've implemented a version (& some testing) that allows one to compress a single tree to
stdout
and pipe to a file of choice. The code is a bit clunky w.r.t. filehandling so I'm making this as a draft PR initially to get some feedback.There is one outstanding thing that I have not added, and that is how to compress multiple tree sequences to the stdout stream and verify that we can read them in sequence (similar to sequential calls to
tskit.load
from a single file).