Compressing tree-sequences to stdout #53

aabiddanda · 2021-11-12T15:34:20Z

In response to #49, I've implemented a version (& some testing) that allows one to compress a single tree to stdout and pipe to a file of choice. The code is a bit clunky w.r.t. filehandling so I'm making this as a draft PR initially to get some feedback.

There is one outstanding thing that I have not added, and that is how to compress multiple tree sequences to the stdout stream and verify that we can read them in sequence (similar to sequential calls to tskit.load from a single file).

codecov-commenter · 2021-11-12T15:35:57Z

Codecov Report

Merging #53 (693c669) into main (340315a) will decrease coverage by 0.19%.
The diff coverage is 96.15%.

@@            Coverage Diff             @@
##             main      #53      +/-   ##
==========================================
- Coverage   97.76%   97.56%   -0.20%     
==========================================
  Files           6        6              
  Lines         313      329      +16     
  Branches       62       65       +3     
==========================================
+ Hits          306      321      +15     
- Misses          5        6       +1     
  Partials        2        2

Impacted Files	Coverage Δ
tszip/cli.py	`99.04% <83.33%> (-0.96%)`	⬇️
tszip/compression.py	`98.96% <100.00%> (+0.06%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 340315a...693c669. Read the comment docs.

jeromekelleher · 2021-11-12T15:53:03Z

Thanks for following up on this @aabiddanda!

Have you tried passing sys.stdout to zarr directly? It looks like me like the zipfile interface supports this, and the zarr code just passes the path argument through. So, ideally, we could pass the file handle through directly?

aabiddanda · 2021-11-23T01:09:56Z

I've tried every permutation of trying to get sys.stdout and sys.stdout.buffer passed directly as a zipfile (which the naive ZipFile interface allows), but I think that the broader issue with zarr is that it expects a full string path (see line 1500 here). This isn't something that can easily be solved by just passing in sys.stdout or even /dev/stdout (which leads to hanging on the file handle on my Mac). Would appreciate any suggestions @jeromekelleher if you have insights here.

jeromekelleher

What do you think of this version @aabiddanda?

jeromekelleher · 2021-11-23T12:25:08Z

tszip/compression.py

@@ -95,8 +97,13 @@ def compress(ts, path, variants_only=False):
        with zarr.ZipStore(filename, mode="w") as store:
            root = zarr.group(store=store)
            compress_zarr(ts, root, variants_only=variants_only)
-        os.replace(filename, destination)
-    logging.info(f"Wrote {destination}")
+        if stdout:


Ah, looks like we're actually writing a temporary file anyway, so it's not that bad. How about this (untested!)

if isinstance(destination, (str, pathlib.Path)): os.replace(filename, destination) logging.info(f"Wrote {destination}") else: # Assume that destination is a file-like object open in "wb" mode. with open(filename, "rb") as source: chunk_size = 2**10 # 1MiB for chunk in iter(functools.partial(source.read, 64), b""): destination.write(chunk)

That way we don't need the stdout option and can write to any suitable file-like object.

In general I like this solution as it helps keep the function signature the same. I think it takes a little bit of messing with because it does not reliably work when constructing an in-place directory when the path being fed in is sys.stdout, but I think that it could be worked around. I'll see what I can do for this to make it work with the various tests in place as well.

destination = str(path) # Write the file into a temporary directory on the same file system so that # we can write the output atomically. destdir = os.path.dirname(os.path.abspath(destination)) with tempfile.TemporaryDirectory(dir=destdir, prefix=".tszip_work_") as tmpdir:

tszip/compression.py

aabiddanda · 2022-02-13T03:04:17Z

After a brief hiatus - the issues with running tszip compression are now appropriately handled. Would appreciate any comments related to helping get this across the line.

jeromekelleher · 2022-02-13T11:49:08Z

Thanks for picking this up again @aabiddanda! I've added a few small tweaks, what do you think? Basically I don't like the idea of complicating the library code just to make the tests work - I think this is cleaner, even if the tests are a bit more complicated.

If you're happy, then can you rebase and squash the commits down to one so we can merge please?

aabiddanda · 2022-02-17T13:56:57Z

Should be squashed and good to go.

However, I think that it should be noted that the behavior for compressing multiple files to stdout is a little bit different from gzip when using the -c flag. When testing this, it seems that only the final compressed tree is available. For example, a test like this fails:

    def test_compress_stdout_multiple(self):
        self.assertTrue(self.trees_path.exists())
        tmp_file = pathlib.Path(self.tmpdir.name) / "stdout_mult.trees.tsz"
        with mock.patch("tszip.cli.get_stdout", wraps=get_stdout_for_pytest):
            stdout, stderr = self.run_tszip_stdout(["-c", str(self.trees_path), str(self.trees_path2)])
        with open(tmp_file, "wb+") as tmp:
            tmp.write(stdout)
        ts_mult1 = tszip.decompress(tmp_file)
        ts_mult2 = tszip.decompress(tmp_file)
        # We only get the last one?
        self.assertEqual(ts_mult1.tables, self.ts.tables) # this line fails
        self.assertEqual(ts_mult2.tables, self.ts2.tables)

If we only can compress one tree-sequence file at a time on the commandline, then we should potentially raise a warning if a user tries to compress multiple tree-sequences to a single file?

jeromekelleher · 2022-02-17T14:42:53Z

Looks like we still have 14 commits @aabiddanda, so maybe you didn't force push or something? This guide should help, but I'm happy to do the squashing if you prefer.

Re the multiple files thing - hmm, good catch. Maybe we should merge this much first, and open a separate issue to track that problem? We can then decide how to deal with it before releasing. (Just raising an error is fine if we need to IMO - it's an edge case)

author Arjun Biddanda <[email protected]> 1636730041 -0500 committer Arjun Biddanda <[email protected]> 1645130969 -0500 parent 340315a author Arjun Biddanda <[email protected]> 1636730041 -0500 committer Arjun Biddanda <[email protected]> 1645130930 -0500 feature: tszip compression to stdout

aabiddanda · 2022-02-17T20:55:54Z

Alright I think that its now squashed to one commit @jeromekelleher (I'm always amazed by the rebasing process).

Re: multiple files I agree with your suggestion of making a separate issue and adding in a test potentially for that edge case.

jeromekelleher

This is great, for taking the time to work this through @aabiddanda!

jeromekelleher reviewed Nov 23, 2021

View reviewed changes

jeromekelleher reviewed Nov 24, 2021

View reviewed changes

tszip/compression.py Outdated Show resolved Hide resolved

aabiddanda marked this pull request as ready for review February 13, 2022 03:00

jeromekelleher force-pushed the compress_stdout branch from c7faf57 to 693c669 Compare February 13, 2022 11:47

aabiddanda force-pushed the compress_stdout branch from 277a579 to c3cb856 Compare February 13, 2022 13:30

aabiddanda force-pushed the compress_stdout branch from c3cb856 to 2b7ed66 Compare February 17, 2022 20:53

jeromekelleher approved these changes Feb 18, 2022

View reviewed changes

jeromekelleher added the AUTOMERGE-REQUESTED label Feb 18, 2022

mergify bot merged commit 73fe8a0 into tskit-dev:main Feb 18, 2022

mergify bot removed the AUTOMERGE-REQUESTED label Feb 18, 2022

This was referenced Feb 18, 2022

Update changelog for 0.2.2 #62

Closed

Handle multiple input files with compressing to stdout #63

Closed

Compress to stdout #49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compressing tree-sequences to stdout #53

Compressing tree-sequences to stdout #53

Uh oh!

aabiddanda commented Nov 12, 2021

Uh oh!

codecov-commenter commented Nov 12, 2021 •

edited

Loading

Uh oh!

jeromekelleher commented Nov 12, 2021 •

edited

Loading

Uh oh!

aabiddanda commented Nov 23, 2021

Uh oh!

jeromekelleher left a comment

Uh oh!

jeromekelleher Nov 23, 2021

Uh oh!

aabiddanda Nov 24, 2021

Uh oh!

Uh oh!

aabiddanda commented Feb 13, 2022

Uh oh!

jeromekelleher commented Feb 13, 2022

Uh oh!

aabiddanda commented Feb 17, 2022

Uh oh!

jeromekelleher commented Feb 17, 2022

Uh oh!

aabiddanda commented Feb 17, 2022

Uh oh!

jeromekelleher left a comment

Uh oh!

Uh oh!

Compressing tree-sequences to stdout #53

Compressing tree-sequences to stdout #53

Uh oh!

Conversation

aabiddanda commented Nov 12, 2021

Uh oh!

codecov-commenter commented Nov 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeromekelleher commented Nov 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aabiddanda commented Nov 23, 2021

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

jeromekelleher Nov 23, 2021

Choose a reason for hiding this comment

Uh oh!

aabiddanda Nov 24, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aabiddanda commented Feb 13, 2022

Uh oh!

jeromekelleher commented Feb 13, 2022

Uh oh!

aabiddanda commented Feb 17, 2022

Uh oh!

jeromekelleher commented Feb 17, 2022

Uh oh!

aabiddanda commented Feb 17, 2022

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Nov 12, 2021 •

edited

Loading

jeromekelleher commented Nov 12, 2021 •

edited

Loading