Skip to content

Compressing tree-sequences to stdout #53

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 18, 2022

Conversation

aabiddanda
Copy link
Contributor

In response to #49, I've implemented a version (& some testing) that allows one to compress a single tree to stdout and pipe to a file of choice. The code is a bit clunky w.r.t. filehandling so I'm making this as a draft PR initially to get some feedback.

There is one outstanding thing that I have not added, and that is how to compress multiple tree sequences to the stdout stream and verify that we can read them in sequence (similar to sequential calls to tskit.load from a single file).

@codecov-commenter
Copy link

codecov-commenter commented Nov 12, 2021

Codecov Report

Merging #53 (693c669) into main (340315a) will decrease coverage by 0.19%.
The diff coverage is 96.15%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #53      +/-   ##
==========================================
- Coverage   97.76%   97.56%   -0.20%     
==========================================
  Files           6        6              
  Lines         313      329      +16     
  Branches       62       65       +3     
==========================================
+ Hits          306      321      +15     
- Misses          5        6       +1     
  Partials        2        2              
Impacted Files Coverage Δ
tszip/cli.py 99.04% <83.33%> (-0.96%) ⬇️
tszip/compression.py 98.96% <100.00%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 340315a...693c669. Read the comment docs.

@jeromekelleher
Copy link
Member

jeromekelleher commented Nov 12, 2021

Thanks for following up on this @aabiddanda!

Have you tried passing sys.stdout to zarr directly? It looks like me like the zipfile interface supports this, and the zarr code just passes the path argument through. So, ideally, we could pass the file handle through directly?

@aabiddanda
Copy link
Contributor Author

I've tried every permutation of trying to get sys.stdout and sys.stdout.buffer passed directly as a zipfile (which the naive ZipFile interface allows), but I think that the broader issue with zarr is that it expects a full string path (see line 1500 here). This isn't something that can easily be solved by just passing in sys.stdout or even /dev/stdout (which leads to hanging on the file handle on my Mac). Would appreciate any suggestions @jeromekelleher if you have insights here.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of this version @aabiddanda?

@@ -95,8 +97,13 @@ def compress(ts, path, variants_only=False):
with zarr.ZipStore(filename, mode="w") as store:
root = zarr.group(store=store)
compress_zarr(ts, root, variants_only=variants_only)
os.replace(filename, destination)
logging.info(f"Wrote {destination}")
if stdout:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, looks like we're actually writing a temporary file anyway, so it's not that bad. How about this (untested!)

if isinstance(destination, (str, pathlib.Path)):
     os.replace(filename, destination)
     logging.info(f"Wrote {destination}")
else:
     # Assume that destination is a file-like object open in "wb" mode.
     with open(filename, "rb") as source:
            chunk_size = 2**10  # 1MiB
            for chunk in iter(functools.partial(source.read, 64), b""):
                  destination.write(chunk)

That way we don't need the stdout option and can write to any suitable file-like object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I like this solution as it helps keep the function signature the same. I think it takes a little bit of messing with because it does not reliably work when constructing an in-place directory when the path being fed in is sys.stdout, but I think that it could be worked around. I'll see what I can do for this to make it work with the various tests in place as well.

    destination = str(path)
    # Write the file into a temporary directory on the same file system so that
    # we can write the output atomically.
    destdir = os.path.dirname(os.path.abspath(destination))
    with tempfile.TemporaryDirectory(dir=destdir, prefix=".tszip_work_") as tmpdir:

@aabiddanda aabiddanda marked this pull request as ready for review February 13, 2022 03:00
@aabiddanda
Copy link
Contributor Author

After a brief hiatus - the issues with running tszip compression are now appropriately handled. Would appreciate any comments related to helping get this across the line.

@jeromekelleher
Copy link
Member

Thanks for picking this up again @aabiddanda! I've added a few small tweaks, what do you think? Basically I don't like the idea of complicating the library code just to make the tests work - I think this is cleaner, even if the tests are a bit more complicated.

If you're happy, then can you rebase and squash the commits down to one so we can merge please?

@aabiddanda
Copy link
Contributor Author

Should be squashed and good to go.

However, I think that it should be noted that the behavior for compressing multiple files to stdout is a little bit different from gzip when using the -c flag. When testing this, it seems that only the final compressed tree is available. For example, a test like this fails:

    def test_compress_stdout_multiple(self):
        self.assertTrue(self.trees_path.exists())
        tmp_file = pathlib.Path(self.tmpdir.name) / "stdout_mult.trees.tsz"
        with mock.patch("tszip.cli.get_stdout", wraps=get_stdout_for_pytest):
            stdout, stderr = self.run_tszip_stdout(["-c", str(self.trees_path), str(self.trees_path2)])
        with open(tmp_file, "wb+") as tmp:
            tmp.write(stdout)
        ts_mult1 = tszip.decompress(tmp_file)
        ts_mult2 = tszip.decompress(tmp_file)
        # We only get the last one?
        self.assertEqual(ts_mult1.tables, self.ts.tables) # this line fails
        self.assertEqual(ts_mult2.tables, self.ts2.tables)

If we only can compress one tree-sequence file at a time on the commandline, then we should potentially raise a warning if a user tries to compress multiple tree-sequences to a single file?

@jeromekelleher
Copy link
Member

Looks like we still have 14 commits @aabiddanda, so maybe you didn't force push or something? This guide should help, but I'm happy to do the squashing if you prefer.

Re the multiple files thing - hmm, good catch. Maybe we should merge this much first, and open a separate issue to track that problem? We can then decide how to deal with it before releasing. (Just raising an error is fine if we need to IMO - it's an edge case)

author Arjun Biddanda <[email protected]> 1636730041 -0500
committer Arjun Biddanda <[email protected]> 1645130969 -0500

parent 340315a
author Arjun Biddanda <[email protected]> 1636730041 -0500
committer Arjun Biddanda <[email protected]> 1645130930 -0500

feature: tszip compression to stdout
@aabiddanda
Copy link
Contributor Author

Alright I think that its now squashed to one commit @jeromekelleher (I'm always amazed by the rebasing process).

Re: multiple files I agree with your suggestion of making a separate issue and adding in a test potentially for that edge case.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, for taking the time to work this through @aabiddanda!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants