Skip to content
This repository was archived by the owner on Aug 25, 2024. It is now read-only.

source: file: Compression #15

Closed
4 tasks done
johnandersen777 opened this issue Mar 10, 2019 · 23 comments
Closed
4 tasks done

source: file: Compression #15

johnandersen777 opened this issue Mar 10, 2019 · 23 comments
Labels
enhancement New feature or request gsoc Google Summer of Code related project Issues which will take a while to complete

Comments

@johnandersen777
Copy link

johnandersen777 commented Mar 10, 2019

DFFML is hoping to participate in Google Summer of Code (GSoC) under the Python Software Foundation umbrella. You can read all about what this means at http://python-gsoc.org/. This issue, and any others tagged gsoc and project are not generally available bugs, but related to project ideas for GSoC.

Project Idea: File Source Compression

Project description:

DFFML's initial release includes a FileSource which saves and loads data from files using the load_fd and dump_fd methods.

JSON Example

async def load_fd(self, fd):
repos = json.load(fd)
self.mem = {src_url: Repo(src_url, data=data) \
for src_url, data in repos.items()}
LOGGER.debug('%r loaded %d records', self, len(self.mem))
async def dump_fd(self, fd):
json.dump({repo.src_url: repo.dict() for repo in self.mem.values()}, fd)
LOGGER.debug('%r saved %d records', self, len(self.mem))

For the open method of FileSource

async def _open(self):
if not os.path.exists(self.filename) \
or os.path.isdir(self.filename):
LOGGER.debug('%r is not a file, initializing memory to empty dict',
self.filename)
self.mem = {}
return
with open(self.filename, 'r') as fd:
await self.load_fd(fd)

Allow for reading and writing the following file formats, transparently (so without subclasses having to do anything) to any source which is a subclass of FileSource.

Skills: Python, git
Difficulty level: Easy

Related Readings/Links:

See https://docs.python.org/3/library/archiving.html for documentation

Potential mentors: @pdxjohnny

Getting Started: Figure out how to do one of the file types, probably gzip (as that probably is as simple as using https://docs.python.org/3/library/gzip.html#gzip.GzipFile if the filename ends in .gz) then move on to the rest. For now just make modifications directly to the FileSource class. We may have you split out the logic later, but don't worry about another class for now.

What we want to see in your application: Describe how you intend to solve the problem, and give us some "stretch goals", maybe implement a remote file source which reads form URLs. Don't forget to include some time for building appropriate tests.

@johnandersen777 johnandersen777 added enhancement New feature or request gsoc Google Summer of Code related project Issues which will take a while to complete labels Mar 10, 2019
@yashlamba
Copy link
Contributor

Hey! I am Yash from Cluster Innovation Centre, University of Delhi pursuing BTech in Information Technology and Mathematical Innovations. I am interested in contributing to DFFML this summer. Can you suggest me a potential start for this project?

@johnandersen777
Copy link
Author

Hi Yash! Check out https://github.com/intel/dffml/wiki/DFFML-Ideas-Page-for-GSoC-2019#getting-started first. Make sure you can run the tests. Then I'd suggest looking at https://docs.python.org/3/library/archiving.html and making some tests which save and load those file types. After that, take what you've done and integrate it with FileSource

Note: All ideas are open to anyone until someone's proposal is chosen. See http://python-gsoc.org/students.html for more info

@yashlamba
Copy link
Contributor

Hi Yash! Check out https://github.com/intel/dffml/wiki/DFFML-Ideas-Page-for-GSoC-2019#getting-started first. Make sure you can run the tests. Then I'd suggest looking at https://docs.python.org/3/library/archiving.html and making some tests which save and load those file types. After that, take what you've done and integrate it with FileSource

Note: All ideas are open to anyone until someone's proposal is chosen. See http://python-gsoc.org/students.html for more info

So I went through the steps and am able to run all the tests successfully. However, I have some doubts over how to contribute, can I mail you directly or is there some other means of contacting about doubts directly?

@johnandersen777
Copy link
Author

Ya you can email me: [email protected] however, ideally all discussion is kept on GitHub (maybe in #12) so that if you ask a question others might have, my response if viewable to them as well.

@yashlamba
Copy link
Contributor

Going by the Gzip module, it is basically a compression module that reads and writes either str or bytes. I wanted to ask whether we will be writing JSON objects or something pre-defined, or shall I take random objects just to implement it and writing tests for now?

For using dictionaries, we still need to use json for encoding and decoding to bytes (https://stackoverflow.com/questions/39450065/python-3-read-write-compressed-json-objects-from-to-gzip-file)

I have referred mainly to
https://www.journaldev.com/19827/python-gzip-compress-decompress#python-gzip-module
and
https://docs.python.org/3/library/gzip.html

For the basics, I have implemented the following. Is this what that is needed for now? I'll open a WIP: PR if this is what is needed:

For reading:

    async def load_fd(self, fd):
        with gzip.GzipFile(fd, 'rb') as f:
            repos = f.read()            
        #LOGGER.debug('%r loaded %d records', self, len(self.mem))
        f.close()

For Writing:

    async def dump_fd(self, fd):
        data = b'data'
        with gzip.GzipFile(fd, 'wb') as f:
            f.write(data)
        f.close()

@johnandersen777
Copy link
Author

with open(self.filename, 'r') as fd:
await self.load_fd(fd)

would be

if self.filename[::-1].startswith(('.gz')[::-1]):
    # Check if filename starts with .gz (by reversing .gz and then
    # seeing if the filename in reverse starts with that.
    opener = gzip.GzipFile(fd, 'r')
else:
    # Otherwise just open the file.
    opener = open(self.filename, 'r')

with opener as fd:
    await self.load_fd(fd)

@yashlamba
Copy link
Contributor

That's the part to open the file (and it is integrating it FileSource, which was to be done later), I get it. What about reading, what exactly would we be reading? This might sound silly but is really confusing to me. Which load_fd function will be called and based on what data?

@johnandersen777
Copy link
Author

Yes my bad, I said later, but since we're figuring things out as we go like this, we will just do it now. FileSource is an abstract base class. Which means that classes which inherit from it will have to define those methods.

@abc.abstractmethod
async def load_fd(self, fd):
pass # pragma: no cover
@abc.abstractmethod
async def dump_fd(self, fd):
pass # pragma: no cover

async def load_fd(self, fd):
repos = json.load(fd)
self.mem = {src_url: Repo(src_url, data=data) \
for src_url, data in repos.items()}
LOGGER.debug('%r loaded %d records', self, len(self.mem))
async def dump_fd(self, fd):
json.dump({repo.src_url: repo.dict() for repo in self.mem.values()}, fd)
LOGGER.debug('%r saved %d records', self, len(self.mem))

@yashlamba
Copy link
Contributor

Got that! So ultimately, there would be a gzipsource.py that would have load_fd and dump_fd defined. But my issue with that is what data I would be reading. Do I write them as bytes encoded json format? This is actually almost clear to me if I get what finally I would be reading or writing as gzip only accepts either string or bytes.

@johnandersen777
Copy link
Author

johnandersen777 commented Mar 21, 2019

What needs to be done for GZip to be finished is to modify FileSource (For open that's: #15 (comment))

Then repeat for close (using the w flag for write instead of the r flag).

if not self.readonly:
with open(self.filename, 'w') as fd:
await self.dump_fd(fd)

Edit To clarify. The end result is that there will be no more source classes added. Just modifications of FileSource.

@yashlamba
Copy link
Contributor

@pdxjohnny You mentioned that Gzip is the easiest to implement. If I start implementing bz2 module, what difference I might face? I read the module and it seems almost same.

@johnandersen777
Copy link
Author

I'm not sure. I'd say that you could start by creating a testcase, and using those modules to create files of those types with JSON or CSV data in them. Then see if JSONSource and CSVSource read the correct repo data from the files. They should throw errors until to implement the correct GZip, etc, in the open and close methods of FileSource.

@yashlamba
Copy link
Contributor

Okay, Got it! Thank you so much for your help and patience.

@johnandersen777
Copy link
Author

No problemo! Thank you for your contribution!

@yashlamba
Copy link
Contributor

Should I start with bz2? I read the module and found that there's nothing much different.

@johnandersen777
Copy link
Author

Ya go with whatever sounds good to you

@yashlamba
Copy link
Contributor

Hey!
So would support of .tar files be any useful? If we can wrap this up soon, it would be easy for me to document.

@johnandersen777
Copy link
Author

Hi Yash! sorry i am still working on a reply to your email. I think this is pretty much done. I don;t think tar support is needed right now. If you want to document what's been implemented with relation to this, that would be awesome. Thank you!

@yashlamba
Copy link
Contributor

Okay, I'll start working on documenting this and other source related classes. I have pretty spent the past couple of days understanding the code for the same.
Thank you.

@johnandersen777
Copy link
Author

johnandersen777 commented Mar 27, 2019 via email

@yashlamba
Copy link
Contributor

Hey! So I have a couple questions:

  1. Are we looking forward to finalize the zip module?
  2. How detailed it should be documented? I have written about the FileSource class and subclasses along with supported modules but do I need to document the super class too (Source) and it's functionality?

@johnandersen777
Copy link
Author

  1. Yes! (I've been caught up with df: Data Flow Facilitator #25)
  2. Do whatever you feel like, but it would probably be good to document Source.

@johnandersen777
Copy link
Author

Closed via #38

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request gsoc Google Summer of Code related project Issues which will take a while to complete
Projects
None yet
Development

No branches or pull requests

2 participants