-
Notifications
You must be signed in to change notification settings - Fork 138
source: file: Compression #15
Comments
Hey! I am Yash from Cluster Innovation Centre, University of Delhi pursuing BTech in Information Technology and Mathematical Innovations. I am interested in contributing to DFFML this summer. Can you suggest me a potential start for this project? |
Hi Yash! Check out https://github.com/intel/dffml/wiki/DFFML-Ideas-Page-for-GSoC-2019#getting-started first. Make sure you can run the tests. Then I'd suggest looking at https://docs.python.org/3/library/archiving.html and making some tests which save and load those file types. After that, take what you've done and integrate it with Note: All ideas are open to anyone until someone's proposal is chosen. See http://python-gsoc.org/students.html for more info |
So I went through the steps and am able to run all the tests successfully. However, I have some doubts over how to contribute, can I mail you directly or is there some other means of contacting about doubts directly? |
Ya you can email me: [email protected] however, ideally all discussion is kept on GitHub (maybe in #12) so that if you ask a question others might have, my response if viewable to them as well. |
Going by the Gzip module, it is basically a compression module that reads and writes either str or bytes. I wanted to ask whether we will be writing JSON objects or something pre-defined, or shall I take random objects just to implement it and writing tests for now? For using dictionaries, we still need to use json for encoding and decoding to bytes (https://stackoverflow.com/questions/39450065/python-3-read-write-compressed-json-objects-from-to-gzip-file) I have referred mainly to For the basics, I have implemented the following. Is this what that is needed for now? I'll open a WIP: PR if this is what is needed: For reading:
For Writing:
|
Lines 43 to 44 in dd8007d
would be if self.filename[::-1].startswith(('.gz')[::-1]):
# Check if filename starts with .gz (by reversing .gz and then
# seeing if the filename in reverse starts with that.
opener = gzip.GzipFile(fd, 'r')
else:
# Otherwise just open the file.
opener = open(self.filename, 'r')
with opener as fd:
await self.load_fd(fd) |
That's the part to open the file (and it is integrating it FileSource, which was to be done later), I get it. What about reading, what exactly would we be reading? This might sound silly but is really confusing to me. Which load_fd function will be called and based on what data? |
Yes my bad, I said later, but since we're figuring things out as we go like this, we will just do it now. Lines 54 to 60 in dd8007d
Lines 19 to 27 in dd8007d
|
Got that! So ultimately, there would be a gzipsource.py that would have load_fd and dump_fd defined. But my issue with that is what data I would be reading. Do I write them as bytes encoded json format? This is actually almost clear to me if I get what finally I would be reading or writing as gzip only accepts either string or bytes. |
What needs to be done for GZip to be finished is to modify Then repeat for Lines 50 to 52 in dd8007d
Edit To clarify. The end result is that there will be no more source classes added. Just modifications of |
@pdxjohnny You mentioned that Gzip is the easiest to implement. If I start implementing bz2 module, what difference I might face? I read the module and it seems almost same. |
I'm not sure. I'd say that you could start by creating a testcase, and using those modules to create files of those types with JSON or CSV data in them. Then see if |
Okay, Got it! Thank you so much for your help and patience. |
No problemo! Thank you for your contribution! |
Should I start with bz2? I read the module and found that there's nothing much different. |
Ya go with whatever sounds good to you |
Hey! |
Hi Yash! sorry i am still working on a reply to your email. I think this is pretty much done. I don;t think tar support is needed right now. If you want to document what's been implemented with relation to this, that would be awesome. Thank you! |
Okay, I'll start working on documenting this and other source related classes. I have pretty spent the past couple of days understanding the code for the same. |
Sweet! Just ping me if there's anywhere you need clarification.
…On Wed, Mar 27, 2019 at 11:03:08AM -0700, Yash Lamba wrote:
Okay, I'll start working on documenting this and other source related
classes. I have pretty spent the past couple of days understanding the
code for the same.
Thank you.
—
You are receiving this because you were mentioned.
Reply to this email directly, [1]view it on GitHub, or [2]mute the
thread.
References
1. #15 (comment)
2. https://github.com/notifications/unsubscribe-auth/AFrL4XC9FTrKLLc_DTVA3SLCoCetQonbks5va7JcgaJpZM4bncu_
|
Hey! So I have a couple questions:
|
|
Closed via #38 |
Uh oh!
There was an error while loading. Please reload this page.
DFFML is hoping to participate in Google Summer of Code (GSoC) under the Python Software Foundation umbrella. You can read all about what this means at http://python-gsoc.org/. This issue, and any others tagged
gsoc
andproject
are not generally available bugs, but related to project ideas for GSoC.Project Idea: File Source Compression
Project description:
DFFML's initial release includes a
FileSource
which saves and loads data from files using theload_fd
anddump_fd
methods.dffml/dffml/source/json.py
Lines 19 to 27 in dd8007d
For the
open
method ofFileSource
dffml/dffml/source/file.py
Lines 36 to 44 in dd8007d
Allow for reading and writing the following file formats, transparently (so without subclasses having to do anything) to any source which is a subclass of
FileSource
.Skills: Python, git
Difficulty level: Easy
Related Readings/Links:
See https://docs.python.org/3/library/archiving.html for documentation
Potential mentors: @pdxjohnny
Getting Started: Figure out how to do one of the file types, probably gzip (as that probably is as simple as using https://docs.python.org/3/library/gzip.html#gzip.GzipFile if the filename ends in
.gz
) then move on to the rest. For now just make modifications directly to theFileSource
class. We may have you split out the logic later, but don't worry about another class for now.What we want to see in your application: Describe how you intend to solve the problem, and give us some "stretch goals", maybe implement a remote file source which reads form URLs. Don't forget to include some time for building appropriate tests.
The text was updated successfully, but these errors were encountered: