Skip to content

comments from arrow #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
martindurant opened this issue May 15, 2018 · 4 comments
Closed

comments from arrow #6

martindurant opened this issue May 15, 2018 · 4 comments

Comments

@martindurant
Copy link
Member

@pitrou wrote the following on [email protected]

Here are some comments:

  • API naming: you seem to favour re-using Unix command-line monickers in
    some places, while using more regular verbs or names in other
    places. I think it should be consistent. Since the Unix
    command-line doesn't exactly cover the exposed functionality, and
    since Unix tends to favour short cryptic names, I think it's better
    to use Python-like naming (which is also more familiar to non-Unix
    users). For example "move" or "rename" or "replace" instead of "mv",
    etc.

  • **kwargs parameters: a couple APIs (mkdir, put...) allow passing
    arbitrary parameters, which I assume are intended to be
    backend-specific. It makes it difficult to add other optional
    parameters to those APIs in the future. So I'd make the
    backend-specific directives a single (optional) dict parameter rather
    than a **kwargs.

  • invalidate_cache doesn't state whether it invalidates recursively
    or not (recursively sounds better intuitively?). Also, I think it
    would be more flexible to take a list of paths rather than a single
    path.

  • du: the effect of the deep parameter isn't obvious to me. I don't
    know what it would mean not to recurse here: what is the size of a
    directory if you don't recurse into it?

  • glob may need a formal definition (are trailing slashes
    significant for directory or symlink resolution? this kind of thing),
    though you may want to keep edge cases backend-specific.

  • are head and tail at all useful? They can be easily recreated
    using a generic open facility.

  • read_block tries to do too much in a single API IMHO, and
    using open directly is more flexible anyway.

  • if touch is intended to emulate the Unix API of the same name, the
    docstring should state "Create empty file or update last modification
    timestamp".

  • the information dicts returned by several APIs (ls, info....)
    need standardizing, at least for non backend-specific fields.

  • if the backend is a networked filesystem with non-trivial latency,
    perhaps the operations would deserve being batched (operate on
    several paths at once), though I will happily defer to your expertise
    on the topic.

@martindurant
Copy link
Member Author

@wesm commented on dev@arrow

https://cwiki.apache.org/confluence/display/ARROW/Python+Filesystems+and+Filesystem+API

I added a JIRA filter and tagged a couple of filesystem-related
issues; there are more that should be added to the list. There's a lot
of other work related to filesystem implementations that we can help
organize and plan here.

As far as refining the details of the API, we should see what's the
best place to collect feedback and discuss. Martin, can you set up a
pull request with the entire patch so that many people can comment and
discuss?

NB: TensorFlow defines a filesystem abstraction, albeit in C++ with
SWIG bindings. We might also look there as a check on some of our
assumptions.

@martindurant
Copy link
Member Author

Martin, can you set up a pull request with the entire patch

@wesm , sorry, I didn't understand this - a PR of which patch to where?

I think I said this previously, I'd be very happy to see everything in this repo be taken over by apache/arrow, if it's at all useful. That might in itself improve conversation. I'm also thoroughly behind amending the file-system APIs I'm involved in to adhere to a consensus design.

The link to TensorFlow is interesting, a fairly different and minimalist take on filesystem functions. I assume that there is no file-like layer (yet), i.e., you can neither grab a file descriptor handle nor pass anything directly to python functions expecting to read from files.

@wesm
Copy link

wesm commented Jul 6, 2018

It would be just for the purposes of code review / line-by-line comments. So you could create an empty branch (say based on the repo's initial commit) and then make a PR from HEAD into the empty branch

@martindurant
Copy link
Member Author

#7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants