Skip to content

Code review #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ celerybeat-schedule
.venv
venv/
ENV/
.idea/

# Spyder project settings
.spyderproject
Expand Down
83 changes: 82 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,83 @@
# filesystem_spec
A specification that python filesystems should adhere to.

A specification for pythonic filesystems.

## Purpose

To produce a template or specification for a file-system interface, that specific implementations should follow,
so that applications making use of them can rely on a common behaviour and not have to worry about the specific
internal implementation decisions with any given backend.

In addition, if this is well-designed, then additional functionality, such as a key-value store or FUSE
mounting of the file-system implementation may be available for all implementations "for free".

## Background

Python provides a standard interface for open files, so that alternate implementations of file-like object can
work seamlessly with many function which rely only on the methods of that standard interface. A number of libraries
have implemented a similar concept for file-systems, where file operations can be performed on a logical file-system
which may be local, structured data store or some remote service.

This repository is intended to be a place to define a standard interface that such file-systems should adhere to,
such that code using them should not have to know the details of the implementation in order to operate on any of
a number of backends.

Everything here is up for discussion, and although a little code has already been included to kick things off, it
is only meant as a suggestion of one possible way of doing things. With hope, the community can come together to
define an interface that is the best for the highest number of users, and having the specification, makes developing
other file-system implementations simpler.

There is no specific model (yet) of how the contents of this repo would be used, whether as a spec to refer to,
or perhaps something to subclass or use as a mixin, that can also form part of the conversation.

#### History

I (Martin Durant) have been involved in building a number of remote-data file-system implementations, principally
in the context of the [Dask](http://dask.pydata.org/en/latest/) project. In particular, several are listed
in [the docs](http://dask.pydata.org/en/latest/remote-data-services.html) with links to the specific repositories.
With common authership, there is much that is similar between the implementations, for example posix-like naming
of the operations, and this has allowed Dask to be able to interact with the various backends and parse generic
URLs in order to select amongst them. However, *some* extra code was required in each case to adapt the peculiarities
of each implementation with the generic usage that Dask demanded. People may find the
[code](https://github.com/dask/dask/blob/master/dask/bytes/core.py#L266) which parses URLs and creates file-system
instances interesting.

At the same time, the Apache [Arrow](https://arrow.apache.org/) project was also concerned with a similar problem,
particularly a common interface to local and HDFS files, for example the
[hdfs](https://arrow.apache.org/docs/python/filesystems.html) interface (which actually communicated with HDFS
with a choice of driver). These are mostly used internally within Arrow, but Dask was modified in order to be able
to use the alternate HDFS interface (which solves some security issues with `hdfs3`). In the process, a
[conversation](https://github.com/dask/dask/issues/2880)
was started, and I invite all interested parties to continue the conversation in this location.

There is a good argument that this type of code has no place in Dask, which is concerned with making graphs
representing computations, and executing those graphs on a scheduler. Indeed, the file-systems are generally useful,
and each has a user-base wider than just those that work via Dask.

## Influences

The following places to consider, when chosing the definitions of how we would like the file-system specification
to look:

- pythons [os](https://docs.python.org/3/library/os.html) moduler and its `path` namespace; also other file-connected
functionality in the standard library
- posix/bash method naming conventions that linux/unix/osx users are familiar with; or perhaps their Windows variants
- the existing implementations for the various backends (e.g.,
[gcsfs](http://gcsfs.readthedocs.io/en/latest/api.html#gcsfs.core.GCSFileSystem) or Arrow's
[hdfs](https://arrow.apache.org/docs/python/filesystems.html#hdfs-api))
- [pyfilesystems](https://docs.pyfilesystem.org/en/latest/index.html), an attempt to do something similar, with a
plugin architecture. This conception has several types of local file-system, and a lot of well-thought-out
validation code.

## Contents of the Repo

The main proposal here is in `fsspec/spec.py`, a single class with methods and doc-strings, and a little code. The
initial method names were copied from `gcsfs`, but this reflects only lazyness on the part of the inital committer.
Although the directory and files appear like a python package, they are not meant for installation or execution
until possibly some later date - or maybe never, if this is to be only loose reference specification.

In addition `fsspec/utils.py` contains a couple of useful functions that Dask happens to rely on; it is envisaged
that if the spec here matures to real code, then a number of helpful functions may live alongside the main
definitions. Furthermore, `fsspec/mapping.py` shows how a key-value map may be easily implemented for all file-systems
for free, by adhering to a single definition of the structure. This is meant as a motivator, and happens to be
particularly useful for the [zarr](https://zarr.readthedocs.io) project.
Empty file added fsspec/__init__.py
Empty file.
119 changes: 119 additions & 0 deletions fsspec/local.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
import os
import shutil
import tempfile
from .spec import AbstractFileSystem


class LocalFileSystem(AbstractFileSystem):
def mkdir(self, path, **kwargs):
os.mkdir(path, **kwargs)

def makedirs(self, path, exist_ok=False):
os.makedirs(path, exist_ok=exist_ok)

def rmdir(self, path):
os.rmdir(path)

def ls(self, path, detail=False):
paths = [os.path.abspath(os.path.join(path, f))
for f in os.listdir(path)]
if detail:
return [self.info(f) for f in paths]
else:
return paths

def walk(self, path, simple=False):
out = os.walk(os.path.abspath(path))
if simple:
results = []
for dirpath, dirnames, filenames in out:
results.extend([os.path.join(dirpath, f) for f in filenames])
return results
else:
return out

def info(self, path):
out = os.stat(path)
if os.path.isfile(path):
t = 'file'
elif os.path.isdir(path):
t = 'directory'
elif os.path.islink(path):
t = 'link'
else:
t = 'other'
result = {
'name': path,
'size': out.st_size,
'type': t,
'created': out.st_ctime
}
for field in ['mode', 'uid', 'gid', 'mtime']:
result[field] = getattr(out, 'st_' + field)
return result

def copy(self, path1, path2, **kwargs):
""" Copy within two locations in the filesystem"""
shutil.copyfile(path1, path2)

get = copy
put = copy

def mv(self, path1, path2, **kwargs):
""" Move file from one location to another """
os.rename(path1, path2)

def rm(self, path, recursive=False):
if recursive:
shutil.rmtree(path)
else:
os.remove(path)

def _open(self, path, mode='rb', block_size=None, **kwargs):
return LocalFileOpener(path, mode, **kwargs)

def touch(self, path, **kwargs):
""" Create empty file, or update timestamp """
if self.exists(path):
os.utime(path, None)
else:
open(path, 'a').close()


class LocalFileOpener(object):
def __init__ (self, path, mode, autocommit=True):
# TODO: does autocommit mean write directory to destination, or
# do move operation immediately on close
self.path = path
self._incontext = False
if autocommit or 'w' not in mode:
self.autocommit = True
self.f = open(path, mode=mode)
else:
# TODO: check if path is writable?
self.autocommit = False
i, name = tempfile.mkstemp()
self.temp = name
self.f = open(name, mode=mode)

def commit(self):
if self._incontext:
raise RuntimeError('Cannot commit while within file context')
os.rename(self.temp, self.path)

def discard(self):
if self._incontext:
raise RuntimeError('Cannot discard while within file context')
if self.autocommit is False:
os.remove(self.temp)

def __getattr__(self, item):
return getattr(self.f, item)

def __enter__(self):
self._incontext = True
return self.f

def __exit__(self, exc_type, exc_value, traceback):
self.f.close()
self._incontext = False
104 changes: 104 additions & 0 deletions fsspec/mapping.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@

from collections import MutableMapping


class FSMap(MutableMapping):
"""Wrap a FileSystem instance as a mutable wrapping.

The keys of the mapping become files under the given root, and the
values (which must be bytes) the contents of those files.

Parameters
----------
root : string
prefix for all the files
fs : FileSystem instance
check : bool (=True)
performs a touch at the location, to check for write access.

Examples
--------
>>> fs = FileSystem(**parameters) # doctest: +SKIP
>>> d = FSMap('my-data/path/', fs) # doctest: +SKIP
>>> d['loc1'] = b'Hello World' # doctest: +SKIP
>>> list(d.keys()) # doctest: +SKIP
['loc1']
>>> d['loc1'] # doctest: +SKIP
b'Hello World'
"""

def __init__(self, root, fs, check=False, create=False):
self.fs = fs
self.root = root
if create:
self.fs.mkdir(root)
if check:
if not self.fs.exists(root):
raise ValueError("Path %s does not exist. Create "
" with the ``create=True`` keyword" %
root)
self.fs.touch(root+'/a')
self.fs.rm(root+'/a')

def clear(self):
"""Remove all keys below root - empties out mapping
"""
try:
self.fs.rm(self.root, True)
self.fs.mkdir(self.root)
except (IOError, OSError):
pass

def _key_to_str(self, key):
"""Generate full path for the key"""
return '/'.join([self.root, key])

def _str_to_key(self, s):
"""Strip path of to leave key name"""
return s[len(self.root) + 1:]

def __getitem__(self, key, default=None):
"""Retrieve data"""
key = self._key_to_str(key)
try:
with self.fs.open(key, 'rb') as f:
result = f.read()
except (IOError, OSError):
if default is not None:
return default
raise KeyError(key)
return result

def __setitem__(self, key, value):
"""Store value in key"""
key = self._key_to_str(key)
with self.fs.open(key, 'wb') as f:
f.write(value)

def keys(self):
"""List currently defined keys"""
return (self._str_to_key(x) for x in self.fs.walk(self.root))

def __iter__(self):
return self.keys()

def __delitem__(self, key):
"""Remove key"""
self.fs.rm(self._key_to_str(key))

def __contains__(self, key):
"""Does key exist in mapping?"""
return self.fs.exists(self._key_to_str(key))

def __len__(self):
"""Number of stored elements"""
return sum(1 for _ in self.keys())

def __getstate__(self):
"""Mapping should be pickleable"""
return self.fs, self.root

def __setstate__(self, state):
fs, root = state
self.fs = fs
self.root = root
20 changes: 20 additions & 0 deletions fsspec/registry.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import importlib
__all__ = ['registry', 'get_filesystem_class', 'default']

registry = {}
default = 'fsspec.local.LocalFileSystem'

known_implementations = {
'file': default,
}


def get_filesystem_class(protocol):
if protocol not in registry:
if protocol not in known_implementations:
raise ValueError("Protocol not known: %s" % protocol)
mod, name = protocol.rsplit('.', 1)
mod = importlib.import_module(mod)
registry[protocol] = getattr(mod, name)

return registry[protocol]
Loading