-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Read csv headers #37966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Read csv headers #37966
Changes from all commits
Commits
Show all changes
44 commits
Select commit
Hold shift + click to select a range
bb3e8e6
storage_options as headers and tests added
db51474
additional tests - gzip, test additional headers receipt
6f901b8
bailed on using threading for testing
3af6a3d
clean up comments add json http tests
bad5739
Merge branch 'master' into read_csv_headers to update
8f5a0f1
added documentation on storage_options for headers
9fcc72a
DOC:Added doc for custom HTTP headers in read_csv and read_json
df6e539
DOC:Corrected versionadded tag and added issue number for reference
98db1c4
DOC:updated storage_options documentation
f28f36c
TST:updated with tm.assert_frame_equal
dd3265f
TST:fixed incorrect usage of tm.assert_frame_equal
02fc840
CLN:reordered imports to fix pre-commit error
da97f0a
DOC:changed whatsnew and added to shared_docs.py GH36688
fce4b17
ENH: read nonfsspec URL with headers built from storage_options GH36688
e0cfcb6
TST:Added additional tests parquet and other read methods GH36688
33115b7
TST:removed mocking in favor of threaded http server
5a1c64e
DOC:refined storage_options doscstring
018a399
Merge branch 'master' into read_csv_headers
cdknox 87d7dc6
CLN:used the github editor and had pep8 issues
64a0d19
CLN: leftover comment removed
1724e9b
TST:attempted to address test warning of unclosed socket GH36688
f8b8c43
TST:added pytest.importorskip to handle the two main parquet engines …
a17d574
CLN: imports moved to correct order GH36688
eed8915
TST:fix fastparquet tests GH36688
75573a4
CLN:removed blank line at end of docstring GH36688
dc596c6
CLN:removed excess newlines GH36688
e27e3a9
CLN:fixed flake8 issues GH36688
734c9d3
TST:renamed a test that was getting clobbered and fixed the logic GH3…
8a5c5a3
CLN:try to silence mypy error via renaming GH36688
978d94a
TST:pytest.importorfail replaced with pytest.skip GH36688
807eb25
TST:content of dataframe on error made more useful GH36688
44c2869
CLN:fixed flake8 error GH36688
01ce3ae
TST: windows fastparquet error needs raised for troubleshooting GH36688
13bc775
CLN:fix for flake8 GH36688
6915517
TST:changed compression used in to_parquet from 'snappy' to None GH36688
186b0a4
TST:allowed exceptions to be raised via removing a try except block G…
88e9600
TST:replaced try except with pytest.importorskip GH36688
2a05d0f
CLN:removed dict() in favor of {} GH36688
d38a813
Merge branch 'master' into read_csv_headers
268e06a
DOC: changed potentially included version from 1.2.0 to 1.3.0 GH36688
565197f
TST:user agent tests moved from test_common to their own file GH36688
842e594
TST: used fsspec instead of patching bytesio GH36688
c0c3d34
TST: added importorskip for fsspec on FastParquet test GH36688
7025abb
TST:added missing importorskip to fsspec in another test GH36688
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,7 +14,13 @@ | |
from pandas import DataFrame, MultiIndex, get_option | ||
from pandas.core import generic | ||
|
||
from pandas.io.common import IOHandles, get_handle, is_fsspec_url, stringify_path | ||
from pandas.io.common import ( | ||
IOHandles, | ||
get_handle, | ||
is_fsspec_url, | ||
is_url, | ||
stringify_path, | ||
) | ||
|
||
|
||
def get_engine(engine: str) -> "BaseImpl": | ||
|
@@ -66,8 +72,10 @@ def _get_path_or_handle( | |
fs, path_or_handle = fsspec.core.url_to_fs( | ||
path_or_handle, **(storage_options or {}) | ||
) | ||
elif storage_options: | ||
raise ValueError("storage_options passed with buffer or non-fsspec filepath") | ||
elif storage_options and (not is_url(path_or_handle) or mode != "rb"): | ||
# can't write to a remote url | ||
# without making use of fsspec at the moment | ||
raise ValueError("storage_options passed with buffer, or non-supported URL") | ||
|
||
handles = None | ||
if ( | ||
|
@@ -79,7 +87,9 @@ def _get_path_or_handle( | |
# use get_handle only when we are very certain that it is not a directory | ||
# fsspec resources can also point to directories | ||
# this branch is used for example when reading from non-fsspec URLs | ||
handles = get_handle(path_or_handle, mode, is_text=False) | ||
handles = get_handle( | ||
path_or_handle, mode, is_text=False, storage_options=storage_options | ||
) | ||
fs = None | ||
path_or_handle = handles.handle | ||
return path_or_handle, handles, fs | ||
|
@@ -307,7 +317,9 @@ def read( | |
# use get_handle only when we are very certain that it is not a directory | ||
# fsspec resources can also point to directories | ||
# this branch is used for example when reading from non-fsspec URLs | ||
handles = get_handle(path, "rb", is_text=False) | ||
handles = get_handle( | ||
path, "rb", is_text=False, storage_options=storage_options | ||
) | ||
path = handles.handle | ||
parquet_file = self.api.ParquetFile(path, **parquet_kwargs) | ||
|
||
|
@@ -404,10 +416,12 @@ def to_parquet( | |
return None | ||
|
||
|
||
@doc(storage_options=generic._shared_docs["storage_options"]) | ||
def read_parquet( | ||
path, | ||
engine: str = "auto", | ||
columns=None, | ||
storage_options: StorageOptions = None, | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This new parameter should maybe be added after use_nullable_dtypes |
||
use_nullable_dtypes: bool = False, | ||
**kwargs, | ||
): | ||
|
@@ -432,13 +446,18 @@ def read_parquet( | |
By file-like object, we refer to objects with a ``read()`` method, | ||
such as a file handle (e.g. via builtin ``open`` function) | ||
or ``StringIO``. | ||
engine : {'auto', 'pyarrow', 'fastparquet'}, default 'auto' | ||
engine : {{'auto', 'pyarrow', 'fastparquet'}}, default 'auto' | ||
Parquet library to use. If 'auto', then the option | ||
``io.parquet.engine`` is used. The default ``io.parquet.engine`` | ||
behavior is to try 'pyarrow', falling back to 'fastparquet' if | ||
'pyarrow' is unavailable. | ||
columns : list, default=None | ||
If not None, only these columns will be read from the file. | ||
|
||
{storage_options} | ||
|
||
.. versionadded:: 1.3.0 | ||
|
||
use_nullable_dtypes : bool, default False | ||
If True, use dtypes that use ``pd.NA`` as missing value indicator | ||
for the resulting DataFrame (only applicable for ``engine="pyarrow"``). | ||
|
@@ -448,6 +467,7 @@ def read_parquet( | |
support dtypes) may change without notice. | ||
|
||
.. versionadded:: 1.2.0 | ||
|
||
**kwargs | ||
Any additional kwargs are passed to the engine. | ||
|
||
|
@@ -456,6 +476,11 @@ def read_parquet( | |
DataFrame | ||
""" | ||
impl = get_engine(engine) | ||
|
||
return impl.read( | ||
path, columns=columns, use_nullable_dtypes=use_nullable_dtypes, **kwargs | ||
path, | ||
columns=columns, | ||
storage_options=storage_options, | ||
use_nullable_dtypes=use_nullable_dtypes, | ||
**kwargs, | ||
) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.