-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
read_sas fails due to unclear problems in SAS dataset #16615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you provide a data set that reproduces the issue? Either something existing online, or you can attach to the issue. |
As always, when I try to re-create, it reads the file fine. I'll try to fetch the original file tomorrow and update. |
After some further investigation I think the problem could be elsewhere, not in New line or carriage return symbol. Actually all I needed is just to re-create the file with simple data step and after that the new data set is read properly by read_sas data new_file; I'm attaching the problematic file which gives this error below: Traceback (most recent call last): |
Dug into this a bit because I was seeing a similar issue. I think it's something to do with unexpected bytes - starting in row 1806 in your file there's a bunch of odd-looking bytes which the parser is choking on somehow. I can't get something working, but as far as I can see: import numpy as np
from pandas.io.sas.sas7bdat import SAS7BDATReader
from pandas.io.sas._sas import Parser
reader = SAS7BDATReader('load_log.sas7bdat', index=None, encoding=None, chunksize=None)
print(reader.row_count)
#2097
nd = (reader.column_types == b'd').sum()
ns = (reader.column_types == b's').sum()
nrows = reader.row_count
reader._string_chunk = np.empty((ns, nrows), dtype=np.object)
reader._byte_chunk = np.empty((nd, 8 * nrows), dtype=np.uint8)
reader._current_row_in_chunk_index = 0
p = Parser(reader)
p.read(nrows)
print(reader._current_row_in_chunk_index)
#1805
print(reader._current_row_in_file_index)
#1805
Iterating through the import pandas as pd
rows = list(pd.read_sas('load_log.sas7bdat', iterator=True))
print(len(rows))
#2097
print(rows[1804]['libname'])
#1804 b'TRANS'
print(rows[1805]['libname'])
#1805 b'\x00\x00\x00\x00\x00\x00\x00\x00'
odd_bytes = rows[1805]['libname'].iloc[0]
print(odd_bytes)
#b'\x00\x00\x00\x00\x00\x00\x00\x00'
print(odd_bytes.decode('latin-1'))
#
print(len(odd_bytes.decode('latin-1')))
#8 |
Thank you Ian, it seems \x00 is a NULL character. Its interesting that SAS does not mind having NULL in the dataset, but it removes it during regular dataset rewrite. So the question that's left is why read_sas behaves badly when encountering NULL character in the data. |
That sample SAS dataset has 9 deleted observations. At observation number 69,70.71.72,97,1218,1219,1220 and 1221. Does this python package understand how to skip the deleted observations? |
There are no null characters in any of the character variables in that SAS dataset. Unless they are in the 9 deleted observations. |
…das-dev#16615) SAS can apparently generate data pages having bit 7 (128) set on the page type. It seems that the presence of bit 8 (256) determines whether it's a data page or not. So treat page as a data page if bit 8 is set and don't mind the lower bits.
Any fixes/suggestions for this? I'm running into this error and don't know much about SAS |
I got the same issue, when the specific field contains more than 8000 chars (text field), any suggestion to deal with this issue? |
I'm also hitting this: pd.read_sas("https://www2.census.gov/programs-surveys/supplemental-poverty-measure/datasets/spm/spm_pu_2018.sas7bdat")
As a workaround, importing with the from sas7bdat import SAS7BDAT
# After downloading file:
df = SAS7BDAT("spm_pu_2018.sas7bdat").to_data_frame() |
Hi MaxGhenis, hopefully pandas or SAS7BDAT packages can fix the issues |
Using pandas 1.3.5 on Windows 10 through Anaconda. A bunch of datasets where the same For example, for the file import pandas as pd
df = pd.read_sas("fts0003.sas7bdat") returns Warning: column count mismatch (587 + 1023 != 3337)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_18836/2218533414.py in <module>
----> 2 df = pd.read_sas("fts0003.sas7bdat")
~\.miniconda3\lib\site-packages\pandas\io\sas\sasreader.py in read_sas(filepath_or_buffer, format, index, encoding, chunksize, iterator)
159
160 with reader:
--> 161 return reader.read()
~\.miniconda3\lib\site-packages\pandas\io\sas\sas7bdat.py in read(self, nrows)
754 p.read(nrows)
755
--> 756 rslt = self._chunk_to_dataframe()
757 if self.index is not None:
758 rslt = rslt.set_index(self.index)
~\.miniconda3\lib\site-packages\pandas\io\sas\sas7bdat.py in _chunk_to_dataframe(self)
798
799 if self._column_types[j] == b"d":
--> 800 rslt[name] = self._byte_chunk[jb, :].view(dtype=self.byte_order + "d")
801 rslt[name] = np.asarray(rslt[name], dtype=np.float64)
802 if self.convert_dates:
~\.miniconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
3610 else:
3611 # set column
-> 3612 self._set_item(key, value)
3613
3614 def _setitem_slice(self, key: slice, value):
~\.miniconda3\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value)
3782 ensure homogeneity.
3783 """
-> 3784 value = self._sanitize_column(value)
3785
3786 if (
~\.miniconda3\lib\site-packages\pandas\core\frame.py in _sanitize_column(self, value)
4507
4508 if is_list_like(value):
-> 4509 com.require_length_match(value, self.index)
4510 return sanitize_array(value, self.index, copy=True, allow_2d=True)
4511
~\.miniconda3\lib\site-packages\pandas\core\common.py in require_length_match(data, index)
529 """
530 if len(data) != len(index):
--> 531 raise ValueError(
532 "Length of values "
533 f"({len(data)}) "
ValueError: Length of values (10275) does not match length of index (10243) From the column number mismatch, I'd suspect that the issue has to do with recognizing missing values, which these data files are rich of. However, I know close-to-nothing about SAS, so I'm having difficulties with providing good insights. |
Hi AndreaPasqualini , the method that i used is to limit the column length < 8192 in SAS, so you have to modify the columns in SAS. I still cannot find a good solution in python platform xp |
Thank you for the suggestion, but that is something I cannot do. I have no access to SAS and I'm only a consumer of those data. I have provided data to help the developers troubleshoot the problem. |
Problem description
I was trying to read a SAS dataset with pandas 0.19.2. It was not successful, with an error: ValueError('Length of values does not match length of ' 'index').
After some research I came up with the idea, that new line symbol in one of the character values creates this error.
I removed new line and carriage return symbols from column values in SAS data and read_sas finished without errors after that. I assume that read_sas treats any new line symbol it encounters as new line of a table.
Expected Output
read_sas could translate new line symbols found in column values to space and finish without an error.
The text was updated successfully, but these errors were encountered: