-
Notifications
You must be signed in to change notification settings - Fork 166
New line stripping is broken in binary reading mode #173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This was referenced Jan 20, 2022
ejguan
pushed a commit
to ejguan/data
that referenced
this issue
Jan 21, 2022
Summary: Fixes pytorch#173 Note that the [input to `strip`](https://docs.python.org/3/library/stdtypes.html#str.strip) > is a string specifying the **set of characters** to be removed. [Emphasis mine] Thus, stripping works something like ```python for char in chars: string.replace(char, "") ``` rather than ```python string.replace(chars, "") ``` This means that always stripping `"\r\n"` is harmless even if the line terminator is only `"\n"` or `\"r"`. Pull Request resolved: pytorch#174 Reviewed By: ejguan Differential Revision: D33684458 Pulled By: NivekT fbshipit-source-id: 9821b77d60d3afe038ae698965beefe319783aa1
ejguan
added a commit
that referenced
this issue
Jan 21, 2022
Summary: Fixes #173 Note that the [input to `strip`](https://docs.python.org/3/library/stdtypes.html#str.strip) > is a string specifying the **set of characters** to be removed. [Emphasis mine] Thus, stripping works something like ```python for char in chars: string.replace(char, "") ``` rather than ```python string.replace(chars, "") ``` This means that always stripping `"\r\n"` is harmless even if the line terminator is only `"\n"` or `\"r"`. Reviewed By: ejguan Differential Revision: D33684458 Pulled By: NivekT fbshipit-source-id: 9821b77d60d3afe038ae698965beefe319783aa1 [ghstack-poisoned]
ejguan
added a commit
that referenced
this issue
Jan 21, 2022
Summary: Fixes #173 Note that the [input to `strip`](https://docs.python.org/3/library/stdtypes.html#str.strip) > is a string specifying the **set of characters** to be removed. [Emphasis mine] Thus, stripping works something like ```python for char in chars: string.replace(char, "") ``` rather than ```python string.replace(chars, "") ``` This means that always stripping `"\r\n"` is harmless even if the line terminator is only `"\n"` or `\"r"`. Reviewed By: ejguan Differential Revision: D33684458 Pulled By: NivekT fbshipit-source-id: 9821b77d60d3afe038ae698965beefe319783aa1 ghstack-source-id: 37a119b Pull Request resolved: #176
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Let's say we have this file:
If we open in text reading mode (default), we get the following output for the lines:
By default, Python recognizes the different line terminators and maps them to
\n
. Thus, our current line stripping is sufficient:data/torchdata/datapipes/iter/util/plain_text_reader.py
Lines 46 to 49 in c06066a
However, if we open it in binary reading mode, this is a different story:
Python does not perform the newline mapping here. Thus, in this mode our stripping is not sufficient
The text was updated successfully, but these errors were encountered: