Skip to content

New line stripping is broken in binary reading mode #173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pmeier opened this issue Jan 20, 2022 · 0 comments
Closed

New line stripping is broken in binary reading mode #173

pmeier opened this issue Jan 20, 2022 · 0 comments

Comments

@pmeier
Copy link
Contributor

pmeier commented Jan 20, 2022

Let's say we have this file:

path = "example.txt"

with open(path, "w", newline="") as file:
    file.write("foo\n")
    file.write("bar\r\n")
    file.write("baz\r")

If we open in text reading mode (default), we get the following output for the lines:

with open(path) as file:
    for line in file:
        print(repr(line))
'foo\n'
'bar\n'
'baz\n'

By default, Python recognizes the different line terminators and maps them to \n. Thus, our current line stripping is sufficient:

if isinstance(line, str):
yield line.strip("\n")
else:
yield line.strip(b"\n")

dp = IterableWrapper([path])
dp = FileOpener(dp)
dp = LineReader(dp, decode=True, strip_newline=True, return_path=False)

for item in dp:
    print(repr(item))
'foo'
'bar'
'baz'

However, if we open it in binary reading mode, this is a different story:

with open(path, "rb") as file:
    for line in file:
        print(repr(line.decode()))
'foo\n'
'bar\r\n'
'baz\r'

Python does not perform the newline mapping here. Thus, in this mode our stripping is not sufficient

dp = IterableWrapper([path])
dp = FileOpener(dp, mode="rb")
dp = LineReader(dp, decode=True, strip_newline=True, return_path=False)

for item in dp:
    print(repr(item))
'foo'
'bar\r'
'baz\r'
ejguan pushed a commit to ejguan/data that referenced this issue Jan 21, 2022
Summary:
Fixes pytorch#173

Note that the [input to `strip`](https://docs.python.org/3/library/stdtypes.html#str.strip)

> is a string specifying the **set of characters** to be removed. [Emphasis mine]

Thus, stripping works something like

```python
for char in chars:
    string.replace(char, "")
```

rather than

```python
string.replace(chars, "")
```

This means that always stripping `"\r\n"` is harmless even if the line terminator is only `"\n"` or `\"r"`.

Pull Request resolved: pytorch#174

Reviewed By: ejguan

Differential Revision: D33684458

Pulled By: NivekT

fbshipit-source-id: 9821b77d60d3afe038ae698965beefe319783aa1
ejguan added a commit that referenced this issue Jan 21, 2022
Summary:
Fixes #173

Note that the [input to `strip`](https://docs.python.org/3/library/stdtypes.html#str.strip)

> is a string specifying the **set of characters** to be removed. [Emphasis mine]

Thus, stripping works something like

```python
for char in chars:
    string.replace(char, "")
```

rather than

```python
string.replace(chars, "")
```

This means that always stripping `"\r\n"` is harmless even if the line terminator is only `"\n"` or `\"r"`.

Reviewed By: ejguan

Differential Revision: D33684458

Pulled By: NivekT

fbshipit-source-id: 9821b77d60d3afe038ae698965beefe319783aa1

[ghstack-poisoned]
ejguan added a commit that referenced this issue Jan 21, 2022
Summary:
Fixes #173

Note that the [input to `strip`](https://docs.python.org/3/library/stdtypes.html#str.strip)

> is a string specifying the **set of characters** to be removed. [Emphasis mine]

Thus, stripping works something like

```python
for char in chars:
    string.replace(char, "")
```

rather than

```python
string.replace(chars, "")
```

This means that always stripping `"\r\n"` is harmless even if the line terminator is only `"\n"` or `\"r"`.

Reviewed By: ejguan

Differential Revision: D33684458

Pulled By: NivekT

fbshipit-source-id: 9821b77d60d3afe038ae698965beefe319783aa1

ghstack-source-id: 37a119b
Pull Request resolved: #176
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant