-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
std.os.windows: Support UTF-8 ↔ UTF-16 conversion for Windows native console I/O #16618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
911c601
to
1a88bec
Compare
Thanks for working on this! At some point I'd like to benchmark reading/writing to console against master to see what the performance cost of this approach is. Prior attempt at the same thing: #12400 |
1a88bec
to
4ff8419
Compare
Update: I also tried UTF-16 to UTF-8 conversion for raw mode, but I encountered ferocious edge cases, especially when I read the console input partially (leaving UTF-16 code units in the console input text buffer) and then switching on/off It should be perfectly fine as long as people don't mess with console input mode. I would be optimistic to assume that it is effectively done. |
6f18d6d
to
d5e15c5
Compare
2043ba0
to
fffbb92
Compare
Is the hidden allocation here fine? Wouldn't this be better solved with a stack-allocated buffer and multiple writes? |
I did that just to keep the logic simple. |
…g fixed-size temporary buffer
Done. The temporary buffer is now 1024-byte long, and no heap allocation involved. |
8208db5
to
d4fe650
Compare
d4fe650
to
9495568
Compare
@squeek502 Just rebased onto |
Will try writing some benchmarks / test cases when I get a chance. In the meantime, it might be helpful if you could take a look at #12400 and compare your implementation with its (relatively simple) implementation. What are the relevant differences? Are there things that your implementation handles that wouldn't be handled by the one in #12400? |
As far as I can tell after a quick glance at #12400:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not good. Let's take a step back and understand what is going on before adding such hacks to the codebase.
@@ -523,19 +527,186 @@ pub fn ReadFile(in_hFile: HANDLE, buffer: []u8, offset: ?u64, io_mode: std.io.Mo | |||
}; | |||
break :blk &overlapped_data; | |||
} else null; | |||
if (kernel32.ReadFile(in_hFile, buffer.ptr, want_read_count, &amt_read, overlapped) == 0) { | |||
var console_mode: DWORD = undefined; | |||
const is_console_handle: bool = kernel32.GetConsoleMode(in_hFile, &console_mode) != FALSE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? You want to call GetConsoleMode
with every call to ReadFile
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some context:
- In std: Windows Console IO Fix #12400 it used
isTty
which usesos.isatty
which callsGetConsoleMode
(so it'd call it on every read/write) - Rust uses an
is_console
function which is a wrapper aroundGetConsoleMode
(so it calls it on every read/write AFAICT)
However, it seems Go stores the kind
as "console"
in newFile to skip the GetConsoleMode on read/write:
- https://github.com/golang/go/blob/6e8caefc19cae465444775f6cd107b138a26cce7/src/os/file_windows.go#L49-L51
- https://github.com/golang/go/blob/6e8caefc19cae465444775f6cd107b138a26cce7/src/internal/poll/fd_windows.go#L419-L420
But still seems to have a GetConsoleMode
call in a (lower level?) API? I'm not familiar with Go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? You want to call
GetConsoleMode
with every call toReadFile
?
No, I don't.
However, std.fs.File
is merely a thin wrapper around the underlying OS file handle. If we want to remember whether the OS handle is a console handle on std.fs.File
creation, we will need to add OS-specific fields in std.fs.File
data structure, or add additional records in std.os.windows
along with a lookup scheme by a std.fs.File
handle (either pointer or a unique ID), just like how C runtime libraries deal with POSIX file descriptors internally.
For minimum intrusive changes with platform-agnostic code (i.e. std.fs.File
), I chose calling GetConsoleMode
with every ReadFile
call. I would like to hear your thoughts on design choices before making any change.
// There is no reliable way to implement perfectly platform-agnostic UTF-16 to UTF-8 | ||
// conversion for raw mode, because it is impossible to know the number of pending | ||
// code units stored in console input buffer, while in cooked mode we can rely on the | ||
// terminating LF character. Without knowing that, ReadConsoleW() may accidentally pop | ||
// out characters without blocking, or prompt for user input at unexpected timing. | ||
// In the case of raw mode, redirect to kernel32.ReadFile() without conversion for now, | ||
// just don't make things worse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not good enough. If it cannot be implemented perfectly using the current abstractions, then the current abstractions need to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me elaborate on this:
The current goal of my implementation is to make users totally unaware of UTF-16 presence even if they call std.os.windows.ReadFile
directly. Users are able to pass UTF-8-encoded strings everywhere, and std.os.windows.ReadFile
does the translation job under the hood. It means that I need to insert a UTF-16 → UTF-8 translation layer between std.os.windows.ReadFile
and kernel32.ReadConsoleW
.
Here comes the key problem that gives rise to annoying edge cases: The user-provided buffer is meant to be UTF-8 encoded. Console input (which is UTF-16 encoded) must be converted to UTF-8 before filling into the user-provided buffer, while the converted UTF-8 length can't be easily determined. When the user-provided buffer is not large enough to hold the converted console input, there are three pieces of data we have to deal with carefully:
- UTF-8 fragments. For example, a Unicode character from console input converts into a 3-byte sequence, but the user-provided buffer is only 2 bytes long. Then
ReadFile
should store the first 2 bytes into the user-provided buffer and stash the final byte. WheneverReadFile
is called again, this byte should be popped first before reading anything else from the console. - Single UTF-16 high surrogate yet to form a complete Unicode code point. We should stash it separately and wait for a low surrogate from another
ReadConsoleW
call to convert into UTF-8 byte sequence. - UTF-16 code units left in
ReadConsoleW
's dedicated buffer. It is completely opaque to user-mode code, causing serious troubles, which will be discussed below.
There are two modes affecting ReadConsoleW
behaviour determined by ENABLE_LINE_INPUT
flag (accessed with GetConsoleMode
/SetConsoleMode
functions):
- Cooked mode (
ENABLE_LINE_INPUT
set): If CR is encountered in the dedicated buffer, an LF is appended to the user-provided buffer and reading from console ends immediately. Otherwise, wait for console input until a CR is entered (either single key press or pasting multiple characters). Everything after the CR is left inside the dedicated buffer after the read. - Raw mode (
ENABLE_LINE_INPUT
unset): If there is at least one character available in the dedicated buffer, reading from console ends immediately. Otherwise, wait for console input (either single key press or pasting multiple characters). The dedicated buffer is flushed after the read.
Cooked mode can be emulated perfectly, as in my implementation. However, to emulate the behaviour in raw mode, we have to try flushing UTF-8 fragments and then call ReadConsoleW
to flush the dedicated buffer. The problem arises when we popped at least one byte from the UTF-8 fragments. Because we have "read" something (from UTF-8 fragments), std.os.windows.ReadFile
should return immediately if the ReadConsoleW
dedicated buffer is empty, but calling ReadConsoleW
when the dedicated buffer is empty causes an unwanted wait for console input. As I explained before, there is no way to know whether the dedicated buffer is empty other than calling ReadConsoleW
, so it is a stalemate.
If there are better abstractions, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relevant commit message from Go: golang/go@610d522
The readConsole
implementation has changed a bit since then, the latest is here: https://github.com/golang/go/blob/6e8caefc19cae465444775f6cd107b138a26cce7/src/internal/poll/fd_windows.go#L454-L521
There does not seem to be any special handling of ENABLE_LINE_INPUT
in Rust or Go at all.
For what it's worth, here are my initial reactions to these points:
This is a neutral point IMO--not a pro nor a con.
This makes sense to me and seems like a good change. #12400 returned
This needs some justification. Why should Zig care about LF ↔ CRLF conversion in console input/output? It doesn't seem like Rust/Go do this AFAICT.
👍 |
For console output, I agree that converting LF to CRLF will not bring special benefits, as Windows Console Host should recognize single LF. This conversion makes more sense when writing to a text file. I implemented this way just because C But for console input, if we convert only the text encoding after |
Matching libc behavior is not a goal here. In my opinion the goals should be:
|
All right, I will remove LF → CRLF conversion for console output. But I still believe it is a good idea to keep CRLF → LF conversion for console input:
What do you think of this? |
It seems like handling this via conversion within |
All right, I will remove LF ↔ CRLF conversion completely and leave the job to users, considering that there are no abstractions of "text mode" and "binary mode" for Zig's |
I'm sorry, I didn't review this in time, and now it has bitrotted. Furthermore, so many pull requests have stacked up that I can't keep up and I am therefore declaring Pull Request Bankruptcy and closing old PRs that now have conflicts with master branch. If you want to reroll, you are by all means welcome to revisit this changeset with respect to the current state of master branch, and there's a decent chance your patch will be reviewed the second time around. Either way, I'm closing this now, otherwise the PR queue will continue to grow indefinitely. |
All right. I have been busy recently, hopefully I will come back to this PR and rebase all the changes in a month or two. |
These commits aim to eliminate text encoding problems regarding Windows native console I/O.
To-do list:
WriteConsoleW()
(resolves Zig runtime should set console output mode to utf-8 by default on Windows #7600, closes std.start: initialize Windows console output CP to UTF-8 on exe startup #14411)ReadConsoleW()
(resolves Cannot read unicode data from STDIN on Windows #5148)