Description
Using go1.5
Also discovered this while fixing other archive/tar issues (and I found fair number of them, mostly minor). However, fixing this will change the way archive/tar reads and writes certain formats.
What the current archive/tar thinks the GNU format is:
- A magic and version that forms the string
"ustar\x20\x20\x00"
(this is correct). - That the structure is identical to the POSIX format. That is, there is a 155byte prefix section (this is incorrect).
- That it extends the POSIX format by adding the ability to perform base-256 encoding (this is not necessarily specific to GNU format).
What the GNU manual actually says the format is:
The GNU manual says that the format for headers using this magic is the following (in Go syntax):
type headerGNU struct {
// Original V7 header
name [100]byte // 0
mode [8]byte // 100
uid [8]byte // 108
gid [8]byte // 116
size [12]byte // 124
mtime [12]byte // 136
chksum [8]byte // 148
typeflag [1]byte // 156
linkname [100]byte // 157
// This section is based on the Posix standard.
magic [6]byte // 257: "ustar "
version [2]byte // 263: " \x00"
uname [32]byte // 265
gname [32]byte // 297
devmajor [8]byte // 329
devminor [8]byte // 337
// The GNU format replaces the prefix field with this stuff.
// The fact that GNU replaces the prefix with this makes it non-compliant.
atime [12]byte // 345
ctime [12]byte // 357
offset [12]byte // 369
longnames [4]byte // 381
unused [1]byte // 385
sparse [4]headerSparse // 386
isextended [1]byte // 482
realsize [12]byte // 483
// 495
}
type headerSparse struct {
offset [12]byte // 0
numbytes [12]byte // 12
// 24
}
In fact, the structure for GNU swaps out the prefix section of POSIX, for a bunch of extra fields for atime, ctime, and sparse file support (contrary to what Go thinks).
Regarding the use of base-256 encoding, it seems that GNU was the first to introduce this encoding back in 1999. Since then, pretty much every tar decoder handles reading base-256 encoding regardless of whether it is GNU format or not. Marking the format as GNU may or may not be necessary just because base-256 encoding was used.
Problem 1:
When reading, if the decoder detects the GNU magic number, it will attempt to read 155bytes for the prefix. This is just plain wrong and will start to read the atime, ctime, etc instead. This causes the prefix to be incorrect.
See this playground example
The paths there have something like "12574544345" prepended to it. This is because when the tar archive tries to read the the prefix, it is actually reading the atime (which is in ASCII octal and is null terminated). Thus, it incorrectly uses the atime as the prefix.
This probably went undetected for so long since the "incremental" mode of GNU tar is rarely used, and thus the atime and ctime fields are never filled out and left as null bytes. This happens to work in the common case, since the cstring for this field ends up being an empty string.
Problem 2:
When writing, if a numeric field was ever too large to represent in octal format, it would trigger the usedBinary
flag and cause the library to output the GNU magic numbers, but subsequently fail to encode in the GNU format. Since it believes that the GNU format has a prefix field, it erroneously tries to use it, losing some information in the process.
This is ultimately what causes #9683, but is rare in practice since the perfect conditions need to be met for GNU format to be used. There is a very narrow range between the use cases of USTAR and PAX where the logic will use GNU.
Solution:
When decoding, change it so that the reader doesn't read the 155byte prefix field (since this is just plain wrong). Optionally, support parsing of the atime and ctime from the GNU format. Nothing needs to change for sparse file support since that logic correctly understood the GNU format.
When encoding, I propose the following order of precedence:
- First, use the 1988 POSIX (USTAR) standard when possible for maximum backwards compatibility.
- If any numeric field goes beyond the octal representation, or any names are longer than what is supported, just use the 2001 POSIX (PAX) standard.
Let's avoid writing the GNU format. In fact the GNU manual itself, says the following under the POSIX section:
This archive format will be the default format for future versions of GNU tar.
The only advantages that GNU offers over USTAR is:
- Unlimited length filenames (only ASCII)
- Relatively large filesizes
- Possibly atime and ctime
However, PAX offers all of these over USTAR and far more:
- Unlimited length strings (including UTF-8) support for filenames, usernames, etc.
- Unlimited large integers for filesizes, uids, etc.
- Sub-second resolution times.
- No need for base-256 encoding (and assuming that decoders can handle them) since PAX has its own well-defined method of encoding arbitrarily large integers.
Not to mention, we are already outputting PAX in many situations. What's the point of straggling between 3 different output formats?
Thoughts?