Skip to content

archive/tar: package understanding of GNU format is wrong #12594

Closed
@dsnet

Description

@dsnet

Using go1.5

Also discovered this while fixing other archive/tar issues (and I found fair number of them, mostly minor). However, fixing this will change the way archive/tar reads and writes certain formats.

What the current archive/tar thinks the GNU format is:

What the GNU manual actually says the format is:

The GNU manual says that the format for headers using this magic is the following (in Go syntax):

type headerGNU struct {
    // Original V7 header
    name     [100]byte //   0
    mode     [8]byte   // 100
    uid      [8]byte   // 108
    gid      [8]byte   // 116
    size     [12]byte  // 124
    mtime    [12]byte  // 136
    chksum   [8]byte   // 148
    typeflag [1]byte   // 156
    linkname [100]byte // 157

    // This section is based on the Posix standard.
    magic      [6]byte         // 257: "ustar "
    version    [2]byte         // 263: " \x00"
    uname      [32]byte        // 265
    gname      [32]byte        // 297
    devmajor   [8]byte         // 329
    devminor   [8]byte         // 337

    // The GNU format replaces the prefix field with this stuff.
    // The fact that GNU replaces the prefix with this makes it non-compliant.
    atime      [12]byte        // 345
    ctime      [12]byte        // 357
    offset     [12]byte        // 369
    longnames  [4]byte         // 381
    unused     [1]byte         // 385
    sparse     [4]headerSparse // 386
    isextended [1]byte         // 482
    realsize   [12]byte        // 483
                               // 495
}

type headerSparse struct {
    offset   [12]byte //  0
    numbytes [12]byte // 12
                      // 24
}

In fact, the structure for GNU swaps out the prefix section of POSIX, for a bunch of extra fields for atime, ctime, and sparse file support (contrary to what Go thinks).

Regarding the use of base-256 encoding, it seems that GNU was the first to introduce this encoding back in 1999. Since then, pretty much every tar decoder handles reading base-256 encoding regardless of whether it is GNU format or not. Marking the format as GNU may or may not be necessary just because base-256 encoding was used.

Problem 1:

When reading, if the decoder detects the GNU magic number, it will attempt to read 155bytes for the prefix. This is just plain wrong and will start to read the atime, ctime, etc instead. This causes the prefix to be incorrect.

See this playground example
The paths there have something like "12574544345" prepended to it. This is because when the tar archive tries to read the the prefix, it is actually reading the atime (which is in ASCII octal and is null terminated). Thus, it incorrectly uses the atime as the prefix.

This probably went undetected for so long since the "incremental" mode of GNU tar is rarely used, and thus the atime and ctime fields are never filled out and left as null bytes. This happens to work in the common case, since the cstring for this field ends up being an empty string.

Problem 2:

When writing, if a numeric field was ever too large to represent in octal format, it would trigger the usedBinary flag and cause the library to output the GNU magic numbers, but subsequently fail to encode in the GNU format. Since it believes that the GNU format has a prefix field, it erroneously tries to use it, losing some information in the process.

This is ultimately what causes #9683, but is rare in practice since the perfect conditions need to be met for GNU format to be used. There is a very narrow range between the use cases of USTAR and PAX where the logic will use GNU.

Solution:

When decoding, change it so that the reader doesn't read the 155byte prefix field (since this is just plain wrong). Optionally, support parsing of the atime and ctime from the GNU format. Nothing needs to change for sparse file support since that logic correctly understood the GNU format.

When encoding, I propose the following order of precedence:

  • First, use the 1988 POSIX (USTAR) standard when possible for maximum backwards compatibility.
  • If any numeric field goes beyond the octal representation, or any names are longer than what is supported, just use the 2001 POSIX (PAX) standard.

Let's avoid writing the GNU format. In fact the GNU manual itself, says the following under the POSIX section:

This archive format will be the default format for future versions of GNU tar.

The only advantages that GNU offers over USTAR is:

  • Unlimited length filenames (only ASCII)
  • Relatively large filesizes
  • Possibly atime and ctime

However, PAX offers all of these over USTAR and far more:

  • Unlimited length strings (including UTF-8) support for filenames, usernames, etc.
  • Unlimited large integers for filesizes, uids, etc.
  • Sub-second resolution times.
  • No need for base-256 encoding (and assuming that decoders can handle them) since PAX has its own well-defined method of encoding arbitrarily large integers.

Not to mention, we are already outputting PAX in many situations. What's the point of straggling between 3 different output formats?

Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    FrozenDueToAgeNeedsFixThe path to resolution is known, but the work has not been done.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions