[API Proposal]: TarReader should not assume same format for all entries in a tar file

### Background and motivation

The user @Bio2Hazard was kind enough to report [this bug](https://github.com/dotnet/runtime/issues/68230#issuecomment-1128266820) when testing the new `System.Formats.Tar` APIs published in .NET 7 Preview4 with a tar file he generated with their own work tools. He shared the file [here](https://github.com/dotnet/runtime/issues/68230#issuecomment-1129201469).

The tar file contains 4 entries:

- 2 consecutive `RegularFile` `ustar` entries.
- Then a `gnu` metadata entry of type `LongLink`.
- Then the `RegularFile` entry with the actual data that the previous `LongLink` metadata entry describes.
 
I confirmed this by inspecting the archive with a Hex Editor.

The archive also shows that all 4 entries have `magic` and a `version` metadata fields following the `ustar` rules. Even the 3rd entry, which is clearly not supported by `ustar` because the entry type is `L`. In the `gnu` format, the `magic` and `version` fields are slightly different to those from POSIX formats (`ustar` and `pax`). I describe those differences [here](https://github.com/dotnet/runtime/issues/68230#issuecomment-1131311063).

The current behavior of our `TarReader` is to throw `FormatException` when initially reading an entry of a particular format, and then encountering an subsequent entry in a different format, or the entry type is unsupported by the initially-assumed format of the whole archive.

Surprisingly, if this same file is opened and traversed with SharpCompress or 7-zip, they can traverse all the entries without problems. This means that:

- They do not assume the whole archive is in the same format, and allow intermixing entries of different formats.
- They do not mind if the entry has a magic and a version that belongs to a particular format, but the entry type is for another format.

I am opening this proposal to discuss the possibility of becoming as flexible as SharpCompress and 7-zip, at least when it comes to the `TarReader`.

The `TarWriter` should keep its current behavior: The user should specify the initial format in the constructor, and if an unsupported entry is inserted, it should be converted to that format if possible, or throw an exception if the file is unsupported. This behavior aligns with the Unix `tar` tool, which fails if, for example, the user tries to add a block device entry, or a long path entry, when creating a `v7` or `ustar` archive (these two formats do not support those types of files).

@bartonjs @stephentoub @eerhardt @adamsitnik @Jozkee @jeffhandley @baronfel 

### API Proposal

Remove the `Format` property from `TarReader` to stop assuming all entries are in the same format:

Remove:
```diff
namespace System.Formats.Tar;

public class TarReader : IDisposable
{
-      public TarFormat Format { get; }
}
```

Add:
```diff
public class TarEntry
{
+    public TarFormat Format { get; }
}
```

### API Usage

```cs
using FileStream fs = File.OpenRead("archive.tar"); // Archive with intermixed entries
using TarReader reader = new(fs);
TarEntry entry = reader.GetNextEntry();

// The entry format can be detected by using the Format property
switch (entry.Format)
{
    case TarFormat.V7:
        //...
        break;
    case TarFormat.Ustar:
        //...
        break;
    case TarFormat.Pax:
        //...
        break;
    case TarFormat.Gnu:
        //...
        break;
    case TarFormat.Unknown:
        //...
        break;
}
```

### Alternative Designs

One alternative is to just not remove anything, and if we encounter an entry of a different format than the first one, we just switch the `Format` property to `Unknown`, but keep returning entries to the user when they call `GetNextEntry`. We would have to document this very clearly.

### Risks

Low.

- We are in still in preview, so we have time for adjustments.
- We want to be as helpful as SharpCompress and 7-zip, especially if they are able to handle these cases and we don't.
- The spec does not explicitly indicate that all entries are expected to be in the same format, but it does specify clear rules about the metadata differences between formats, which should help ensure we properly detect the format of each entry individually and independently from the others.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[API Proposal]: TarReader should not assume same format for all entries in a tar file #69544

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[API Proposal]: TarReader should not assume same format for all entries in a tar file #69544

Description

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions