Skip to content

x/build/cmd/rundockerbuildlets: scaleway buildlets fail looping #22748

Closed
@bradfitz

Description

@bradfitz

The scaleway builders are all saying:

root@scw-e8738f:~# journalctl -f -u rundockerbuildlet.service
...
Nov 15 20:41:28 scw-e8738f rundockerbuildlet[4707]: 2017/11/15 20:41:28 Creating scaleway-prod-05 ...
Nov 15 20:41:28 scw-e8738f rundockerbuildlet[4707]: 2017/11/15 20:41:28 Error creating scaleway-prod-05: exit status 125, docker: Error response from daemon: Conflict. The name "/scaleway-prod-05" is already in use by container e1a1f4e89f3a073402f2ae513753db661c41a4727e808215758364c313850357. You have to remove (or rename) that container to be able to reuse that name..
Nov 15 20:41:28 scw-e8738f rundockerbuildlet[4707]: See 'docker run --help'.
Nov 15 20:41:29 scw-e8738f rundockerbuildlet[4707]: 2017/11/15 20:41:29 Creating scaleway-prod-05 ...
Nov 15 20:41:29 scw-e8738f rundockerbuildlet[4707]: 2017/11/15 20:41:29 Error creating scaleway-prod-05: exit status 125, docker: Error response from daemon: Conflict. The name "/scaleway-prod-05" is already in use by container e1a1f4e89f3a073402f2ae513753db661c41a4727e808215758364c313850357. You have to remove (or rename) that container to be able to reuse that name..
Nov 15 20:41:29 scw-e8738f rundockerbuildlet[4707]: See 'docker run --help'.

/cc @jessfraz @adams-sarah @andybons

Activity

added this to the Unreleased milestone on Nov 15, 2017
added
Buildersx/build issues (builders, bots, dashboards)
on Nov 15, 2017
gopherbot

gopherbot commented on Nov 15, 2017

@gopherbot
Contributor

Change https://golang.org/cl/78032 mentions this issue: dashboard: remove linux-arm as a trybot

bradfitz

bradfitz commented on Nov 16, 2017

@bradfitz
ContributorAuthor

I nuked all 50 instances and recreated and now I see them just deleting & restarting the containers:

Nov 16 00:08:02 scw-e8738f rundockerbuildlet[4815]: 2017/11/16 00:08:02 Removed container 755cec905f30 (scaleway-prod-50)
Nov 16 00:08:02 scw-e8738f rundockerbuildlet[4815]: 2017/11/16 00:08:02 Creating scaleway-prod-50 ...
Nov 16 00:08:03 scw-e8738f rundockerbuildlet[4815]: 2017/11/16 00:08:03 Created scaleway-prod-50
Nov 16 00:08:17 scw-e8738f rundockerbuildlet[4815]: 2017/11/16 00:08:17 Removed container 2c1569cd1186 (scaleway-prod-50)
Nov 16 00:08:17 scw-e8738f rundockerbuildlet[4815]: 2017/11/16 00:08:17 Creating scaleway-prod-50 ...
Nov 16 00:08:19 scw-e8738f rundockerbuildlet[4815]: 2017/11/16 00:08:19 Created scaleway-prod-50
Nov 16 00:08:32 scw-e8738f rundockerbuildlet[4815]: 2017/11/16 00:08:32 Created scaleway-prod-50
Nov 16 00:08:47 scw-e8738f rundockerbuildlet[4815]: 2017/11/16 00:08:47 Removed container b69851e6724a (scaleway-prod-50)
Nov 16 00:08:47 scw-e8738f rundockerbuildlet[4815]: 2017/11/16 00:08:47 Creating scaleway-prod-50 ...
Nov 16 00:08:49 scw-e8738f rundockerbuildlet[4815]: 2017/11/16 00:08:49 Created scaleway-prod-50
Nov 16 00:09:00 scw-e8738f rundockerbuildlet[4815]: 2017/11/16 00:09:00 Removed container a278c75f4c14 (scaleway-prod-50)
Nov 16 00:09:00 scw-e8738f rundockerbuildlet[4815]: 2017/11/16 00:09:00 Creating scaleway-prod-50 ...
Nov 16 00:09:02 scw-e8738f rundockerbuildlet[4815]: 2017/11/16 00:09:02 Created scaleway-prod-50
Nov 16 00:09:07 scw-e8738f rundockerbuildlet[4815]: 2017/11/16 00:09:07 Removed container 650d128f0243 (scaleway-prod-50)
Nov 16 00:09:07 scw-e8738f rundockerbuildlet[4815]: 2017/11/16 00:09:07 Creating scaleway-prod-50 ...
Nov 16 00:09:08 scw-e8738f rundockerbuildlet[4815]: 2017/11/16 00:09:08 Created scaleway-prod-50

Why? Maybe because of #22749 is making the containers start up too slowly?

bradfitz

bradfitz commented on Nov 16, 2017

@bradfitz
ContributorAuthor

Hm... archive/tar changes?

The docker logs are saying:

2017/11/16 01:05:03 Downloaded ./buildlet.exe (7178728 bytes)
2017/11/16 01:05:03 buildlet starting.
2017/11/16 01:05:04 Dialing coordinator farmer.golang.org:443 ...
2017/11/16 01:05:04 Doing TLS handshake with coordinator (verifying hostname "farmer.golang.org")...
2017/11/16 01:05:04 Registering reverse mode with coordinator...
2017/11/16 01:05:04 Connected to coordinator; reverse dialing active
2017/11/16 01:05:05 Removing .
2017/11/16 01:05:05 Ignoring fail of RemoveAll(.)
2017/11/16 01:05:06 writetgz: untarring Request.Body into /workdir/go
2017/11/16 01:05:06 extracted tarball into /workdir/go: 1 files, 1 dirs (1.338425ms)
2017/11/16 01:05:06 writetgz: untarring Request.Body into /workdir/go
2017/11/16 01:05:06 error extracting tarball into /workdir/go after 0 files, 0 dirs, 105.528154ms: tar file contained invalid name ""
2017/11/16 01:05:07 Halting in 1 second.
2017/11/16 01:05:07 buildlet reverse mode exiting.

Note the tar file contained invalid name "".

I probably built the linux-arm binary with Go tip when I addressed #21839, and now the new tip-built buildlet can't parse the coordinator's Go 1.8 tar files.

/cc @dsnet

dsnet

dsnet commented on Nov 16, 2017

@dsnet
Member

What's the code that generates the tar file (or reads it)? The "0 files, 0 dirs" is also interesting.

bradfitz

bradfitz commented on Nov 16, 2017

@bradfitz
ContributorAuthor

The The "0 files, 0 dirs" just means it's failing on the first entry and the counters for number of files and number of dirs haven't been incremented yet.

The tar.gz comes from x/build/cmd/coordinator which gets it either from gitmirror (which just does exec.CommandContext(ctx, "git", "archive", "--format=tgz", rev)) or it gets it directly from Gerrit's equivalent export-a-tarball handler.

There's only one use of tar.NewWriter that we're using in the coordinator (to generate a VERSION file), but that push works fine. You can see it in the log above (the first one, extracted tarball into /workdir/go: 1 files, 1 dirs (1.338425ms))

bradfitz

bradfitz commented on Nov 16, 2017

@bradfitz
ContributorAuthor

Btw, I re-pushed a new buildlet binary built at Go 1.9 and they're all working again. So it does seem to be something at tip.

dsnet

dsnet commented on Nov 16, 2017

@dsnet
Member

Is there any way to get access to the tar.gz file being passed in?

bradfitz

bradfitz commented on Nov 16, 2017

@bradfitz
ContributorAuthor

With some work I suppose.

dsnet

dsnet commented on Nov 16, 2017

@dsnet
Member

I have a theory.

I added support for parsing the PAX records for an esoteric feature of tar, which are global PAX records, denoted by a typeflag of TypeXGlobalHeader. The relevant new code may return a header without a filename (since they don't make sense for global headers).

Assuming this is the issue, you probably want to either push the validRelPath check to be within each case of the switch statement or check the Header.Typeflag up front .

dsnet

dsnet commented on Nov 16, 2017

@dsnet
Member

Confirmed. I ran git archive --format=tgz ee6321b5405504f1d090d01a4703ec9ff6b218ea | gzip -d | head -c 512| hexdump -C and got:

00000000  70 61 78 5f 67 6c 6f 62  61 6c 5f 68 65 61 64 65  |pax_global_heade|
00000010  72 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |r...............|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000060  00 00 00 00 30 30 30 30  36 36 36 00 30 30 30 30  |....0000666.0000|
00000070  30 30 30 00 30 30 30 30  30 30 30 00 30 30 30 30  |000.0000000.0000|
00000080  30 30 30 30 30 36 34 00  31 33 32 30 33 31 32 35  |0000064.13203125|
00000090  36 36 31 00 30 30 31 34  35 31 31 00 67 00 00 00  |661.0014511.g...|
000000a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*

You can see a typeflag of 'g' at offset 0x9c.
I'm rather surprised that git outputs archives using this esoteric feature.

bradfitz

bradfitz commented on Nov 16, 2017

@bradfitz
ContributorAuthor

Do you think the Go tar reader should expose these records by default, or do you think they should be opt-in?

dsnet

dsnet commented on Nov 16, 2017

@dsnet
Member

They were actually always returned in Go1.9 and earlier, but returned a bogus file containing the raw contents of the PAX headers.

The buildlet logic actually happened to work by chance. There aren't any hard restrictions on what filename to use for global headers, so the one used by git just happened to work for validRelPath. So if you looked onto the filesystem of the builders, you would have found a bogus file called "pax_global_header".

I was actually surprised that the buildet worked at all in Go1.9 and earlier. The surprising behavior is that tar.Header{Typeflag: tar.TypeXGlobalHeader}.FileInfo().Mode().IsRegular() reports true, when I would have expected false. Had this been false, then it would hit the default case in the logic, making the problem obvious.

Unfortunately, I don't know how to fix tar.Header.FileInfo.Mode to report false here. The problem is that os.FileMode.IsRegular has no way to indicate that something is not a regular file without saying that it is something else since it only checks no mode bits are set.

gopherbot

gopherbot commented on Nov 16, 2017

@gopherbot
Contributor

Change https://golang.org/cl/78355 mentions this issue: archive/tar: use placeholder name for global PAX records

gopherbot

gopherbot commented on Nov 18, 2017

@gopherbot
Contributor

Change https://golang.org/cl/78546 mentions this issue: cmd/buildlet: ignore pax global header when untarring

locked and limited conversation to collaborators on Nov 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @bradfitz@dsnet@gopherbot

        Issue actions

          x/build/cmd/rundockerbuildlets: scaleway buildlets fail looping · Issue #22748 · golang/go