CSI: validate that single-node mounts aren't used with canaries #13380

f3l1x · 2022-06-15T13:31:21Z

Nomad version

1.3.1

Operating system and Environment details

Debian 11.3

Issue

Hi ✋

We're running Nomad 1.3.1 with 3 nomad masters, 3 nomad clients, consul, traefik and NFS plugin.

We are facing given allocation in pending state forever. It will end by progress_deadline (10m) and failed deployment.

I am not sure if it's related to CSI, we are using NFS (https://gitlab.com/rocketduck/csi-plugin-nfs). But maybe it's not, sometimes it's happing with CSI and sometimes without.

Reproduction steps

Take a look at job file. If I change only metadata version=10 to version=20, it will stuck. Pending until progress_deadline.

If I change port to static port or dynamic port, it does not matter. Sometimes it surprisely works. :-)

Expected Result

Deployment will success.

Actual Result

Deployment failed. Allocation is still pending.

Job file (if appropriate)

job "canary" {
  type        = "service"
  datacenters = ["dc1"]

  meta {
    version = 10
  }

  update {
    canary       = 1
    max_parallel = 1
    health_check = "checks"
    auto_revert  = true
    auto_promote = true
  }

  group "server" {
    count = 1

    network {
      port "http" { to = 3001 }
    }

    volume "canary-data" {
      type            = "csi"
      source          = "canary-data-volume"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"
    }

    task "echo" {
      driver = "docker"
      config {
        image = "hashicorp/http-echo:latest"
        args  = [
          "-listen", ":${NOMAD_PORT_http}",
          "-text", "Hello world! IP ${NOMAD_IP_http} and PORT ${NOMAD_PORT_http}",
        ]
        ports = ["http"]
      }

      resources {
        cpu    = 128
        memory = 128
      }

      volume_mount {
        volume      = "canary-data"
        destination = "/app/data"
      }

      service {
        port = "http"

        tags = [
          "traefik.enable=true",
          "traefik.http.routers.${NOMAD_JOB_ID}.rule=Host(`canary.domain.tld`)"
        ]

        check {
          type     = "http"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

DerekStrickland · 2022-06-15T14:56:42Z

Hi @f3l1x

Thanks for using Nomad and for reporting this issue. We'll try to replicate this locally and update the issue.

DerekStrickland · 2022-06-15T15:18:25Z

Hi again @f3l1x

So a couple things stick out to me. Since it sometimes runs and sometimes doesn't it can be really tricky to debug. Can you include your server and client logs ideally both when it works and when it doesn't? Also, if you could include your server and client configs with any secrets removed that would be really helpful. Replicating your environment as best we can is going to be essential.

f3l1x · 2022-06-15T15:43:22Z

Hi @DerekStrickland. I've verified this job right now and changing only meta of the job 10 times result in same state (pending allocation -> progress_deadline).

With sometimes run I meant that changing different parts of job such as ports or volumes sometimes helps, but it's definitely not the way I would like to use it (changing randomly something). :-) My apologies for mystification.

DerekStrickland · 2022-06-15T17:36:48Z

Thanks for the clarification 😄

Are you able to share the logs and configuration files with us?

f3l1x · 2022-06-15T17:44:17Z

I cut the logs around failed deployment. I hope that help. If you need anything more, just say.

Nomad server:

Jun 15 19:37:04 nmdmaster1 nomad[36658]:   |
Jun 15 19:37:04 nmdmaster1 nomad[36658]:   | 404 page not found
Jun 15 19:37:04 nmdmaster1 nomad[36658]:
Jun 15 19:37:34 nmdmaster1 nomad[36658]:     2022-06-15T19:37:34.479+0200 [WARN]  nomad.vault: failed to contact Vault API: retry=30s
Jun 15 19:37:34 nmdmaster1 nomad[36658]:   error=
Jun 15 19:37:34 nmdmaster1 nomad[36658]:   | Error making API request.
Jun 15 19:37:34 nmdmaster1 nomad[36658]:   |
Jun 15 19:37:34 nmdmaster1 nomad[36658]:   | URL: GET https://vault.domain.tld/v1/sys/health?drsecondarycode=123&performancestandbycode=123&sealedcode=123&standbycode=12>
Jun 15 19:37:34 nmdmaster1 nomad[36658]:   | Code: 404. Raw Message:
Jun 15 19:37:34 nmdmaster1 nomad[36658]:   |
Jun 15 19:37:34 nmdmaster1 nomad[36658]:   | 404 page not found
Jun 15 19:37:34 nmdmaster1 nomad[36658]:
Jun 15 19:37:57 nmdmaster1 nomad[36658]:     2022-06-15T19:37:57.585+0200 [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"
Jun 15 19:37:59 nmdmaster1 nomad[36658]:     2022-06-15T19:37:59.597+0200 [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"
Jun 15 19:38:03 nmdmaster1 nomad[36658]:     2022-06-15T19:38:03.608+0200 [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"
Jun 15 19:38:04 nmdmaster1 nomad[36658]:     2022-06-15T19:38:04.491+0200 [WARN]  nomad.vault: failed to contact Vault API: retry=30s
Jun 15 19:38:04 nmdmaster1 nomad[36658]:   error=
Jun 15 19:38:04 nmdmaster1 nomad[36658]:   | Error making API request.
Jun 15 19:38:04 nmdmaster1 nomad[36658]:   |
Jun 15 19:38:04 nmdmaster1 nomad[36658]:   | URL: GET https://vault.domain.tld/v1/sys/health?drsecondarycode=123&performancestandbycode=123&sealedcode=123&standbycode=12>
Jun 15 19:38:04 nmdmaster1 nomad[36658]:   | Code: 404. Raw Message:
Jun 15 19:38:04 nmdmaster1 nomad[36658]:   |
Jun 15 19:38:04 nmdmaster1 nomad[36658]:   | 404 page not found
Jun 15 19:38:04 nmdmaster1 nomad[36658]:
Jun 15 19:38:11 nmdmaster1 nomad[36658]:     2022-06-15T19:38:11.622+0200 [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"

Nomad client:

Jun 15 19:05:25 nmd2 nomad[174638]:     2022-06-15T19:05:25.340+0200 [INFO]  client.gc: marking allocation for GC: alloc_id=85b9ec44-060c-712e-6548-c2064d9437f2
Jun 15 19:05:25 nmd2 nomad[174638]:     2022-06-15T19:05:25.343+0200 [INFO]  agent: (runner) stopping
Jun 15 19:05:25 nmd2 nomad[174638]:     2022-06-15T19:05:25.343+0200 [INFO]  agent: (runner) received finish
Jun 15 19:37:57 nmd2 nomad[174638]:     2022-06-15T19:37:57.587+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=A.A.A.A:4647
Jun 15 19:37:57 nmd2 nomad[174638]:     2022-06-15T19:37:57.587+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=A.A.A.A:4647
Jun 15 19:37:59 nmd2 nomad[174638]:     2022-06-15T19:37:59.598+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=B.B.B.B:4647
Jun 15 19:37:59 nmd2 nomad[174638]:     2022-06-15T19:37:59.598+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=B.B.B.B:4647
Jun 15 19:38:03 nmd2 nomad[174638]:     2022-06-15T19:38:03.609+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=C.C.C.C:4647
Jun 15 19:38:03 nmd2 nomad[174638]:     2022-06-15T19:38:03.609+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=C.C.C.C:4647
Jun 15 19:38:11 nmd2 nomad[174638]:     2022-06-15T19:38:11.623+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=A.A.A.A:4647
Jun 15 19:38:11 nmd2 nomad[174638]:     2022-06-15T19:38:11.623+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=A.A.A.A:4647
Jun 15 19:38:27 nmd2 nomad[174638]:     2022-06-15T19:38:27.635+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=B.B.B.B:4647
Jun 15 19:38:27 nmd2 nomad[174638]:     2022-06-15T19:38:27.635+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=B.B.B.B:4647
Jun 15 19:38:59 nmd2 nomad[174638]:     2022-06-15T19:38:59.645+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=C.C.C.C:4647
Jun 15 19:38:59 nmd2 nomad[174638]:     2022-06-15T19:38:59.645+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=C.C.C.C:4647
Jun 15 19:39:59 nmd2 nomad[174638]:     2022-06-15T19:39:59.655+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=A.A.A.A:4647
Jun 15 19:39:59 nmd2 nomad[174638]:     2022-06-15T19:39:59.656+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=A.A.A.A:4647

f3l1x · 2022-06-15T17:57:46Z

@DerekStrickland Just an idea, but is it possible that we're facing this trouble because we use single-node-writer in CSI? And multiple canaries instances requires two allocations at the same time before first one gets off?

DerekStrickland · 2022-06-16T10:17:59Z

That's a really interesting theory. The "max claims reached" log message might point to just that. I'm cc'ing my colleague @tgross for a consultation 😄

tgross · 2022-06-17T12:48:29Z

Hi folks. Yes, single-node-writer isn't going to be compatible with canaries at all. The canary instance needs to mount the volume in order to claim it, which we can't support for a volume that's only allowed to have one allocation mounting it.

We already validate that the per_alloc flag isn't compatible with canaries and reject it at the time of job submission. We should probably tweak this so that it's checked for all volumes that are single-node-* mounts instead. I'm going to reword the title of this issue and mark it as a roadmap item. Thanks!

f3l1x · 2022-06-17T13:05:35Z

Hi @tgross, thank you.

Can you please clarify what is the correct usage of multi-node-single-writer? I understand the rest, but this I don't know where to use it.

tgross · 2022-06-17T13:25:51Z

Can you please clarify what is the correct usage of multi-node-single-writer? I understand the rest, but this I don't know where to use it.

That's a volume that can accept multiple readers but only a single writer. As noted in the access_mode docs, support for this is controlled by the storage provider (and CSI plugin). I'm going to be honest and say I'm not sure I know of any examples of storage providers where that option is available. Most support only single-node-* and the ones that support multi-node usually support multiple writes as well (ex. NFS).

scaleoutsean · 2022-09-09T10:36:20Z

Yes, single-node-writer isn't going to be compatible with canaries at all. The canary instance needs to mount the volume in order to claim it, which we can't support for a volume that's only allowed to have one allocation mounting it.

AFAIK CSI SINGLE_NODE_WRITER allows multiple allocations on the same host (node).

     // Can only be published once as read/write on a single node, at
      // any given time.
      SINGLE_NODE_WRITER = 1;

That means if a bunch of workloads are dispatched to the same Nomad client, that should work for SINGLE_NODE_WRITER (RWO) volume - they're still single writer from the host perspective, not from the process perspective.

tgross · 2022-09-09T12:35:59Z

Hi @scaleoutsean. That's not my reading of the spec: "published" is a different state than "staged", and publishing includes the association with a specific workload.

scaleoutsean · 2022-09-10T05:12:40Z

@tgross okay, then - even though our opinions differ, it's useful to know how the spec is understood by Nomad.

If you're willing to entertain the possibility of the spec being wrong or not clear: yes, publishing is different, but as you said that's an association with a workload (not covered by the spec) whereas SINGLE vs MULTI describes the number of worker nodes where that may happen (covered by the spec).

If a volume has a single host FS with just one file, write.txt, and is published twice to two workloads running on the same node (i.e. SINGLE_NODE writer), where one workload consists of echo $(date) >> write.txt while the other does tail -f write.txt, it's easy to see there's nothing wrong with that. Or in a milder version where both workloads just serve the same static Web site.

In fact each pod could even write to a different file. Imagine two (stand-alone) MiniIO containers allowing upload to the same filesystem (last writer wins), while reads would be parallel. And this also wouldn't be a MULTI-node setup and should be allowed by Nomad, IMO. What wouldn't work is if one of the pods died and Nomad tried to reschedule it on another worker, at which point the volume couldn't be published the second time, because that would become a MULTI-node situation.

And even in a "worst case" scenario where multiple workloads write to the same file that's workable as well, as long as workloads are smart (lock a byte range for modifications, or lazily obtain a write lock only when writing). That's no different on how it works on a Linux VM where multiple applications log to the same file without there being a MULTI writer-capable filesystem (cluster file system or NFS) underneath.

It's beneficial to Nomad users if they can schedule multiple workloads that use the same volume on a host. If you have a single-host filesystem, you can't work on it in parallel (if the spec is understood to mean "single workload") although the host may have plentiful resources that can allow parallel execution (e.g. parametrized batch jobs). The second workload that tries to obtain exclusive lock on a file already locked by the first workload couldn't start, but then this is what's expected and consistent with VM or bare metal environments - if one tries to start two PostgreSQL instances using the same data and log files and that doesn't work, they probably won't attempt to argue it's a PostgreSQL bug.

Related to this issue, I haven't looked at how the provisioner used by OP works, but "sometimes it's happing [sic!]" indicates there's no problem; if you get lucky and the second pod that uses the volume gets scheduled on the same worker where the existing workload is - it'll work.

tgross · 2022-09-12T13:29:25Z

If you're willing to entertain the possibility of the spec being wrong or not clear:

In my experience that's, uh, definitely a possibility. 😀 So we're absolutely open to discussing it. As it turns out, there's an open issue in the spec repo that covers exactly this case: container-storage-interface/spec#178 which suggests that you're not alone in wanting this.

And even in a "worst case" scenario where multiple workloads write to the same file that's workable as well, as long as workloads are smart (lock a byte range for modifications, or lazily obtain a write lock only when writing). That's no different on how it works on a Linux VM where multiple applications log to the same file without there being a MULTI writer-capable filesystem (cluster file system or NFS) underneath.

Totally agreed that the application could own the "who's writing?" semantics. The application developer knows way more about the usage pattern than the orchestrator (Nomad in this case) can possibly know. But there's benefit in our being conservative here and protecting users from corrupting their data by imposing the requirement that their application be aware of these semantics. And I think that's what makes it a harder sell for us.

That being said, if container-storage-interface/spec#178 ends up getting resolved, we'll most likely end up needing to support that approach anyways. We'll likely need to get around to fixing #11798 as well.

Related to this issue, I haven't looked at how the provisioner used by OP works, but "sometimes it's happing [sic!]" indicates there's no problem; if you get lucky and the second pod that uses the volume gets scheduled on the same worker where the existing workload is - it'll work.

Yeah, something like the distinct_hosts field seems like it should help here, but I'm not sure off the top of my head that it works across job versions, I'd have to dig into that.

scaleoutsean · 2022-09-23T15:29:13Z

@tgross - for reference only, K8s seems to have implemeneted a "workaround" for this outside of CSI, with ReadWriteOncePod. A separate enhancement request/issue can be created if we wanted something similar on Nomad.

tgross · 2022-09-27T18:04:30Z

@scaleoutsean thanks for that reference. It looks like the k8s folks have at least at some point nudged the spec folks about this very issue: container-storage-interface/spec#465 (comment). Our team is currently focused on getting Nomad 1.4.0 out the door, but I think this is worth having further discussion about here once we've got some breathing room.

f3l1x added the type/bug label Jun 15, 2022

DerekStrickland self-assigned this Jun 15, 2022

tgross changed the title ~~Allocation pending forever (canary=1, NFS, progress_deadline, deployment failed)~~ CSI: validate that single-node mounts aren't used with canaries Jun 17, 2022

tgross added the theme/storage label Jun 17, 2022

tgross unassigned DerekStrickland Jun 17, 2022

tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Jun 17, 2022

tgross assigned angrycub Jun 21, 2022

angrycub mentioned this issue Jun 21, 2022

Disallow canaries with CSI single-node volume requests #13449

Closed

tgross added this to the 1.3.x milestone Jul 11, 2022

tgross modified the milestones: 1.3.x, 1.4.0 Aug 30, 2022

tgross modified the milestones: 1.4.0, 1.4.x Sep 12, 2022

tgross removed this from the 1.4.x milestone Jan 20, 2023

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

angrycub removed their assignment Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CSI: validate that single-node mounts aren't used with canaries #13380

CSI: validate that single-node mounts aren't used with canaries #13380

f3l1x commented Jun 15, 2022 •

edited

Loading

DerekStrickland commented Jun 15, 2022

Uh oh!

DerekStrickland commented Jun 15, 2022

Uh oh!

f3l1x commented Jun 15, 2022

Uh oh!

DerekStrickland commented Jun 15, 2022

Uh oh!

f3l1x commented Jun 15, 2022

Uh oh!

f3l1x commented Jun 15, 2022

Uh oh!

DerekStrickland commented Jun 16, 2022

Uh oh!

tgross commented Jun 17, 2022

Uh oh!

f3l1x commented Jun 17, 2022

Uh oh!

tgross commented Jun 17, 2022

Uh oh!

scaleoutsean commented Sep 9, 2022

Uh oh!

tgross commented Sep 9, 2022

Uh oh!

scaleoutsean commented Sep 10, 2022 •

edited

Loading

Uh oh!

tgross commented Sep 12, 2022

Uh oh!

scaleoutsean commented Sep 23, 2022

Uh oh!

tgross commented Sep 27, 2022

Uh oh!

CSI: validate that single-node mounts aren't used with canaries #13380

CSI: validate that single-node mounts aren't used with canaries #13380

Comments

f3l1x commented Jun 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

DerekStrickland commented Jun 15, 2022

Uh oh!

DerekStrickland commented Jun 15, 2022

Uh oh!

f3l1x commented Jun 15, 2022

Uh oh!

DerekStrickland commented Jun 15, 2022

Uh oh!

f3l1x commented Jun 15, 2022

Uh oh!

f3l1x commented Jun 15, 2022

Uh oh!

DerekStrickland commented Jun 16, 2022

Uh oh!

tgross commented Jun 17, 2022

Uh oh!

f3l1x commented Jun 17, 2022

Uh oh!

tgross commented Jun 17, 2022

Uh oh!

scaleoutsean commented Sep 9, 2022

Uh oh!

tgross commented Sep 9, 2022

Uh oh!

scaleoutsean commented Sep 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tgross commented Sep 12, 2022

Uh oh!

scaleoutsean commented Sep 23, 2022

Uh oh!

tgross commented Sep 27, 2022

Uh oh!

f3l1x commented Jun 15, 2022 •

edited

Loading

scaleoutsean commented Sep 10, 2022 •

edited

Loading