-
Notifications
You must be signed in to change notification settings - Fork 2k
CSI: validate that single-node mounts aren't used with canaries #13380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @f3l1x Thanks for using Nomad and for reporting this issue. We'll try to replicate this locally and update the issue. |
Hi again @f3l1x So a couple things stick out to me. Since it sometimes runs and sometimes doesn't it can be really tricky to debug. Can you include your server and client logs ideally both when it works and when it doesn't? Also, if you could include your server and client configs with any secrets removed that would be really helpful. Replicating your environment as best we can is going to be essential. |
Hi @DerekStrickland. I've verified this job right now and changing only meta of the job 10 times result in same state (pending allocation -> progress_deadline). With |
Thanks for the clarification 😄 Are you able to share the logs and configuration files with us? |
I cut the logs around failed deployment. I hope that help. If you need anything more, just say. Nomad server:
Nomad client:
|
@DerekStrickland Just an idea, but is it possible that we're facing this trouble because we use |
That's a really interesting theory. The "max claims reached" log message might point to just that. I'm cc'ing my colleague @tgross for a consultation 😄 |
Hi folks. Yes, We already validate that the |
Hi @tgross, thank you. Can you please clarify what is the correct usage of |
That's a volume that can accept multiple readers but only a single writer. As noted in the |
AFAIK CSI
That means if a bunch of workloads are dispatched to the same Nomad client, that should work for SINGLE_NODE_WRITER (RWO) volume - they're still single writer from the host perspective, not from the process perspective. |
Hi @scaleoutsean. That's not my reading of the spec: "published" is a different state than "staged", and publishing includes the association with a specific workload. |
@tgross okay, then - even though our opinions differ, it's useful to know how the spec is understood by Nomad. If you're willing to entertain the possibility of the spec being wrong or not clear: yes, publishing is different, but as you said that's an association with a workload (not covered by the spec) whereas If a volume has a single host FS with just one file, write.txt, and is published twice to two workloads running on the same node (i.e. In fact each pod could even write to a different file. Imagine two (stand-alone) MiniIO containers allowing upload to the same filesystem (last writer wins), while reads would be parallel. And this also wouldn't be a And even in a "worst case" scenario where multiple workloads write to the same file that's workable as well, as long as workloads are smart (lock a byte range for modifications, or lazily obtain a write lock only when writing). That's no different on how it works on a Linux VM where multiple applications log to the same file without there being a It's beneficial to Nomad users if they can schedule multiple workloads that use the same volume on a host. If you have a single-host filesystem, you can't work on it in parallel (if the spec is understood to mean "single workload") although the host may have plentiful resources that can allow parallel execution (e.g. parametrized batch jobs). The second workload that tries to obtain exclusive lock on a file already locked by the first workload couldn't start, but then this is what's expected and consistent with VM or bare metal environments - if one tries to start two PostgreSQL instances using the same data and log files and that doesn't work, they probably won't attempt to argue it's a PostgreSQL bug. Related to this issue, I haven't looked at how the provisioner used by OP works, but "sometimes it's happing [sic!]" indicates there's no problem; if you get lucky and the second pod that uses the volume gets scheduled on the same worker where the existing workload is - it'll work. |
In my experience that's, uh, definitely a possibility. 😀 So we're absolutely open to discussing it. As it turns out, there's an open issue in the spec repo that covers exactly this case: container-storage-interface/spec#178 which suggests that you're not alone in wanting this.
Totally agreed that the application could own the "who's writing?" semantics. The application developer knows way more about the usage pattern than the orchestrator (Nomad in this case) can possibly know. But there's benefit in our being conservative here and protecting users from corrupting their data by imposing the requirement that their application be aware of these semantics. And I think that's what makes it a harder sell for us. That being said, if container-storage-interface/spec#178 ends up getting resolved, we'll most likely end up needing to support that approach anyways. We'll likely need to get around to fixing #11798 as well.
Yeah, something like the |
@scaleoutsean thanks for that reference. It looks like the k8s folks have at least at some point nudged the spec folks about this very issue: container-storage-interface/spec#465 (comment). Our team is currently focused on getting Nomad 1.4.0 out the door, but I think this is worth having further discussion about here once we've got some breathing room. |
Uh oh!
There was an error while loading. Please reload this page.
Nomad version
1.3.1
Operating system and Environment details
Debian 11.3
Issue
Hi ✋
We're running Nomad 1.3.1 with 3 nomad masters, 3 nomad clients, consul, traefik and NFS plugin.
We are facing given allocation in pending state forever. It will end by progress_deadline (10m) and failed deployment.
I am not sure if it's related to CSI, we are using NFS (https://gitlab.com/rocketduck/csi-plugin-nfs). But maybe it's not, sometimes it's happing with CSI and sometimes without.
Reproduction steps
Take a look at job file. If I change only metadata
version=10
toversion=20
, it will stuck. Pending until progress_deadline.If I change port to static port or dynamic port, it does not matter. Sometimes it surprisely works. :-)
Expected Result
Deployment will success.
Actual Result
Deployment failed. Allocation is still pending.
Job file (if appropriate)
The text was updated successfully, but these errors were encountered: