Clarify Identity Service vs Probe request #171

chakri-nelluri · 2018-01-24T15:30:33Z

Today, we have Identity service which establishes the identity and discovers the list of supported versions.
Also, a service can crash and spin back up anytime and the error codes defined in the probe can ideally be returned for any RPC call.
Given that do we really need a Probe call? Can we just use Identity service to establish the identity and readiness of the plugin?

saad-ali · 2018-02-01T23:24:50Z

The primary utility of the Probe calls is deployment verification . This can be a one time or periodic check to ensure volume plugin is healthy (and has all dependencies available). This information can be used, for example, to monitor the health of the plugin and redeploy the plugin (or take other automated measures) when it becomes unhealthy. On Kubernetes, for example, this could be hooked in to the Liveness/Readiness mechanism on plugin deployment.

That said, since the purpose of this call is to check the health of the plugin, we don't really need a separate ControllerProbe(...) and NodeProbe(...) call. Regardless of if the plugin is deployed in "controller mode" or "node mode", the same Probe(...) call in the Identity service should be sufficient. @jdef agreed to make this change as part of the "making controller service optional" refactor.

Additionally, if we consolidate Probe in to the Identity service, it's worth considering your question: do we really need a standalone Probe call at all, couldn't we use the GetSupportedVersions(...) call for the same purpose. <- I don't feel very strongly about that so I'd be ok either way.

chakri-nelluri · 2018-02-02T15:04:49Z

Thanks @saad-ali.
Thinking a bit more, what is the expected workflow when a v0.1 plugin is silently replaced with v0.2 plugin?
My assumption is we detect the mismatch from the API reply (INVALID_ARGUMENT & error description in this case) and will redo the identity service calls.
Given that.. how about we replace(rename) GetSupportedVersions with Probe call and return "SupportedVersions" as part of Probe response?

saad-ali · 2018-02-08T04:26:37Z

We had a big discussion about consolidating ControllerProbe(...) and NodeProbe(...) calls in to a single Probe(...) call in the Identity service this morning.

@akutz brought up that this can't be done because volume plugins may not know what mode they are running in (controller, node, both) since the spec does not dictate that plugins should know nor does it offer any mechanism for them to find out in a standard way. A naive plugin, therefore, could depend entirely on the CO to make the appropriate calls, and would not know how to handle a generic Probe(...) call.

I agree that this is an issue. I don't think this mode of operation should be supported. We should modify the spec to dictate that a plugin must know (somehow) what mode it is running in (controller, node, or both). This will let us have a cleaner API, and reduce the matrix of possible deployment configurations slightly.

That said we should not dictate the exact mechanism for how the plugin discovers its mode, as that is an implementation detail for plugin author: a plugin may choose to use whatever they like: an environment variable, binary flag, etc.

Once we can make the assumption that an instance of a plugin knows what mode it is operating in, it makes sense to consolidate Probe(...).

I'm also open to @chakri-nelluri's proposal to collapse GetSupportedVersions(...) and just have the Probe(...) call return the SupportedVersions as part of the Probe(...) response.

CC'ing some of the people who seemed interested in the discussion: @childsb @chakri-nelluri @akutz @jieyu @jdef @julian-hj @lpabon. Apologies if I missed anyone.

akutz · 2018-02-09T06:52:59Z

Hi @saad-ali,

We should not collapse GetSupportedVersions. That's a very, very bad idea. The reason is because Go+gRPC and CSI do not offer true backward compatibility due to the way Go's gRPC implementation handles type registration. That being the case we absolutely need a simple, forever static means of obtaining the supported versions. We agreed early on that GetSupportedVersions would stay the same and not change, and it mustn't. Or else there cannot be a version-agnostic means of determining what API version is supported by a remote CSI endpoint. Because if the client is written in Go it cannot support multiple CSI versions due to the aforementioned reason. Not unless it forgoes language bindings altogether.

We cannot act as if CSI is not already in the wild. It is. It's not prudent to make drastic changes anymore. It may not yet be 1.0.0, but we should act like it unless there's an enormous reason to ignore that behavior. GetSupportedVersions is the one RPC that should never, ever change.

akutz · 2018-02-09T07:02:55Z

Hi @saad-ali,

So here's the thing. The spec already defines CSI_ENDPOINT as a global environment variable that SPs should reference. The fact is that CSI_ENDPOINT is set/configured by something. Whether that's an admin or a CO config file that deploys SPs, CSI_ENDPOINT defined in the spec, set, and used by the SP. The spec defines CSI_ENDPOINT because it's a piece of information that all SPs need to use.

How is this any different than, let's say, CSI_MODE? Yes, it's an implementation detail, but so is CSI_ENDPOINT! The CSI spec should absolutely define CSI_MODE as a standard environment variable that all SPs should use in order to determine the desired mode of operation. That way there is an agreed upon mechanic, just like CSI_ENDPOINT, that COs/admins can use to configure an SP and an SP looks for to determine its desired mode of operation.

`CSI_MODE`

The environment variable CSI_MODE defines the configured service model of the SP process:

Value	Behavior
`<not_set>`	Controller+Node+Identity
`<set_and_empty>`	Controller+Node+Identity
`controller`	Controller+Identity
`node`	Node+Identity

akutz · 2018-02-09T07:15:01Z

Hi @saad-ali,

Re: Probe... I had a long chat with @codenrhoden, and he and I both agree, unless Probe is invoked before all other RPCs and every time the SP process restarts, then Probe is utterly useless. All RPCs should have a defined error code that SPs can return to indicate that Probe has not yet been called and that the CO should call probe. That's the only way Probe has any value.

The value of Probe is such that an SP can be made ready without the SP being restarted. For example:

 1. CO -> CreateVolume -> SP
 2. SP -> FailedPrecondition (probe required) -> CO
 3. CO -> Probe -> SP
 4. SP -> FailedPrecondition (missing kernel module) -> CO

 5. CO Admin Inserts Module

 6. CO -> CreateVolume -> SP
 7. SP -> FailedPrecondition (probe required) -> CO
 8. CO -> Probe -> SP
 9. SP -> Success -> CO
10. CO -> CreateVolume -> SP

The example above demonstrates why Probe is valuable, but only if the SP is able to indicate to the CO that it is required. The CO will not know if the SP process is recycled, so the SP must be able to self-detect that a Probe has not been invoked. This provides an opportunity to correct the SP's dependencies or configuration without having to recycle the SP.

Other possibilities:

The SP process should not be allowed to start at all if the configuration is invalid or there are missing host dependencies. This has the downside the the error may be buried in logs and not obvious to the CO admin.
The SP auto-probes with every incoming RPC. In other words, the SP does a dependency/config check via middleware ahead of every incoming request from the CO. This could be expensive and is the very reason the Probe RPC exists.

The Probe RPC is absolutely required. However, it has no value unless an SP is able to indicate to the CO that Probe must be invoked. The CO should invoke Probe whenever it receives this error code from the SP, even, possibly, multiple times.

Additionally, CSI_MODE should be codified into the spec as a way for SPs to know both which services to register and serve as well as to know how to handle Probe requests. The fact is most developers will not likely create separate code bases for controller and node services. Thus devs will likely use an environment variable to handle the service mode, so why not simply standardize it like CSI_ENDPOINT. Please see my above comments for more details regarding CSI_MODE.

codenrhoden · 2018-02-09T18:34:20Z

After the CSI meeting on Wednesday, I had some thoughts on this I wanted to share. I agree with @saad-ali's comment above that:

Once we can make the assumption that an instance of a plugin knows what mode it is operating in, it makes sense to consolidate Probe(...).

Whether it's through CSI_MODE or left as an implementation detail, I think we've agreed that a plugin should know what "mode" it is running in, and thus a unified Probe call in Identity makes sense.

However, during the call on Tuesday someone asked "do we expect that a probe is required after an SP restarts?". And the answer was no. I contend that if that is the case, Probe is useless. Right now the spec says of Probe: The CO SHALL invoke this RPC prior to any other controller service RPC in order to allow the CO to determine the readiness of the controller service. Similar language is in place for the Node service.

No matter how you look at it, the SP is required per the spec to reject all incoming RPC requests to the Node or Controller service if it has not been probed. If a Controller request comes in prior to a Probe, the CO has violated the spec, and it is appropriate for the SP to return an error.

I think this applies no matter what the lifecycle of the SP is. Whether it is starting up for the first time, the 30th time, or starting anew after an upgrade, the SP cannot accept requests until it has been probed, so that it can verify dependencies, configuration, etc. I believe that is what Probe is for. Every time an SP is restarted, system state (e.g. kernel modules, binaries) or configuration could have changed, so a new Probe is necessary to revalidate that the SP is "good to go". So to me, saying that CO doesn't have to re-Probe after a restart makes no sense.

I contend that all RPCs in the Node and Controller services should have an error code that indicates that a Probe is required when one hasn't happened already. In this way, a CO does not have to care if a plugin has been restarted or just started the first time, it merely is notified that a probe hasn't happened yet, and therefore it needs to do one. I think the correct error for this would be a FAILED_PRECONDITION, which some sort of standard error message in error.details that indicates that a probe is required.

I in fact implemented this sort of gate for the csi-scaleio driver, and was forced to implement what I called "auto-probe" where the probing was done automatically due to the k8s implementation not ever issuing a probe, despite being required by the spec. I don't like the idea of an "auto-probe", because I think it would be harder for a CO to know to look for an error about the SP not being ready. It seems more straightforward to know you need to Probe (by virtue of an error telling you that it hasn't been done), actually doing the Probe, and if that Probe returns any error, you know that the error is supposed to tell you what is wrong with the SP.

jdef · 2018-02-09T18:52:59Z

We used to have CSI_MODE and got rid of it. The spec focuses on protocol over everything else, intentionally. Requiring a CSI_ENDPOINT envvar was a compromise. I strongly disagree that we should include any other envvars as CSI standards.

…

On Fri, Feb 9, 2018 at 2:02 AM, Schley Andrew Kutz ***@***.*** > wrote: Hi @saad-ali <https://github.com/saad-ali>, So here's the thing. The spec already defines CSI_ENDPOINT as a global environment variable that SPs should reference. The fact is that CSI_ENDPOINT is set/configured by *something*. Whether that's an admin or a CO config file that deploys SPs, it's defined in the spec, set, and used by the SP. The spec defines CSI_ENDPOINT because it's a piece of information that all SPs need to use. How is this *any* different than, let's say, CSI_MODE? Yes, it's an implementation detail, but so is CSI_ENDPOINT! The CSI spec should absolutely define CSI_MODE as a standard environment variable that all SPs should use in order to determine the desired mode of operation. That way there is an agreed upon mechanic, just like CSI_ENDPOINT, that COs/admins can use to configure an SP and an SP looks for to determine its desired mode of operation. CSI_MODE The environment variable CSI_MODE defines the configured service model of the SP process: Value Behavior <not_set> Controller+Node+Identity <set_and_empty> Controller+Node+Identity controller Controller+Identity node Node+Identity — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#171 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACPVLJ87yy-2loW5XxwwMCJ-M-4nwQCAks5tS-2ggaJpZM4Rrb_w> .

childsb · 2018-02-09T18:53:11Z

I know this is ugly........... But the common way to solve most of the issues discussed here is by including the API version in the function names... such as:

ControllerProbeV1(...), ControllerProbeV2(...)

akutz · 2018-02-09T19:01:52Z

Hi @childsb,

You're right, it's ugly :) The issue is that Go prevents multiple gRPC bindings, and I don't think the version in the RPC name is a tenable solution either.

akutz · 2018-02-09T19:02:50Z

Hi @jdef,

Then either remove CSI_ENDPOINT or acknowledge that CSI_MODE is as universal and define that as well. Both are global configuration properties relevant to all SPs and should be part of the spec. If not you will see SPs with wildly varied configuration property names for the same types of circumstances.

childsb · 2018-02-09T19:12:10Z

You're right, it's ugly :) The issue is that Go prevents multiple gRPC bindings, and I don't think the version in the RPC name is a tenable solution either.

how does multiple RPC bindings play into this? If the endpoint is versioned to the spec, it would bind to the endpoint which it supports..

Then either remove CSI_ENDPOINT or acknowledge that CSI_MODE is as universal and define that as well. Both are global configuration properties relevant to all SPs and should be part of the spec. If not you will see SPs with wildly varied configuration property names for the same types of circumstances.

this is a deployment/packaging issue.. while it could be standardized it, deployment & packaging is listed as out-of-scope for the CSI spec..

akutz · 2018-02-09T19:49:46Z

Hi @childsb,

The reason multiple bindings are an issue is to do with the way Go's gRPC implementation registers types (the relevant CSI type registration). The proto.RegisterType function used to, and plans to again, panic when multiple types are registered with the same type name.

What could happen is if we include the API version in the proto's defined namespace so that all of the CSI proto messages include the version in the proto type namespace. For example, here is what it is now:

syntax = "proto3";
package csi;

We can change it to:

syntax = "proto3";
package csi.v020;
option go_package="csi";
option java_package="csi";
option csharp_namespace="csi";

This way the CSI message types would change from /csi/CreateVolumeRequest to /csi/v020/CreateVolumeRequest. For example:

proto.RegisterType((*GetSupportedVersionsRequest)(nil), "csi.v020.GetSupportedVersionsRequest")
proto.RegisterType((*GetSupportedVersionsResponse)(nil), "csi.v020.GetSupportedVersionsResponse")
proto.RegisterType((*Version)(nil), "csi.v020.Version")
proto.RegisterType((*GetPluginInfoRequest)(nil), "csi.v020.GetPluginInfoRequest")
proto.RegisterType((*GetPluginInfoResponse)(nil), "csi.v020.GetPluginInfoResponse")
proto.RegisterType((*CreateVolumeRequest)(nil), "csi.v020.CreateVolumeRequest")
proto.RegisterType((*CreateVolumeResponse)(nil), "csi.v020.CreateVolumeResponse")

Anyway, I'm on board with the idea of including the API version in the package namespace so that types registered with protobuf are isolated in their own versioned namespace.

cpuguy83 · 2018-02-09T20:14:16Z

We have to be able to make changes to the spec that are going to be breaking.

akutz · 2018-02-09T20:20:05Z

Hi @cpuguy83,

I agree. I disagree that many of these changes are necessary or even useful. It seems many of them are due to preference, not necessity.

akutz · 2018-02-09T20:20:50Z

Hi @cpuguy83,

We have to be able to make changes to the spec that are going to be breaking.

I'm also confused how you arrive at this remark when my previous comment was aligned with how to do just what you suggested in a safe manner by versioning the CSI protobuf message namespace.

cpuguy83 · 2018-02-09T20:38:58Z

@akutz It's a huge complication in order to support 0.1 plugins going forward.

akutz · 2018-02-09T20:41:04Z

Hi @cpuguy83,

That's fine, except the spec also calls out multiple supported versions. I'm suggesting an approach to be flexible. You're suggesting dropping backwards support. I don't care if we do, but I don't see what one has to do with the other. I'm just trying to propose a potential means of supporting multiple API versions in the bindings.

cpuguy83 · 2018-02-09T20:45:37Z

@akutz Aren't the protos backwards compatible by themselves (once we hit stable and stop moving things around)?

childsb · 2018-02-09T20:47:12Z

I'm not familiar enough with gRPC to know if @akutz suggestion is standard practice, but it seems legit....I do like the idea of allowing for breaking change AND keeping early API..

@akutz Aren't the protos backwards compatible by themselves (once we hit stable and stop moving things around)?

@cpuguy83 wasn't your earlier point that we will never be stable and should expect to always move things around?

cpuguy83 · 2018-02-09T20:48:42Z

Typically you'd version the RPC service itself, but versioning the types tends to be problematic.
Once types are stable, you just don't do things to break backwards-compat... e.g. don't change the field numbers around.

akutz · 2018-02-09T20:50:16Z

Hi @cpuguy83,

@akutz Aren't the protos backwards compatible by themselves (once we hit stable and stop moving things around)?

I don't have faith that things will stop moving around sufficiently for it to ever be backwards compatible due to the way Go and gRPC handle message type registration. Working with Go means that there is no room for any error if the bindings share a type namespace. My proposal is simple, causes no real issues, and enables support for multiple CSI API bindings in a single Go process.

akutz · 2018-02-09T20:51:51Z

Hi @cpuguy83,

Once types are stable, you just don't do things to break backwards-compat... e.g. don't change the field numbers around.

Is that accurate though?. SemVer allows for breaking changes between major versions. So it's perfectly reasonable for you or anyone to recommend a breaking change between 1.0 and 2.0. All I'm suggesting is to make it possible for Go devs to handle that much more gracefully than requiring separate binaries that support different, major CSI versions.

akutz · 2018-02-09T20:54:02Z

Hi @cpuguy83 / @childsb,

How about instead of versioning the API namespace for each iteration of MAJOR, MINOR, or PATCH we version it only for major changes? For example:

csi.v1xx.GetSupportedVersionsRequest
csi.v2xx.GetSupportedVersionsRequest
etc.

That would reduce the number of namespaces and also enable in-proc support for backwards compatibility across major versions.

akutz · 2018-02-09T20:58:21Z

Hi @childsb,

I'm not familiar enough with gRPC to know if @akutz suggestion is standard practice,

I am not either. I do know enough about gRPC to know that breaking changes breaks backwards compatibility if you change message names or field order. How many times have message names changed already? I am just trying to open people up to the idea of a way of allowing for that without breaking support across the board for languages like Go that make supporting multiple API versions problematic.

cpuguy83 · 2018-02-09T21:30:23Z

@akutz
Yes, this is generally what I would expect.
See https://github.com/containerd/containerd/tree/master/api/services/containers/v1 as an example.
The RPC, and the messages which are specific to those RPCs are under a v1 namespace.
A breaking change would mean making a v2 service.

Sometimes there are common types, in this case it ends up not really making sense to version them since they are consumed by multiple services.

akutz · 2018-02-09T21:31:57Z

Hi @cpuguy83,

Okay then. Can we then reach a compromised agreement where the proto namespace will have vMAJOR in it to allow for multiple, in-proc MAJOR versions of the API bindings? There's no need for multiple Go package names even.

childsb · 2018-02-09T21:48:25Z

Given this version scheme, does it make sense to consider the current spec work a true V2, or should it be a V1.1

akutz · 2018-02-09T21:50:34Z

Hi @childsb,

Until the spec is 1.0.0 all MAJOR changes must occur by virtue of incrementing the otherwise MINOR field since there is no real PATCH field at the moment.

chakri-nelluri · 2018-02-10T01:43:05Z

Thanks Guys. Couple of quick thoughts I had when I opened the issue:

Probe - What it the role of the probe?

1st Argument:
- It is just a health check and can be invoked as many times as possible. I like the idea, but would prefer to rename it. We should clarify the sequence of events after a probe failure though. Do we expect to query identity service and reestablish the identity on probe failure? If that is the case, given we have a stateless model, why can't probe piggyback some identity information all the time?
- Probe can fail for multiple reasons including timeouts & live upgrades etc.
- If a plugin fails to initialize a particular service, the failure reason can be included as part of the probe response.

2nd Argument:
- Probe is used to initialize the plugin and I agree there could be a use case for it. But, what it is the initial state of the plugin and what services it will listen to.
Option 1: It will only start identity service initially and will start controller service on the controller probe request and will start node service on node probe request. If this the case, then controller probe & node probe have to be part of Identity service, if not, it is a chicken and egg problem.
Option 2: It will start identity, controller & node services at start, but it will init controller & node services only after corresponding probe request. I do not completely like this. Ideally no services should be advertised, unless a plugin is expected to play that role.

childsb · 2018-02-12T20:09:28Z

If we really want the CO to provide an init call to the plugin, i think we should call it init(...) or something that implies such (though packaging and deployment is a NON goal in the SPEC)... Probe to me is always a status check and not a bootstrap call to action.

akutz · 2018-02-13T17:52:14Z

Hi @childsb,

I disagree. I think probe can be both a liveness and bootstrap check. An SP should be able to, via any RPC, indicate that a probe is required. That would enable both scenarios. Today the spec does say explicitly that probe calls should be invoked ahead of any other RPCs:

The CO SHALL invoke this RPC prior to any other controller service RPC in order to allow the CO to determine the readiness of the controller service.

That must imply when after an SP process starts or as written is useless. Thus today the spec does indicate that probe is both liveness and in effect an init call. I disagree with that language, however, as it isn't being given any initial configuration. It's just indicating to the process to prepare itself.

childsb · 2018-02-13T18:05:29Z

Sure, we can conflate init and probe, but whats the benefit @akutz ?

akutz · 2018-02-13T18:09:03Z

Hi @childsb,

Because Init implies that it occurs once, and there's no good way for the CO to know if the SP process has restarted. Short of the SP process refusing to start due to failed host dependencies or missing configuration, an SP should be able to indicate that a probe is required in order to respond to the CO with the error details about the same failed preconditions. I prefer this scenario since an SP failing to start will result in the error buried in logs, unable to be given to the CO admin as easily as the result of a probe call.

I believe myself and @codenrhoden outlined all of this in our comments above:

Please take another look at those comments for detailed reasons related to your question.

childsb · 2018-02-13T18:31:40Z

I think the point I was trying to make is that probe is a readiness check from API POV, and thats it.. if a volume plugin uses it for something else (such as init) its at its own peril (and i'm not saying lets add init to the spec)

Why the distinction? if the CO acknowledges it as a bootstrap process it implies some sort of recovery handling from the CO. If its only readiness check the responsibility is on the CSI driver to resolve its own state/readiness if it does use probe(..) to init.

codenrhoden · 2018-02-13T18:35:50Z

I never took Probe for any kind of a health/liveness check, as has been suggested in the last week or so. The reason for that is twofold:

The spec calls out that Probe can be an expensive call for some storage providers, and therefore is not something you want to call repeatedly
The spec calls out that Probe is a gating call for all other Controller and Node RPCs. Right now there is no reliable way to inform a CO that it is necessary. We want to make sure that a CO only calls it when needed, and then leaves the SP along until it is restarted/upgraded.

Is there agreement on that?

childsb · 2018-02-13T18:43:17Z

Probe should be idempotent
Probe could be called at any time, for any reason
Ordering of probe and other calls can not be guaranteed by the CO (nor dictated to the CO)

Many plugins will probably choose probe(...) to start bootstrap, but dictating the behavior as API is problematic.

codenrhoden · 2018-02-13T18:48:04Z

Probe should be idempotent

sure

Probe could be called at any time, for any reason

sure

Ordering of probe and other calls can not be guaranteed by the CO (nor dictated to the CO)

sure. However, the spec requires that a probe comes first. All we're asking for is the ability for an SP to reliably communicate to the CO that a probe hasn't happened yet, so that it can send one.

childsb · 2018-02-13T18:54:52Z

sure. However, the spec requires that a probe comes first. All we're asking for is the ability for an SP to reliably communicate to the CO that a probe hasn't happened yet, so that it can send one

Sorry i'm slow... Why does the CO HAVE to re-probe on request? Can't the plugin probe itself if it needs it?

akutz · 2018-02-13T18:56:28Z

Hi @childsb,

Because the SP can either fail to start with the error as to why or probe itself ahead of every RPC and return some error to the CO that the spec does not define and the CO may not understand.

codenrhoden · 2018-02-13T19:00:26Z

If a plugin is going to probe itself, why have the RPC defined at all?

From the spec: The CO SHALL invoke this RPC prior to any other controller service RPC in order to allow the CO to determine the readiness of the controller service. A CO MAY invoke this call multiple times with the understanding that a plugin's implementation MAY NOT be trivial and there MAY be overhead incurred by such repeated calls.

Could the plugin probe itself? Yes, it could. But then how do you inform the CO what is wrong? I would think the CO would be interested in the error details returned from a Probe call, not some error from a ListVolumes calls that says "hey, the plugin is not ready and here is why". Now the CO has to handle that on every call. In order to make it easier for a CO to handle that kind of error recovery, I think there should be a well-defined error that says that a Probe has to happen. Then the CO can gracefully handle an error from that RPC.

Unless the language of the spec changes to no longer say that a CO calling this RPC MUST happen first, I don't understand the pushback here. How do you suggest a CO go about meeting its requirement to first call a Probe?

childsb · 2018-02-13T19:39:05Z

i'm not sure that i agree (with the specs assertion) that probe should be called before any other request.. let me go back over it.

saad-ali · 2018-02-15T06:25:20Z

We should not collapse GetSupportedVersions. That's a very, very bad idea. The reason is because Go+gRPC and CSI do not offer true backward compatibility due to the way Go's gRPC implementation handles type registration. That being the case we absolutely need a simple, forever static means of obtaining the supported versions. We agreed early on that GetSupportedVersions would stay the same and not change, and it mustn't. Or else there cannot be a version-agnostic means of determining what API version is supported by a remote CSI endpoint. Because if the client is written in Go it cannot support multiple CSI versions due to the aforementioned reason. Not unless it forgoes language bindings altogether.

This is a good argument. I agree we should leave GetSupportedVersions(...) as is.

How is this any different than, let's say, CSI_MODE? Yes, it's an implementation detail, but so is CSI_ENDPOINT! The CSI spec should absolutely define CSI_MODE as a standard environment variable that all SPs should use in order to determine the desired mode of operation. That way there is an agreed upon mechanic, just like CSI_ENDPOINT, that COs/admins can use to configure an SP and an SP looks for to determine its desired mode of operation.

The spec specifically avoided specifying packaging, except in this case. And now that I reflect on it, defining the exact mechanism by which a plugin would discover initialization information (like socket path) in the spec is unnecessary and we probably shouldn't have done it. The spec should dictate that communication must occur via a unix domain socket, but it doesn't need to dictate that it must be set via a CSI_ENDPOINT environment variable. That is an implementation/deployment detail. It doesn't matter if one plugin chooses to receive this information via an environment variable and another chooses to receive it as a binary flag.

Additionally, CSI_MODE should be codified into the spec as a way for SPs to know both which services to register and serve as well as to know how to handle Probe requests. The fact is most developers will not likely create separate code bases for controller and node services. Thus devs will likely use an environment variable to handle the service mode, so why not simply standardize it like CSI_ENDPOINT. Please see my above comments for more details regarding CSI_MODE.

Agreed, we should modify the spec to dictate that a plugin must know (somehow) what mode it is running in (controller, node, or both). I disagree that the exact exact mechanism for how the plugin discovers its mode should be dictated. As stated above, I think we should loosen the existing env var requirements to allow the plugin to decide what mechanism it will choose to fulfill this requirement.

Re: Probe... I had a long chat with @codenrhoden, and he and I both agree, unless Probe is invoked before all other RPCs and every time the SP process restarts, then Probe is utterly useless. All RPCs should have a defined error code that SPs can return to indicate that Probe has not yet been called and that the CO should call probe. That's the only way Probe has any value.

@chakri-nelluri and I had this thought which led him to open this issue. Upon further discussion we realized that the value of Probe is as a mechanism to hook CO health probes in to (to allow the CO, if the plugin reports unhealthy, to do something to bring it back in to a healthy state).

The SP process should not be allowed to start at all if the configuration is invalid or there are missing host dependencies. This has the downside the the error may be buried in logs and not obvious to the CO admin.

Agreed. A SP should fail with non-0 exit code if pre-conditions are not met at initialization and spit out information in to stderr. This should be guidance (not a requirement and probably not part of the spec).

Whether it's through CSI_MODE or left as an implementation detail, I think we've agreed that a plugin should know what "mode" it is running in, and thus a unified Probe call in Identity makes sense.

Yes!

It is just a health check and can be invoked as many times as possible.

Yes!

I like the idea, but would prefer to rename it.

I'd prefer to just clarifying the definition instead of (or in addition to) renaming it.

We should clarify the sequence of events after a probe failure though. Do we expect to query identity service and reestablish the identity on probe failure?

No, this is unnecessary. The purpose is to check the health of this specific instance (deployment) of the plugin only. Only action should be what ever "rectification" logic that makes sense and is defined by the deployment (e.g. plugin reported unhealthy restart it).

If that is the case, given we have a stateless model, why can't probe piggyback some identity information all the time?

For the reasons @akutz listed above it makes sense to not combine Probe with GetSupportedVersion. We could consider combining it with GetPluginInfo() -- but since they serve two different purposes, I'm ok leaving them separate.

If we really want the CO to provide an init call to the plugin, i think we should call it init(...) or something that implies such (though packaging and deployment is a NON goal in the SPEC)... Probe to me is always a status check and not a bootstrap call to action.

No, Initialization configuration information should be passed in via packaging and deployment mechanisms, not via the interface.

Probe to me is always a status check and not a bootstrap call to action.

Yes!

However, during the call on Tuesday someone asked "do we expect that a probe is required after an SP restarts?". And the answer was no. I contend that if that is the case, Probe is useless. Right now the spec says of Probe: The CO SHALL invoke this RPC prior to any other controller service RPC in order to allow the CO to determine the readiness of the controller service. Similar language is in place for the Node service.

I never took Probe for any kind of a health/liveness check, as has been suggested in the last week or so. The reason for that is twofold:
The spec calls out that Probe can be an expensive call for some storage providers, and therefore is not something you want to call repeatedly
The spec calls out that Probe is a gating call for all other Controller and Node RPCs. Right now there is no reliable way to inform a CO that it is necessary. We want to make sure that a CO only calls it when needed, and then leaves the SP along until it is restarted/upgraded.
Is there agreement on that?

I think this wording is incorrect and part of what led to the confusion around the purpose of probe.

Probe should be idempotent
Probe could be called at any time, for any reason
Ordering of probe and other calls can not be guaranteed by the CO (nor dictated to the CO)

Yes! Yes! Yes!

However, the spec requires that a probe comes first. All we're asking for is the ability for an SP to reliably communicate to the CO that a probe hasn't happened yet, so that it can send one.
...
Unless the language of the spec changes to no longer say that a CO calling this RPC MUST happen first, I don't understand the pushback here. How do you suggest a CO go about meeting its requirement to first call a Probe?

That was a mistake. The spec should not require this. Probe should not be used as a "initialization signal".

Let's summarize, there are four distinct duties, and it is worth clarifying what component is responsible for them:

Initialization checks
- Upon initialization validate that all required pre-conditions are met (e.g. kernel modules present, binaries available, etc.)
  - Not the responsibility of Probe().
  - Should be left to the plugin: on start up the plugin runs checks and exists with non-0 exit code if pre-conditions fail spitting out information to stderr.
Initialization configuration information
- Plugin needs to know what mode, etc. it is running.
  - Not the responsibility of Probe() to set this or signal it in any way.
  - Initialization configuration information should be passed in via packaging and deployment mechanisms (env var, binary flags, etc.), not via the interface.
RPC pre-condition checks
- Verify that pre-conditions are met before a CSI RPC is issued.
  - Not the responsibility of Probe(). Any such check done by probe could be invalid by the time a subsequent CSI RPC call is issued.
  - Should be left to the plugin: on receiving a CSI RPC, if some pre-conditions are invalid, return a non-success gRPC error code.
Periodic health monitoring to allow CO to "try and fix things"
- This is the sole purpose of the Probe call and it should be re-worded and consolidated to make this clear.
- Example: If the plugin is deployed on Kubernetes, a Kubernetes liveness probe can be set up to issue Probe() calls periodically. On failure, Kubernetes would automatically restart the container. This could catch and fix, for example, a deadlock, where the plugin is running, but unable to make progress. It is up to the plugin to determine what should cause the Probe(...) call to fail.

How about instead of versioning the API namespace for each iteration of MAJOR, MINOR, or PATCH we version it only for major changes? For example:
csi.v1xx.GetSupportedVersionsRequest
csi.v2xx.GetSupportedVersionsRequest
etc.
That would reduce the number of namespaces and also enable in-proc support for backwards compatibility across major versions.

This is a neat idea. It reminds me of the way that Kubernetes API does versioning, and allows for alpha and beta releases in the future:

csi.v1beta1.GetSupportedVersionsRequest
csi.v2alpha1.GetSupportedVersionsRequest

We have had lots of discussion, I want to make sure we come to a conclusion and actually get something done.

Concrete proposal:

Add a requirement to dictate that a plugin must know (somehow) what mode it is running in (controller, node, or both).
Remove the requirement that end point must be set via a CSI_ENDPOINT environment variable (allow plugin to choose).
- The interface should not define packaging or deployment. Having implemented CSI in Kubernetes, I see no need for us to dictate that endpoint must be passed in via a specific env var.
Consolidate NodeProbe and ControllerProbe in to a single Probe call.
- spec: consolidate xxxProbe calls into Identity.Probe #190
Reword the definition of `Probe(..) to say something like:
a. The purpose of the probe call is to serve as a plugin health check. Failure is an indication that the plugin has encountered an error that requires restarting the plugin to fix. If no such condition exists, the call should simply returns success in all cases.
b. Probe can be called at any time, for any reason. i.e. it may be called periodically.
c. Ordering of probe and other calls is no longer dictated by the spec (i.e. remove the clause about Probe must be called before other calls).
Add vMAJOR version (e.g. csi.v1.) to either the API bindings or Go package names.

The code freeze for Kubernetes is coming up on Feb 26, and I'd like to get these changes in. So could you please chime in if you disagree ASAP.

cpuguy83 · 2018-02-15T15:44:48Z

Agreed with @saad-ali's proposal.
I would only make sure on point 1 to be explicit that the spec does not dictate how a plugin should find out it's mode and is left to the SP implementation to deal with it.

@saad-ali

This patch removes all of the SP implementation information previously detailed in the specification. With respect to issue container-storage-interface#171 and comments made by @saad-ali, @childsb, @jdef, and others, it has become apparent that the specification should not supply any suggestions or restrictions on matters related to an SP's packaging, deployment, or runtime configuration. Removed information includes, but not limited to: * `CSI_ENDPOINT` - SPs are now responsible for defining the manner by which they grok the gRPC endpoint on which they serve the CSI services. CO admins are now responsible for deploying and configuring SPs using that SP's documentation. * Env var names - The spec no longer claims ownership of the `CSI_` prefix for environment variables. * Permissions - The spec no longer asserts any recommendations with respect to container permissions that an SP might require. * Logging - All logging details are the responsibility of the SP and may vary SP to SP. The removed information may very well be suitable to live in a separate document that provides guidelines for SP creation and configuration, such as common environment variable names, logging and packaging details, etc. cc @codenrhoden

@saad-ali

This patch removes all of the SP implementation information previously detailed in the specification. With respect to issue container-storage-interface#171 and comments made by @saad-ali, @childsb, @jdef, and others, it has become apparent that the specification should not supply any suggestions or restrictions on matters related to an SP's packaging, deployment, or runtime configuration. Removed information includes, but not limited to: * `CSI_ENDPOINT` - SPs are now responsible for defining the manner by which they grok the gRPC endpoint on which they serve the CSI services. CO admins are now responsible for deploying and configuring SPs using that SP's documentation. * Env var names - The spec no longer claims ownership of the `CSI_` prefix for environment variables. * Permissions - The spec no longer asserts any recommendations with respect to container permissions that an SP might require. * Logging - All logging details are the responsibility of the SP and may vary SP to SP. The removed information may very well be suitable to live in a separate document that provides guidelines for SP creation and configuration, such as common environment variable names, logging and packaging details, etc. cc @codenrhoden

saad-ali · 2018-02-16T01:21:09Z

I spoke with @jieyu about the proposal above. While he agrees with not specifying the mechanism by which a plugin will discover mode (proposal 1), he explained that Mesos needs CSI_ENDPOINT (proposal 2) because unlike Kubernetes, Mesos manages the deployment of CSI plugins and depends on this behavior.

To accommodate that use case we agreed on a revised proposal:

Concrete proposal:

Add a requirement to dictate that a plugin must know (somehow) what mode it is running in (controller, node, or both).
- @jdef will add this to spec: consolidate xxxProbe calls into Identity.Probe #190
~~Remove the requirement that end point must be set via a CSI_ENDPOINT environment variable (allow plugin to choose).~~
- Dropping this due to the reason listed above.
Consolidate NodeProbe and ControllerProbe in to a single Probe call.
- spec: consolidate xxxProbe calls into Identity.Probe #190
Reword the definition of `Probe(..) to say something like:
- @jdef will add this to spec: consolidate xxxProbe calls into Identity.Probe #190
  a. The purpose of the probe call is to serve as a plugin health check. Failure is an indication that the plugin has encountered an error that requires restarting the plugin to fix. If no such condition exists, the call should simply returns success in all cases.
  b. Probe can be called at any time, for any reason. i.e. it may be called periodically.
  c. Ordering of probe and other calls is no longer dictated by the spec (i.e. remove the clause about Probe must be called before other calls).
Add vMAJOR version (e.g. csi.v1.) to either the API bindings or Go package names.
- We will decouple this and track it independently: Consider adding vMAJOR version (e.g. csi.v1.) to either the API bindings or package names #193

chakri-nelluri · 2018-02-17T06:58:24Z

@saad-ali Shall we use the standard GRPC health checking model: https://github.com/grpc/grpc/blob/master/doc/health-checking.md

chakri-nelluri · 2018-02-17T11:35:59Z

@akutz @saad-ali @childsb Is the versioning going to be at a service level or at an API level inside the same service? Lets clarify. I have seen some people doing at service level rather than at api level. Ex: https://developers.google.com/assistant/sdk/reference/rpc/

saad-ali · 2018-02-21T23:05:37Z

@saad-ali Shall we use the standard GRPC health checking model: https://github.com/grpc/grpc/blob/master/doc/health-checking.md

Oh nice. So far we've not imported any external protos.

saad-ali · 2018-02-21T23:06:58Z

@akutz @saad-ali @childsb Is the versioning going to be at a service level or at an API level inside the same service? Lets clarify. I have seen some people doing at service level rather than at api level. Ex: https://developers.google.com/assistant/sdk/reference/rpc/

Please review #195 and add comments there.

jdef · 2018-02-26T15:32:42Z

@chakri-nelluri @saad-ali something that I find interesting about the gRPC health check protocol is that there's an OPTIONAL service name in the request, which allows for (optionally) more granular, per-service level health checks. For example, if a plugin instance implemented controller and node services, and the Probe RPC supported a similar, optional service name in the request, then a CO could be configured to probe the health of the controller and node services independently. In the end, for CSI, I'm not convinced that this would be a huge win (yet) because the only thing a CO can do w/ an unhealthy plugin is to respawn it -- strategy is independent of which specific gRPC service is failing. Or am I missing something obvious?

Anyway, since we've simplified the API, we can always add an optional service name to the Probe request at some later point, if needed, without breaking the API.

saad-ali added this to the v0.2 milestone Jan 24, 2018

jdef mentioned this issue Feb 5, 2018

Capabilities and Probe RPCs for Identity service. #184

Closed

akutz mentioned this issue Feb 8, 2018

NodeStageVolume/NodeUnstageVolume #169

Merged

jdef mentioned this issue Feb 10, 2018

spec: consolidate xxxProbe calls into Identity.Probe #190

Merged

akutz mentioned this issue Feb 15, 2018

Remove all SP Implementation Info #192

Closed

saad-ali mentioned this issue Feb 16, 2018

Consider adding vMAJOR version (e.g. csi.v1.) to either the API bindings or package names #193

Closed

saad-ali closed this as completed in #190 Feb 21, 2018

jdef mentioned this issue Feb 26, 2018

ProbeController rpc on Controller service #102

Closed

Clarify Identity Service vs Probe request #171

Clarify Identity Service vs Probe request #171

Comments

chakri-nelluri commented Jan 24, 2018

saad-ali commented Feb 1, 2018

chakri-nelluri commented Feb 2, 2018

saad-ali commented Feb 8, 2018

akutz commented Feb 9, 2018 • edited Loading

akutz commented Feb 9, 2018 • edited Loading

CSI_MODE

akutz commented Feb 9, 2018

codenrhoden commented Feb 9, 2018

jdef commented Feb 9, 2018 via email

childsb commented Feb 9, 2018

akutz commented Feb 9, 2018

akutz commented Feb 9, 2018

childsb commented Feb 9, 2018

akutz commented Feb 9, 2018

cpuguy83 commented Feb 9, 2018

akutz commented Feb 9, 2018

akutz commented Feb 9, 2018

cpuguy83 commented Feb 9, 2018

akutz commented Feb 9, 2018

cpuguy83 commented Feb 9, 2018

childsb commented Feb 9, 2018 • edited Loading

cpuguy83 commented Feb 9, 2018

akutz commented Feb 9, 2018

akutz commented Feb 9, 2018

akutz commented Feb 9, 2018

akutz commented Feb 9, 2018

cpuguy83 commented Feb 9, 2018

akutz commented Feb 9, 2018

childsb commented Feb 9, 2018

akutz commented Feb 9, 2018 • edited Loading

chakri-nelluri commented Feb 10, 2018 • edited Loading

childsb commented Feb 12, 2018 • edited Loading

akutz commented Feb 13, 2018

childsb commented Feb 13, 2018

akutz commented Feb 13, 2018 • edited Loading

childsb commented Feb 13, 2018

codenrhoden commented Feb 13, 2018

childsb commented Feb 13, 2018

codenrhoden commented Feb 13, 2018

childsb commented Feb 13, 2018

akutz commented Feb 13, 2018

codenrhoden commented Feb 13, 2018

childsb commented Feb 13, 2018

saad-ali commented Feb 15, 2018 • edited Loading

cpuguy83 commented Feb 15, 2018

saad-ali commented Feb 16, 2018

chakri-nelluri commented Feb 17, 2018

chakri-nelluri commented Feb 17, 2018 • edited Loading

saad-ali commented Feb 21, 2018

saad-ali commented Feb 21, 2018

jdef commented Feb 26, 2018 • edited Loading

akutz commented Feb 9, 2018 •

edited

Loading

akutz commented Feb 9, 2018 •

edited

Loading

`CSI_MODE`

childsb commented Feb 9, 2018 •

edited

Loading

akutz commented Feb 9, 2018 •

edited

Loading

chakri-nelluri commented Feb 10, 2018 •

edited

Loading

childsb commented Feb 12, 2018 •

edited

Loading

akutz commented Feb 13, 2018 •

edited

Loading

saad-ali commented Feb 15, 2018 •

edited

Loading

chakri-nelluri commented Feb 17, 2018 •

edited

Loading

jdef commented Feb 26, 2018 •

edited

Loading