Have kubelet pass pod annotations directly to CNI plugins #69882
Comments
/sig network |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale Multus would make use of this too (cc @dcbw), to figure out if it needs to create additional networks for a pod. And we could just pass the entire pod JSON. I think plenty of network plugins could make use of that. (HostPort pods are one thing in particular that the plugin might need to deal with that aren't labels or annotations.) |
Do we have a priority for this? and i can work on this if needed |
I mean, we're surviving just fine without this feature, so I guess |
@danwinship Thanks for confirming my assumption |
Well, the discussion above is incomplete; no one ever decided exactly how the data would be passed, and it's not clear to me if you're talking about just passing the annotations or if you mean the entire pod json. So, if you want to describe what you're planning to implement before writing too much code, that might be good. |
I am also interested in helping with defining the requirements and the specific mechanisms. I thought passing the entire pod json is probably too much, but maybe we can pass pod labels matching a well-known pattern ("cni*"?) as args to libcni? Additionally, kubelet could call different CNI plugins (not just the same plugin with a different configuration) based on a well-known pod label. This would be useful for launching pods with different networking requirements on the same node. In any case, I'm interested in discussing and helping with the implementation. |
Thanks @danwinship and @ofiliz
|
No, there are already other annotations that don't match that naming convention that plugins want access to. Eg, the Network Attachment annotation or the pod bandwidth annotations. You should pass all of the annotations. |
Thanks for giving more context. I will make a PR passing all the annotations. |
@danwinship @kmala I don't think we need to pass annotations to CNI plugin, because it's CRI's job to translate pod spec or annotations to CNI RuntimeConfig. |
I am worried about the amount of information we pass into CNI (and how updated they are at the time of I think we should keep the The above represents a challenge to CNI developers because CNI is currently stateless, so data will have to be re-read as needed. |
@mars1024 This is about CNI plugins that are specifically interested in Kubernetes-specific things. There are already other bits of kubernetes-specific information that get passed to plugins (eg, the namespace of the pod). This would just be one more. CNI plugins that want to be container-platform-agnostic would just ignore the kubernetes-specific information @khenidak No, we would not pass updates to the CNI plugin. Plugins that want to track the state of a Pod after creation time would have to continue doing that on their own, just like they do now. The goal here is just to simplify things for CNI plugins that need the pod annotations (or more) at |
@danwinship i understand your point, thank you. The point i was trying to make. today we will add annotation, because they are needed. Later some CNI will be working with |
@khenidak @kmala @danwinship we've periodically discussed the issue in SIG Network in the past. And I think the reasons not to pass annotations still hold. Those are roughly that:
Basically we drew a line between plugins that are pretty dumb (eg flannel, bridge, kubenet, etc) and plugins that are more capable and already use KubeClients. And decided that it wasn't worth the downsides (complexity, size, processing, etc) to put annotations and/or other stuff into the CNI plugin. I don't think I've seen compelling arguments to revisit that decision :( |
|
@dcbw I think while your point 3 above, "plugins that need these annotations already have a KubeClient anyway" is certainly true, this is only because we provide no other way for such plugins to achieve their goals. In complex Kubernetes networking setups there is now an explosion of daemons running on each node for this very reason. Common examples are daemons to configure sidecar proxies and network policy agents. These daemons consume node resources and increase deployment complexity. Every watch and get on API server also causes node scalability problems. Another related problem we are solving here is not just what config to pass to CNI plugins, but also which CNI plugins to call. Existing design forces all plugins to be called for all pods. The common solution to that problem is to use a delegating CNI plugin, but because those plugins have no in-context information to decide which plugin(s) to call, they also have to fetch pod specs, inflating the problem. |
But most of them are still going to need to need the KubeClient even if we provide them the annotations. eg:
But if the pod has the network attachment annotation, then that will point to a CRD which the delegating plugin will also need to look at. So it's still going to need a KubeClient anyway. |
I understand that some plugins will/might still need to call KubeClient which we may not be able to solve but still there are multiple use-cases like custom firewall and network CNI plugins which would require the metadata of the pod and this feature would help them a lot. |
I'd avoid tying in pod annotations via prefixes like |
The prefixes is just an example and am okay to change it depending on what everyone agrees but the important point i want to talk about is that CNI spec already supports passing of metadata by the runtimes and now instead of asking the plugins to get the information by talking to apiserver all the time it only makes sense to pass generic metadata to the CNI plugins. Since we anyway pass the metadata like kubernetes namespace name and pod name, i think we can pass more metadata for the CNI plugins to use. |
For dynamic selection of network (e.g. Multus-style), I don't see any way how we can avoid a Kube-client in the near future. Mostly because of status reporting. Some day, we should probably standardize on a way to get multiple networks in the CRI, but I don't think that's going to happen now. However, I'm more sympathetic to the allowing per-pod tunables. There is some precedent for this with the special bandwidth annotation, which is directly translated in to a CNI capability argument. However, not all possible parameters should be capability args. When the CNI was originally conceived, it was definitely expected that end-users would be able to pass runtime arguments directly to the CNI plugins. It was intended to look something like So I would be in favor of adding a |
@danwinship "But most of them are still going to need to need the KubeClient" I don't think we have any data to support that claim. I can speak for EKS and the scenarios we are considering are all solvable without a kubeclient. I agree with the annotations (not the whole podspec) as CNI args approach, or the cni.k8s.io prefixed annotations. With a tactical 20-line change here we can fix the majority of common use cases. The remaining cases that require kubeclient can continue to do so. |
OK, as discussed here and in the corresponding PR, there are issues with this idea/implementation:
If we are going to do this, it needs a KEP so we can hash all of that out (and get input from CRI people as well). |
I've been watching this issue since I liked our CNI to not watch for Pods but we already had kubeclient to manage CRDs representing our container links, so I was in agreement with @ofiliz . Now I'm changing our CNI for a nested environment, e.g. a Pod of a kubernetes in VMs which have their interfaces managed by another kubernetes. This inner CNI should talk with two API servers, inner kubernetes for Pods and upper kubernetes for Links of VMs. If I had Pod annotations from kubelet, I could make CNI to talk with just upper kubernetes. It's still a narrow situation and relates to our design decisions, so probably not a requirement. I just wanted to share a case that kubeclient is not really needed and some Pod information can suffice. Still how much information is needed is a subject of debate. |
Related issue: #84248 (node local apicache) This seems like a strong use case for a node-local API. Basically, address the problems with CNI plugins needing to watch the apiserver, without needing to load everything into the CNI API. |
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
What happened:
Currently, if a CNI network plugin needs access to annotations that are configured on a pod, it needs to do a GET/read of the pod manifest from the Kubernetes API server, based on the pod's name.
What you expected to happen:
With the introduction of CNI release 0.4.0 (which supports CNI API version 0.2.0), it should be possible to make changes to kubelet such that pod annotations (and possibly labels) can be dynamically inserted into the CNI network configuration that is passed to CNI plugin binaries via stdin. In many cases, this would eliminate the need for CNI plugins to perform Kubernetes API reads of pod manifests.
This can be achieved by using the new args field in the CNI network configuration. This field is intended to allow runtimes (e.g. kubelet) to dynamically insert per-container (per-pod, in the case of Kubernetes) configuration when a container/pod is being created.
The CNI conventions for the args field currently suggest the use of a "labels" field to convey mappings to the plugin:
It's not clear whether this "labels" field should be used ("overloaded") for Kubernetes pod annotations... maybe the CNI maintainers would prefer that we create a new "annotations" field in parallel with "labels".
The use of this feature would also require the use of CNI version 0.4.0 binaries, so we should consider bumping up the Kubernetes reference version for CNI from 0.3.0.
How to reproduce it (as minimally and precisely as possible):
Refer to code here. This is approximately where changes would be needed.
Anything else we need to know?:
Use of this feature will require CNI versions 0.4.0 or newer. The Kubernetes reference version for CNI should be bumped from 0.3.0 to 0.4.0 at some point.
Environment:
kubectl version
):uname -a
):The text was updated successfully, but these errors were encountered: