I wrote a little bit about how to do deployment updates without serving errors in #Kubernetes. Since I don't have a blog, I'll just tweet. You have some pods, and maybe an load-balancer, too. You want to update the deployment. How to do this "best" in K8s today? 1/
You want to 1) start sending traffic to new pods; 2) stop sending new traffic to old (being deleted); 3) allow existing traffic to old pods to finish (drain). I'll assume you set Deployment `maxSurge` and `maxUnavailable` to allow at least 1 new pod before removing old ones. 2/
So now that you have at least 1 new pod ready, it's time to cull at least 1 old pod. Remember, this is Kubernetes, so it's almost all asynchronous. When you DELETE a pod, multiple things happen roughly at the same time, but really it's in whatever order is worst for you. 3/
1) Kubelet observes the pod's `deletionTimestamp` being set, and starts the end-of-life process. Usually this means sending SIGTERM, but it could also involve a `preStop` handler of your choice. 4/
2) Endpoints are updated to either remove the pod or mark it unready (depending on Endpoints vs. EndpointSlice). This itself causes another round of async things to happen. 5/
3) The kube-proxy (or equivalent) on each node observes the endpoints changing, and programs the data-plane. Usually this means "don't send any more connections", but the implementations vary quite a lot. 6/
4) External things which track endpoints (like load-balancer NEGs on GCP) are updated to stop sending new connections. These things tend to be much slower than in-cluster data-plane programming, and there can be an arbitrary number of them. 7/
I feel the need to reiterate - all of these happen completely async to almost all the others (there's some causality in there but not much). There is very little "before" or "after" that is guaranteed. 8/
Somewhere in there, your pod receives a SIGTERM. The "obvious" thing to do it stop accepting new connections, right? Yeah, don't do that. You might get that SIGTERM _before_ kube-proxies and/or LBs are updated! They could still be sending you traffic, and you will reject it. 9/
While Kubernetes has "finalizers", we do not distinguish "delete requested" from "delete acknowledged by all parties", only "delete started" and "delete finished". Once `deletionTimestamp` is set, a (wall) clock is ticking. 10/
You want to drain existing connections, but need to still accept new ones! Proxies/LBs will (usually) remove the pod "soon", and new traffic will tail off. Existing connections can complete within the pod's grace period (you did set that, right?) and then your pod will die. 11/
I know it's not super satisfying. We want a deterministic answer, but I hope you can now reason through why this hard. You can explore more by emulating the "better" process. 12/
Before you delete the pod, change its labels so it no longer matches the service. Now wait for for _all_ the nodes to update their iptables/ebpf/ipvs/... A lone straggler means you have to wait. I hope none of your nodes are crashed - that could take a while. 13/
Also wait for all external observers to "finish". This can take arbitrarily long and there can be an arbitrary number of them. Also, you might not know about all of them, but I'm reasonable, so let's assume you do. 14/
Only once that is _all_ done can you actually delete the pod. If you HAVE TO sto accepting on SIGTERM (e.g. using a 3rd party thing - nginx, I'm looking at you), you can use a "sleep" in `lifecycle.preStop`. To borrow a phrase - it's janky, but it's battle hardened jank. 15/
TL;DR:
@thockin how about a new deploy2.0 with "additional" pod selector ver=v2.0, start this deploy2.0 with 1 replica. This new pods will also be part of endpoint of the existing service. Check everything is good in new pod, scale-it-up, add ver=v2.0 in service and scale-down old deploy.
@fiddler_roof There is an argument to be made for a v2 of many builtin APIs to capture better defaults. But there is no existing relationship between Deployment and Service, so such a thing would be net-new and has a lot of design problems (there's a reason we didn't do that from the start)
@thockin In Knative we added a pre stop hook on the user container that tells our sidecar to start failing readiness probes on their behalf, and delays sending the SIGTERM to them until the probes have started directing traffic away from the pod.
@thockin If this is "best" in Kubernetes, then ..... are there any plans to simplify this? Is there already a KEP or SIG?
@thockin Kubernetes screwed up safe stopping of service pods. It’s possible to do drain stops more safely by default.
