Kubebuilder provides tooling to help get you started quickly writing operators for Kubernetes, and builds on top of controller-runtime. I’ve been looking at how errors are handled in a couple of Kubebuilder projects recently. I’d seen a couple of GitHub issues that mentioned that controller-runtime has back-off behaviour for errors so started looking through the docs to find out more, but didn’t find anything. If I get chance, I’d like to find a suitable place to send a PR to add some details in the docs, but for now I’m collating my notes here as a reference for future me!
Baseline
As a starting point, the code below is a simple Reconcile loop. Along with this code there is a Custom Resource Definition for a Demo object, and the Reconcile function is invoked by controller-runtime whenever a Demo object is created, updated or deleted.
|
|
As you can see in the definition of the Reconcile method above, it takes in a Request instance (defined in controller-runtime) and returns a Result (again defined in controller-runtime) and an error.
The first few lines of the Reconcile method are setting up some variables, and then the main block in the middle is looking up the Demo instead based on the metadata in the req parameter, and handling errors. Finally, after not really doing anything, the method returns an empty Result and nil error.
If we run this and create a Demo object we would see log output similar to the output below showing the reconciler starting and successfully completing.
2020-02-02T07:06:42.513Z INFO controllers.Demo Starting {"demo": "default/demo-sample"}
2020-02-02T07:06:42.514Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "demo", "request": "default/demo-sample"}
Adding in Errors
Now let’s change the last line of the reconciler to return an error:
return ctrl.Result{}, fmt.Errorf("dummy error")
If we run the same test as before we now start to see error messages in the logs as shown in the output below. Note that the full error details have been stripped out as it is quite noisy. It is clear that the reconciler was invoked more than once for the same object. This standard behaviour with controller-runtime; if you return an error it will retry the execution of the Reconcile function.
Looking at the timestamps we can see that the first few retries happen very quickly, but the gap between retries starts to increase. controller-runtime is automatically implementing a retry back-off for us!
2020-02-02T06:51:18.288Z INFO controllers.Demo Starting {"demo": "default/demo-sample"}
2020-02-02T06:51:18.288Z ERROR controller-runtime.controller Reconciler error {"controller": "demo", "request": "default/demo-sample", "error": "dummy error"}
2020-02-02T06:51:19.289Z INFO controllers.Demo Starting {"demo": "default/demo-sample"}
2020-02-02T06:51:19.289Z ERROR controller-runtime.controller Reconciler error {"controller": "demo", "request": "default/demo-sample", "error": "dummy error"}
2020-02-02T06:51:20.289Z INFO controllers.Demo Starting {"demo": "default/demo-sample"}
2020-02-02T06:51:20.289Z ERROR controller-runtime.controller Reconciler error {"controller": "demo", "request": "default/demo-sample", "error": "dummy error"}
2020-02-02T06:51:21.290Z INFO controllers.Demo Starting {"demo": "default/demo-sample"}
2020-02-02T06:51:21.290Z ERROR controller-runtime.controller Reconciler error {"controller": "demo", "request": "default/demo-sample", "error": "dummy error"}
2020-02-02T06:51:22.290Z INFO controllers.Demo Starting {"demo": "default/demo-sample"}
2020-02-02T06:51:22.291Z ERROR controller-runtime.controller Reconciler error {"controller": "demo", "request": "default/demo-sample", "error": "dummy error"}
2020-02-02T06:51:23.291Z INFO controllers.Demo Starting {"demo": "default/demo-sample"}
2020-02-02T06:51:23.291Z ERROR controller-runtime.controller Reconciler error {"controller": "demo", "request": "default/demo-sample", "error": "dummy error"}
2020-02-02T06:51:24.315Z INFO controllers.Demo Starting {"demo": "default/demo-sample"}
2020-02-02T06:51:24.316Z ERROR controller-runtime.controller Reconciler error {"controller": "demo", "request": "default/demo-sample", "error": "dummy error"}
2020-02-02T06:51:25.316Z INFO controllers.Demo Starting {"demo": "default/demo-sample"}
2020-02-02T06:51:25.317Z ERROR controller-runtime.controller Reconciler error {"controller": "demo", "request": "default/demo-sample", "error": "dummy error"}
2020-02-02T06:51:26.317Z INFO controllers.Demo Starting {"demo": "default/demo-sample"}
2020-02-02T06:51:26.318Z ERROR controller-runtime.controller Reconciler error {"controller": "demo", "request": "default/demo-sample", "error": "dummy error"}
2020-02-02T06:51:27.598Z INFO controllers.Demo Starting {"demo": "default/demo-sample"}
2020-02-02T06:51:27.598Z ERROR controller-runtime.controller Reconciler error {"controller": "demo", "request": "default/demo-sample", "error": "dummy error"}
2020-02-02T06:51:30.158Z INFO controllers.Demo Starting {"demo": "default/demo-sample"}
2020-02-02T06:51:30.158Z ERROR controller-runtime.controller Reconciler error {"controller": "demo", "request": "default/demo-sample", "error": "dummy error"}
2020-02-02T06:51:35.279Z INFO controllers.Demo Starting {"demo": "default/demo-sample"}
2020-02-02T06:51:35.280Z ERROR controller-runtime.controller Reconciler error {"controller": "demo", "request": "default/demo-sample", "error": "dummy error"}
2020-02-02T06:51:45.520Z INFO controllers.Demo Starting {"demo": "default/demo-sample"}
2020-02-02T06:51:45.520Z ERROR controller-runtime.controller Reconciler error {"controller": "demo", "request": "default/demo-sample", "error": "dummy error"}
2020-02-02T06:52:06.001Z INFO controllers.Demo Starting {"demo": "default/demo-sample"}
2020-02-02T06:52:06.001Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "demo", "request": "default/demo-sample"}
Digging into the Code
In this section we’ll explore the back-off behaviour seen above by digging into the libraries that our controller is built on, following the trail of the implementation to find where (and how) it is implemented. If you just want the end results of this then skip to the Summary section.
Starting in the Kubebuilder-generated project, the Reconcile method on our controller is wired up in the SetupWithManager function shown below.
|
|
This provides a starting point for digging into the controller-runtime code and as we do we reach this code which creates a Controller. Lines 8-12 below show where the rate-limiting queue is created.
|
|
Note that the NewNamedRateLimitingQueue takes a workqueue.DefaultControllerRateLimiter as a parameter. We’ll come back to explore that in a moment, but for now we’ll look at the NewNamedRateLimitingQueue (from the client-go package) which is shown below.
|
|
As you can see from the implementation this uses an internal rateLimitingType, which has a few functions defined:
|
|
The AddRateLimited function is interesting as it uses the rateLimiter to determine where the specified item should be added in the queue (i.e. what time it should be available for de-queueing). As we saw in the call to NewNamedRateLimitingQueue, the rate limiter is returned from workqueue.DefaultControllerRateLimiter (also in client-go) shown below:
|
|
This is the rate-limiter that AddRateLimited uses to determine when items should be enqueued for. MaxOfRateLimiter.When takes the longest duration of the encapsulated rate-limiters, as shown below.
|
|
So now we know that items added to the work queue using AddRateLimited will get a per-item exponential back-off and 10 items per second limit (with a burst allowance of 100). The next step is to determine when items are added to the queue using AddRateLimited.
The code we saw earlier that creates the work queue assigned that as part of initialising a controller.Controller instance. The Controller type in this context is in the internal/controller/controller.go in controller-runtime which has the following logic in it’s reconcileHandler method:
|
|
In this code, the c.Do.Reconcile is the Reconcile method from our operator, so we can see that Reconcile calls that return an error or set Requeue: true on the Result will result in calls to AddAfter and will have the exponential back-off and 10 events per second rate limiting applied. We can also see that setting RequeueAfter: someDuration does not result in rate-limiting being applied.
New items
The previous section gives us a clear picture of when rate-limiting is applied, but what about new instances of our CRD? Are we rate-limited on those?
To answer that we can go back to the SetupWithManager code that we saw earlier. Repeating the process of the following the thread of the code, we can see that the Complete method calls Build, which then calls the doWatch method:
|
|
Here we can see bldr.ctrl.Watch being invoked for a handler.EnqueueRequestForObject on a source.Kind. Tracking through the Watch method we find that this is implemented in the Controller.Watch method which calls into the Controller.Start method. This in turn calls the Start method on Source.Kind. This method uses an EventHandler to wrap the event handler that was passed in. This code has OnAdd (shown below), OnUpdate and OnDelete methods.
|
|
The snippet above omits the code to load the metadata, but once loaded it calls any predicates that were specified when setting up the controller. This is where the predicates from the previous posts on filtering deletes and updates are executed.
The code for OnUpdate and OnDelete is similar - if no predicate returns false the code goes on to call the Update or Delete method on the event handler. This event handler is the one that was passed down from the doWatch method which was a handler.EnqueueRequestForObject. Looking at the code for the EnqueueRequestForObject.Create method:
|
|
In this code we can see that the event is being added to the queue using Add rather than AddAfter, so no rate-limiting is applied to the incoming events.
Summary
This post ended up being a bit of a journey through the controller-runtime and client-go codebases, but coming out the other side we now know that
- rate-limiting is not applied to incoming events (e.g. new watched items)
- rate-limiting is applied to reconcile responses with an error or that set
Requeue: true - rate-limiting is not applied to reconcile responses that set
RequeueAfter: someDuration
When rate-limiting is applied it is the combination of:
- a per-item exponential back-off
- and
rate.NewLimiter(10, 100)(i.e. rate limit of 10 per second and burst amount of 100)
At the time of writing, this rate-limiting cannot be configured but there is a pull request open to allow the rate-limiter to be specified in the options in SetupWithManager.