scheduler: fix inFlightPods leak when pod is recreated during scheduling failure
by MaybeSam05 ยท April 10, 2026
Overview
inFlightPods map, causing unbounded growth of inFlightEvents and related metrics. The fix captures the original in-flight UID before handleSchedulingFailure refreshes the pod object from the informer, and passes that UID explicitly to AddUnschedulableIfNotPresent so the correct entry is cleaned up.Motivation
inFlightPods entries to never be removed, blocking event pruning and allowing metrics to grow without bound.C4 Context
This change is part of the Kubernetes container orchestration system. Kubernetes manages the lifecycle of containerized workloads across a cluster, and correctness of its internal scheduling state is critical to stable cluster operation. The bug affects long-running clusters where rapid pod churn (delete + recreate with the same name) occurs frequently.
Within Kubernetes, the affected container is the kube-scheduler process. The scheduler is responsible for assigning Pods to Nodes and maintains an internal priority queue (SchedulingQueue) to track pods awaiting scheduling. The queue uses inFlightPods and inFlightEvents maps to coordinate scheduling cycles with cluster events.
At the component level, the fix touches handleSchedulingFailure in schedule_one.go, which reloads pod state from the shared informer after a pre-binding scheduling failure. It also modifies AddUnschedulableIfNotPresent in scheduling_queue.go / active_queue.go to accept an explicit in-flight UID, and threads that UID through determineSchedulingHintForInFlightPod and clusterEventsForPod so that map lookups remain consistent even when the refreshed pod object carries a different UID than the one originally popped from the queue.
Bug Fix: UID Mismatch Detection in handleSchedulingFailure
schedule_one.go that detects when a pod has been recreated (same name, different UID) during a scheduling failure. When a UID mismatch is detected between the cached pod (from the informer) and the pod that was originally popped from the scheduling queue, the handler now returns early, skipping requeueing and status updates for the stale entry. This prevents the in-flight pod tracking maps from retaining entries that can never be cleaned up.Bug Fix: UID Mismatch Detection in handleSchedulingFailure โ Diagram
Bug Fix: UID Mismatch Detection in handleSchedulingFailure โ Key Signatures
| Name | File | What it does |
|---|---|---|
handleSchedulingFailure |
pkg/scheduler/schedule_one.go | Handles scheduling failures for a pod by either logging an assignment abort or requeueing the pod as unschedulable โ now with an early return when the cached pod's UID differs from the originally-queued pod's UID. |
Bug Fix: UID Mismatch Detection in handleSchedulingFailure โ Walkthrough
When a pod fails scheduling, handleSchedulingFailure looks up the current state of the pod via the SharedInformer cache (cachedPod) and compares it against the pod that was originally popped from the scheduling queue (podInfo.Pod).
The existing logic already handles the case where the pod has been bound to a node (the outer else branch's sibling), but previously had no defense against the pod having been deleted and a new pod created with the same name.
The new check cachedPod.UID != podInfo.Pod.UID detects this recreation scenario: if the UIDs differ, the cachedPod is a completely different object that merely shares a namespace/name with the pod that failed scheduling.
In that case the handler logs at verbosity 2 and returns immediately, skipping both the AddUnschedulableIfNotPresent call and any status patch โ because the stale pod no longer exists in the API and the new pod will go through its own independent scheduling cycle.
Without this guard, the in-flight pod tracking maps (nominatedPods, scheduling cycle maps, etc.) would retain an entry keyed to the old UID that can never be removed, since the cleanup path (DonePod) would be associated with the new pod's lifecycle, not the old one's.
Bug Fix: UID Mismatch Detection in handleSchedulingFailure
pkg/scheduler/schedule_one.go modified
@@ -1256,6 +1256,10 @@ func (sched *Scheduler) handleSchedulingFailure(ctx context.Context, podFwk fram
logger.Info("Pod has been assigned to node. Abort adding it back to queue.", "pod", klog.KObj(pod), "node", cachedPod.Spec.NodeName)
// We need to call DonePod here because we don't call AddUnschedulableIfNotPresent in this case.
} else {
+ if cachedPod.UID != podInfo.Pod.UID {
+ logger.V(2).Info("Pod was recreated while handling scheduling failure. Skip requeueing and status updates.", "pod", klog.KObj(pod), "oldUID", podInfo.Pod.UID, "newUID", cachedPod.UID)
+ return
+ }
// As <cachedPod> is from SharedInformer, we need to do a DeepCopy() here.
// ignore this err since apiserver doesn't properly validate affinity terms
// and we can't fix the validation for backwards compatibility.
Bug Fix: UID Mismatch Detection in handleSchedulingFailure โ Issues
-
medium
The early
returnskips callingDonePod, which is explicitly called in the sibling branch (node-assigned case) precisely becauseAddUnschedulableIfNotPresentis not invoked. IfDonePodis also required here to release in-flight tracking state for the old UID, its absence could itself cause a leak โ the very problem the fix is trying to prevent. The commit message says this prevents tracking map leaks, but the fix path should be verified to confirmDonePodis not needed (or should be added before thereturn). -
low
The log message says 'Skip requeueing and status updates' but the code only skips the else-branch logic; it should be confirmed that no status patch or nomination update happens after this function returns in the caller, to ensure the log message is fully accurate.
-
low
No unit test is added or referenced in this diff for the new UID-mismatch code path, making it harder to catch regressions if the surrounding control flow changes in the future.
Regression Test: TestHandleSchedulingFailureSkipsRecreatedPod
schedule_one_test.go that exercises the fix end-to-end. It creates an 'old' pod, pops it from the scheduling queue (putting it in-flight), then replaces the informer's view with a 'recreated' pod of the same name but a new UID, and calls handleSchedulingFailure. The test asserts that: the in-flight entry is cleared, the recreated pod is not added to the backoff/unschedulable queues, no nomination is recorded, and the recreated pod's status in the API server is untouched.Regression Test: TestHandleSchedulingFailureSkipsRecreatedPod โ Key Signatures
| Name | File | What it does |
|---|---|---|
TestHandleSchedulingFailureSkipsRecreatedPod |
pkg/scheduler/schedule_one_test.go | Regression test that simulates a pod being popped from the scheduling queue, replaced in the informer cache by a recreated pod with the same name but a different UID, and then asserts that handleSchedulingFailure clears the in-flight entry without polluting the backoff/unschedulable queues, nomination records, or the API server pod status. |
Regression Test: TestHandleSchedulingFailureSkipsRecreatedPod โ Walkthrough
The test creates an oldPod with UID "old-uid" and a recreatedPod as a deep copy with UID "new-uid", seeding the fake clientset with only the recreated pod โ this mimics the real-world scenario where a pod is deleted and a new one with the same name is created before failure handling runs.
The informer factory is wired to the fake clientset and synced, so when handleSchedulingFailure consults the informer's lister it will see recreatedPod (UID "new-uid"), not oldPod.
The scheduling queue is populated with oldPod and immediately popped, placing it into the in-flight set; the test asserts upfront that the pod is indeed in-flight before calling handleSchedulingFailure, establishing the precondition clearly.
handleSchedulingFailure is then invoked with the popped oldPod QueuedPodInfo, an Unschedulable status, and a NominatingInfo targeting "node1" โ the combination that, without the fix, would have re-enqueued or nominated the stale pod.
The in-flight drain is verified with a PollUntilContextTimeout loop rather than a direct assertion, correctly accounting for any asynchronous cleanup that handleSchedulingFailure may trigger internally.
Four independent postcondition checks follow: in-flight cleared, backoff queue empty, unschedulable queue empty, and no nomination recorded for "node1" โ each covering a distinct code path the fix must guard.
Finally, the test fetches the pod from the fake API server and diffs its status against the original recreated pod's status, ensuring no spurious status patch (e.g., an unschedulable condition) was written for the new pod.
Regression Test: TestHandleSchedulingFailureSkipsRecreatedPod
pkg/scheduler/schedule_one_test.go modified
@@ -1318,6 +1318,79 @@ func TestSchedulerScheduleOne(t *testing.T) {
}
}
+func TestHandleSchedulingFailureSkipsRecreatedPod(t *testing.T) {
+ logger, ctx := ktesting.NewTestContext(t)
+ ctx, cancel := context.WithCancel(ctx)
+ defer cancel()
+
+ oldPod := st.MakePod().Name("foo").Namespace("ns").UID("old-uid").SchedulerName(testSchedulerName).Obj()
+ recreatedPod := oldPod.DeepCopy()
+ recreatedPod.UID = "new-uid"
+
+ client := clientsetfake.NewClientset(recreatedPod)
+ informerFactory := informers.NewSharedInformerFactory(client, 0)
+ eventBroadcaster := events.NewBroadcaster(&events.EventSinkImpl{Interface: client.EventsV1()})
+
+ schedFramework, err := tf.NewFramework(ctx,
+ []tf.RegisterPluginFunc{
+ tf.RegisterQueueSortPlugin(queuesort.Name, queuesort.New),
+ tf.RegisterBindPlugin(defaultbinder.Name, defaultbinder.New),
+ },
+ testSchedulerName,
+ frameworkruntime.WithClientSet(client),
+ frameworkruntime.WithEventRecorder(eventBroadcaster.NewRecorder(scheme.Scheme, testSchedulerName)),
+ frameworkruntime.WithInformerFactory(informerFactory),
+ )
+ if err != nil {
+ t.Fatal(err)
+ }
+
+ ar := metrics.NewMetricsAsyncRecorder(10, time.Second, ctx.Done())
+ queue := internalqueue.NewSchedulingQueue(nil, informerFactory, internalqueue.WithMetricsRecorder(ar))
+ sched := &Scheduler{
+ client: client,
+ SchedulingQueue: queue,
+ }
+
+ informerFactory.Start(ctx.Done())
+ informerFactory.WaitForCacheSync(ctx.Done())
+
+ queue.Add(ctx, oldPod)
+ popped, err := queue.Pop(logger)
+ if err != nil {
+ t.Fatalf("Pop: %v", err)
+ }
+ if got := queue.InFlightPods(); !podListContainsPod(got, oldPod) {
+ t.Fatalf("expected popped pod to be in-flight before failure handling, got %v", got)
+ }
+
+ nominatingInfo := &fwk.NominatingInfo{NominatingMode: fwk.ModeOverride, NominatedNodeName: "node1"}
+ sched.handleSchedulingFailure(ctx, schedFramework, popped, fwk.NewStatus(fwk.Unschedulable, "no fit"), nominatingInfo, time.Now())
+
+ if err := wait.PollUntilContextTimeout(ctx, time.Millisecond, wait.ForeverTestTimeout, false, func(context.Context) (bool, error) {
+ return len(queue.InFlightPods()) == 0, nil
+ }); err != nil {
+ t.Fatalf("in-flight pod was not cleared: %v", queue.InFlightPods())
+ }
+ if got := queue.PodsInBackoffQ(); len(got) != 0 {
+ t.Fatalf("expected recreated pod to stay out of backoffQ, got %v", got)
+ }
+ if got := queue.UnschedulablePods(); len(got) != 0 {
+ t.Fatalf("expected recreated pod to stay out of unschedulablePods, got %v", got)
+ }
+ if got := queue.NominatedPodsForNode("node1"); len(got) != 0 {
+ t.Fatalf("expected recreated pod to stay out of nominated pods, got %v", got)
+ }
+
+ updatedPod, err := client.CoreV1().Pods(recreatedPod.Namespace).Get(ctx, recreatedPod.Name, metav1.GetOptions{})
+ if err != nil {
+ t.Fatalf("Get pod: %v", err)
+ }
+ if diff := cmp.Diff(recreatedPod.Status, updatedPod.Status); diff != "" {
+ t.Fatalf("expected recreated pod status to remain unchanged (-want,+got):\n%s", diff)
+ }
+}
+
type constSigPluginConfig struct {
name string
signature []fwk.SignFragment
Regression Test: TestHandleSchedulingFailureSkipsRecreatedPod โ Issues
-
medium
The
poppedvariable returned byqueue.Popis a*framework.QueuedPodInfo, but the test never verifies thatpopped.Pod.UID == "old-uid"before passing it tohandleSchedulingFailure. If the queue implementation ever returns the wrong entry this silent assumption could make the test pass for the wrong reason. -
low
The
wait.PollUntilContextTimeoutuseswait.ForeverTestTimeoutas the deadline. If the in-flight pod is never cleared (i.e., the bug is present), the test will hang for the full timeout duration rather than failing quickly; a shorter, explicit timeout (e.g., 5 seconds) would give faster feedback during CI. -
low
The
Schedulerstruct is initialized with onlyclientandSchedulingQueuefields, leaving other fields (e.g.,nodeInfoSnapshot,percentageOfNodesToScore) as zero values. This is fine for this specific test path, but a brief comment explaining why a minimal struct is sufficient would improve maintainability. -
low
There is no assertion that the
oldPod's status in the API server was not patched โ the test only checks the recreated pod. If the fix accidentally patches the wrong object (using the old pod's name but new pod's namespace, for instance), this would go undetected; asserting a 404 or verifying no status update was sent forold-uidwould close this gap. (Though with a fake clientset keyed by name this is tricky, so at minimum a comment acknowledging the limitation would help.)
Review Complete
inFlightPods map, causing unbounded growth of inFlightEvents and related metrics. The fix captures the original in-flight UID before handleSchedulingFailure refreshes the pod object from the informer, and passes that UID explicitly to AddUnschedulableIfNotPresent so the correct entry is cleaned up.Key risks to keep in mind
- The new `schedulingCycleInFlightUID` parameter is added to `AddUnschedulableIfNotPresent`; all existing call sites pass an empty string which preserves prior behavior, but any missed call site would silently regress to the old buggy behavior.
- The fix relies on capturing the UID at the right moment (before informer refresh) โ if the capture point is wrong or the variable is not threaded correctly through `determineSchedulingHintForInFlightPod` and `clusterEventsForPod`, in-flight lookups could still mismatch.
- The test coverage is limited to the new `TestPriorityQueue_AddUnschedulableIfNotPresent_PodRecreatedSameName` case with `SchedulerQueueingHints` enabled; the bug scenario with hints disabled is not explicitly covered.
- Race conditions in the narrow window between pod deletion/recreation and scheduling failure handling are inherently hard to test deterministically; the new test may not catch all timing variants.