prize

scheduler: fix inFlightPods leak when pod is recreated during scheduling failure

by MaybeSam05 · April 10, 2026

https://github.com/kubernetes/kubernetes/pull/138324

Overview

This PR fixes a memory/state leak in the Kubernetes scheduler where deleting and recreating a Pod with the same name during a failed scheduling attempt could leave a stale UID permanently stuck in the inFlightPods map, causing unbounded growth of inFlightEvents and related metrics. The fix captures the original in-flight UID before handleSchedulingFailure refreshes the pod object from the informer, and passes that UID explicitly to AddUnschedulableIfNotPresent so the correct entry is cleaned up.

2 files changed

+77 additions

-0 deletions

Motivation

To fix a resource leak (issue #138316) where a UID mismatch between the scheduling-cycle's tracked pod and a newly recreated pod of the same name caused inFlightPods entries to never be removed, blocking event pruning and allowing metrics to grow without bound.

schedulerscheduling queuein-flight pod trackingpod lifecycle handlingscheduler metrics

C4 Context

This change is part of the Kubernetes container orchestration system. Kubernetes manages the lifecycle of containerized workloads across a cluster, and correctness of its internal scheduling state is critical to stable cluster operation. The bug affects long-running clusters where rapid pod churn (delete + recreate with the same name) occurs frequently.

Within Kubernetes, the affected container is the kube-scheduler process. The scheduler is responsible for assigning Pods to Nodes and maintains an internal priority queue (SchedulingQueue) to track pods awaiting scheduling. The queue uses inFlightPods and inFlightEvents maps to coordinate scheduling cycles with cluster events.

At the component level, the fix touches handleSchedulingFailure in schedule_one.go, which reloads pod state from the shared informer after a pre-binding scheduling failure. It also modifies AddUnschedulableIfNotPresent in scheduling_queue.go / active_queue.go to accept an explicit in-flight UID, and threads that UID through determineSchedulingHintForInFlightPod and clusterEventsForPod so that map lookups remain consistent even when the refreshed pod object carries a different UID than the one originally popped from the queue.

1 / 2

Bug Fix: UID Mismatch Detection in handleSchedulingFailure

The core fix in schedule_one.go that detects when a pod has been recreated (same name, different UID) during a scheduling failure. When a UID mismatch is detected between the cached pod (from the informer) and the pod that was originally popped from the scheduling queue, the handler now returns early, skipping requeueing and status updates for the stale entry. This prevents the in-flight pod tracking maps from retaining entries that can never be cleaned up.

pkg/scheduler/schedule_one.go

Bug Fix: UID Mismatch Detection in handleSchedulingFailure — Diagram

flowchart TD A[handleSchedulingFailure called] --> B{cachedPod found\nin informer?} B -- No --> Z[existing error path] B -- Yes --> C{cachedPod assigned\nto a node?} C -- Yes --> D[Log abort\nCall DonePod\nReturn] C -- No --> E{"cachedPod.UID !=\npodInfo.Pod.UID?"} E -- "yes: pod recreated" --> F[Log UID mismatch\nReturn early\nno requeue no status update] E -- "no: same pod" --> G[DeepCopy cachedPod\nAddUnschedulableIfNotPresent\nPatch status]

Bug Fix: UID Mismatch Detection in handleSchedulingFailure — Key Signatures

Name	File	What it does
`handleSchedulingFailure`	pkg/scheduler/schedule_one.go	Handles scheduling failures for a pod by either logging an assignment abort or requeueing the pod as unschedulable — now with an early return when the cached pod's UID differs from the originally-queued pod's UID.

Bug Fix: UID Mismatch Detection in handleSchedulingFailure — Walkthrough

When a pod fails scheduling, handleSchedulingFailure looks up the current state of the pod via the SharedInformer cache (cachedPod) and compares it against the pod that was originally popped from the scheduling queue (podInfo.Pod).

The existing logic already handles the case where the pod has been bound to a node (the outer else branch's sibling), but previously had no defense against the pod having been deleted and a new pod created with the same name.

The new check cachedPod.UID != podInfo.Pod.UID detects this recreation scenario: if the UIDs differ, the cachedPod is a completely different object that merely shares a namespace/name with the pod that failed scheduling.

In that case the handler logs at verbosity 2 and returns immediately, skipping both the AddUnschedulableIfNotPresent call and any status patch — because the stale pod no longer exists in the API and the new pod will go through its own independent scheduling cycle.

Without this guard, the in-flight pod tracking maps (nominatedPods, scheduling cycle maps, etc.) would retain an entry keyed to the old UID that can never be removed, since the cleanup path (DonePod) would be associated with the new pod's lifecycle, not the old one's.

Bug Fix: UID Mismatch Detection in handleSchedulingFailure

pkg/scheduler/schedule_one.go modified

@@ -1256,6 +1256,10 @@ func (sched *Scheduler) handleSchedulingFailure(ctx context.Context, podFwk fram
         			logger.Info("Pod has been assigned to node. Abort adding it back to queue.", "pod", klog.KObj(pod), "node", cachedPod.Spec.NodeName)
         			// We need to call DonePod here because we don't call AddUnschedulableIfNotPresent in this case.
         		} else {
        +			if cachedPod.UID != podInfo.Pod.UID {
        +				logger.V(2).Info("Pod was recreated while handling scheduling failure. Skip requeueing and status updates.", "pod", klog.KObj(pod), "oldUID", podInfo.Pod.UID, "newUID", cachedPod.UID)
        +				return
        +			}
         			// As <cachedPod> is from SharedInformer, we need to do a DeepCopy() here.
         			// ignore this err since apiserver doesn't properly validate affinity terms
         			// and we can't fix the validation for backwards compatibility.

Bug Fix: UID Mismatch Detection in handleSchedulingFailure — Issues

medium
The early return skips calling DonePod, which is explicitly called in the sibling branch (node-assigned case) precisely because AddUnschedulableIfNotPresent is not invoked. If DonePod is also required here to release in-flight tracking state for the old UID, its absence could itself cause a leak — the very problem the fix is trying to prevent. The commit message says this prevents tracking map leaks, but the fix path should be verified to confirm DonePod is not needed (or should be added before the return).
low
The log message says 'Skip requeueing and status updates' but the code only skips the else-branch logic; it should be confirmed that no status patch or nomination update happens after this function returns in the caller, to ensure the log message is fully accurate.
low
No unit test is added or referenced in this diff for the new UID-mismatch code path, making it harder to catch regressions if the surrounding control flow changes in the future.

2 / 2

Regression Test: TestHandleSchedulingFailureSkipsRecreatedPod

A new integration-style test in schedule_one_test.go that exercises the fix end-to-end. It creates an 'old' pod, pops it from the scheduling queue (putting it in-flight), then replaces the informer's view with a 'recreated' pod of the same name but a new UID, and calls handleSchedulingFailure. The test asserts that: the in-flight entry is cleared, the recreated pod is not added to the backoff/unschedulable queues, no nomination is recorded, and the recreated pod's status in the API server is untouched.

pkg/scheduler/schedule_one_test.go

Regression Test: TestHandleSchedulingFailureSkipsRecreatedPod — Key Signatures

Name	File	What it does
`TestHandleSchedulingFailureSkipsRecreatedPod`	pkg/scheduler/schedule_one_test.go	Regression test that simulates a pod being popped from the scheduling queue, replaced in the informer cache by a recreated pod with the same name but a different UID, and then asserts that handleSchedulingFailure clears the in-flight entry without polluting the backoff/unschedulable queues, nomination records, or the API server pod status.

Regression Test: TestHandleSchedulingFailureSkipsRecreatedPod — Walkthrough

The test creates an oldPod with UID "old-uid" and a recreatedPod as a deep copy with UID "new-uid", seeding the fake clientset with only the recreated pod — this mimics the real-world scenario where a pod is deleted and a new one with the same name is created before failure handling runs.

The informer factory is wired to the fake clientset and synced, so when handleSchedulingFailure consults the informer's lister it will see recreatedPod (UID "new-uid"), not oldPod.

The scheduling queue is populated with oldPod and immediately popped, placing it into the in-flight set; the test asserts upfront that the pod is indeed in-flight before calling handleSchedulingFailure, establishing the precondition clearly.

handleSchedulingFailure is then invoked with the popped oldPod QueuedPodInfo, an Unschedulable status, and a NominatingInfo targeting "node1" — the combination that, without the fix, would have re-enqueued or nominated the stale pod.

The in-flight drain is verified with a PollUntilContextTimeout loop rather than a direct assertion, correctly accounting for any asynchronous cleanup that handleSchedulingFailure may trigger internally.

Four independent postcondition checks follow: in-flight cleared, backoff queue empty, unschedulable queue empty, and no nomination recorded for "node1" — each covering a distinct code path the fix must guard.

Finally, the test fetches the pod from the fake API server and diffs its status against the original recreated pod's status, ensuring no spurious status patch (e.g., an unschedulable condition) was written for the new pod.

Regression Test: TestHandleSchedulingFailureSkipsRecreatedPod

pkg/scheduler/schedule_one_test.go modified

@@ -1318,6 +1318,79 @@ func TestSchedulerScheduleOne(t *testing.T) {
         	}
         }

        +func TestHandleSchedulingFailureSkipsRecreatedPod(t *testing.T) {
        +	logger, ctx := ktesting.NewTestContext(t)
        +	ctx, cancel := context.WithCancel(ctx)
        +	defer cancel()
        +
        +	oldPod := st.MakePod().Name("foo").Namespace("ns").UID("old-uid").SchedulerName(testSchedulerName).Obj()
        +	recreatedPod := oldPod.DeepCopy()
        +	recreatedPod.UID = "new-uid"
        +
        +	client := clientsetfake.NewClientset(recreatedPod)
        +	informerFactory := informers.NewSharedInformerFactory(client, 0)
        +	eventBroadcaster := events.NewBroadcaster(&events.EventSinkImpl{Interface: client.EventsV1()})
        +
        +	schedFramework, err := tf.NewFramework(ctx,
        +		[]tf.RegisterPluginFunc{
        +			tf.RegisterQueueSortPlugin(queuesort.Name, queuesort.New),
        +			tf.RegisterBindPlugin(defaultbinder.Name, defaultbinder.New),
        +		},
        +		testSchedulerName,
        +		frameworkruntime.WithClientSet(client),
        +		frameworkruntime.WithEventRecorder(eventBroadcaster.NewRecorder(scheme.Scheme, testSchedulerName)),
        +		frameworkruntime.WithInformerFactory(informerFactory),
        +	)
        +	if err != nil {
        +		t.Fatal(err)
        +	}
        +
        +	ar := metrics.NewMetricsAsyncRecorder(10, time.Second, ctx.Done())
        +	queue := internalqueue.NewSchedulingQueue(nil, informerFactory, internalqueue.WithMetricsRecorder(ar))
        +	sched := &Scheduler{
        +		client:          client,
        +		SchedulingQueue: queue,
        +	}
        +
        +	informerFactory.Start(ctx.Done())
        +	informerFactory.WaitForCacheSync(ctx.Done())
        +
        +	queue.Add(ctx, oldPod)
        +	popped, err := queue.Pop(logger)
        +	if err != nil {
        +		t.Fatalf("Pop: %v", err)
        +	}
        +	if got := queue.InFlightPods(); !podListContainsPod(got, oldPod) {
        +		t.Fatalf("expected popped pod to be in-flight before failure handling, got %v", got)
        +	}
        +
        +	nominatingInfo := &fwk.NominatingInfo{NominatingMode: fwk.ModeOverride, NominatedNodeName: "node1"}
        +	sched.handleSchedulingFailure(ctx, schedFramework, popped, fwk.NewStatus(fwk.Unschedulable, "no fit"), nominatingInfo, time.Now())
        +
        +	if err := wait.PollUntilContextTimeout(ctx, time.Millisecond, wait.ForeverTestTimeout, false, func(context.Context) (bool, error) {
        +		return len(queue.InFlightPods()) == 0, nil
        +	}); err != nil {
        +		t.Fatalf("in-flight pod was not cleared: %v", queue.InFlightPods())
        +	}
        +	if got := queue.PodsInBackoffQ(); len(got) != 0 {
        +		t.Fatalf("expected recreated pod to stay out of backoffQ, got %v", got)
        +	}
        +	if got := queue.UnschedulablePods(); len(got) != 0 {
        +		t.Fatalf("expected recreated pod to stay out of unschedulablePods, got %v", got)
        +	}
        +	if got := queue.NominatedPodsForNode("node1"); len(got) != 0 {
        +		t.Fatalf("expected recreated pod to stay out of nominated pods, got %v", got)
        +	}
        +
        +	updatedPod, err := client.CoreV1().Pods(recreatedPod.Namespace).Get(ctx, recreatedPod.Name, metav1.GetOptions{})
        +	if err != nil {
        +		t.Fatalf("Get pod: %v", err)
        +	}
        +	if diff := cmp.Diff(recreatedPod.Status, updatedPod.Status); diff != "" {
        +		t.Fatalf("expected recreated pod status to remain unchanged (-want,+got):\n%s", diff)
        +	}
        +}
        +
         type constSigPluginConfig struct {
         	name       string
         	signature  []fwk.SignFragment

Regression Test: TestHandleSchedulingFailureSkipsRecreatedPod — Issues

medium
The popped variable returned by queue.Pop is a *framework.QueuedPodInfo, but the test never verifies that popped.Pod.UID == "old-uid" before passing it to handleSchedulingFailure. If the queue implementation ever returns the wrong entry this silent assumption could make the test pass for the wrong reason.
low
The wait.PollUntilContextTimeout uses wait.ForeverTestTimeout as the deadline. If the in-flight pod is never cleared (i.e., the bug is present), the test will hang for the full timeout duration rather than failing quickly; a shorter, explicit timeout (e.g., 5 seconds) would give faster feedback during CI.
low
The Scheduler struct is initialized with only client and SchedulingQueue fields, leaving other fields (e.g., nodeInfoSnapshot, percentageOfNodesToScore) as zero values. This is fine for this specific test path, but a brief comment explaining why a minimal struct is sufficient would improve maintainability.
low
There is no assertion that the oldPod's status in the API server was not patched — the test only checks the recreated pod. If the fix accidentally patches the wrong object (using the old pod's name but new pod's namespace, for instance), this would go undetected; asserting a 404 or verifying no status update was sent for old-uid would close this gap. (Though with a fake clientset keyed by name this is tricky, so at minimum a comment acknowledging the limitation would help.)

Review Complete

This PR fixes a memory/state leak in the Kubernetes scheduler where deleting and recreating a Pod with the same name during a failed scheduling attempt could leave a stale UID permanently stuck in the inFlightPods map, causing unbounded growth of inFlightEvents and related metrics. The fix captures the original in-flight UID before handleSchedulingFailure refreshes the pod object from the informer, and passes that UID explicitly to AddUnschedulableIfNotPresent so the correct entry is cleaned up.

Key risks to keep in mind

The new `schedulingCycleInFlightUID` parameter is added to `AddUnschedulableIfNotPresent`; all existing call sites pass an empty string which preserves prior behavior, but any missed call site would silently regress to the old buggy behavior.
The fix relies on capturing the UID at the right moment (before informer refresh) — if the capture point is wrong or the variable is not threaded correctly through `determineSchedulingHintForInFlightPod` and `clusterEventsForPod`, in-flight lookups could still mismatch.
The test coverage is limited to the new `TestPriorityQueue_AddUnschedulableIfNotPresent_PodRecreatedSameName` case with `SchedulerQueueingHints` enabled; the bug scenario with hints disabled is not explicitly covered.
Race conditions in the narrow window between pod deletion/recreation and scheduling failure handling are inherently hard to test deterministically; the new test may not catch all timing variants.