Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If pendingPodConditions isn't set, KEDA never detects any pending jobs #6157

Open
Makeshift opened this issue Sep 12, 2024 · 2 comments
Open
Labels
bug Something isn't working

Comments

@Makeshift
Copy link

Makeshift commented Sep 12, 2024

Report

If pendingPodConditions is not set on a ScaledJob with strategy accurate, KEDA appears to incorrectly set the number of pending jobs as always being 0. This means that if a job takes longer to start up than the poll time of the trigger, you end up with duplicate jobs.

Expected Behavior

When using scalingStrategy.strategy: accurate, I'd expect KEDA to correctly count the number of pods that had been scheduled but are not yet running, and calculate the quantity to scale up by as QueueLength-RunningJobs-PendingJobs.

Actual Behavior

If pendingPodConditions is not set in your ScaledJob, KEDA will always calculate the number of pending jobs as being 0, resulting in duplicate jobs if your job has a long startup time.

Steps to Reproduce the Problem

  1. Define a ScaledJob with:
jobTargetRef:
  pollingInterval: 5
  scalingStrategy:
    strategy: accurate
  • a pod spec that will spend at least 10 seconds in the pending state
  1. Add 3 items to the trigger such that KEDA will scale up to 3 replicas. These 3 replicas should stay in the 'pending' state, however the KEDA operator will log "Number of pending Jobs": 0

  2. After 5 seconds (the next poll), KEDA will attempt to launch 3 additional replicas (bringing the total to 6) because it does not see the pending jobs.

  3. Reset and modify your ScaledJob to add all possible pendingPodConditions:

jobTargetRef:
  pollingInterval: 5
  scalingStrategy:
    strategy: accurate
    pendingPodConditions:
      - Ready
      - PodReadyToStartContainers
      - ContainersReady
      - Initialized
      - PodScheduled
  1. Add 3 items to the trigger such that KEDA will scale up to 3 replicas. These 3 replicas should stay in the 'pending' state, and KEDA should log
"Number of pending Jobs": 4
No need to create jobs - all requested jobs already exist

Logs from KEDA operator

15s poll time with pendingPodConditions unset, comments mine:

2024-09-12T14:48:13Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of running Jobs": 0}
2024-09-12T14:48:13Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of pending Jobs": 0}
# 3 items submitted to queue - 3 jobs created and pending
2024-09-12T14:48:13Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Effective number of max jobs": 3}
2024-09-12T14:48:13Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of jobs": 3}
2024-09-12T14:48:13Z	INFO	scaleexecutor	Created jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of jobs": 3}
# KEDA claims 0 pending jobs
2024-09-12T14:48:28Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of running Jobs": 0}
2024-09-12T14:48:28Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of pending Jobs": 0}
# No new items added, previous 3 jobs still pending, KEDA creates 3 more jobs
2024-09-12T14:48:28Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Effective number of max jobs": 6}
2024-09-12T14:48:28Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of jobs": 3}
2024-09-12T14:48:28Z	INFO	scaleexecutor	Created jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of jobs": 3}
# The first 3 jobs are finally now running, KEDA still can't see the most recent 3 jobs that are pending
2024-09-12T14:48:43Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of running Jobs": 3}
2024-09-12T14:48:43Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of pending Jobs": 0}

15s poll time with pendingPodConditions set to all conditions, only part of the log but shows it's working correctly:

2024-09-12T14:42:13Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of pending Jobs": 4}
2024-09-12T14:42:13Z	INFO	scaleexecutor	No need to create jobs - all requested jobs already exist	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "jobs": 0}

KEDA Version

2.15.1

Kubernetes Version

1.28

Platform

Amazon Web Services

Scaler Details

SQS FIFO queue

Anything else?

I dug into this a tiny bit and I think the culprit is here. It looks like this function might be returning that a pod is running or complete when it's actually pending.

My conclusion that this is a bug is based on how the default behaviour is described here:

Default behavior - Job that have not finished yet and the underlying pod is either not running or has not been completed yet
@Makeshift Makeshift added the bug Something isn't working label Sep 12, 2024
@quiqueg
Copy link

quiqueg commented Oct 11, 2024

I noticed the same behavior and can confirm that adding pendingPodConditions fixed it for us.

@andretibolaintelipost
Copy link

andretibolaintelipost commented Oct 12, 2024

Adding pendingPodConditions also fixed for me, but it took me a while to get this fixed. I resorted to a large polling time at first, but that was affecting end users. Only after coming across this issue that i finally got to scale without duplicates and with a short polling period

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: To Triage
Development

No branches or pull requests

3 participants