-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paused jobs in Prompt Reco due to MaxPSS reached #46040
Comments
cms-bot internal usage |
A new Issue was created by @malbouis. @Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
There seems to be very little information there. E.g. I don't see CMSSW logs or the configuration. The
The job stayed quite steadily under 11 GB for almost 6 hours, and then in the last 1.5 hours the memory usage increased by 5 GB. |
assign reconstruction, dqm Just guessing the high memory usage would be caused by the application code |
New categories assigned: reconstruction,dqm @jfernan2,@mandrenguyen,@rvenditti,@syuvivida,@tjavaid,@nothingface0,@antoniovagnerini you have been requested to review this Pull request/Issue and eventually sign? Thanks |
We have another paused job for the muon PD with exceeding memory. The tarball can be found here:
I copied the RAW input file to the following location so that the issue can be reproduced anytime:
Best, |
Looking the |
confused... the log file in /eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/PromptReco_Run386319_Muon1/job/WMTaskSpace/cmsRun1 is from run 386037 - or so the fwk thinks. the job ran for 40 hrs and 200k events - eg, 9kHz into a pd. seems garbage data (certainly not good cosmics/circulating data) |
28-Sep-2024 07:02:12 UTC Initiating request to open file root://eoscms.cern.ch//eos/cms/tier0/store/data/Run2024H/Cosmics/RAW/v1/000/386/037/00000/8959d673-4a4c-487b-8e25-213767c3a788.root?eos.app=cmst0 indeed, lumi section 100 of this run has very high rates. |
Oh sorry, I mixed up the job tarballs of a different paused job. Please ignore my previous comment about the files. The one you analyzed (run 386037) was a cosmic run where DT got out of global run at LS ~ 100. I will try to get another example, hoping the files are still on disk. |
I found the correct tarball + RAW file and copied them over to the same location (removed the old files):
The maxPSS error is visible in the
|
New look into
While VSIZE grows somewhat gradually (even if in steps) after the first ~1000 events, the RSS shows rapid growth at towards the end of the job, starting around 7160th event (the job processed total of 7382 events). (btw, the earlier case could have been interesting to study as well to find the ~58 kB/event hoarding/leak, even if the physics content was garbage) |
We have a new instance of this issue. I put the tarball and RAW input here in case it can help the investigation:
|
I wonder if these "sudden grow" periods could be related to unrelated process being terminated in the presence of high fragmentation that was brought up e.g. in #42387. |
@germanfgv Have you tested if resubmission of the paused job would make any difference? I'm just wondering if we'd already have any evidence for these RSS/PSS rapid growths being reproducible or not. |
Continuing on
I ran the job on cmsdev42 (via slc7 container on el8 host), and got the following RSS and VSIZE (to be compared to #46040 (comment)) I think the difference in RSS behavior hints towards the behavior being dependent on the overall system state (supporting the hypothesis of the "RSS storm in presence of high fragmentation"). |
Two more occurred and are reported here:
The RAW files are also copied over. |
@jeyserma Does Tier0 usually try again these jobs, or fail them after the first attempt? |
maxPSS paused jobs are automatically retried 3 times by our agent. For this particular memory issue, we increased the memory limit to 17 GB (default 16 GB for 8 cores), and they ran fine. |
Thanks @jeyserma. Are these failure reports posted if the job has failed all 4 times (i.e. it continues to fail), or if it failed any time even a later retry succeeded? |
Under the assumption that we should reduce the memory footprint in general I ran IgProf in the job of Total amount of allocated memory: 4.22 GB, divides roughly into (including only the largest contributors or are otherwise interesting
|
This one had already an issue open in #42995 |
Spinned off to #46446 |
Spinned off to #46448 |
Spinned off to #46449 |
Spinned off to #46450 |
Hi @makortel. What I claimed previously was wrong: a promptReco job that exceeds memory is never retried automatically (it's only true to Express, therefore my confusion). We had a few extra paused jobs last week due to this memory issue and I decided to retry them without increasing the memory, and they all finished successfully. But probably that's not true for all maxMemory jobs that paused, though we can't try it anymore. Nevertheless probably on average, the Muon is closer to the maxMemory limit and therefore we see such an increase in paused jobs. |
Thanks @jeyserma!
Would the logs of those failed jobs be still available? I'd like to collect more statistics of the "failing behavior". Given the evidence so far suggests operating system's dynamic behavior playing a role, my suggestion would be to re-try a job paused because of reaching MaxPSS once (or maybe twice) either with the same or a bit larger limit. |
Spinned off to #46466 |
Just out of curiosity I collected the total cost of ML algorithms from the aforementioned IgProf memory profiles (numbers reflecting the state of a long-running 8 thread/stream job; not counting the
So a total of about 620 MB. FYI @cms-sw/ml-l2 |
Spinned off to #46494 |
Spinned off to #46498 |
@cms-sw/tracking-pog-l2 |
seems like a large fraction is the cost of recomputing a hit depending on track parameters. Unfortunately that's apparently inlined and not clearly visible in the igprof: cmssw/RecoTracker/TransientTrackingRecHit/src/TkClonerImpl.cc Lines 35 to 40 in b96fd02
vs https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue46040/test_17_total/304 1M in Does it matter? |
The memory churn from O(1 MHz) of memory allocations may have a significant impact towards memory getting more fragmented, that could then lead to the OS to use (much) more RSS in some situations (I mean, this sounds plausible, there is little direct evidence beyond what was presented in #42387). Or in other words, the main practical impact of memory churn is some slowdown, until it gets so bad that everything breaks. |
Dear all,
There are two jobs that failed due to MaxPSS reached in Tier0 processing:
The tar ball can be found at
/eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/job_569724
Would experts please please investigate?
Thanks!
The text was updated successfully, but these errors were encountered: