Paused jobs in Prompt Reco due to MaxPSS reached #46040

malbouis · 2024-09-18T16:44:05Z

Dear all,

There are two jobs that failed due to MaxPSS reached in Tier0 processing:

for Muon0 PD, collision run number 385728;
For Muon1 PD, collision run number 385738.

The tar ball can be found at /eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/job_569724

Would experts please please investigate?

Thanks!

The text was updated successfully, but these errors were encountered:

cmsbuild · 2024-09-18T16:44:22Z

cms-bot internal usage

cmsbuild · 2024-09-18T16:44:23Z

A new Issue was created by @malbouis.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2024-09-18T18:34:57Z

The tar ball can be found at /eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/job_569724

There seems to be very little information there. E.g. I don't see CMSSW logs or the configuration. The wmagentJob.log shows

2024-09-18 03:42:42,498:INFO:PerformanceMonitor:PSS: 8215517; RSS: 8681824; PCPU: 212; PMEM: 3.2
<cut>
2024-09-18 06:13:09,492:INFO:PerformanceMonitor:PSS: 10095653; RSS: 10286620; PCPU: 756; PMEM: 3.9
<cut>
2024-09-18 09:03:30,419:INFO:PerformanceMonitor:PSS: 10273310; RSS: 10381460; PCPU: 766; PMEM: 3.9
<cut>
2024-09-18 09:23:35,351:INFO:PerformanceMonitor:PSS: 10693552; RSS: 10806524; PCPU: 767; PMEM: 4.1
<cut>
2024-09-18 09:38:36,992:INFO:PerformanceMonitor:PSS: 11046040; RSS: 11246436; PCPU: 767; PMEM: 4.2
2024-09-18 09:43:38,421:INFO:PerformanceMonitor:PSS: 11318562; RSS: 11471652; PCPU: 767; PMEM: 4.3
2024-09-18 09:48:38,893:INFO:PerformanceMonitor:PSS: 11658895; RSS: 11866360; PCPU: 767; PMEM: 4.5
2024-09-18 09:53:39,318:INFO:PerformanceMonitor:PSS: 11840042; RSS: 11950812; PCPU: 768; PMEM: 4.5
2024-09-18 09:58:39,852:INFO:PerformanceMonitor:PSS: 12126577; RSS: 12207752; PCPU: 768; PMEM: 4.6
2024-09-18 10:03:40,190:INFO:PerformanceMonitor:PSS: 12468206; RSS: 12574136; PCPU: 768; PMEM: 4.7
2024-09-18 10:08:40,511:INFO:PerformanceMonitor:PSS: 12553465; RSS: 12652744; PCPU: 768; PMEM: 4.8
2024-09-18 10:13:40,954:INFO:PerformanceMonitor:PSS: 13047006; RSS: 13150440; PCPU: 768; PMEM: 4.9
2024-09-18 10:18:41,530:INFO:PerformanceMonitor:PSS: 13397210; RSS: 13525900; PCPU: 768; PMEM: 5.1
2024-09-18 10:23:42,898:INFO:PerformanceMonitor:PSS: 13576974; RSS: 13752052; PCPU: 768; PMEM: 5.2
2024-09-18 10:28:44,000:INFO:PerformanceMonitor:PSS: 14084790; RSS: 14202436; PCPU: 768; PMEM: 5.3
2024-09-18 10:33:44,468:INFO:PerformanceMonitor:PSS: 14320801; RSS: 14423916; PCPU: 768; PMEM: 5.4
2024-09-18 10:38:45,646:INFO:PerformanceMonitor:PSS: 14525319; RSS: 14654568; PCPU: 768; PMEM: 5.5
2024-09-18 10:43:46,187:INFO:PerformanceMonitor:PSS: 14916861; RSS: 15010812; PCPU: 768; PMEM: 5.7
2024-09-18 10:48:46,523:INFO:PerformanceMonitor:PSS: 15372452; RSS: 15477132; PCPU: 769; PMEM: 5.8
2024-09-18 10:53:47,070:INFO:PerformanceMonitor:PSS: 15506350; RSS: 15730204; PCPU: 769; PMEM: 5.9
2024-09-18 10:58:47,469:INFO:PerformanceMonitor:PSS: 15627389; RSS: 15716988; PCPU: 769; PMEM: 5.9
2024-09-18 11:03:47,648:INFO:PerformanceMonitor:PSS: 15862479; RSS: 15958220; PCPU: 769; PMEM: 6.0
2024-09-18 11:08:47,876:INFO:PerformanceMonitor:PSS: 16160977; RSS: 16250008; PCPU: 769; PMEM: 6.1
2024-09-18 11:08:47,877:ERROR:PerformanceMonitor:Error in CMSSW step cmsRun1
Number of Cores: 8
Job has exceeded maxPSS: 16000 MB
Job has PSS: 16160 MB

The job stayed quite steadily under 11 GB for almost 6 hours, and then in the last 1.5 hours the memory usage increased by 5 GB.

makortel · 2024-09-18T18:35:38Z

assign reconstruction, dqm

Just guessing the high memory usage would be caused by the application code

cmsbuild · 2024-09-18T18:35:57Z

New categories assigned: reconstruction,dqm

@jfernan2,@mandrenguyen,@rvenditti,@syuvivida,@tjavaid,@nothingface0,@antoniovagnerini you have been requested to review this Pull request/Issue and eventually sign? Thanks

jeyserma · 2024-09-30T06:41:38Z

We have another paused job for the muon PD with exceeding memory. The tarball can be found here:

/eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/PromptReco_Run386319_Muon1

I copied the RAW input file to the following location so that the issue can be reproduced anytime:

/eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/PromptReco_Run386319_Muon1/8959d673-4a4c-487b-8e25-213767c3a788.root

Best,
Jan

makortel · 2024-09-30T15:41:53Z

/eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/PromptReco_Run386319_Muon1

Looking the cmsRun log by eye the RSS seems to grow from 4.3 GB around 9th event to 15.5 GB around 204907th event. The growth seems to be gradual rather than moving up and down. This growth would correspond to about 58 kB/event hoarding or leak.

makortel · 2024-09-30T22:16:57Z

I plotted the RSS and VSIZE vs timestamp (vs the event record number gives similar picture and got

that looks pretty much what one would expect from hoarding or leak.

davidlange6 · 2024-10-01T08:01:34Z

confused... the log file in /eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/PromptReco_Run386319_Muon1/job/WMTaskSpace/cmsRun1

is from run 386037 - or so the fwk thinks.

the job ran for 40 hrs and 200k events - eg, 9kHz into a pd. seems garbage data (certainly not good cosmics/circulating data)

davidlange6 · 2024-10-01T08:06:19Z

28-Sep-2024 07:02:12 UTC Initiating request to open file root://eoscms.cern.ch//eos/cms/tier0/store/data/Run2024H/Cosmics/RAW/v1/000/386/037/00000/8959d673-4a4c-487b-8e25-213767c3a788.root?eos.app=cmst0

indeed, lumi section 100 of this run has very high rates.

jeyserma · 2024-10-01T08:23:18Z

Oh sorry, I mixed up the job tarballs of a different paused job. Please ignore my previous comment about the files. The one you analyzed (run 386037) was a cosmic run where DT got out of global run at LS ~ 100.

I will try to get another example, hoping the files are still on disk.
Sorry for the inconvenience.

jeyserma · 2024-10-01T15:33:50Z

I found the correct tarball + RAW file and copied them over to the same location (removed the old files):

/eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/PromptReco_Run386319_Muon1

The maxPSS error is visible in the wmagentJob.log file:

2024-09-29 09:33:54,759:INFO:PerformanceMonitor:PSS: 9373714; RSS: 9630908; PCPU: 678; PMEM: 3.6
2024-09-29 09:38:55,498:INFO:PerformanceMonitor:PSS: 9339344; RSS: 9563400; PCPU: 679; PMEM: 3.6
2024-09-29 09:43:56,208:INFO:PerformanceMonitor:PSS: 9326689; RSS: 9467828; PCPU: 680; PMEM: 3.5
2024-09-29 09:48:56,939:INFO:PerformanceMonitor:PSS: 9255149; RSS: 9470780; PCPU: 681; PMEM: 3.5
2024-09-29 09:53:57,710:INFO:PerformanceMonitor:PSS: 9468200; RSS: 9722240; PCPU: 682; PMEM: 3.6
2024-09-29 09:58:58,372:INFO:PerformanceMonitor:PSS: 10211382; RSS: 10400500; PCPU: 683; PMEM: 3.9
2024-09-29 10:03:59,080:INFO:PerformanceMonitor:PSS: 13383424; RSS: 13646516; PCPU: 684; PMEM: 5.1
2024-09-29 10:08:59,743:INFO:PerformanceMonitor:PSS: 16107196; RSS: 16698828; PCPU: 685; PMEM: 6.3
2024-09-29 10:08:59,743:ERROR:PerformanceMonitor:Error in CMSSW step cmsRun1
Number of Cores: 8
Job has exceeded maxPSS: 16000 MB
Job has PSS: 16107 MB

2024-09-29 10:08:59,745:ERROR:PerformanceMonitor:Attempting to kill step using SIGUSR2
2024-09-29 10:10:28,134:INFO:CMSSW:Step cmsRun1: Chirp_WMCore_cmsRun_ExitCode 0
2024-09-29 10:10:28,318:INFO:CMSSW:Step cmsRun1: Chirp_WMCore_cmsRun1_ExitCode 0

makortel · 2024-10-08T20:34:33Z

New look into

/eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/PromptReco_Run386319_Muon1

shows RSS vs. event

and VSIZE vs event

While VSIZE grows somewhat gradually (even if in steps) after the first ~1000 events, the RSS shows rapid growth at towards the end of the job, starting around 7160th event (the job processed total of 7382 events).

(btw, the earlier case could have been interesting to study as well to find the ~58 kB/event hoarding/leak, even if the physics content was garbage)

germanfgv · 2024-10-09T10:19:18Z

We have a new instance of this issue. I put the tarball and RAW input here in case it can help the investigation:

/eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/PromptReco_Run386604_Muon0

makortel · 2024-10-09T14:00:46Z

/eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/PromptReco_Run386604_Muon0

Here are the RSS and VSIZE plots vs. event number. The RSS behavior in particular is interesting, showing two periods of steep rise

makortel · 2024-10-09T14:02:23Z

I wonder if these "sudden grow" periods could be related to unrelated process being terminated in the presence of high fragmentation that was brought up e.g. in #42387.

makortel · 2024-10-09T21:04:12Z

@germanfgv Have you tested if resubmission of the paused job would make any difference? I'm just wondering if we'd already have any evidence for these RSS/PSS rapid growths being reproducible or not.

makortel · 2024-10-10T15:35:25Z

Continuing on

/eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/PromptReco_Run386319_Muon1

I ran the job on cmsdev42 (via slc7 container on el8 host), and got the following RSS and VSIZE (to be compared to #46040 (comment))

I think the difference in RSS behavior hints towards the behavior being dependent on the overall system state (supporting the hypothesis of the "RSS storm in presence of high fragmentation").

jeyserma · 2024-10-13T12:17:03Z

Two more occurred and are reported here:

/eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/PromptReco_Run386694_Muon0/

The RAW files are also copied over.

makortel · 2024-10-14T20:36:22Z

/eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/PromptReco_Run386694_Muon0/

Here are memory plots from the logs
8678369f-4998-464d-aef9-ce8b1ece0259-111-0-logArchive.tar.gz

8678369f-4998-464d-aef9-ce8b1ece0259-148-0-logArchive.tar.gz

makortel · 2024-10-14T20:45:13Z

@jeyserma Does Tier0 usually try again these jobs, or fail them after the first attempt?

jeyserma · 2024-10-14T22:47:16Z

maxPSS paused jobs are automatically retried 3 times by our agent.

For this particular memory issue, we increased the memory limit to 17 GB (default 16 GB for 8 cores), and they ran fine.

makortel · 2024-10-17T19:17:54Z

Thanks @jeyserma. Are these failure reports posted if the job has failed all 4 times (i.e. it continues to fail), or if it failed any time even a later retry succeeded?

makortel · 2024-10-17T19:31:31Z

Under the assumption that we should reduce the memory footprint in general I ran IgProf in the job of /eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/PromptReco_Run386319_Muon1 on slc7 on a single thread, and inspecting the heap state after 20th event. The full profile can be found from here, and below is my summary

Total amount of allocated memory: 4.22 GB, divides roughly into (including only the largest contributors or are otherwise interesting

1.11 GB in initialization (EventProcessor::init(), link)
- 883 MB in EDModule constructors (link)
  - 390 MB in cut/expression parser via many modules (link) [Run3 PromptReco] 390 MB in cut/expression parser #46493
    - 103 MB via ObjectSelectorBase<SingleElementCollectionSelector<edm::View<reco::Muon>, ...> (link)
    - 93 MB via ObjectSelectorBase<SingleElementCollectionSelector<std::vector<pat::GenericParticle, ...> (link)
    - 90 MB via BXVectorSimpleFlatTableProducer<l1t::EGamma> (link)
    - 26 MB via SimpleFlatTableProducerBase<reco::BeamSpot, reco::BeamSpot> (link)
    - 17 MB via TopSingleLeptonDQM (link
    - 12 MB via TopMonitor (link)
    - 11 MB via VersionedIdProducer<edm::Ptr<reco::Photon> (link)
    - 8 MB via RecoTauPiZeroProducer (link)
    - 5 MB via SimpleFlatTableProducer<pat::Jet> (link)
    - 5 MB via SimpleFlatTableProducer<pat::Electron> (link)
  - 140 MB in ONNX models (link)
    - 47 MB via BoostedJetONNXJetTagsProducer::initializeGlobalCache()
    - 32 MB via BaseMVAValueMapProducer<pat::Muon>::initializeGlobalCache()
    - 32 MB via pat::MuonMvaIDEstimator (via pat::PATMuonProducer)
    - 22 MB via UnifiedParticleTransformerAK4ONNXJetTagsProducer::initializeGlobalCache()
    - 7.6 MB via DeepFlavourONNXJetTagsProducer::initializeGlobalCache()
  - 88 MB in GBRForests (link)
    - 28 MB via LowPtGsfElectronSeedProducer (link
    - 22 MB via MVAValueMapProducer<reco::GsfElectron> (link)
    - 14 MB via LowPtGsfElectronIDProducer (link)
    - 6 MB via MVAValueMapProducer<reco::Photon> (link)
    - 3.6 MB via PFElecTkProducer (link)
  - 15+35=50 MB in TensorFlow graphs (link) and sessions (link)
    - 11+23=34 MB via DeepTauId::initializeGlobalCache() (link)
    - 1.8+3.9=5.7 MB via TfGraphDefProducer::produce() (link) although this is really via EventSetup
    - 0.8+1.7=2.5 MB via DeepCoreSeedGenerator::initializeGlobalCache( (link)
  - 30 MB in CSCTriggerPrimitivesProducer (link) [Run3 PromptReco] CSCTriggerPrimitivesProducer constructor uses 30 MB / stream #46432
  - 13 MB in DeepTauId (link)
    - 12.8 MB via TensorFlow inference call? (link)
  - 9 MB in PoolOutputModule (link)
    - nearly all in product selection rules (link)
  - 5.1 MB in CSCMonitorModule (link)
  - 4.3 MB in MuonIdProducer (link)
    - nearly all of that in reading something from a TFile (link)
- 96 MB in PSet registry (link)
- 87 MB in Cling (link)
- 81 MB in product registry (link)
  - 37 MB in ROOT dictionary code (link)
  - 37 MB in more ROOT dictionary code (link)
- 28 MB in PoolSource (link)
3.09 GB in data processing (EventProcessor::runToCompletion(), link)
- EventSetup 1018 MB (link) after subtracting the contribution of edm::one EDModules
  - 210 MB in SiPixelTemplateStoreESProducer::produce() (link)
  - 119 MB in SiPixelGainCalibrationOffline via CondDB (link)
  - 79 MB in magneticfield::DD4hep_VolumeBasedMagneticFieldESProducerFromDB::produce() (link)
  - 68 MB in PixelCPEClusterRepairESProducer::produce() (link)
  - 66 MB in SiPixel2DTemplateDBObject via CondDB (link)
  - 49 MB in GBRForestD via CondDB (link)
  - 42 MB in CaloGeometryDBEP<EcalPreshowerGeometry, CaloGeometryDBReader>::produceAligned() (link)
  - 42 MB in EcalCondObjectContainer<EcalPulseCovariance> via CondDB (link)
- beginRun
  - 470 MB in DQM
    - 415 MB as edm::stream (link)
      - 70 MB via SiPixelPhase1Base::bookHistograms()
      - 60 MB via JetMonitor::bookHistograms()
      - 36 MB via TopMonitor::bookHistograms()
      - orthgonally, 55 MB of the 415 MB comes via GenericTriggerEventFlag::initRun() (link)
        
        19 MB via JetMonitor::bookHistograms()
        
        15 MB via METMonitor::bookHistograms()
        
        13 MB via BPHMonitor::bookHistograms()
    - 55 MB as edm::one (link)
      - 29 MB of this is actually HLTConfigProvider::init() (link)
  - 11 MB in L1TMuonOverlapPhase1TrackProducer::beginRun() (link)
- after Event transition
  - 1.1 GB in PoolOutputModule::write() (link)
  - 72 MB in tensorflow::run() (link), after subtracting the component from DeepTauId constructor
    - 49 MB via DeepTauId::produce() (link)
    - 7.6 MB via DeepMETProducer::produce() (link)
    - 5.3 MB via TrackMVAClassifierBase::produce() (link)
  - 48 MB in cms::Ort::ONNXRuntime::run() (link)
    - 29 MB via DeepFlavourONNXJetTagsProducer::produce()
    - 17 MB via BoostedJetONNXJetTagsProducer::produce()
  - 48 MB in L1TMuonEndCapTrackProducer::produce() (link) L1TMuonEndCapTrackProducer::produce() takes 96 MB memory per stream #42526
    - 47 MB in PtAssignmentEngine::load() (link)
  - 20 MB in SiStripRecHitConverter::produce() (link)
    - 12 MB via produced edmNew::DetSetVector<SiStripRecHit2D>
    - 7.8 MB via something in the produce() body
  - 19 MB in SeedCreatorFromRegionHitsEDProducerT<SeedFromConsecutiveHitsCreator>::produce() (link)
  - 15 MB in pat::PATPackedCandidateProducer::produce() (link)
    - 14 MB via loading covariance parametrization from TFile (link)
  - 11 MB in SeedCreatorFromRegionHitsEDProducerT<SeedFromConsecutiveHitsTripletOnlyCreator>::produce() (link)
  - 11 MB in TrackCollectionMerger::produce() (link)
  - 10 MB in TrackProducer::produce() (link)
  - ...
  - 1.5 MB in AlcaBeamMonitor::analyze() (link) AlcaBeamMonitor hoards memory #42995
    - This looks like a good candidate for hoarding data per event

makortel · 2024-10-17T20:22:31Z

1.44 MB to 57.5 MB (56 MB increase, 58 kB/event) in AlcaBeamMonitor::analyze() (20 events link, 1000 events link)

This one had already an issue open in #42995

makortel · 2024-10-17T20:33:15Z

1440 M in CaloSubdetectorGeometry::cellGeomPtr() (link, already noted above)

Spinned off to #46433

makortel · 2024-10-18T17:53:09Z

70 to 140 MB (70 MB increase) in SiPixelPhase1Base::bookHistograms() (4 streams link)

10 to 40 MB (30 MB increase) in a std::map<std::vector<std::pair<int, double>>, AbstractHistogram>

9.9 to 40 MB (30 MB increase) in 3 member functions of GeometryInterface

Spinned off to #46446

makortel · 2024-10-18T18:47:15Z

55 to 218 MB (164 MB increase) in GenericTriggerEventFlag::initRun() (4 streams link)

Spinned off to #46448

makortel · 2024-10-18T18:47:44Z

7.8 to 31 MB (23 MB increase) in BaseMVAValueMapProducer<pat::Muon> (4 streams link)

Spinned off to #46449

makortel · 2024-10-18T18:56:58Z

7.3 to 29 MB (22 MB increase) in DTTrigProd::beginRun() (1 stream link, 4 streams link)

Spinned off to #46450

jeyserma · 2024-10-21T10:36:31Z

Hi @makortel. What I claimed previously was wrong: a promptReco job that exceeds memory is never retried automatically (it's only true to Express, therefore my confusion).

We had a few extra paused jobs last week due to this memory issue and I decided to retry them without increasing the memory, and they all finished successfully. But probably that's not true for all maxMemory jobs that paused, though we can't try it anymore.

Nevertheless probably on average, the Muon is closer to the maxMemory limit and therefore we see such an increase in paused jobs.

makortel · 2024-10-21T14:34:43Z

Thanks @jeyserma!

We had a few extra paused jobs last week due to this memory issue and I decided to retry them without increasing the memory, and they all finished successfully. But probably that's not true for all maxMemory jobs that paused, though we can't try it anymore.

Would the logs of those failed jobs be still available? I'd like to collect more statistics of the "failing behavior".

Given the evidence so far suggests operating system's dynamic behavior playing a role, my suggestion would be to re-try a job paused because of reaching MaxPSS once (or maybe twice) either with the same or a bit larger limit.

makortel · 2024-10-21T20:06:59Z

4.8 MB to 19 MB (14 MB increase) in HLTPrescaleProvider::init()

Spinned off to #46466

makortel · 2024-10-21T20:09:01Z

0.686 G (2.89 %) in CAHitNtupletEDProducerT<CAHitQuadrupletGenerator>::produce() (link)

0.680 G (2.86 %) in CAHitNtupletEDProducerT<CAHitTripletGenerator>::produce() (link)

These are already discussed in #37698

makortel · 2024-10-23T14:52:01Z

Just out of curiosity I collected the total cost of ML algorithms from the aforementioned IgProf memory profiles (numbers reflecting the state of a long-running 8 thread/stream job; not counting the PtAssignmentEngine from L1TMuonEndCapTrackProducer as that should be reworked in several ways)

Models
- ONNX: 140 MB
- GBRForest: 137 MB
- TMVA: 87 MB (11 MB / stream)
- Tensorflow: 57 MB (49 MB + 1 MB / stream)
- XgBoost: 14 MB
Inference
- Tensorflow: 117 MB (number for 1 stream though)
- ONNX: 72 MB
- XGBoost: ~ 60 kB (or less)
- TMVA ~30 kB
- GBRForest: ~ 0 ? (I guess the inputs need some temporary allocations, but nothing from the inference gets held by the modules)

So a total of about 620 MB. FYI @cms-sw/ml-l2

makortel · 2024-10-23T15:07:08Z

390 MB in cut/expression parser via many modules (link)

Spinned off to #46493

makortel · 2024-10-23T16:52:54Z

35 to 39 MB (3.3 MB increase) in tensorflow::createSession() (4 streams link)

0.95 to 3.8 MB (2.9 MB increase) in egammaTools::EgammaDNNHelper::getSessions() (1 stream link, 4 streams link)

Spinned off to #46494

makortel · 2024-10-23T18:36:10Z

26 M from SiStripFolderOrganizer::getSubDetFolderAndTag()

Spinned off to #46498

makortel · 2024-10-23T18:42:07Z

@cms-sw/tracking-pog-l2
While my feeling is that significants improvements during the remaining Run 3 would not be feasible, I think it is nevertheless worth of noting that the tracking makes (at least) 10.2 million memory allocations per event, on average (this corresponds to about 43 % of all memory allocations done during the event processing). (for more details see above)

slava77 · 2024-10-23T19:56:35Z

@cms-sw/tracking-pog-l2 While my feeling is that significants improvements during the remaining Run 3 would not be feasible, I think it is nevertheless worth of noting that the tracking makes (at least) 10.2 million memory allocations per event, on average (this corresponds to about 43 % of all memory allocations done during the event processing). (for more details see above)

seems like a large fraction is the cost of recomputing a hit depending on track parameters.

Unfortunately that's apparently inlined and not clearly visible in the igprof:

cmssw/RecoTracker/TransientTrackingRecHit/src/TkClonerImpl.cc

Lines 35 to 40 in b96fd02

    
           std::unique_ptr<SiStripRecHit2D> TkClonerImpl::operator()(SiStripRecHit2D const& hit, 
        
                                                                     TrajectoryStateOnSurface const& tsos) const { 
        
             /// FIXME: this only uses the first cluster and ignores the others 
        
             const SiStripCluster& clust = hit.stripCluster(); 
        
             StripClusterParameterEstimator::LocalValues lv = stripCPE->localParameters(clust, *hit.detUnit(), tsos); 
        
             return std::make_unique<SiStripRecHit2D>(lv.first, lv.second, *hit.det(), hit.omniCluster());

vs
https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue46040/test_17_total/304

1M in MultiHitFromChi2EDProducer apparently comes from perhaps 25K seeds per event (e.g. pixelLess in that run has 16k/ev), or 40 allocs per final seed, while the number of considered permutations is likely larger. So, doesn't look too unreasonable.

Does it matter?

makortel · 2024-10-23T20:12:54Z

Does it matter?

The memory churn from O(1 MHz) of memory allocations may have a significant impact towards memory getting more fragmented, that could then lead to the OS to use (much) more RSS in some situations (I mean, this sounds plausible, there is little direct evidence beyond what was presented in #42387).

Or in other words, the main practical impact of memory churn is some slowdown, until it gets so bad that everything breaks.

makortel · 2024-11-07T20:29:33Z

335 M in HLTPrescaleProvider::prescaleValuesInDetail() (link)

This one seems to be likely improved in #46628

cmsbuild added the pending-assignment label Sep 18, 2024

cmsbuild added reconstruction-pending dqm-pending pending-signatures and removed pending-assignment labels Sep 18, 2024

makortel mentioned this issue Oct 11, 2024

Memory Jump from 14_1_0_pre5 for Phase2 Workflows #45854

Open

makortel mentioned this issue Oct 17, 2024

[Run3 PromptReco] Many memory allocations CaloSubdetectorGeometry::cellGeomPtr() #46433

Open

makortel mentioned this issue Oct 18, 2024

[Run3 PromptReco] SiPixelPhase1Base::bookHistograms() uses 20 MB memory per stream #46446

Open

This was referenced Oct 18, 2024

[Run3 PromptReco] GenericTriggerEventFlag::initRun() uses 54.7 MB memory per stream #46448

Open

[Run3 PromptReco] TMVA in BaseMVAValueMapProducer<pat::{Electron,Muon}> uses 11 MB memory per stream #46449

Open

makortel mentioned this issue Oct 18, 2024

[Run3 PromptReco] DTTrigProd::beginRun() uses 7.3 MB memory per stream #46450

Open

makortel mentioned this issue Oct 21, 2024

[Run3 PromptReco] HLTPrescaleProvider::init() uses 4.9 MB memory per stream #46466

Open

makortel mentioned this issue Oct 21, 2024

Expensive allocations of a std::vector<unsigned int> in CellularAutomaton::createAndConnectCells #37698

Open

makortel mentioned this issue Oct 23, 2024

[Run3 PromptReco] 390 MB in cut/expression parser #46493

Open

makortel mentioned this issue Oct 23, 2024

[Run3 PromptReco] GsfElectronProducer and GEDPhotonProducer duplicate TF sessions across streams #46494

Open

makortel mentioned this issue Oct 23, 2024

[Run3 PromptReco] 117k memory allocations per event in string manipulation in SiStripMonitorTrack::analyze() #46498

Open

makortel mentioned this issue Nov 8, 2024

Extend SimpleMemoryCheck service to report jemalloc and smaps information, and on early termination signal cms-sw/framework-team#1082

Open

valsdav mentioned this issue Nov 11, 2024

Moved Egamma TensorFlow sessions to global cache #46655

Open

Paused jobs in Prompt Reco due to MaxPSS reached #46040

Paused jobs in Prompt Reco due to MaxPSS reached #46040

Comments

malbouis commented Sep 18, 2024

cmsbuild commented Sep 18, 2024 • edited Loading

cmsbuild commented Sep 18, 2024

makortel commented Sep 18, 2024

makortel commented Sep 18, 2024

cmsbuild commented Sep 18, 2024

jeyserma commented Sep 30, 2024

makortel commented Sep 30, 2024

makortel commented Sep 30, 2024

davidlange6 commented Oct 1, 2024

davidlange6 commented Oct 1, 2024

jeyserma commented Oct 1, 2024

jeyserma commented Oct 1, 2024

makortel commented Oct 8, 2024

germanfgv commented Oct 9, 2024

makortel commented Oct 9, 2024

makortel commented Oct 9, 2024

makortel commented Oct 9, 2024

makortel commented Oct 10, 2024 • edited Loading

jeyserma commented Oct 13, 2024

makortel commented Oct 14, 2024

makortel commented Oct 14, 2024 • edited Loading

jeyserma commented Oct 14, 2024

makortel commented Oct 17, 2024

makortel commented Oct 17, 2024 • edited Loading

makortel commented Oct 17, 2024

makortel commented Oct 17, 2024

makortel commented Oct 18, 2024

makortel commented Oct 18, 2024

makortel commented Oct 18, 2024 • edited Loading

makortel commented Oct 18, 2024

jeyserma commented Oct 21, 2024

makortel commented Oct 21, 2024

makortel commented Oct 21, 2024

makortel commented Oct 21, 2024

makortel commented Oct 23, 2024

makortel commented Oct 23, 2024

makortel commented Oct 23, 2024

makortel commented Oct 23, 2024

makortel commented Oct 23, 2024

slava77 commented Oct 23, 2024 • edited Loading

makortel commented Oct 23, 2024

makortel commented Nov 7, 2024

cmsbuild commented Sep 18, 2024 •

edited

Loading

makortel commented Oct 10, 2024 •

edited

Loading

makortel commented Oct 14, 2024 •

edited

Loading

makortel commented Oct 17, 2024 •

edited

Loading

makortel commented Oct 18, 2024 •

edited

Loading

slava77 commented Oct 23, 2024 •

edited

Loading