Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to pull logs multiple ways and compare them #305

Open
jenimal opened this issue Jan 24, 2017 · 16 comments
Open

Need to pull logs multiple ways and compare them #305

jenimal opened this issue Jan 24, 2017 · 16 comments

Comments

@jenimal
Copy link

jenimal commented Jan 24, 2017

Last week at the WMArchive meeting it was requested that we pull logs "the old way" by pulling them off the agent:
https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsWorkflowTeamLeaderResponsibilities#Retrieving_log_files_from_failed

and via the instructions at:
https://github.com/dmwm/WMArchive/wiki/Given-a-LFN,-return-a-logArchive-or-logCollect-tarball-location
and looking at what Unified pulls:

to see if we are getting the same information all 3 ways, and if that information is what the sites need to debug workflows:

@jenimal
Copy link
Author

jenimal commented Jan 27, 2017

Start by looking at workflow:
pdmvserv_task_TSG-PhaseIFall16GS-00013__v1_T_170125_113822_6705
going after a 134 failure
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
Unified has the following:
https://cmst2.web.cern.ch/cmst2/unified/report/pdmvserv_task_TSG-PhaseIFall16GS-00013__v1_T_170125_113822_6705#134

This is what WMStats has to say about it:
Job Name: 71ad71c2-e368-11e6-8fee-002590494fb0-134
WMBS job id: 2047173
Workflow: pdmvserv_task_TSG-PhaseIFall16GS-00013__v1_T_170125_113822_6705
Task: /pdmvserv_task_TSG-PhaseIFall16GS-00013__v1_T_170125_113822_6705/TSG-PhaseIFall16GS-00013_0/TSG-PhaseIFall16GS-00013_0MergeRAWSIMoutput/TSG-PhaseIFall16DR-00001_0
Status: jobfailed
Input dataset:
Site: T2_US_Florida
Agent: vocms0304.cern.ch
ACDC URL: https://cmsweb.cern.ch/couchdb/acdcserver
State Transition
Exit code: 134
Retry count: 3
Errors:
logArch1
cmsRun1
CMSSWStepFailure (Exit Code: 134)
Adding last 25 lines of CMSSW stdout:
Begin processing the 764th record. Run 1, Event 4681529, LumiSection 9197 at 27-Jan-2017 11:47:43.960 EST
Begin processing the 765th record. Run 1, Event 4681530, LumiSection 9197 at 27-Jan-2017 11:48:03.450 EST
Begin processing the 766th record. Run 1, Event 4681528, LumiSection 9197 at 27-Jan-2017 11:48:06.635 EST
%MSG-w InvalidHelix: SeedGeneratorFromRegionHitsEDProducer:hltPAIter5PixelLessSeeds 27-Jan-2017 11:48:21 EST Run: 1 Event: 4681527
PixelClusterShapeSeedComparitor helix is not valid, result is bad
%MSG
Begin processing the 767th record. Run 1, Event 4681533, LumiSection 9197 at 27-Jan-2017 11:48:26.111 EST
Begin processing the 768th record. Run 1, Event 4681532, LumiSection 9197 at 27-Jan-2017 11:48:35.665 EST
%MSG-w InvalidHelix: SeedGeneratorFromRegionHitsEDProducer:hltPAIter6TobTecSeedsPair 27-Jan-2017 11:48:52 EST Run: 1 Event: 4681528
PixelClusterShapeSeedComparitor helix is not valid, result is bad
%MSG
Begin processing the 769th record. Run 1, Event 4681534, LumiSection 9197 at 27-Jan-2017 11:48:54.684 EST
Begin processing the 770th record. Run 1, Event 4681531, LumiSection 9197 at 27-Jan-2017 11:48:55.746 EST
Begin processing the 771st record. Run 1, Event 4681535, LumiSection 9197 at 27-Jan-2017 11:49:15.523 EST
Begin processing the 772nd record. Run 1, Event 4681539, LumiSection 9197 at 27-Jan-2017 11:49:20.260 EST
Begin processing the 773rd record. Run 1, Event 4681538, LumiSection 9197 at 27-Jan-2017 11:49:34.538 EST
Begin processing the 774th record. Run 1, Event 4681537, LumiSection 9197 at 27-Jan-2017 11:49:59.378 EST
Begin processing the 775th record. Run 1, Event 4681540, LumiSection 9197 at 27-Jan-2017 11:50:07.452 EST
Begin processing the 776th record. Run 1, Event 4681536, LumiSection 9197 at 27-Jan-2017 11:50:16.873 EST
Begin processing the 777th record. Run 1, Event 4681543, LumiSection 9197 at 27-Jan-2017 11:50:51.183 EST
Begin processing the 778th record. Run 1, Event 4681541, LumiSection 9197 at 27-Jan-2017 11:50:54.028 EST
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Complete
process id is 8273 status is 134
CmsRunFailure
Message: Error running cmsRun
{'arguments': ['/bin/bash', '/tmp/838534.cms_slurm_job/glide_Iv0D3A/execute/dir_6508/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'slc6_amd64_gcc530', 'scramv1', 'CMSSW', 'CMSSW_8_1_0', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '', '', '']}
Return code: 134

ModuleName : WMCore.WMSpec.Steps.WMExecutionFailure
MethodName : __init__
ClassInstance : None
FileName : /tmp/838534.cms_slurm_job/glide_Iv0D3A/execute/dir_6508/job/WMCore.zip/WMCore/WMSpec/Steps/WMExecutionFailure.py
ClassName : None
LineNumber : 18
ErrorNr : 134

Traceback:

BadFWJRXML (Exit Code: 50115)
Error reading XML job report file, possibly corrupt XML File:
Details: no element found: line 1782, column 0
ErrorLoggingAddition (Exit Code: 99999)
Adding extra error in order to hold error report

Adding last ten lines of CMSSW stderr:
WARNING: In non-interactive mode release checks e.g. deprecated releases, production architectures are disabled.
WARNING: There already exists /tmp/838534.cms_slurm_job/glide_Iv0D3A/execute/dir_6508/job/WMTaskSpace/cmsRun1/CMSSW_8_1_0 area for SCRAM_ARCH slc6_amd64_gcc530.
/tmp/838534.cms_slurm_job/glide_Iv0D3A/execute/dir_6508/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh: line 78: 8273 Aborted $EXECUTABLE -j $JOB_REPORT $CONFIGURATION 2>&1
stageOut1
ReportManipulatingError (Exit Code: 99999)
Could not find report file for step stageOut1!
Input files
Lumis
Output:
logArchive
lfn: /store/unmerged/logs/prod/2017/1/27/pdmvserv_task_TSG-PhaseIFall16GS-00013__v1_T_170125_113822_6705/TSG-PhaseIFall16GS-00013_0/TSG-PhaseIFall16GS-00013_0MergeRAWSIMoutput/TSG-PhaseIFall16DR-00001_0/0000/3/71ad71c2-e368-11e6-8fee-002590494fb0-134-3-logArchive.tar.gz
location: null
pfn:
checksums: adler32: e8f1a256
size: 0
log archive:

========================================================================
steps taken using "old way" of getting logs:
ssh vocms0304.cern.ch
sudo -u cmst1 /bin/bash --init-file ~cmst1/.bashrc
cmst1@vocms0304:/afs/cern.ch/user/j/jen_a $ source ~cmst1/status.sh
cd install/wmagent/JobArchiver/

ls install/wmagent/JobArchiver/logDir/p/pdmvserv_task_TSG-PhaseIFall16GS-00013__v1_T_170125_113822_6705/JobCluster_2047/Job_2047173.tar.bz2

cp ls install/wmagent/JobArchiver/logDir/p/pdmvserv_task_TSG-PhaseIFall16GS-00013__v1_T_170125_113822_6705/JobCluster_2047/Job_2047173.tar.bz2 ~jen_a/

tar jxf Job_2047173.tar.bz2

files can be found at:
/afs/cern.ch/user/j/jen_a/public/Job_2047173.tar.bz2

================================================================

@jenimal
Copy link
Author

jenimal commented Jan 27, 2017

OK attempting to pull from WMArchive
Following directions here:
https://github.com/dmwm/WMArchive/wiki/Given-a-LFN,-return-a-logArchive-or-logCollect-tarball-location

mac-122185:~ jen_a$ ssh vocms0130.cern.ch

-bash-4.1$ export PYTHONPATH=$PWD/WMArchive/src/python:$PWD/WMCore/src/python:/afs/cern.ch/user/v/valya/public/spark:/usr/lib/spark/python
-bash-4.1$ export PATH=$PWD/WMArchive/bin:$PATH
-bash-4.1$ echo $PATH
/afs/cern.ch/user/j/jen_a/WMArchive/bin:/usr/sue/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin
-bash-4.1$ cd WMArchive
-bash-4.1$ klist
Ticket cache: FILE:/tmp/krb5cc_49573_Tnp2396932
Default principal: [email protected]

Valid starting Expires Service principal
01/28/17 00:07:05 01/29/17 00:06:55 krbtgt/[email protected]
01/28/17 00:07:06 01/29/17 00:06:55 afs/[email protected]

OK where am I supposed to find the right area to do this?
source /data/srv/current/apps/wmarchive/etc/profile.d/init.sh

@jenimal
Copy link
Author

jenimal commented Feb 2, 2017

I keep getting timeouts:
ssh analytix.cern.ch

-bash-4.2$ export PYTHONPATH=$PWD/WMArchive/src/python:$PWD/WMCore/src/python:/afs/cern.ch/user/v/valya/public/spark:/usr/lib/spark/python
-bash-4.2$ export PATH=$PWD/WMArchive/bin:$PATH
-bash-4.2$ cd WMArchive
-bash-4.2$ myspark --script=LogFinder --spec=lfn.json --yarn
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/flume-ng/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/02/02 22:17:41 INFO SparkContext: Running Spark version 1.6.0
17/02/02 22:17:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/02 22:17:42 INFO SecurityManager: Changing view acls to: jen_a
17/02/02 22:17:42 INFO SecurityManager: Changing modify acls to: jen_a
17/02/02 22:17:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jen_a); users with modify permissions: Set(jen_a)
17/02/02 22:17:42 INFO Utils: Successfully started service 'sparkDriver' on port 43430.
17/02/02 22:17:43 INFO Slf4jLogger: Slf4jLogger started
17/02/02 22:17:43 INFO Remoting: Starting remoting
17/02/02 22:17:43 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:43167]
17/02/02 22:17:43 INFO Remoting: Remoting now listens on addresses: [akka.tcp://[email protected]:43167]
17/02/02 22:17:43 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 43167.
17/02/02 22:17:43 INFO SparkEnv: Registering MapOutputTracker
17/02/02 22:17:43 INFO SparkEnv: Registering BlockManagerMaster
17/02/02 22:17:43 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-5cde160b-b424-47ba-9003-e061f6f8d899
17/02/02 22:17:43 INFO MemoryStore: MemoryStore started with capacity 2.1 GB
17/02/02 22:17:43 INFO SparkEnv: Registering OutputCommitCoordinator
17/02/02 22:17:44 INFO Server: jetty-8.y.z-SNAPSHOT
17/02/02 22:17:44 INFO AbstractConnector: Started [email protected]:4040
17/02/02 22:17:44 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/02/02 22:17:44 INFO SparkUI: Started SparkUI at http://128.142.23.174:4040
17/02/02 22:17:44 INFO SparkContext: Added JAR file:/usr/lib/avro/avro-mapred.jar at spark://128.142.23.174:43430/jars/avro-mapred.jar with timestamp 1486070264326
17/02/02 22:17:44 INFO SparkContext: Added JAR file:/usr/lib/spark/examples/lib/spark-examples-1.6.0-cdh5.7.5-hadoop2.6.0-cdh5.7.5.jar at spark://128.142.23.174:43430/jars/spark-examples-1.6.0-cdh5.7.5-hadoop2.6.0-cdh5.7.5.jar with timestamp 1486070264327
17/02/02 22:17:44 INFO Utils: Copying /afs/cern.ch/user/j/jen_a/WMArchive/src/python/WMArchive/Tools/myspark.py to /tmp/spark-c8d85608-5724-446f-acf7-02ffa639a98d/userFiles-da89c2dd-9454-4c89-840a-f87b1e0ef10b/myspark.py
17/02/02 22:17:44 INFO SparkContext: Added file file:/afs/cern.ch/user/j/jen_a/WMArchive/src/python/WMArchive/Tools/myspark.py at file:/afs/cern.ch/user/j/jen_a/WMArchive/src/python/WMArchive/Tools/myspark.py with timestamp 1486070264597
17/02/02 22:17:44 INFO Executor: Starting executor ID driver on host localhost
17/02/02 22:17:44 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43208.
17/02/02 22:17:44 INFO NettyBlockTransferService: Server created on 43208
17/02/02 22:17:44 INFO BlockManager: external shuffle service port = 7337
17/02/02 22:17:44 INFO BlockManagerMaster: Trying to register BlockManager
17/02/02 22:17:44 INFO BlockManagerMasterEndpoint: Registering block manager localhost:43208 with 2.1 GB RAM, BlockManagerId(driver, localhost, 43208)
17/02/02 22:17:44 INFO BlockManagerMaster: Registered BlockManager
17/02/02 22:17:45 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
17/02/02 22:17:46 INFO EventLoggingListener: Logging events to hdfs:///user/spark/applicationHistory/local-1486070264687
17/02/02 22:17:46 INFO Utils: Copying /afs/cern.ch/user/j/jen_a/WMArchive/src/python/WMArchive/PySpark/LogFinder.py to /tmp/spark-c8d85608-5724-446f-acf7-02ffa639a98d/userFiles-da89c2dd-9454-4c89-840a-f87b1e0ef10b/LogFinder.py
17/02/02 22:17:46 INFO SparkContext: Added file /afs/cern.ch/user/j/jen_a/WMArchive/src/python/WMArchive/PySpark/LogFinder.py at file:/afs/cern.ch/user/j/jen_a/WMArchive/src/python/WMArchive/PySpark/LogFinder.py with timestamp 1486070266908

Load LogFinder

{"logCollect": ["root://castorcms.cern.ch//castor/cern.ch/cms/store/logs/prod/2017/01/WMAgent/pdmvserv_task_HIG-RunIISummer16DR80Premix-01090__v1_T_161206_005801_5743/pdmvserv_task_HIG-RunIISummer16DR80Premix-01090__v1_T_161206_005801_5743-LogCollectForHIG-RunIISummer16DR80Premix-01090_0-cmswn2179-393-logs.tar"]}

-bash-4.2$ xrdcp root://castorcms.cern.ch//castor/cern.ch/cms/store/logs/prod/2017/01/WMAgent/pdmvserv_task_HIG-RunIISummer16DR80Premix-01090__v1_T_161206_005801_5743/pdmvserv_task_HIG-RunIISummer16DR80Premix-01090__v1_T_161206_005801_5743-LogCollectForHIG-RunIISummer16DR80Premix-01090_0-cmswn2179-393-logs.tar .
[0B/0B][100%][==================================================][0B/s]
Run: [ERROR] Operation expired
-bash-4.2$ ls
bin doc etc lfn.json README.md setup.py src test
-bash-4.2$

@jenimal
Copy link
Author

jenimal commented Feb 2, 2017

@vkuznet @yuyiguo can you tell me what I'm doing wrong that my xrdcp is timing out?

@vkuznet
Copy link
Contributor

vkuznet commented Feb 2, 2017

Jen, please try xrdcp from lxplus and if it times out post your question to dmDevelopment forum. xrdcp is out of the scope of WMArchive project, it's generic tool to get files from remote locations.

@jenimal
Copy link
Author

jenimal commented Feb 3, 2017

still got the timeout:
-bash-4.1$ xrdcp root://castorcms.cern.ch//castor/cern.ch/cms/store/logs/prod/2017/01/WMAgent/pdmvserv_task_HIG-RunIISummer16DR80Premix-01090__v1_T_161206_005801_5743/pdmvserv_task_HIG-RunIISummer16DR80Premix-01090__v1_T_161206_005801_5743-LogCollectForHIG-RunIISummer16DR80Premix-01090_0-cmswn2179-393-logs.tar .
[0B/0B][100%][==================================================][0B/s]
Run: [ERROR] Operation expired
-bash-4.1$

@jenimal
Copy link
Author

jenimal commented Feb 3, 2017

And the reply:
Aren't Castor files on tape? You can't xrdcp a file on tape. Has this xrdcp command ever worked for Castor files?

---Carl

On 02/03/2017 02:36 PM, Jennifer K Adelman-Mccarthy wrote:

*** Discussion title: Data Management Development

Hi,

I am testing log file retrieval for WMArchive and am having issues with timeouts when trying to retrieve the log files. Valentin suggested I email you to see if you can help:

xrdcp root://castorcms.cern.ch//castor/cern.ch/cms/store/logs/prod/2017/01/WMAgent/pdmvserv_task_HIG-RunIISummer16DR80Premix-01090__v1_T_161206_005801_5743/pdmvserv_task_HIG-RunIISummer16DR80Premix-01090__v1_T_161206_005801_5743-LogCollectForHIG-RunIISummer16DR80Premix-01090_0-cmswn2179-393-logs.tar .
[0B/0B][100%][==================================================][0B/s]
Run: [ERROR] Operation expired

Any ideas?

Jen

[ MIME part of type text/html without a name stripped ]


Visit this CMS message (to reply or unsubscribe) at:
https://hypernews.cern.ch/HyperNews/CMS/get/dmDevelopment/1989.html

So how do I get them off then? https://github.com/dmwm/WMArchive/wiki/Given-a-LFN,-return-a-logArchive-or-logCollect-tarball-location just has to do xrdcp

@jenimal
Copy link
Author

jenimal commented Feb 6, 2017

OK Dirk had the right solution for pulling the logs:
xrdcp root://eoscms.cern.ch//store/logs/prod/2017/01/WMAgent/pdmvserv_task_HIG-RunIISummer16DR80Premix-01090__v1_T_161206_005801_5743/pdmvserv_task_HIG-RunIISummer16DR80Premix-01090__v1_T_161206_005801_5743-LogCollectForHIG-RunIISummer16DR80Premix-01090_0-cmswn2179-393-logs.tar .

OK how do I know which of the tar files to look into?

@vkuznet
Copy link
Contributor

vkuznet commented Feb 8, 2017

Jen,
I am not familiar how tar ball is organized and I think you better ask
Seangchan @ticoann and Alan @amaltaro .

@ticoann
Copy link
Contributor

ticoann commented Feb 8, 2017

The starting point is unmerged file, right.

As in the wiki pages
https://github.com/dmwm/WMArchive/wiki/Given-a-LFN,-return-a-logArchive-or-logCollect-tarball-location

It returns both logArchive file and corresponding output files as a list.
I am not sure how the list is organized but I am assuming the order of the unmerged output matches the order of logArchive file, isn't it? @vkuznet?

"queries": ["/store/unmerged/RunIISummer16MiniAODv2/DMV_NNPDF30_Vector_Mphi-500_Mchi-200_gSM-0p25_gDM-1p0_v2_13TeV-powheg/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/130000/EEDD6A05-01AD-E611-B4C7-001E675A6AA9.root",
 "/store/unmerged/RunIISummer16MiniAODv2/DMV_NNPDF30_Vector_Mphi-500_Mchi-200_gSM-0p25_gDM-1p0_v2_13TeV-powheg/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/130000/58098F3B-8FAC-E611-BC87-001E67DFF7CB.root",
 "/store/unmerged/RunIISummer16MiniAODv2/DMV_NNPDF30_Vector_Mphi-500_Mchi-200_gSM-0p25_gDM-1p0_v2_13TeV-powheg/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/130000/46B86B27-01AD-E611-9A98-001E67E69E32.root",
 "/store/unmerged/RunIISummer16MiniAODv2/DMV_NNPDF30_Vector_Mphi-500_Mchi-200_gSM-0p25_gDM-1p0_v2_13TeV-powheg/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/130000/DCFBE0FB-00AD-E611-BD13-001E67DFFF5F.root",
 "/store/unmerged/RunIISummer16MiniAODv2/DMV_NNPDF30_Vector_Mphi-500_Mchi-200_gSM-0p25_gDM-1p0_v2_13TeV-powheg/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/130000/689521FA-00AD-E611-A6AB-001E67A404B5.root",
 "/store/unmerged/RunIISummer16MiniAODv2/DMV_NNPDF30_Vector_Mphi-500_Mchi-200_gSM-0p25_gDM-1p0_v2_13TeV-powheg/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/130000/549A3202-01AD-E611-A745-001E67A3FE66.root",
 "/store/unmerged/RunIISummer16MiniAODv2/DMV_NNPDF30_Vector_Mphi-500_Mchi-200_gSM-0p25_gDM-1p0_v2_13TeV-powheg/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/130000/3CB5DEF8-00AD-E611-A256-001E67A3FC1D.root",
 "/store/unmerged/RunIISummer16MiniAODv2/DMV_NNPDF30_Vector_Mphi-500_Mchi-200_gSM-0p25_gDM-1p0_v2_13TeV-powheg/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/130000/969189F9-00AD-E611-82B1-001E67A3F8A8.root",
 "/store/unmerged/RunIISummer16MiniAODv2/DMV_NNPDF30_Vector_Mphi-500_Mchi-200_gSM-0p25_gDM-1p0_v2_13TeV-powheg/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/130000/F63804FA-00AD-E611-95A3-001E67A3AEB8.root",
 "/store/unmerged/RunIISummer16MiniAODv2/DMV_NNPDF30_Vector_Mphi-500_Mchi-200_gSM-0p25_gDM-1p0_v2_13TeV-powheg/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/130000/C84C5B0D-01AD-E611-877B-001E67A42161.root", 
 "/store/unmerged/RunIISummer16MiniAODv2/DMV_NNPDF30_Vector_Mphi-500_Mchi-200_gSM-0p25_gDM-1p0_v2_13TeV-powheg/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/130000/52C56AFB-00AD-E611-ACEC-001E67DFFB4F.root",
 "/store/unmerged/RunIISummer16MiniAODv2/DMV_NNPDF30_Vector_Mphi-500_Mchi-200_gSM-0p25_gDM-1p0_v2_13TeV-powheg/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/130000/8C5F11F9-00AD-E611-B3C5-001E675A6653.root",
 "/store/unmerged/RunIISummer16MiniAODv2/DMV_NNPDF30_Vector_Mphi-500_Mchi-200_gSM-0p25_gDM-1p0_v2_13TeV-powheg/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/130000/6CD954F7-00AD-E611-83C6-001E67DDCC81.root",
 "/store/unmerged/logs/prod/2016/11/17/pdmvserv_EXO-RunIISummer16MiniAODv2-00065_00033_v0__161115_160653_2105/StepOneProc/0000/0/40227668-ac52-11e6-9b63-02163e017c3c-0-0-logArchive.tar.gz",
 "/store/unmerged/logs/prod/2016/11/17/pdmvserv_EXO-RunIISummer16MiniAODv2-00065_00033_v0__161115_160653_2105/StepOneProc/0000/0/931d89dc-ac4f-11e6-9b63-02163e017c3c-4-0-logArchive.tar.gz",
 "/store/unmerged/logs/prod/2016/11/17/pdmvserv_EXO-RunIISummer16MiniAODv2-00065_00033_v0__161115_160653_2105/StepOneProc/0000/0/40227668-ac52-11e6-9b63-02163e017c3c-3-0-logArchive.tar.gz",
 "/store/unmerged/logs/prod/2016/11/17/pdmvserv_EXO-RunIISummer16MiniAODv2-00065_00033_v0__161115_160653_2105/StepOneProc/0000/0/1fc3af04-ac52-11e6-9b63-02163e017c3c-5-0-logArchive.tar.gz",
 "/store/unmerged/logs/prod/2016/11/17/pdmvserv_EXO-RunIISummer16MiniAODv2-00065_00033_v0__161115_160653_2105/StepOneProc/0000/0/931d89dc-ac4f-11e6-9b63-02163e017c3c-2-0-logArchive.tar.gz",
 "/store/unmerged/logs/prod/2016/11/17/pdmvserv_EXO-RunIISummer16MiniAODv2-00065_00033_v0__161115_160653_2105/StepOneProc/0000/0/40227668-ac52-11e6-9b63-02163e017c3c-2-0-logArchive.tar.gz",
 "/store/unmerged/logs/prod/2016/11/17/pdmvserv_EXO-RunIISummer16MiniAODv2-00065_00033_v0__161115_160653_2105/StepOneProc/0000/0/931d89dc-ac4f-11e6-9b63-02163e017c3c-0-0-logArchive.tar.gz",
 "/store/unmerged/logs/prod/2016/11/17/pdmvserv_EXO-RunIISummer16MiniAODv2-00065_00033_v0__161115_160653_2105/StepOneProc/0000/0/931d89dc-ac4f-11e6-9b63-02163e017c3c-1-0-logArchive.tar.gz",
 "/store/unmerged/logs/prod/2016/11/17/pdmvserv_EXO-RunIISummer16MiniAODv2-00065_00033_v0__161115_160653_2105/StepOneProc/0000/0/931d89dc-ac4f-11e6-9b63-02163e017c3c-3-0-logArchive.tar.gz",
 "/store/unmerged/logs/prod/2016/11/17/pdmvserv_EXO-RunIISummer16MiniAODv2-00065_00033_v0__161115_160653_2105/StepOneProc/0000/0/40227668-ac52-11e6-9b63-02163e017c3c-1-0-logArchive.tar.gz",
 "/store/unmerged/logs/prod/2016/11/17/pdmvserv_EXO-RunIISummer16MiniAODv2-00065_00033_v0__161115_160653_2105/StepOneProc/0000/0/1fc3af04-ac52-11e6-9b63-02163e017c3c-0-0-logArchive.tar.gz",
 "/store/unmerged/logs/prod/2016/11/17/pdmvserv_EXO-RunIISummer16MiniAODv2-00065_00033_v0__161115_160653_2105/StepOneProc/0000/0/1fc3af04-ac52-11e6-9b63-02163e017c3c-1-0-logArchive.tar.gz",
 "/store/unmerged/logs/prod/2016/11/17/pdmvserv_EXO-RunIISummer16MiniAODv2-00065_00033_v0__161115_160653_2105/StepOneProc/0000/0/1fc3af04-ac52-11e6-9b63-02163e017c3c-4-0-logArchive.tar.gz"]}

@vkuznet
Copy link
Contributor

vkuznet commented Feb 9, 2017 via email

@jenimal
Copy link
Author

jenimal commented Feb 10, 2017

Can somebody help me find this log file? When I look in WMStats I am looking for:
lfn: /store/unmerged/logs/prod/2017/2/10/pdmvserv_task_TSG-PhaseIFall16GS-00005__v1_T_170205_161623_9267/TSG-PhaseIFall16GS-00005_0/TSG-PhaseIFall16GS-00005_0MergeRAWSIMoutput/TSG-PhaseIFall16DR-00003_0/0002/2/03d16bc8-ef51-11e6-88b9-02163e0184a6-0-2-logArchive.tar.gz

The "L" in WMStats tells me it ran on vocms0126.cern.ch
jobcooloff: 2017/2/10 (Fri) 07:17:28 UTC, T1_US_FNAL_Disk
jobcooloff: 2017/2/10 (Fri) 08:49:57 UTC, T1_US_FNAL_Disk
jobcooloff: 2017/2/10 (Fri) 17:01:34 UTC, T1_US_FNAL_Disk

WMBS job id: 3875037

I go onto vocms0126.cern.ch and go look there first, since the job is still in cooloff...
and the tarball isn't in JobArchiver:
cmst1@vocms0126:/data/srv/wmagent/current/install/wmagent/JobArchiver/logDir/p/pdmvserv_task_TSG-PhaseIFall16GS-00005__v1_T_170205_161623_9267/JobCluster_3875 $ ls
Job_3875038.tar.bz2 Job_3875041.tar.bz2 Job_3875371.tar.bz2 Job_3875373.tar.bz2 Job_3875617.tar.bz2
Job_3875040.tar.bz2 Job_3875043.tar.bz2 Job_3875372.tar.bz2 Job_3875616.tar.bz2

So I try WMArchive:
more lfn.json
{"spec":{"lfn":"/store/unmerged/logs/prod/2017/2/10/pdmvserv_task_TSG-PhaseIFall16GS-00005__v1_T_170205_161623_9267/TSG-PhaseIFa
ll16GS-00005_0/TSG-PhaseIFall16GS-00005_0MergeRAWSIMoutput/TSG-PhaseIFall16DR-00003_0/0002/2/03d16bc8-ef51-11e6-88b9-02163e0184a
6-0-2-logArchive.tar.gz","timerange":[20170209,20170210]}, "fields":[]}

7/02/10 18:53:37 INFO EventLoggingListener: Logging events to hdfs:///user/spark/applicationHistory/local-1486749212555
17/02/10 18:53:37 INFO Utils: Copying /afs/cern.ch/user/j/jen_a/WMArchive/src/python/WMArchive/PySpark/LogFinder.py to /tmp/spark-ff4b51f7-96e8-4fa5-918b-0b9b8c3674db/userFiles-8bf03806-3ec6-43e2-ab53-496ed8d5e51f/LogFinder.py
17/02/10 18:53:37 INFO SparkContext: Added file /afs/cern.ch/user/j/jen_a/WMArchive/src/python/WMArchive/PySpark/LogFinder.py at file:/afs/cern.ch/user/j/jen_a/WMArchive/src/python/WMArchive/PySpark/LogFinder.py with timestamp 1486749217719

Load LogFinder

{"logCollect": []}

And it doesn't appear to be there either.

Help!

Jen

@ticoann
Copy link
Contributor

ticoann commented Feb 10, 2017

Hi Jen, logcollect job probably didn't finished/started yet. That probably why you can't fine logCollect location in WMArchive.

The log file is not archived in JobArchiver yet. I am not sure whether this is a bug or just not ready yet.

You can find log files under here.

/data/srv/wmagent/current/install/wmagent/JobCreator/JobCache/pdmvserv_task_TSG-PhaseIFall16GS-00005__v1_T_170205_161623_9267/TSG-PhaseIFall16GS-00005_0/TSG-PhaseIFall16GS-00005_0MergeRAWSIMoutput/TSG-PhaseIFall16DR-00003_0/JobCollection_384033_0/job_3875037

I am not sure how long the file will be preserved there but I think it will move to JobArchiver when it gets deleted.

cmst1@vocms0126:/data/srv/wmagent/current/install/wmagent/JobCreator/JobCache/pdmvserv_task_TSG-PhaseIFall16GS-00005__v1_T_170205_161623_9267/TSG-PhaseIFall16GS-00005_0/TSG-PhaseIFall16GS-00005_0MergeRAWSIMoutput/TSG-PhaseIFall16DR-00003_0/JobCollection_384033_0/job_3875037 $ ls
condor.42895.25.err  condor.42895.25.out  condor.43095.81.log  condor.43343.27.err  condor.43343.27.out  Report.0.pkl  Report.2.pkl
condor.42895.25.log  condor.43095.81.err  condor.43095.81.out  condor.43343.27.log  job.pkl		 Report.1.pkl

@jenimal
Copy link
Author

jenimal commented Feb 10, 2017

I was looking in JobArchiver.. wonder if they are still in JobCreator.. let me go look

OK so there they are..
so when looking for log files for workflows that are still running :

  1. Look in JobCreator If they are no longer there
  2. look in JobArchiver....if they are no longer there
  3. look in eos

if you still can't find them... go to the bar....

@ticoann
Copy link
Contributor

ticoann commented Feb 10, 2017

Bar sounds good. :-) But before you hit the bar, you can check WMArchive as well. LogCollect job may be finished when workflow is still running. But even if workflow is finished, logCollect job still might not be done since we doesn't count that for completed status.

Anyway, searching for log collect job in WMArchive is not ideal for currently running workflows.

@jenimal
Copy link
Author

jenimal commented Feb 10, 2017

Yep, but the deal is we need to be able to find them, and get them posted in ggus tickets for sites if we suspect site issues. so the fact that the logs "move" during the time workflows in in flight makes it tricky to find them. BUT if we know all the landing places in the middle, at least we know where to look, in what order when we need to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants