-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications
A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, J. Fullop, A. Gentile, S. Monk, N. Naksinehaboon, J. Ogden, M. Rajan, M. Showerman, J. Stevenson, N. Taerat, and T. Tucker
IEEE/ACM Int'l. Conf. for High Performance Storage, Networking, and Analysis (SC14) New Orleans, LA. Nov 2014.
LDMS v4 Tutorials from LDMSCON2020:
All tutorials are available at: LDMSCON2020 Tutorials
LDMS v4: Writing Sampler and Store Plugins
Available at LDMSCON2020 Tutorials select: LDMSv4_SamplerStoreTutorials_LDMSCON2020.pdf
Sandia National Laboratories, SAND2020-8001 C, 2020.
Baler: Deterministic, lossless log message clustering tool
N. Taerat, J. Brandt, A. Gentile, M. Wong, and C. Leangsuksun
In: Computer Science - Research and Development
Volume 26, Numbers 3-4, 285-295, DOI: 10.1007/s00450-011-0155-3
Int'l. Supercomputing Conference (ISC). Hamburg, Germany. June 2011.
New Systems, New Behaviors, New Patterns: Monitoring Insights from System Standup
J. Brandt, A. Gentile, C. Martin, J. Repik, and N. Taerat
Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int'l. Conf. on Cluster Computing (CLUSTER) Chicago, IL. Sept 2015.
NOTE: Publications prior to Sept 2011 refer to a different and now deprecated architecture for data collection and transport (i.e., they do NOT use LDMS). Many publications listed here are also available at OSTI.GOV some months after publication.
Integrating Systems Operations into CoDesign-- Keynote_ 🔸
Presented by A. Gentile
2nd Int'l Workshop on Monitoring and Operational Data Analytics (MODA21). Jul 2021.
Delay Sensitivity-Driven Congestion Mitigation for HPC Systems
A. Patke, S. Jha, H. Qui, J. Brandt, A. Gentile, J. Greenseid, A. Kalbarczyk, and R. Iyer
ACM Int'l Conference on Supercomputing (ICS2021). Jun 2021.
Enabling System and Application Data Fusion
Presented by A. Gentile
2021 ECP Annual Meeting Center and Application Monitoring WG. Apr 2021.
HPC System Data Pipeline to Enable Meaningful Insights through Analytic-Driven Visualizations
B. Schwaller, N. Tucker, T. Tucker, B. Allan, and J. Brandt
Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int'l. Conf. on Cluster Computing (CLUSTER) Kobe, Japan, Sept 2020.
Towards Workload-Adaptive Scheduling for HPC Clusters
A. Goponenko, R. Izadpanah, J. Brandt, and D. Dechev
Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int'l. Conf. on Cluster Computing (CLUSTER) Kobe, Japan, Sept 2020.
LDMS Monitoring of EDR InfiniBand Networks -- workshop work-in-progress paper & presentation
B. Allan, M. Aguilar, B. Schwaller, S. Langer
Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int'l. Conf. on Cluster Computing (CLUSTER) Kobe, Japan, Sept 2020. Also as Sandia Technical Report SAND2020-8534C (paper) and SAND2020-9599C (presentation).
Inspecting fast commodity RDMA network performance on production systems with LDMS -- Workshop presentation
B. Allan, M. Aguilar, B. Schwaller, S. Langer
LDMSCON2020:
LDMS Users Group Conference 2020, Albuquerque, NM, Aug 2020. Technical report SAND2020-8014C.
Production LDMS, genders, systemd, and the future -- Workshop presentation
B. Allan
LDMSCON2020:
LDMS Users Group Conference 2020, Albuquerque, NM, Aug 2020. Technical report SAND2020-8015C.
LDMS packaging: Moving from tribal knowledge to community knowledge -- Workshop presentation
B. Allan
LDMSCON2020:
LDMS Users Group Conference 2020, Albuquerque, NM, Aug 2020. Technical report SAND2020-8013C.
Measuring Congestion in High-Performance Datacenter Networks
S. Jha, A. Gentile, J. Brandt, A. Patke, B. Lim, G. Bauer, M. Showerman, L. Kaplan, Z. Kalbarczyk, W. Kramer, and R. Iyer
at The 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Feb 2020.
Attributing Performance Variation from Integrated Application and System Data -- poster
O. Aaziz, B. Allan, J. Brandt, J. Cook., K. Devine, J. Elliott, A. Gentile, S. Olivier, K. Pedretti, and T. Tucker
Applied Computer Science Meeting, Feb 2020.
Enabling Machine Learning-based HPC Performance Diagnostics in Production Environments -- Panel Organizer_ 🔸
Organizers: M. Showerman, J. Greenseid, A. Gentile, and J. Brandt
Panelists: W. T. Kramer (NCSA), R. Gerber (NERSC), N. Brown (EPCC), and A. Saxton (NCSA)
SC19, Fri 11/22 8:30 AM Nov 2019
Holistic Measurement Driven System Assessment (HMDSA) -- poster
S. Jha, M. Showerman, A. Saxton, J. Enos, G. Bauer, Z. Kalbarczyk, A. Gentile, J. Brandt, R. Iyer, and W. T. Kramer
SC19, Nov 2019.
A Machine Learning Approach to Understanding HPC Application Performance Variation -- poster
B. Aksar, B. Schwaller, O. Aaziz, E. Ates, J. Brandt, A. K. Coskun, M. Egele, and V. Leung
SC19, Nov 2019.
LDMS v4: Writing Sampler and Store Plugins
A. Gentile
LDMS User's Group Conference 2019 (LDMSCON2019)
Sandia National Laboratories, SAND2019-12858 O, Oct 2019.
Figures of merit for production HPC
B. Allan
Sandia National Laboratories, SAND2019-12564, Oct. 2019.
Proxy or Imposter? A Method and Case Study to Determine the Answer
O. Aaziz, J. Cook, C. Vaughan, and D. Richards
Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int'l Conference on Cluster Computing (CLUSTER), Sep 2019.
Standardized Environment for Monitoring Heterogeneous Architectures
C. Brown, B. Schwaller, N. Gauntt, B. Allan and K. Davis
Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int'l Conference on Cluster Computing (CLUSTER), Sep 2019.
A Study of Network Congestion in Two Supercomputing High-Speed Interconnects
S. Jha, A. Patke, J. Brandt, A. Gentile, M. Showerman, E. Roman, Z. Kalbarczyk, and R. Iyer
at 26th Symposium on High Performance Interconnects (HOTI), Aug 2019.
Sandia HPC cluster performance monitoring, analysis & visualization
B. Allan
Sandia National Laboratories, SAND2019-10266C, Aug. 2019.
HPAS: An HPC Performance Anomaly Suite for Reproducing Performance Variations
E. Ates, Y. Zhang, B. Aksar, J. Brandt, V. J. Leung, M. Egele, and A. K. Coskun
at Int'l Conf. on Parallel Processing (ICPP). Aug 2019.
Production Application Performance Data Streaming for System Monitoring
R. Izadpanah, B. Allan, D. Dechev, and J. Brandt
ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS). Vol 4 Issue 2, Jun 2019 doi:10.1145/3319498
Exploring New Monitoring and Analysis Capabilities on Cray’s Software Preview System
J. Brandt, C. Brown, S. Donoho, A. Gentile, J. Greenseid, W. Kramer, P. Langer, A. Rashid, K. Rehm, and M. Showerman
at Cray User Group (CUG) 2019. May 2019.
Extracting Actionable System-Application Performance Factors
J. Brandt, A. Gentile, and J. Cook
Minisymposium on Modeling Resource Utilization and Contention in HPC System-Application Interactions -- Minisymposium Organizer 🔸
at the SIAM Conf. on Computational Science and Engineering (CSE 19), Feb-Mar 2019.
Holistic Measurement Driven System Assessment (HMDSA) -- poster
Bill Kramer, Greg Bauer, Brett Bode, Mike Showerman, Jeremy Enos, Aaron Saxton, Saurabh Jha, Zbigniew Kalbarczyk, and Ravishankar Iyer (NCSA/UIUC) and James Brandt and Ann Gentile (SNL)
at Exascale Computing Project Annual Meeting 2019, Jan 2019.
and HMDSA Project Website
Two Weeks In The Life of Skybridge
-- SLURM and LDMS metrics and metadata.
B. Allan
Sandia National Laboratories SAND 2019-4915, April 2019.
Platform Independent Run Time HPC Monitoring, Analysis, and Feedback at Any-Scale -- Featured Presentation at DOE Booth 🔸
J. Brandt
SC18, Nov 2018.
Monitoring Large-Scale HPC Systems: Extracting and Presenting Meaningful System and Application Insights -- BoF Session Organizer 🔸
SC18, Nov 2018.
An Efficient Latch-free Database Index Based on Multi-dimensional Lists
K. Lamar, R. Izadpanah, J. Brandt, and D. Dechev
2018 IEEE 37th International Performance Computing and Communications Conference (IPCCC). Nov 2018.
Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning
O. Tuncer, E. Ates, Y. Zhang, A. Turk, J. Brandt, V. Leung, M.Egele, and A. Coskun
IEEE Transactions on Parallel and Distributed Systems doi: 10.1109/TPDS.2018.2870403, Sep 2018.
A Methodology for Characterizing the Correspondence Between Real and Proxy Applications
O. Aaziz, J.M. Cook, J. Cook, T. Juedeman, D. Richards, and C. Vaughan
IEEE Int'l. Conf. on Cluster Computing (CLUSTER), Sep 2018.
Large-Scale System Monitoring Experiences and Recommendations -- Invited Peer-Reviewed Submission 🔸
V. Ahlgren, S. Andersson, J. Brandt, N. P. Cardo, S. Chunduri, J. Enos, P. Fields, A. Gentile, R. Gerber, M. Gienger, J. Greenseid, A. Greiner, B. Hadri, Y. (Helen) He, D. Hoppe, U. Kaila, K. Kelly, M. Klein, A. Kristiansen, S. Leak, M. Mason, K. Pedretti, J-G. Piccinali, J. Repik, J. Rogers, S. Salminen, M. Showerman, C. Whitney, and J. Williams (Authors representing ALCF, CSC, CSCS, HLRS, KAUST, LANL, NCSA, NERSC, ORNL, SNL, and Cray)
Workshop on Monitoring and Analysis of High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int'l. Conf. on Cluster Computing (CLUSTER), Sep 2018.
Characterizing Supercomputer Traffic Networks Through Link-Level Analysis
S. Jha, J. Brandt, A. Gentile, Z. Kalbarczyk, and R. Iyer
Workshop on Monitoring and Analysis of High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int'l. Conf. on Cluster Computing (CLUSTER), Sep 2018.
Modeling Expected Application Runtime for Characterizing and Assessing Job Performance
O. Aaziz, J. Cook, and M. Tanash
Workshop on Monitoring and Analysis of High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int'l. Conf. on Cluster Computing (CLUSTER), Sep 2018.
Taxonomist: Application Detection through Rich Monitoring Data -- Best Artifact Award 🔸
E. Ates, O. Tuncer, A. Turk, V. J. Leung, J. Brandt, M. Egele and A. K. Coskun
24th Int'l European Conference on Parallel and Distributed Computing (Euro-Par), Turin, Italy, Aug 2018.
Artifact
Integrating Low-latency Analysis into HPC System Monitoring
R. Izadpanah, N. Naksinehaboon, J. Brandt, A. Gentile, and D. Dechev
47th Int'l Conference on Parallel Processing (ICPP), Eugene, OR, Aug 2018.
Cray System Monitoring: Successes, Requirements, Priorities
V. Ahlgren, S. Andersson, J. Brandt, N. P. Cardo, S. Chunduri, J. Enos, P. Fields, A. Gentile, R. Gerber, J. Greenseid, A. Greiner, B. Hadri, Y. He, D. Hoppe, U. Kaila, K. Kelly, M. Klein,
A. Kristiansen, S. Leak, M. Mason, K. Pedretti, J-G. Piccinali, J. Repik, J. Rogers, S. Salminen, M. Showerman, C. Whitney, and J. Williams.
(Authors representing ALCF, CSC, CSCS, HLRS, KAUST, LANL, NCSA, NERSC, ORNL, SNL, and Cray)
Cray Users Group (CUG), Stockholm, Sweden. May 2018.
Supporting Failure Analysis with Discoverable, Annotated Log Datasets
S. Leak, A. Greiner, A. Gentile, and J. Brandt
Cray Users Group (CUG), Stockholm, Sweden. May 2018.
Automated Analysis and Effective Feedback -- BOF Session Organizer 🔸
M. Showerman, J. Brandt, and A. Gentile
Cray Users Group (CUG), May 2018.
Runtime HPC System and Application Performance Assessment and Diagnostics
J. Brandt, A. Gentile, Jon Cook, B. Allan, Jeanine Cook, O. Aaziz, T. Tucker, N. Naksinehaboon, N. Taerat, E. Ates, O. Tuncer, M. Egele, A. Turk, and A. Coskun
Conference on Data Analysis (CODA), Sante Fe, NM, March 2018.
Continuous Performance Tracking for Kokkos using LDMS
J. Brandt, S. Hammond, T. Tucker, A. Gentile, and J. Cook
Programming Models and CoDesign Meeting, Albuquerque, NM. Feb 2018.
Systems Monitoring Data in Action -- BoF Session Organizer 🔸
SC17, 12:15pm-1:15 pm Thurs Nov 16 2017.
Holistic Measurement Driven System Assessment
S. Jha, J. Brandt, A. Gentile, Z. Kalbarczyk, G. Bauer, J. Enos, M. Showerman, L. Kaplan, B. Bode, A. Greiner, A. Bonnie, M. Mason, R. Iyer, and W. Kramer
Workshop on Monitoring and Analysis of High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int'l. Conf. on Cluster Computing (CLUSTER), Sept 2017.
Diagnosing Performance Variations in HPC Applications Using Machine Learning -- Gauss Award Winner 🔸
O. Tuncer, E. Ates, Y. Zhang, A. Turk, J. Brandt, V. J. Leung, M. Egele, and A. K. Coskun
ISC High Performance 2017 (ISC), Jun 2017.
LDMS Version 3 Tutorial and Demo Material
J. Brandt, T. Tucker, A. Gentile, N. Naksinehaboon, and N. Taerat
Sandia National Laboratories, SAND2017-5153 O, May 2017.
Understanding Fault Scenarios and Impacts Through Fault Injection Experiments in Cielo
V. Formicola, S. Jha, F. Deng, D. Chen (UIUC), A. Bonnie, M. Mason (LANL), J. Brandt, A. Gentile (SNL), L. Kaplan, J. Repik (Cray), J, Enos, M. Showerman (NCSA), A. Greiner (NERSC), Z. Kalbarczyk, R. Iyer, and W. Kramer (UIUC)
Cray Users Group (CUG), May 2017.
Runtime Collection and Analysis of System Metrics for Production Monitoring of Trinity Phase II (and slides)
A. DeConinck, H. Nam, D. Morton, A. Bonnie, C. Lueninghoener (LANL), J. Brandt, A. Gentile, K. Pedretti, A. Agelastos, C. Vaughan, S. Hammond, B. Allan (SNL), M. Davis and J. Repik (Cray)
Cray Users Group (CUG), May 2017.
Holistic Systems Monitoring and Analysis -- BOF Session Organizer 🔸
M. Showerman, J. Brandt, and A. Gentile
Cray Users Group (CUG), May 2017.
Contention and Congestion: Challenges and Approaches to Understanding Application Impact
A. Gentile, J. Brandt, A. Agelastos, and J. Lamb, K. Ruggirello, and J. Stevenson
Minisymposium on Understanding Performance Variability due to Application-Data Center Interaction -- Minisymposium Organizer 🔸
at the SIAM Conf. on Computational Science and Engineering (CSE 17), Feb 2017.
Data Analytics Support for HPC System Management -- Panelist 🔸
SC16, Fri 18th Nov 2016 10:30-noon.
Monitoring Large Scale HPC Systems: Understanding, Diagnosis and Attribution of Performance Variation and Issues -- BoF Session Organizer 🔸
SC16, 5:15pm-7pm Wed Nov 16 2016.
Discovery, Interpretation, and Communication of Meaningful Information in HPC Monitoring Data
University of Central Florida, Oct 2016.
Holistic Measurement Driven Resilience
Chaos Community Day Seattle, WA. Aug. 2016.
Continuous Whole-System Monitoring Toward Rapid Understanding of Production HPC Applications and Systems
A. Agelastos, B. Allan, J. Brandt, A. Gentile, S. Lefantzi, S. Monk, J. Ogden, M. Rajan, and J. Stevenson
Parallel Computing (2016), Elsevier B. V., http://dx.doi.org/10.1016/j.parco.2016.05.009
Large-Scale Persistent Numerical Data Source Monitoring System Experiences
J. Brandt, A. Gentile, M. Showerman, J. Enos, J. Fullop, and G. Bauer
Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA)
at IEEE Int'l. Parallel and Distributed Processing Symposium (IPDPS) Chicago, IL. May 2016.
Design and Implementation of a Scalable HPC Monitoring System
S. Sanchez, A. Bonnie, G. Van Heule, C. Robinson, A. DeConinck, K. Kelly, Q. Snead, and J. Brandt
Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA)
at IEEE Int'l. Parallel and Distributed Processing Symposium (IPDPS) Chicago, IL. May 2016.
Network Performance Counter Monitoring and Analysis on the Cray XC Platform
J. Brandt, E. Froese, A. Gentile, L. Kaplan, B. Allan, and E. Walsh
Cray Users Group (CUG), May 2016.
Dynamic Model Specific Register (MSR) Data Collection as a System Service
G. H. Bauer, J. Brandt, A. Gentile, A. Kot, and M. Showerman
Cray Users Group (CUG), May 2016.
Design and Implementation of a Scalable HPC Monitoring System for Trinity
A. DeConinck, A. Bonnie, K. Kelly, S. Sanchez, C. Martin, and M. Mason (LANL), J. Brandt, A. Gentile, B. Allan, and A. Agelastos (SNL), M. Davis and M. Berry (Cray)
Cray Users Group (CUG), May 2016.
Addressing the Challenges of "Systems Monitoring" Data Flows -- BOF Session Organizer 🔸
M. Showerman, J. Brandt, and A. Gentile
Cray Users Group (CUG) , May 2016.
Smart HPC Centers: Data, Analysis, Feedback, and Response
J. Brandt, A. Gentile, C. Martin, B. Allan, and K. Devine
Minisymposium on Improving Performance, Throughput, and Efficiency of HPC Centers through Full System Data Analytics -- Minisymposium Organizer 🔸
at the SIAM Conf. on Parallel Processing for Scientific Computing (PP16), Paris, France. Apr 2016.
Monitoring High Speed Network Fabrics: Experiences and Needs
J. Brandt, A. Gentile, B. Allan, S. Lefantzi, and M. Aguilar
at Open Fabrics Alliance Workshop , Monterey, CA. Apr 2016.
Monitoring Large Scale HPC Platforms: Issues, Approaches, and Experiences
Univ. of Central Florida, Jan 2016.
HPC Monitoring, Understanding, and Performance: Where Less is Less -- Featured Presentation at DOE Booth 🔸
J. Brandt
at IEEE/ACM Int'l. Conf. for High Performance Storage, Networking, and Analysis (SC15) Austin, TX. Nov 2015.
LDMS Demo at DOE Booth SC15 Nov 2015.
Monitoring Large-Scale HPC Systems: Data Analytics and Insights - BOF Session Organizer 🔸
at IEEE/ACM Int'l. Conf. for High Performance Storage, Networking, and Analysis (SC15) Austin, TX. Nov 2015.
Infrastructure for In Situ System Monitoring and Application Data Analysis
J. Brandt, K. Devine, and A. Gentile
In Situ Infrastructures for Enabling Extreme-scale Analysis and Visualization (ISAV 2015) at IEEE/ACM Int'l. Conf. for High Performance Storage, Networking, and Analysis (SC15), Austin, TX. Nov 2015.
New Systems, New Behaviors, New Patterns: Monitoring Insights from System Standup
J. Brandt, A. Gentile, C. Martin, J. Repik, and N. Taerat
Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int'l. Conf. on Cluster Computing (CLUSTER) Chicago, IL. Sept 2015.
Extending LDMS to Enable Performance Monitoring in Multi-Core Applications
S. Feldman, D. Zhang, D. Dechev, and J. Brandt
Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int'l. Conf. on Cluster Computing (CLUSTER) Chicago, IL. Sept 2015.
Toward Rapid Understanding of Production HPC Applications and Systems
A. Agelastos, B. Allan, J. Brandt, A. Gentile, S. Lefantzi, S. Monk, J. Ogden, M. Rajan, and J. Stevenson
IEEE Int'l. Conf. on Cluster Computing (CLUSTER), Chicago, IL. Sept 2015.
Enabling Advanced Operational Analysis Through Multi-Subsystem Data Integration on Trinity -- Best Paper Finalist 🔸
J. Brandt, D. DeBonis, A. Gentile, J. Lujan, C. Martin, D. Martinez, S. Olivier, K. Pedretti, N. Taerat, and R. Velarde
Cray User's Group (CUG), Chicago, IL. April 2015.
Scalable Integrated High-Fidelity Continuous Monitoring
at System Monitoring of Cray Systems BoF
at Cray User's Group (CUG), Chicago, IL. April 2015.
Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping -- Minisymposium Presentation
J. Brandt, K. Devine, A. Gentile, and K. Pedretti
Minisymposium on Topology Mapping and Locality
at the SIAM Conf. on Computational Science and Engineering (CSE 15), Salt Lake City, UT. Mar 2015.
Extreme-scale HPC Monitoring
In Sandia National Laboratories HPC Annual Report 2014, 2014.
Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications
A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, J. Fullop, A. Gentile, S. Monk, N. Naksinehaboon, J. Ogden, M. Rajan, M. Showerman, J. Stevenson, N. Taerat, and T. Tucker
IEEE/ACM Int'l. Conf. for High Performance Storage, Networking, and Analysis (SC14) New Orleans, LA. Nov 2014.
Monitoring Large-Scale HPC Systems: Issues and Approaches -- BOF Session Organizer 🔸
IEEE/ACM Int'l. Conf. for High Performance Storage, Networking, and Analysis (SC14), New Orleans, LA. Nov 2014.
Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping
J. Brandt, K. Devine, A. Gentile, and K. Pedretti
1st Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA)
at IEEE Int'l. Conf. on Cluster Computing (CLUSTER), Madrid, Spain. Sept 2014.
Monitoring Application Resource Utilization on the Intel PHI Coprocessor -- Minitalk
J. Brandt and A. Gentile
1st Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA)
at IEEE Int'l. Conf. on Cluster Computing (CLUSTER), Madrid, Spain. Sept 2014.
Memory Reliability and Performance Degradation -- Minitalk (Extended Abstract)
Benjamin Allan
1st Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA)
at IEEE Int'l. Conf. on Cluster Computing (CLUSTER), Madrid, Spain. Sept 2014.
Large Scale System Monitoring and Analysis on Blue Waters Using OVIS -- Best Paper Finalist 🔸
M. Showerman, J. Enos, J. Fullop (NCSA), P. Cassella (Cray), N. Naksinehaboon, N. Taerat, T. Tucker (OGC), J. Brandt, A. Gentile, and B. Allan (SNL)
Cray User's Group (CUG), Lugano, Switzerland. May 2014.
Large Scale HPC Monitoring
New Mexico State University, Las Cruses, NM. April 2014.
High Fidelity Data Collection and Transport Service Applied to the Cray XE6/XK6
J. Brandt, T. Tucker, A. Gentile, D. Thompson, V. Kuhns, and J. Repik
Cray User's Group (CUG), Napa Valley, CA. May 2013.
Filtering Log Data: Finding Needles in the Haystack
L. Yu, Z. Zheng, Z. Lan, T. Jones, J. Brandt, and A. Gentile
42nd Annual IEEE/IFIP Int'l. Conf. on Dependable Systems and Networks (DSN), Boston, MA June 2012.
Report of Experiments and Evidence for ASC L2 Milestone 4467 - Demonstration of a Legacy Application's Path to Exascale
B. Barrett, R. Barrett, J. Brandt, R. Brightwell, M. Curry, N. Fabian, K. Ferreira, A. Gentile, S. Hemmert, S. Kelly, R. Klundt, J. Laros, V. Leung, M. Levenhagen, G. Lofstead, K. Moreland, R. Oldfield, K. Pedretti, A. Rodrigues, D. Thompson, T. Tucker, L. Ward, J. Van Dyke, C. Vaughan, and K. Wheeler
SAND2012-1750. Sandia National Laboratories. March 2012.
OVIS, Lightweight Data Metric Service (LDMS), and Log File Analysis
SC|11 Seattle, WA, November 2011.
- Exhibit ASC Booth 803 -- Demos & talk
- OVIS at Petascale Systems Management BOF -- Panelist 🔸
Develop Feedback System for Intelligent Dynamic Resource Allocation to Improve Application Performance
J. Brandt, A. Gentile, D. Thompson and T. Tucker
SAND2011-6301. Sandia National Laboratories. September 2011.
Framework for Enabling System Understanding
J. Brandt, F. Chen, A. Gentile, C. Leangsuksun, J. Mayo, P. Pebay, D. Roe, N. Taerat, D. Thompson, and M. Wong
4th Workshop on Resiliency (Resilience) in High Performance Computing
at Euro-Par 2011, Bordeaux, France. August 2011.
Baler: Deterministic, lossless log message clustering tool
N. Taerat, J. Brandt, A. Gentile, M. Wong, and C. Leangsuksun
In: Computer Science - Research and Development
Volume 26, Numbers 3-4, 285-295, DOI: 10.1007/s00450-011-0155-3
Int'l. Supercomputing Conference (ISC). Hamburg, Germany. June 2011.
OVIS, Lightweight Data Metric Service (LDMS), and Log File Analysis
SC|10 New Orleans, LA, Nov 2010.
- Exhibit ASC Booth Demos
- Exhibit ASC Booth talk: OVIS 3: Scalable Data Collection and Analysis for Large Scale HPC System Understanding
Scalable HPC Monitoring and Analysis for Understanding and Automated Response -- Invited Presentation 🔸
HPC Resilience Summit 2010: Workshop on Resilience for Exascale HPC
at the Los Alamos Computer Science Symposium, Santa Fe, NM. Oct 2010.
OVIS 3.2 User's Guide (NB: Deprecated)
J. Brandt, A. Gentile, C. Houf, J. Mayo, P. Pebay, D. Roe, D. Thompson, and M. Wong
SAND 2010-7109, Sandia National Laboratories, Oct 2010.
Understanding Large Scale HPC Systems Through Scalable Monitoring and Analysis
New Mexico State University, NM. October 2010.
Understanding Large Scale HPC Systems Through Scalable Monitoring and Analysis -- Invited Presentation 🔸
European Grid Initiative (EGI) Technical Forum 2010, Amsterdam, Netherlands. September 2010.
Computing Contingency Statistics in Parallel: Design Trade-Offs and Limiting Cases
P. Pébay, D. Thompson, and J. Bennett
IEEE Int'l. Conf. on Cluster Computing (CLUSTER), Heraklion, Greece. September 2010.
A Framework for Graph-Based Synthesis, Analysis, and Visualization of HPC Cluster Job Data
J. Brandt, V. De Sapio, A. Gentile, P. Kegelmeyer, J. Mayo, P. Pebay, D. Roe, D. Thompson, and M. Wong
SAND2010-2400, Sandia National Laboratories, August 2010.
The OVIS analysis architecture (NB: Deprecated)
J. M. Brandt, V. De Sapio, A. C. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. H. Wong
Sandia Report SAND2010-5107, Sandia National Laboratories, July 2010.
The Python command line interface to the OVIS analysis functionality (NB: Deprecated)
J. M. Brandt, A. C. Gentile, J. Mayo, P. Pébay, D. Thompson, and M. H. Wong
Sandia Report SAND2010-4289, Sandia National Laboratories, June 2010.
Quantifying Effectiveness of Failure Prediction and Response in HPC Systems: Methodology and Example
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
1st Int'l Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS)
at the 40th Annual IEEE/IFIP Int'l. Conf. on Dependable Systems and Networks (DSN) Chicago, IL. June 2010.
Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
Workshop on Resiliency in High-Performance Computing (Resilience) in Clusters, Clouds, and Grids
at the 10th IEEE Int'l. Symposium on Cluster, Cloud, and Grid Computing (CCGRID), Melbourne, Australia. May 2010.
Combining Virtualization, Resource Characterization, and Resource Management to Enable Efficient High Performance Compute Platforms Through Intelligent Dynamic Resource Allocation
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
6th Workshop on System Management Techniques, Processes, and Services (SMTPS) - Special Focus on Cloud Computing
at the 24th IEEE Int'l. Parallel and Distributed Processing Symposium (IPDPS), Atlanta, GA. April 2010.
Scalable Information Fusion for Fault Tolerance in Large-Scale HPC -- Minisymposium Presentation 🔸
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
Minisymposium on Vertically Integrated Fault Tolerance for Large-Scale Scientific Computing
at the SIAM Conf. on Parallel Processing and Scientific Computing (PP10), Seattle, WA. Feb 2010.
OVIS in HPC: Information Fusion for Resilience
Louisiana Tech UniversityHost: Box Leangsuksun, Ruston, LA. December 2009.
Failure Prediction and Resilience in Large-Scale HPC Platforms
SC|09 Portland, OR, November 2009.
- Exhibit Presentation and Demo
Advanced ParaView Visualization
K. Moreland, J. Ahrens, D. DeMarle, D. Thompson, P. Pébay and N. Fabian
peer-reviewed tutorial on the use of statistics engines
at the IEEE VisWeek 2009, Atlantic City, NJ. October 2009.
Data Fusion and Statistical Analysis: Piercing the Darkness of the Black Box -- Invited Presentation 🔸
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
Workshop on Resiliency for Petascale HPC
at the Los Alamos Computer Science Symposium (LACSS 2009), Santa Fe, NM. October 2009.
Methodologies for Advance Warning of Compute Cluster Problems via Statistical Analysis: A Case Study
J. Brandt, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
Workshop on Resiliency in High Performance Computing (Resilience)
at the 18th ACM Int'l. Symposium on High Performance Distributed Computing (HPDC), Munich, Germany. June 2009.
Resource Monitoring and Management with OVIS to Enable HPC in Cloud Computing -- Best Paper Award 🔸
J. Brandt, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
5th Workshop on System Management Techniques, Processes, and Services (SMTPS) - Special Focus on Cloud Computing
at the 23rd IEEE Int'l. Parallel and Distributed Processing Symposium (IPDPS), Rome, Italy. May 2009.
OVIS 2.0 User's Guide (Deprecated)
J. Brandt, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
SAND 2009-2329, Sandia National Laboratories, April 2009
OVIS: Scalable Real-time Analysis of Very Large Datasets
Overview viewgraph. 2009.
OVIS2: Whole System Monitoring and Analysis - Toward Understanding and Prediction
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay, D. Thompson, and M. Wong
SC|08 Austin, TX. November 2008.
- Exhibit Presentation and Demo
Combining System Characterization and Novel Execution Models to Achieve Scalable Robust Computing -- Invited Presentation 🔸
H. Adalsteinsson, J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pebay, D. Thompson, and M. Wong
Workshop on Resiliency for Petascale HPC
at the Los Alamos Computer Science Symposium (LACSS 2008), Santa Fe, NM. October 2008.
OVIS: Scalable, Real-time Statistical Analysis of Very Large Datasets
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay , D. Thompson, and M. Wong
2008 Sandia Workshop on Data Mining and Data Analysis
Extended abstract, SAND Report 2008-6109, Sandia National Laboratories, September 2008.
Using Probabilistic Characterization to Reduce Runtime Faults on HPC Systems
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay , D. Thompson, and M. Wong
Workshop on Resiliency in High-Performance Computing (Resilience)
at the 8th IEEE Symposium on Cluster Computing and the Grid (CCGRID), Lyon, France, May 2008.
OVIS-2: A Robust Distributed Architecture for Scalable RAS
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay, D. Thompson, and M. Wong
4th Workshop on System Management Techniques, Processes, and Services (SMTPS)
at the 22nd IEEE Int'l. Parallel and Distributed Processing Symposium (IPDPS), Miami, FL, April 2008.
OVIS-2: A Distributed Framework for Scalable Monitoring and Analysis of Large Computational Clusters
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay, D. Thompson, and M. Wong
SC|07 Reno, NV, November 2007.
- Exhibit Presentation and Demo
Monitoring Computational Clusters with OVIS
J. M. Brandt, A. C. Gentile, P. P. Pébay and M. H. Wong
SAND Report 2006-7939, Sandia National Laboratories, December 2006.
OVIS: A Tool for Intelligent, Real-time Monitoring of Computational Clusters
J. M. Brandt, A. C. Gentile, J. Ortega, P. P. Pébay, D. C. Thompson, and M. H. Wong
SC|06 Tampa, FL, November 2006.
- Exhibit Presentation and Demo
OVIS: A Tool for Intelligent, Real-time Monitoring of Computational Clusters
J. M. Brandt, A. C. Gentile, D. J. Hale, and P. P. Pébay
The 2nd Workshop on System Monitoring Tools for Large-Scale Parallel Systems (SMTPS)
at the 20th IEEE Int'l. Parallel and Distributed Processing Symposium (IPDPS), Rhodes, Greece, April 2006.
Distributed, Intelligent RAS System for Large Computational Clusters: FactSheet
J. M. Brandt, A. C. Gentile, P. P. Pébay and M. H. Wong
Fact sheet, Sandia National Laboratories, April 2006.
Bayesian Inference for Intelligent, Real-time Monitoring of Computational Clusters
J. M. Brandt, A. C. Gentile, D. J. Hale, Y. M. Marzouk, and P. P. Pébay
SC|05 Seattle, Washington, November 2005.
- Exhibit Presentation, Demo, and Flier
- Conference Poster
Meaningful Automated Statistical Analysis of Large Computational Clusters
J. M. Brandt, A. C. Gentile, Y. M. Marzouk, and P. P. Pébay
at IEEE Int'l. Conf. on Cluster Computing (CLUSTER), Boston MA, September 2005.
Meaningful Automated Statistical Analysis of Large Computational Clusters
J. M. Brandt, A. C. Gentile, Y. M. Marzouk, and P. P. Pébay
SAND Report 2005-4558, Sandia National Laboratories, July 2005.
Detection of System Abnormalities Through Behavioral Analysis of ASC Codes
J. M. Brandt and A. C. Gentile
SC|04 Exhibit, Pittsburgh, PA, November 2004.
- Exhibit Demo
Distributed Intelligent RAS System for Large Computational Clusters
J. M. Brandt, N. M. Berry, R. A. Yao, B. M. Tsudama, and A. C. Gentile
SC|03, Phoenix, AZ November 2003.
- Exhibit Demo
- Conference Poster
The ASCR funded exascale resilience project Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection and Impact releases system datasets in support of resilience research.
Cielo Fault Injection Dataset 2016
S. Jha, V. Formicola, A. Bonnie, M. Mason, D. Chen, F. Deng, A. Gentile, J. Brandt, L. Kaplan, J. Repik, J. Enos, M. Showerman, A. Greiner, Z. Kalbarczyk, R. Iyer, and W. Kramer.
LA-UR-19-22749, SAND2019-3531 O, Mar 2019.
Mutrino Dataset 2/15-6/16 (12/16 Release)(About)
J. Brandt, A. Gentile, and J. Repik
SAND2016-12310 O, Dec 2016
[Online]: http://portal.nersc.gov/project/m888/resilience/datasets/mutrino/mutrino1yr-v122016.tgz
Mutrino Dataset 2/15-5/15 (About)
J. Brandt, A. Gentile, and J. Repik
SAND2016-2449 O, Mar 2016
[Online]: http://portal.nersc.gov/project/m888/resilience/datasets/mutrino/logs.051715.cr.tgz