Skip to content

Configuration Considerations (v3)

oceandlr edited this page Oct 19, 2018 · 6 revisions

LDMS is designed to enable lightweight, reliable high-fidelity data collection and transport. In order to ensure this, particularly when you are collecting large amounts of data, or from large numbers of data sources, or at high-frequency, there are a number of configuration considerations and parameters you will want to be sure to set appropriately.

Table of Contents

LDMS Daemons with Samplers (aka Samplers)

Memory allocation

Users configure LDMS daemons with the size of memory to reserve to hold its metric sets. Metric set sizes (both metadata and data) can be seen in a metric set's meta data. You can use this information to guide your choice.

Collection frequency and offset

You can determine your collection frequency based on how often you want the data values, which may be subject to the expected update frequency of the underlying source, the time resolution of features you are interested in, the amount of data you want to handle, and the impact of collecting the information. Note that the Aggregation interval can be set independently of the Sampling interval, so you could sample every second, and initially aggregate at minute intervals, until you decide to change the aggregation interval.
You can set the offset the same or differently on all nodes, if desired. In practice, most people choose to set the offset the same, so that the data is most closely aligned in time to support analysis.

LDMS daemon threads

Logging

LDMS daemons take configuration arguments for log level and target output, which can be a file, or syslog, or even /dev/null. Some levels (e.g., DEBUG) may be quite verbose. After testing, in production you may opt to turn logging off.

LDMS Daemons which are Aggregating (aka Aggregators)

Memory allocation

Fan-in of LDMS Daemons (both those with Samplers and multi-level aggregation)

Aggregation frequency and offset

Aggregation frequency can be independent of the Sampling frequency, as previously described. The offset is an easy way to increase the probability that you are not trying to aggregate while the samplers are collecting (If this happens, the set will be marked inconsistent and will be aggregated but not stored). The metadata of a metric set includes the duration of the sampling time; set the aggregation offset greater than these values, but less than the sampling interval. In practice, for collection intervals of 1 sec, an aggregation offset of milliseconds is fine.

LDMS daemon threads

Transport type

A common architectural model is a system with a high-speed network, which can support RDMA, and an ultimate destination of a remote monitoring and analysis host(s). Often the remote host(s) are only reachable from certain hosts in the system and the only connectivity of those hosts is over ethernet or IB. LDMS supports mixed transports in the same overall set up. A multi-level aggregation setup may be desirable, in which the first level aggregator fetches data from the sampler daemons over RDMA, and the second level (remote) aggregator fetches data from the first level aggregators over Ethernet or IB.

Logging

Sampling on an Aggregator

Currently LDMS does not support Sampling Plugins in an aggregator. If you want to sample on a node where you are aggregating, run a separate LDMS daemon with the sampler plugin on the same node, and include it as a host from which to aggregate (e.g., you can run multiple LDMS daemons on the same node and they can aggregate in the same way remote daemons do).

Failover

LDMS Daemons which are Aggregating and Storing

Store frequency

Storage targets

LDMS seeks to support flexible direction of data to your storage infrastructure. The simplest store is the csv store which writes out the data to a separate file handle per metric set for all components. This can either be an actual text file, or a write to a named pipe by which you can forward the data onward (e.g., to syslog). Ultimately, you can then transform that data as you please at your destination.
Multiple concurrent storage targets are supported. For instance, you may use the function store to store a subset of the metrics (of functions thereof) in a metric set which you want to access quickly (e.g., I/O rates) in one place, while also sending the full metric set to an alternate, long-term store.

Flush to store targets

You will need to be sure that your data store recipient can handle the flow of data being written to the store (e.g., when writing to slow file systems or other sink). This may also place some requirements on the format of the output data (e.g., want to ensure full (not partial) lines are flushed out.) There are parameters to the store to control flushing options (e.g., system determined, number of lines, number of bytes).

Sampler Plugins

Metrics in a metric set

Metrics grouped within a metric set provide benefits of: a) efficient memory layout and fetch all together and b) all metrics associated with the same timestamp. This latter aids in analysis. Many LDMS samplers have been written to collect all possible metrics from a data source (e.g., meminfo), even if you may not be interested in all those metrics to first order. Relatedly, you can write samplers to include metrics from multiple sources within the same sampler (as is done in the Cray system sampler) to get all desired metrics in the same dataset.

Collection redundancy

You may want to redundantly collect data which is exposed to multiple nodes (e.g., network performance counters on the Cray Gemini and Cray Aries) in order to ensure that full data is collected when some nodes go down. Since only the smaller sized data (not the larger metadata) is transported each time, this redundancy should not result in application-performance impacting traffic sizes. An additional benefit is that subdivision of data may entail significantly more work on the store end in order to recombine divided data (e.g., getting all the network counters for a single Aries ASIC after it has been subdivided into separate metric sets among nodes).

Authentication

Authentication determines access to the metric sets such as querying via ldms_ls, ldmsd's aggregating metric sets from other ldmsd's. In v3, authentication options are: shared secret or none. In Cray systems the ptag also restricts access to the daemon.

Misc

Some of our early (2013-early 2016) experiences in production large-scale monitoring on Blue Waters, which motivated some design and configuration decisions are described in Large-scale Persistent Numerical Data Source Monitoring System Experiences Brandt et. al. @HPCMASPA2016.

Main

LDMSCON

Tutorials are available at the conference websites

D/SOS Documentation

LDMS v4 Documentation

Basic

Configurations

Features & Functionalities

Working Examples

Development

Reference Docs

Building

Cray Specific
RPMs
  • Coming soon!

Adding to the code base

Testing

Misc

Man Pages

  • Man pages currently not posted, but they are available in the source and build

LDMS Documentation (v3 branches)

V3 has been deprecated and will be removed soon

Basic

Reference Docs

Building

General
Cray Specific

Configuring

Running

  • Running

Tutorial

Clone this wiki locally