-
Notifications
You must be signed in to change notification settings - Fork 1
Configuration Considerations and Best Practices (v4)
Benjamin Allan edited this page Nov 6, 2020
·
62 revisions
LDMS is designed to enable lightweight, reliable high-fidelity data collection and transport. In order to ensure this, particularly when you are collecting large amounts of data, or from large numbers of data sources, or at high-frequency, there are a number of configuration considerations and parameters you will want to be sure to set appropriately.
- Users configure LDMS daemons with the size of memory to reserve to hold its metric sets. Sets occupy memory that is mapped and locked for the purpose of exchanging with a peer via RDMA. Typically, these transports have limited remote memory access resources. To minimize LDMS utilization of these, we map all set memory a priori and use as needed to contain instances of sets.
- The default values are currently based on sampler estimates and will be sufficient even for the aggregator test cases. For large-scale systems and many sets you should increase the aggregator set memory size. This is the -m option on the command line. Sets may require anywhere from 2K to 64K depending on the set.
- Metric set sizes (both metadata and data) can be seen in a metric set's meta data. You can use this information to guide your choice.
- You can determine your sampling frequency based on how often you want the data values, which may be subject to the expected update frequency of the underlying source, the time resolution of features you are interested in, the amount of data you want to handle, and the impact of collecting the information. Note that the Aggregation interval can be set independently of the Sampling interval (just as in v3). In v3 modifying data collection frequency required manual coordination of aggregation and sampling frequencies. New with v4 is a "hints" feature that enables the aggregators to coordinate their aggregation frequency with a samplers sampling frequency automatically. This feature is enabled by default but can be overridden in the aggregator configuration. This means that samplers can now be re-configured dynamically and the aggregators will be re-configured to match. This requires consideration in sizing aggregator to sampler fanout which should take into account highest expected sampler frequencies.
- You can set the offset the same or differently on all nodes, if desired. In practice, most people choose to set the offset the same, so that the data is most closely aligned in time to support analysis.
- `-P ` specifies the number of threads working on the LDMSD tasks. In the sampler case, the tasks are periodic `timeout` events for each sampler plugin. The default is `1`. For samplers, 1 thread is usually sufficient. In the case of using a sampler plugin that took usually long time to sample, multiple worker threads can be use to help other sampler plugins to avoid being blocked by the long-waiting sampler.
- LDMS daemons take configuration arguments for log level and target output, which can be a file, or syslog, or even /dev/null. Some levels (e.g., DEBUG) may be quite verbose. After testing, in production you may opt to turn logging off.
- Changing the sampling interval can be done using ldmsd_controller or ldmsctl. In the examples below, let's assume there is already a sampler running on port 50000 and we want to change to 1s sampling intervals.
- ldmsd_controller example:
#To change to 1s create a script, in this case freq.sh
more freq.sh
#!/bin/bash
echo "stop name=all_example"
echo "start name=all_example interval=1000000"
echo "quit"
#Use ldmsd_controller to run the script
ldmsd_controller --xprt sock --port 50000 --host localhost --script freq.sh
- ldmsctl example:
#Start ldmsctl
ldmsctl -x sock -p 50000 -h localhost
ldmsctl> stop name=all_example
ldmsctl> start name=all_example interval=1000000
ldmsctl> quit
- TODO (Data sources whose metrics may vary or who may entirely come and go)
- Users configure LDMS daemons with the size of memory to reserve to hold its metric sets. Sets occupy memory that is mapped and locked for the purpose of exchanging with a peer via RDMA. Typically, these transports have limited remote memory access resources. To minimize LDMS utilization of these, we map all set memory a priori and use as needed to contain instances of sets.
- The memory allocation for the aggregator must support all collected sets. The default values are currently based on sampler estimates and will be sufficient even for the aggregator test cases.
- For large-scale systems and many sets you should increase the aggregator set memory size. In many setups the aggregators run on their own (e.g., non-compute nodes) and thus large allocations are not a concern. In large-scale setups we have used around 2G for Cray Aries systems. Cray Gemini cannot allocate that much memory, however (limit here).
- This is the -m option on the command line.
- TODO
- A pair of aggregators can be setup to be each other backup to serve the next-level aggregator. For example, a pair of level-1 aggregators `agg-1-0` and `agg-1-1` are both serve level-2 `agg-2-0`. When `agg-1-0` is out of service, `agg-1-1` will pick up `agg-1-0` load (producers and updaters) and continue serving the data. Let's assume that all aggregators run in their own host (e.g. agg-1-0.hostname) and listening on port `10000`.
- Example:
# on agg-1-0
# add a bunch of prodcers (connections to samplers)
prdcr_add name=node000 ...
prdcr_start name=node000 ...
...
# add an updater that works on all producers
updtr_add name=upd regex=.*
updtr_strat upd
# add a failover partner
failover_config host=agg-1-1.hostname port=10000 xprt=sock interval=1000000
# on agg-1-1
# add a bunch of prodcers (connections to samplers)
prdcr_add name=node100 ...
prdcr_start name=node100 ...
...
# add an updater that works on all producers
updtr_add name=upd regex=.*
updtr_strat upd
# add a failover partner
failover_config host=agg-1-0.hostname port=10000 xprt=sock interval=1000000
# on agg-2-0
# add agg-1-0
prdcr_add name=a10 host=agg-1-0.hostname port=10000 xprt=sock type=active interval=1000000
prdcr_start name=a10
# add agg-1-1
prdcr_add name=a11 host=agg-1-1.hostname port=10000 xprt=sock type=active interval=1000000
prdcr_start name=a11
# update all
updtr_add name=upd regex=.*
updtr_strat upd
- In this example, `agg-1-0` aggregates sets from `node000..099`, `agg-1-1` aggregates from `node100..199`, and `agg-2-0` aggregates from `agg-1-0` and `agg-1-1`. In other words, `agg-2-0` see sets originating from `node000...099` through `agg-1-0`, and `node100..199` through `agg-1-1`. Without failover, if `agg-1-0` died, `agg-2-0` won't be able to see the updates of `node000..099`. With failover, the workload of `node000..099` will be picked up by `agg-1-1`. `agg-2-0` will still continue seeing updates from `node000..099`, but through `agg-1-1`. The same goes for `agg-1-0` if `agg-1-1` died.
- Please see more about ldmsd failover in `ldmsd_failover(7)` manual page.
- Aggregation frequency can be independent of the Sampling frequency, as previously described. The offset is an easy way to increase the probability that you are not trying to aggregate while the samplers are collecting (If this happens, the set will be marked inconsistent and will be aggregated but not stored). The metadata of a metric set includes the duration of the sampling time; set the aggregation offset greater than these values, but less than the sampling interval. In practice, for collection intervals of 1 sec, an aggregation offset of milliseconds is fine.
- The purpose of groups is for the aggregator to grab all the metric sets together in one call and minimize the number of individual grabs to pull all of the sets. The choice of which producers and how many producers to include in an updtr group is dependent on a) the frequency at which you want to pull the sets and b) processor type. All producers in a group should be intended to be pulled at the same frequency. <about></about>.
- If you have misconfigured this with respect to ability to process all the sets in the updtr group, then there is an error message that there is a new update scheduled while one is in progress. This can be resolved by using more updtr groups with fewer producers per group.
- LDMS sets on an LDMS daemon can be grouped into `ldms_setgroup`. The `setgroup` is actually an LDMS set with special attributes. The group of sets is an idea to help ease working with a colleciton of sets. For example setting an update frequency on the aggregator to update sets related to disk IO differently from CPU usage. Or, with
ldms_ls -l <group_name>
- you can see the data of the sets in the group without having to see the data from all sets.
- Please see more about the setgroup in `ldmsd_setgroup(7)` manual page.
TODO
- Recommendation: export ZAP_EVENT_WORKERS=8
- `-P ` specifies the number of threads working on the LDMSD tasks. In the aggregator case, the tasks are for periodic timeout for each updater / producers. Usually, 1 thread is sufficient for posting updates / connects. However, the update completion task needs to scale up if there are some processing on the updated data. `ZAP_EVENT_WORKERS` environment variable controls the number of threads working the set completion routine (default: `4`).
- Recommendation: export ZAP_UGNI_CQ_DEPTh=65536
- The default CQ depth is 2K. This is not enough for aggregators. The CQ contains slots that are consumed when RDMA requests are completed. RDMA requests occur at:
- lookup — happen right after the connect
- update - happens every time the update schedules a set update
- push - happens when a set registered for a push closes a transaction boundary (i.e., sample completes)
- If the low level completion threw cannot keep up, this may cause the CQ to overflow and result in GNI_RESOURCE_ERRORS (will see these in the Cray logs)
- Recommendation: export ZAP_EVENT_QDEPTH=65536
- Recommendation: export ZAP_EVENT_WORKERS=8
- I/O events are delivered to threads for handling by the application. To maintain ordering, an endpoint is assigned to one and only one of the I/O worker threads. Increasing the number of workers will help with performance.
- Because the handling of an event can take a long time (e.g., obtaining and then the process of storing the updated data), these queues may need to be deeper than expected and the number of threads larger. Increasing the queue depth and number of workers will help with performance.
- If the I/O still cannot keep up Due to you system limitations, you may need to split up your store (e.g., use multiple containers, or multiple aggregators writing to different locations), so that the data does not go to a single sink.
- A common architectural model is a system with a high-speed network, which can support RDMA, and an ultimate destination of a remote monitoring and analysis host(s). Often the remote host(s) are only reachable from certain hosts in the system and the only connectivity of those hosts is over ethernet or IB. LDMS supports mixed transports in the same overall set up. A multi-level aggregation setup may be desirable, in which the first level aggregator fetches data from the sampler daemons over RDMA, and the second level (remote) aggregator fetches data from the first level aggregators over Ethernet or IB.
- In some security domains, only certain directions are allowed to initiate a connection. Active vs Passive designate whether the daemon initiates the connection or if it waits to be connected to.
- TODO
- Currently LDMS does not support Sampling Plugins in an aggregator. If you want to sample on a node where you are aggregating, run a separate LDMS daemon with the sampler plugin on the same node, and include it as a host from which to aggregate (e.g., you can run multiple LDMS daemons on the same node and they can aggregate in the same way remote daemons do). The exception to this is the dstat sampler plugin which monitors the daemon itself. The dstat data collected about the daemon it runs in will not be sent to store plugins on the same daemon, but must be collected and stored by a higher level aggregator which monitors the first daemon. The dstat plugin may also be used in directly in sampler daemons to monitor overhead, but adds to the overhead being monitored.
- TODO
- TODO
- LDMS seeks to support flexible direction of data to your storage infrastructure. The simplest store is the csv store which writes out the data to a separate file handle per metric set for all components. This can either be an actual text file, or a write to a named pipe by which you can forward the data onward (e.g., to syslog). Ultimately, you can then transform that data as you please at your destination.
- A Message bus can also be a popular storage target, since it then can be a low delay sink which also supports multiple consumers or subscribers which can handle the actual store details. There is a rabbitMQ store for this purpose and others have written Kafka-based stores (not yet contributed to this release).
- Multiple concurrent storage targets are supported. For instance, you may use the function store to store a subset of the metrics (of functions thereof) in a metric set which you want to access quickly (e.g., I/O rates) in one place, while also sending the full metric set to an alternate, long-term store.
- Another option to downselect metrics is to configure the storage policy to add only a subset of the metrics in a metric set.
- You will need to be sure that your data store recipient can handle the flow of data being written to the store (e.g., when writing to slow file systems or other sink). This may also place some requirements on the format of the output data (e.g., want to ensure full (not partial) lines are flushed out.) There are parameters to the store to control flushing options (e.g., system determined, number of lines, number of bytes).
- For small systems and test cases that have only a few data sources or infrequent collection rates, using system buffering may result in very infrequent writeouts and so it may appear that your data is not being stored when it merely has not flushed yet. In such cases either turn off buffering (0) or configure more frequent flushing.
- TODO
TODO
- Metrics grouped within a metric set provide benefits of: a) efficient memory layout and fetch all together and b) all metrics associated with the same timestamp. This latter aids in analysis. Many LDMS samplers have been written to collect all possible metrics from a data source (e.g., meminfo), even if you may not be interested in all those metrics to first order. Relatedly, you can write samplers to include metrics from multiple sources within the same sampler (as is done in the Cray system sampler) to get all desired metrics in the same dataset.
- In order to minimize the impact on-node, we most often collect data in its raw form on-node and do additional computations (e.g., rates) either when injecting into the store (e.g., via store_function_csv) or at the store itself (e.g., if it is a database). This also gives you the ability to retain access to the raw data in future in case there are additional derivations you discover later.
- All set instances of the same schema are supposed to be the same with the metrics. For non-homogeneous data sources, the metrics may vary. A common practice is to identify the type of processor:
env PROCTYPE=$(cat /sys/devices/system/cpu/modalias | cut -d ':' -f 3 | tr ',' '_')
and then use that processor type in the schema name when configuring the schema name for the plugin:
load name=vmstat config name=vmstat producer=${NID} instance=${NID}/vmstat component_id=${COMPONENT_ID} schema=vmstat_${PROCTYPE} job_set=${NID}/jobinfo uid=0 gid=44476 perm=0777 start name=vmstat interval=${SAMPLE_INTERVAL} offset=${SAMPLE_OFFSET}
- You may want to redundantly collect data which is exposed to multiple nodes (e.g., network performance counters on the Cray Gemini and Cray Aries) in order to ensure that full data is collected when some nodes go down. Since only the smaller sized data (not the larger metadata) is transported each time, this redundancy should not result in application-performance impacting traffic sizes. An additional benefit is that subdivision of data may entail significantly more work on the store end in order to recombine divided data (e.g., getting all the network counters for a single Aries ASIC after it has been subdivided into separate metric sets among nodes).
- Component I’d can be anything, however it is most often used to be a unique numerical identifier for a host, for instance the numerical node Id for the source, as would appear in a scheduler resource list, for example. This is numeric in order to facilitate quick comparison in, for example, a database search, as opposed to a string comparison.
- Container parameter for a store plugin represents some form of encapsulation appropriate for the store. For CSV files, this is a directory under which all the CSv files will be written. Other stores may have different meanings (e.g., a database scheme or table).
- Authentication determines access to the metric sets such as querying via ldms_ls and for ldmsd's aggregating metric sets from other ldmsd's. In v4, authentication options are: munge, shared secret or none. In Cray systems the ptag also restricts access to the daemon.
- Some of our early (2013-early 2016) experiences in production large-scale monitoring on Blue Waters, which motivated some design and configuration decisions are described in Large-scale Persistent Numerical Data Source Monitoring System Experiences Brandt et. al. @HPCMASPA2016.
- Home
- Search
- Feature Overview
- LDMS Data Facilitates Analysis
- Contributing patches
- User Group Meeting Notes - BiWeekly!
- Publications
- News - now in Discussions
- Mailing Lists
- Help
Tutorials are available at the conference websites
- Coming soon!
- Testing Overview
- Test Plans & Documentation: ldms-test
- Man pages currently not posted, but they are available in the source and build
V3 has been deprecated and will be removed soon
- Configuring
- Configuration Considerations
- Running