Skip to content

Latest commit

 

History

History
213 lines (157 loc) · 7.53 KB

tracing.md

File metadata and controls

213 lines (157 loc) · 7.53 KB

Overview

This document explains how to trace Kata Containers components.

Introduction

The Kata Containers runtime and agent are able to generate OpenTelemetry trace spans, which allow the administrator to observe what those components are doing and how much time they are spending on each operation.

OpenTelemetry summary

An OpenTelemetry-enabled application creates a number of trace "spans". A span contains the following attributes:

  • A name
  • A pair of timestamps (recording the start time and end time of some operation)
  • A reference to the span's parent span

All spans need to be finished, or completed, to allow the OpenTelemetry framework to generate the final trace information (by effectively closing the transaction encompassing the initial (root) span and all its children).

For Kata, the root span represents the total amount of time taken to run a particular component from startup to its shutdown (the "run time").

Architecture

Runtime tracing architecture

The runtime, which runs in the host environment, has been modified to optionally generate trace spans which are sent to a trace collector on the host.

Agent tracing architecture

An OpenTelemetry system (such as Jaeger) uses a collector to gather up trace spans from the application for viewing and processing. For an application to use the collector, it must run in the same context as the collector.

This poses a problem for tracing the Kata Containers agent since it does not run in the same context as the collector: it runs inside a virtual machine (VM).

To allow spans from the agent to be sent to the trace collector, Kata provides a trace forwarder component. This runs in the same context as the collector (generally on the host system) and listens on a VSOCK channel for traces generated by the agent, forwarding them on to the trace collector.

Note:

This design supports agent tracing without having to make changes to the image, but also means that custom images can also benefit from agent tracing.

The following diagram summarises the architecture used to trace the Kata Containers agent:

+--------------------------------------------+
| Host                                       |
|                                            |
| +---------------+                          |
| | OpenTelemetry |                          |
| | Trace         |                          |
| | Collector     |                          |
| +---------------+                          |
|       ^                  +---------------+ |
|       | spans            | Kata VM       | |
| +-----+-----+            |               | |
| | Kata      |    spans   o     +-------+ | |
| | Trace     |<-----------------| Kata  | | |
| | Forwarder |    VSOCK   o     | Agent | | |
| +-----------+    Channel |     +-------+ | |
|                          +---------------+ |
+--------------------------------------------+

Agent tracing prerequisites

  • You must have a trace collector running.

    Although the collector normally runs on the host, it can also be run from inside a Docker image configured to expose the appropriate host ports to the collector.

    The Jaeger "all-in-one" Docker image method is the quickest and simplest way to run the collector for testing.

  • If you wish to trace the agent, you must start the trace forwarder.

Notes:

  • If agent tracing is enabled but the forwarder is not running, the agent will log an error (signalling that it cannot generate trace spans), but continue to work as normal.

  • The trace forwarder requires a trace collector (such as Jaeger) to be running before it is started. If a collector is not running, the trace forwarder will exit with an error.

Enable tracing

By default, tracing is disabled for all components. To enable any form of tracing an enable_tracing option must be enabled for at least one component.

Note:

Enabling this option will only allow tracing for subsequently started containers.

Enable runtime tracing

To enable runtime tracing, set the tracing option as shown:

[runtime]
enable_tracing = true

Enable agent tracing

To enable agent tracing, set the tracing option as shown:

[agent.kata]
enable_tracing = true

Note:

If both agent tracing and runtime tracing are enabled, the resulting trace spans will be "collated": expanding individual runtime spans in the Jaeger web UI will show the agent trace spans resulting from the runtime operation.

Appendices

Agent tracing requirements

Host environment

  • The host kernel must support the VSOCK socket type.

    This will be available if the kernel is built with the CONFIG_VHOST_VSOCK configuration option.

  • The VSOCK kernel module must be loaded:

    $ sudo modprobe vhost_vsock
    

Guest environment

  • The guest kernel must support the VSOCK socket type:

    This will be available if the kernel is built with the CONFIG_VIRTIO_VSOCKETS configuration option.

    Note: The default Kata Containers guest kernel provides this feature.

Agent tracing limitations

  • Agent tracing is only "completed" when the workload and the Kata agent process have exited.

    Although trace information can be inspected before the workload and agent have exited, it is incomplete. This is shown as <trace-without-root-span> in the Jaeger web UI.

    If the workload is still running, the trace transaction -- which spans the entire runtime of the Kata agent -- will not have been completed. To view the complete trace details, wait for the workload to end, or stop the container.

Performance impact

OpenTelemetry is designed for high performance. It combines the best of two previous generation projects (OpenTracing and OpenCensus) and uses a very efficient mechanism to capture trace spans. Further, the trace points inserted into the agent are generated dynamically at compile time. This is advantageous since new versions of the agent will automatically benefit from improvements in the tracing infrastructure. Overall, the impact of enabling runtime and agent tracing should be extremely low.

Agent shutdown behaviour

In normal operation, the Kata runtime manages the VM shutdown and performs certain optimisations to speed up this process. However, if agent tracing is enabled, the agent itself is responsible for shutting down the VM. This it to ensure all agent trace transactions are completed. This means there will be a small performance impact for container shutdown when agent tracing is enabled as the runtime must wait for the VM to shutdown fully.

Set up a tracing development environment

If you want to debug, further develop, or test tracing, enabling full debug is highly recommended. For working with the agent, you may also wish to enable a debug console to allow you to access the VM environment.