Device activity tracing allows to measure execution time for the kernels and memory transfers running on the device. This capability is a part of Level Zero core API and does not require to enable the Instrumentation Layer.
There are also two alternative ways to get kernel execution time based on Metrics API covered in appropriate chapter:
- Based on metric queries;
- Based on stream markers.
General approach for all the options is to find (or instrument with Tracing API) the functions that append some activity into command list (e.g. zeCommandListAppendLaunchKernel
) and inject an additional time measurements before and after this activity.
Supported Runtimes:
Supported OS:
- Linux
- Windows
Supported HW:
- Intel(R) Processor Graphics GEN9+
Needed Headers:
Needed Libraries:
- oneAPI Level Zero libraries
Device activity tracing is a part of Level Zero core functionality that allows to collect timestamps for the events created from the pool with enabled profiling capabilities.
To use it one should create an event pool with timestamps support and an event from it:
// Create an event pool
ze_event_pool_desc_t event_pool_desc = {
ZE_STRUCTURE_TYPE_EVENT_POOL_DESC, nullptr,
ZE_EVENT_POOL_FLAG_KERNEL_TIMESTAMP | ZE_EVENT_POOL_FLAG_HOST_VISIBLE, // all events in pool contain profiling information
1}; // number of events in pool
ze_event_pool_handle_t event_pool = nullptr;
zeEventPoolCreate(context, &event_pool_desc, 0, nullptr, &event_pool);
assert(event_pool != nullptr);
// Create an event
ze_event_desc_t event_desc = {
ZE_STRUCTURE_TYPE_EVENT_DESC, nullptr, 0,
ZE_EVENT_SCOPE_FLAG_HOST, ZE_EVENT_SCOPE_FLAG_HOST};
ze_event_handle_t event = nullptr;
zeEventCreate(event_pool, &event_desc, &event);
assert(event != nullptr);
The next step is to intercept target device activity and use the created event to measure its execution time, e.g.:
zeCommandListAppendLaunchKernel(cmd_list, kernel, global_size,
event /* profiling event */, 0, nullptr);
To get device activity timestamps one may use the following functions (should be called only after the activity will be completed):
ze_kernel_timestamp_result_t timestamp{};
zeEventQueryKernelTimestamp(event, ×tamp);
Finally to compute actual activity duration in nanoseconds one should retrieve timer resolution for the device and perform time scaling:
ze_device_properties_t props{};
props.version = ZE_DEVICE_PROPERTIES_VERSION_CURRENT;
status = zeDeviceGetProperties(state->device, &props);
assert(status == ZE_RESULT_SUCCESS);
uint64_t time_ns = (timestamp.context.kernelEnd - timestamp.context.kernelStart) * props.timerResolution;
There are two types of timestamps one may retrieve:
global
- wall-clock time start/end in GPU clocks for event, should be used to map kernel to global application timeline;context
- context time start/end in GPU clocks for event, only includes time while HW context is actively running on GPU, may be used to calculate precise kernel duration.
The major difference between context and global timestamps is that global time will include time of activity preemption and context will not.
Common problem while kernel timestamps collection is to map these timestamps to general CPU timeline. Since Level Zero provides kernel timestamps in GPU clocks, one may need to convert them to some CPU time. Starting from Level Zero 1.1, new function zeDeviceGetGlobalTimestamps
is available. Using this function, one can get correlated host (CPU) and device (GPU) timestamps for any particular device:
uint64_t host_timestamp = 0, device_timestamp = 0;
ze_result_t status = zeDeviceGetGlobalTimestamps(
device, &host_timestamp, &device_timestamp);
assert(status == ZE_RESULT_SUCCESS);
Host timestamp value corresponds to CLOCK_MONOTONIC_RAW
on Linux or QueryPerformanceCounter
on Windows, while device timestamp for GPU is collected in raw GPU cycles.
Note that the number of valid bits for the device timestamp returned by zeDeviceGetGlobalTimestamps
is timestampValidBits
, while the global kernel timastamp returned by zeEventQueryKernelTimestamp
has kernelTimestampValidBits
(both values are fields of ze_device_properties_t
). And currently kernelTimestampValidBits
is less then timestampValidBits
, so to map kernels into CPU timeline one may need to truncate device timestamp to kernelTimestampValidBits
:
ze_device_properties_t props{ZE_STRUCTURE_TYPE_DEVICE_PROPERTIES_1_2, };
ze_result_t status = zeDeviceGetProperties(device, &props);
assert(status == ZE_RESULT_SUCCESS);
uint64_t mask = (1ull << props.kernelTimestampValidBits) - 1ull;
uint64_t kernel_timestamp = (device_timestamp & mask);
To convert GPU cycles into seconds one may use timerResolution
field from ze_device_properties_t
structure, that represents cycles per second starting from Level Zero 1.2:
ze_device_properties_t props{ZE_STRUCTURE_TYPE_DEVICE_PROPERTIES_1_2, };
ze_result_t status = zeDeviceGetProperties(device, &props);
assert(status == ZE_RESULT_SUCCESS);
const uint64_t NSEC_IN_SEC = 1000000000;
uint64_t device_timestamp_ns = NSEC_IN_SEC * device_timestamp / props.timerResolution;
Event pool profiling does not require any additional environment variables to be set, simply run the application as is:
./<application>
- refer to oneAPI Level Zero documentation to learn more