30 Jul 22:27

dzier

c81fb75

Release 1.15.0 corresponding to NGC container 20.07

NVIDIA Triton Inference Server

The NVIDIA Triton Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 1.15.0

Support for the legacy V1 HTTP/REST, GRPC and corresponding client libraries is released on GitHub branch r20.07-v1 and as NGC container 20.07-v1-py3.

Known Issues

TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.15.0_ubuntu1804.clients.tar.gz file. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.15.0_ubuntu1804.custombackend.tar.gz file. See the documentation section 'Building a Custom Backend' for more information on using these files.

Assets 4

30 Jun 01:06

dzier

v2.0.0

b5a2527

Release 2.0.0 corresponding to NGC container 20.06

NVIDIA Triton Inference Server

What's New In 2.0.0

Updates for KFserving HTTP/REST and GRPC protocols and corresponding Python and C++ client libraries.
Migration from Triton V1 to Triton V2 requires signficant changes, see the “Backwards Compatibility” and “Roadmap” sections of the GitHub README for more information.

Known Issues

The KFServing HTTP/REST and GRPC protocols and corresponding V2 experimental Python and C++ clients are beta quality and are likely to change. Specifically:
- The data returned by the statistics API will be changing to include additional information.
- The data returned by the repository index API will be changing to include additional information.
The new C API specified in tritonserver.h is beta quality and is likely to change.
TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v2.0.0_ubuntu1804.clients.tar.gz file. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v2.0.0_ubuntu1804.custombackend.tar.gz file. See the documentation section 'Building a Custom Backend' for more information on using these files.

Jetson Jetpack Support

A release of Triton for the Developer Preview of JetPack 4.4 (https://developer.nvidia.com/embedded/jetpack) is provided in the attached file: v2.0.0-jetpack4.4ga.tgz. This experimental release supports the TensorFlow (1.15.2), TensorRT (7.1) and Custom backends as well as ensembles. GPU metrics, GCS storage and S3 storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples.

Installation and Usage

The following dependencies must be installed before running Triton.

apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libb64-dev \
        libgoogle-glog0v5 \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        rapidjson-dev \
        patchelf \
        zlib1g-dev

Additionally, to run the clients the following dependencies must be installed.

apt-get install -y --no-install-recommends \
        curl \
        libopencv-dev=3.2.0+dfsg-4ubuntu0.1 \
        libopencv-core-dev=3.2.0+dfsg-4ubuntu0.1 \
        pkg-config \
        python3 \
        python3-pip \
        python3-dev

python3 -m pip install --upgrade wheel setuptools
python3 -m pip install --upgrade grpcio-tools numpy pillow

The Python wheel for the python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/triton*.whl

Assets 5

30 Jun 01:06

dzier

v1.14.0

cbe74be

Release 1.14.0 corresponding to NGC container 20.06

NVIDIA Triton Inference Server

What's New In 1.14.0

Support for the legacy V1 HTTP/REST, GRPC and corresponding client libraries
is released on GitHub branch r20.06-v1 and as NGC container
20.06-v1-py3.

Known Issues

TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.14.0_ubuntu1804.clients.tar.gz file. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.14.0_ubuntu1804.custombackend.tar.gz file. See the documentation section 'Building a Custom Backend' for more information on using these files.

Assets 4

01 Jun 21:39

dzier

v1.13.0

95f40ca

Release 1.13.0 corresponding to NGC container 20.03.1

NVIDIA Triton Inference Server

What's New In 1.13.0

Updates for KFserving HTTP/REST and GRPC protocols and corresponding Python and C++ client libraries. See Roadmap section of README for more information.
Update GRPC version to 1.24.0.
Several issues with S3 storage were resolved.
Fix last_inferrence_timestamp value to correctly show the time when inference last occurred for each model.
The Caffe2 backend is deprecated. Support for Caffe2 models will be removed in a future release.

Known Issues

The KFServing HTTP/REST and GRPC protocols and corresponding V2 experimental Python and C++ clients are beta quality and are likely to change. Specifically:
- The data returned by the statistics API will be changing to include additional information.
- The data returned by the repository index API will be changing to include additional information.
The new C API specified in tritonserver.h is beta quality and is likely to change.
When using the experimental V2 HTTP/REST C++ client, classification results are not supported for output tensors. This issue will be fixed in the next release.
When using the experimental V2 perf_client_v2, for high concurrency values perf_client_v2 may not be able to achieve throughput as high as V1 perf_client. This will be fixed in the next release.
TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.13.0_ubuntu1804.clients.tar.gz file. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.13.0_ubuntu1804.custombackend.tar.gz file. See the documentation section 'Building a Custom Backend' for more information on using these files.

Jetson Jetpack Support

An experimental release of Triton for the Developer Preview of JetPack 4.4 is available as part of the 20.03 release. See 20.03 release for more information.

Assets 4

26 Mar 21:17

dzier

v1.12.0

b5b8a0e

Release 1.12.0 corresponding to NGC container 20.03

NVIDIA Triton Inference Server

What's New In 1.12.0

Add queuing policies for dynamic batching scheduler. These policies are specified in the model configuration and allow each model to set maximum queue size, time outs, and priority levels for inference requests.
Support for large ONNX models where weights are stored in separate files.
Allow ONNX Runtime optimization level to be configured via the model configuration optimization setting.
Experimental Python client and server support for community standard GRPC inferencing API.
Add --min-supported-compute-capability flag to allow Triton Server to use older, unsupported GPUs.
Fix perf_client shared memory support. In some cases shared-memory option did not work correctly due to the input and output tensor names. This issue is now resolved.

Known Issues

TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.12.0_ubuntu1804.clients.tar.gz file. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.12.0_ubuntu1804.custombackend.tar.gz file. See the documentation section 'Building a Custom Backend' for more information on using these files.

Jetson Jetpack Support

An experimental release of Triton for the Developer Preview of JetPack 4.4 (https://developer.nvidia.com/embedded/jetpack) is provided in the attached file: v1.12.0-jetpack4.4dp.tgz. This experimental release supports the TensorFlow (1.15.2), TensorRT (7.1) and Custom backends as well as ensembles. GPU metrics, GCS storage and S3 storage are not supported.

The tar file contains the Triton executable and shared libraries and also the C++ and Python client libraries and examples.

Installation and Usage

The following dependencies must be installed before running Triton.

apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libgoogle-glog0v5 \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        zlib1g-dev

Additionally, to run the clients the following dependencies must be installed.

apt-get install -y --no-install-recommends \
        curl \
        libopencv-dev=3.2.0+dfsg-4ubuntu0.1 \
        libopencv-core-dev=3.2.0+dfsg-4ubuntu0.1 \
        pkg-config \
        python3 \
        python3-pip \
        python3-dev

python3 -m pip install --upgrade wheel setuptools
python3 -m pip install --upgrade grpcio-tools numpy pillow

The Python wheel for the python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tensorrtserver-*.whl

Assets 5

26 Feb 21:47

dzier

v1.11.0

873e177

Release 1.11.0 corresponding to NGC container 20.02

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.11.0

The TensorRT backend is improved to have significantly better performance. Improvements include reducing thread contention, using pinned memory for faster CPU<->GPU transfers, and increasing compute and memory copy overlap on GPUs.
Reduce memory usage of TensorRT models in many cases by sharing weights across multiple model instances.
Boolean data-type and shape tensors are now supported for TensorRT models.
A new model configuration option allows the dynamic batcher to create “ragged” batches for custom backend models. A ragged batch is a batch where one or more of the input/output tensors have different shapes in different batch entries.
Local S3 storage endpoints are now supported for model repositories. A local S3 endpoint is specified as 's3://host:port/path/to/repository'.
The Helm chart showing an example Kubernetes deployment is updated to include Prometheus and Grafana support so that inference server metrics can be collected and visualized.
The inference server container no longer sets LD_LIBRARY_PATH, instead the server uses RUNPATH to locate its shared libraries.
Python 2 is end-of-life so all support has been removed. Python 3 is still supported.
Ubuntu 18.04 with January 2020 updates

Known Issues

TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.11.0_ubuntu1604.clients.tar.gz and v1.11.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.11.0_ubuntu1604.custombackend.tar.gz and v1.11.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

Assets 6

28 Jan 17:56

dzier

v1.10.0

7012ff7

Release 1.10.0 corresponding to NGC container 20.01

NVIDIA TensorRT Inference Server

What's New In 1.10.0

Server status can be requested in JSON format using the HTTP/REST API. Use endpoint /api/status?format=json.
The dynamic batcher now has an option to preserve the ordering of batched requests when there are multiple model instances. See model_config.proto for more information.

Known Issues

TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.10.0_ubuntu1604.clients.tar.gz and v1.10.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.10.0_ubuntu1604.custombackend.tar.gz and v1.10.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

Assets 6

21 Dec 01:24

dzier

v1.9.0

a1f3860

Release 1.9.0, corresponding to NGC container 19.12

NVIDIA TensorRT Inference Server

What's New In 1.9.0

The model configuration now includes a model warmup option. This option provides the ability to tune and optimize the model before inference requests are received, avoiding initial inference delays. This option is especially useful for frameworks like TensorFlow that perform network optimization in response to the initial inference requests. Models can be warmed-up with one or more synthetic or realistic workloads before they become ready in the server.
An enhanced sequence batcher now has multiple scheduling strategies. A new Oldest strategy integrates with the dynamic batcher to enable improved inference performance for models that don’t require all inference requests in a sequence to be routed to the same batch slot.
The perf_client now has an option to generate requests using a realistic poisson distribution or a user provided distribution.
A new repository API (available in the shared library API, HTTP, and GRPC) returns an index of all models available in the model repositories) visible to the server. This index can be used to see what models are available for loading onto the server.
The server status returned by the server status API now includes the timestamp of the last inference request received for each model.
Inference server tracing capabilities are now documented in the Optimization section of the User Guide. Tracing support is enhanced to provide trace for ensembles and the contained models.
A community contributed Dockerfile is now available to build the TensorRT Inference Server clients on CentOS.

Known Issues

The beta of the custom backend API version 2 has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
- The signature of the CustomGetNextInputV2Fn_t function adds the memory_type_id argument.
- The signature of the CustomGetOutputV2Fn_t function adds the memory_type_id argument.
The beta of the inference server library API has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
- The signature and operation of the TRTSERVER_ResponseAllocatorAllocFn_t function has changed. See src/core/trtserver.h for a description of the new behavior.
- The signature of the TRTSERVER_InferenceRequestProviderSetInputData function adds the memory_type_id argument.
- The signature of the TRTSERVER_InferenceResponseOutputData function add the memory_type_id argument.
TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.9.0_ubuntu1604.clients.tar.gz and v1.9.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.9.0_ubuntu1604.custombackend.tar.gz and v1.9.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

Assets 6

27 Nov 18:13

dzier

v1.8.0

b0ffafc

Release 1.8.0, corresponding to NGC container 19.11

NVIDIA TensorRT Inference Server

What's New In 1.8.0

Shared-memory support is expanded to include CUDA shared memory.
Improve efficiency of pinned-memory used for ensemble models.
The perf_client application has been improved with easier-to-use
command-line arguments (which maintaining compatibility with existing
arguments).
Support for string tensors added to perf_client.
Documentation contains a new “Optimization” section discussing some common
optimization strategies and how to use perf_client to explore these
strategies.

Deprecated Features

The asynchronous inference API has been modified in the C++ and Python client libraries.
- In the C++ library:
  - The non-callback version of the AsyncRun function was removed.
  - The GetReadyAsyncRequest function was removed.
  - The signature of the GetAsyncRunResults function was changed to remove the is_ready and wait arguments.
- In the Python library:
  - The non-callback version of the async_run function was removed.
  - The get_ready_async_request function was removed.
  - The signature of the get_async_run_results function was changed to remove the wait argument.

Known Issues

The beta of the custom backend API version 2 has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
- The signature of the CustomGetNextInputV2Fn_t function adds the memory_type_id argument.
- The signature of the CustomGetOutputV2Fn_t function adds the memory_type_id argument.
The beta of the inference server library API has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
- The signature and operation of the TRTSERVER_ResponseAllocatorAllocFn_t function has changed. See src/core/trtserver.h for a description of the new behavior.
- The signature of the TRTSERVER_InferenceRequestProviderSetInputData function adds the memory_type_id argument.
- The signature of the TRTSERVER_InferenceResponseOutputData function add the memory_type_id argument.
TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.8.0_ubuntu1604.clients.tar.gz and v1.8.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.8.0_ubuntu1604.custombackend.tar.gz and v1.8.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

Assets 6

30 Oct 00:03

dzier

v1.7.0

d81e141

Release 1.7.0, corresponding to NGC container 19.10

NVIDIA TensorRT Inference Server

What's New In 1.7.0

A Client SDK container is now provided on NGC in addition to the inference server container. The client SDK container includes the client libraries and examples.
TensorRT optimization may now be enabled for any TensorFlow model by enabling the feature in the optimization section of the model configuration.
The ONNXRuntime backend now includes the TensorRT and Open Vino execution providers. These providers are enabled in the optimization section of the model configuration.
Automatic configuration generation (--strict-model-config=false) now works correctly for TensorRT models with variable-sized inputs and/or outputs.
Multiple model repositories may now be specified on the command line. Optional command-line options can be used to explicitly load specific models from each repository.
Ensemble models are now pruned dynamically so that only models needed to calculate the requested outputs are executed.
The example clients now include a simple Go example that uses the GRPC API.

Known Issues

In TensorRT 6.0.1, reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.6.0_ubuntu1604.clients.tar.gz and v1.6.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.6.0_ubuntu1604.custombackend.tar.gz and v1.6.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

Assets 6

Releases: triton-inference-server/server

Release 1.15.0 corresponding to NGC container 20.07

NVIDIA Triton Inference Server

What's New In 1.15.0

Known Issues

Client Libraries and Examples

Custom Backend SDK

Release 2.0.0 corresponding to NGC container 20.06

NVIDIA Triton Inference Server

What's New In 2.0.0

Known Issues

Client Libraries and Examples

Custom Backend SDK

Jetson Jetpack Support

Installation and Usage

Release 1.14.0 corresponding to NGC container 20.06

NVIDIA Triton Inference Server

What's New In 1.14.0

Known Issues

Client Libraries and Examples

Custom Backend SDK

Release 1.13.0 corresponding to NGC container 20.03.1

NVIDIA Triton Inference Server

What's New In 1.13.0

Known Issues

Client Libraries and Examples

Custom Backend SDK

Jetson Jetpack Support

Release 1.12.0 corresponding to NGC container 20.03

NVIDIA Triton Inference Server

What's New In 1.12.0

Known Issues

Client Libraries and Examples

Custom Backend SDK

Jetson Jetpack Support

Installation and Usage

Release 1.11.0 corresponding to NGC container 20.02

NVIDIA TensorRT Inference Server

What's New In 1.11.0

Known Issues

Client Libraries and Examples

Custom Backend SDK

Release 1.10.0 corresponding to NGC container 20.01

NVIDIA TensorRT Inference Server

What's New In 1.10.0

Known Issues

Client Libraries and Examples

Custom Backend SDK

Release 1.9.0, corresponding to NGC container 19.12

NVIDIA TensorRT Inference Server

What's New In 1.9.0

Known Issues

Client Libraries and Examples

Custom Backend SDK

Release 1.8.0, corresponding to NGC container 19.11

NVIDIA TensorRT Inference Server

What's New In 1.8.0

Deprecated Features

Known Issues

Client Libraries and Examples

Custom Backend SDK

Release 1.7.0, corresponding to NGC container 19.10

NVIDIA TensorRT Inference Server

What's New In 1.7.0

Known Issues

Client Libraries and Examples

Custom Backend SDK