Skip to content

Commit

Permalink
Update README and versions for 20.08 release
Browse files Browse the repository at this point in the history
  • Loading branch information
David Goodwin authored and dzier committed Aug 27, 2020
1 parent 6d65a85 commit 4479257
Show file tree
Hide file tree
Showing 3 changed files with 248 additions and 10 deletions.
10 changes: 5 additions & 5 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -146,8 +146,8 @@ FROM ${TENSORFLOW2_IMAGE} AS tritonserver_tf2
############################################################################
FROM ${BASE_IMAGE} AS tritonserver_build

ARG TRITON_VERSION=2.2.0dev
ARG TRITON_CONTAINER_VERSION=20.08dev
ARG TRITON_VERSION=2.2.0
ARG TRITON_CONTAINER_VERSION=20.08

# libgoogle-glog0v5 is needed by caffe2 libraries.
# libcurl4-openSSL-dev is needed for GCS
Expand Down Expand Up @@ -366,8 +366,8 @@ ENTRYPOINT ["/opt/tritonserver/nvidia_entrypoint.sh"]
############################################################################
FROM ${BASE_IMAGE}

ARG TRITON_VERSION=2.2.0dev
ARG TRITON_CONTAINER_VERSION=20.08dev
ARG TRITON_VERSION=2.2.0
ARG TRITON_CONTAINER_VERSION=20.08

ENV TRITON_SERVER_VERSION ${TRITON_VERSION}
ENV NVIDIA_TRITON_SERVER_VERSION ${TRITON_CONTAINER_VERSION}
Expand All @@ -377,7 +377,7 @@ LABEL com.nvidia.tritonserver.version="${TRITON_SERVER_VERSION}"

ENV PATH /opt/tritonserver/bin:${PATH}

# Need to include pytorch in LD_LIBRARY_PATH since Torchvision loads custom
# Need to include pytorch in LD_LIBRARY_PATH since Torchvision loads custom
# ops from that path
ENV LD_LIBRARY_PATH /opt/tritonserver/lib/pytorch/:$LD_LIBRARY_PATH

Expand Down
246 changes: 242 additions & 4 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,251 @@
NVIDIA Triton Inference Server
==============================

**NOTE: You are currently on the r20.08 branch which tracks
stabilization towards the next release. This branch is not usable
during stabilization.**

.. overview-begin-marker-do-not-remove
NVIDIA Triton Inference Server provides a cloud inferencing solution
optimized for NVIDIA GPUs. The server provides an inference service
via an HTTP/REST or GRPC endpoint, allowing remote clients to request
inferencing for any model being managed by the server. For edge
deployments, Triton Server is also available as a shared library with
an API that allows the full functionality of the server to be included
directly in an application.

What's New in 2.2.0
-------------------

* TensorFlow 2.x is now supported in addition to TensorFlow 1.x. See
the Frameworks Support Matrix for the supported TensorFlow
versions. The version of TensorFlow used can be selected when
launching Triton with the
--backend-config=tensorflow,version=<version> flag. Set <version> to
1 or 2 to select TensorFlow1 or TensorFlow2 respectively. By default
TensorFlow 1 is used.

* Add inference request timeout option to Python and C++ client
libraries.

* GRPC inference protocol updated to fix performance regression.

* Explicit major/minor versioning added to TRITONSERVER and
TRITONBACKED APIs.

* New CMake option TRITON_CLIENT_SKIP_EXAMPLES to disable building the
client examples.

Features
--------

* `Multiple framework support
<https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/model_repository.html#framework-model-definition>`_. The
server can manage any number and mix of models (limited by system
disk and memory resources). Supports TensorRT, TensorFlow GraphDef,
TensorFlow SavedModel, ONNX, PyTorch, and Caffe2 NetDef model
formats. Both TensorFlow 1.x and TensorFlow 2.x are supported. Also
supports TensorFlow-TensorRT and ONNX-TensorRT integrated
models. Variable-size input and output tensors are allowed if
supported by the framework. See `Capabilities
<https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/capabilities.html#capabilities>`_
for detailed support information for each framework.

* `Concurrent model execution support
<https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/model_configuration.html#instance-groups>`_. Multiple
models (or multiple instances of the same model) can run
simultaneously on the same GPU.

* Batching support. For models that support batching, Triton Server
can accept requests for a batch of inputs and respond with the
corresponding batch of outputs. Triton Server also supports multiple
`scheduling and batching
<https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/model_configuration.html#scheduling-and-batching>`_
algorithms that combine individual inference requests together to
improve inference throughput. These scheduling and batching
decisions are transparent to the client requesting inference.

* `Custom backend support
<https://github.com/NVIDIA/triton-inference-server/blob/master/docs/backend.rst>`_. Triton
Server allows individual models to be implemented with custom
backends instead of by a deep-learning framework. With a custom
backend a model can implement any logic desired, while still
benefiting from the GPU support, concurrent execution, dynamic
batching and other features provided by the server.

* `Ensemble support
<https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/models_and_schedulers.html#ensemble-models>`_. An
ensemble represents a pipeline of one or more models and the
connection of input and output tensors between those models. A
single inference request to an ensemble will trigger the execution
of the entire pipeline.

* Multi-GPU support. Triton Server can distribute inferencing across
all system GPUs.

* Triton Server provides `multiple modes for model management
<https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/model_management.html>`_. These
model management modes allow for both implicit and explicit loading
and unloading of models without requiring a server restart.

* `Model repositories
<https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/model_repository.html#>`_
may reside on a locally accessible file system (e.g. NFS), in Google
Cloud Storage or in Amazon S3.

* HTTP/REST and GRPC `inference protocols
<https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/http_grpc_api.html>`_
based on the community developed `KFServing protocol
<https://github.com/kubeflow/kfserving/tree/master/docs/predict-api/v2>`_.

* Readiness and liveness `health endpoints
<https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/http_grpc_api.html>`_
suitable for any orchestration or deployment framework, such as
Kubernetes.

* `Metrics
<https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/metrics.html>`_
indicating GPU utilization, server throughput, and server latency.

* `C library inferface
<https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/library_api.html>`_
allows the full functionality of Triton Server to be included
directly in an application.

.. overview-end-marker-do-not-remove
The current release of the Triton Inference Server is 2.2.0 and
corresponds to the 20.08 release of the tensorrtserver container on
`NVIDIA GPU Cloud (NGC) <https://ngc.nvidia.com>`_. The branch for
this release is `r20.08
<https://github.com/NVIDIA/triton-inference-server/tree/r20.08>`_.

Backwards Compatibility
-----------------------

Version 2 of Triton is beta quality, so you should expect some changes
to the server and client protocols and APIs. Version 2 of Triton does
not generally maintain backwards compatibility with version 1.
Specifically, you should take the following items into account when
transitioning from version 1 to version 2:

* The Triton executables and libraries are in /opt/tritonserver. The
Triton executable is /opt/tritonserver/bin/tritonserver.

* Some *tritonserver* command-line arguments are removed, changed or
have different default behavior in version 2.

* --api-version, --http-health-port, --grpc-infer-thread-count,
--grpc-stream-infer-thread-count,--allow-poll-model-repository, --allow-model-control
and --tf-add-vgpu are removed.

* The default for --model-control-mode is changed to *none*.

* --tf-allow-soft-placement and --tf-gpu-memory-fraction are renamed
to --backend-config="tensorflow,allow-soft-placement=<true,false>"
and --backend-config="tensorflow,gpu-memory-fraction=<float>".

* The HTTP/REST and GRPC protocols, while conceptually similar to
version 1, are completely changed in version 2. See the `inference
protocols
<https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/http_grpc_api.html>`_
section of the documentation for more information.

* Python and C++ client libraries are re-implemented to match the new
HTTP/REST and GRPC protocols. The Python client no longer depends on
a C++ shared library and so should be usable on any platform that
supports Python. See the `client libraries
<https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/client_library.html>`_
section of the documentaion for more information.

* The version 2 cmake build requires these changes:

* The cmake flag names have changed from having a TRTIS prefix to
having a TRITON prefix. For example, TRITON_ENABLE_TENSORRT.

* The build targets are *server*, *client* and *custom-backend* to
build the server, client libraries and examples, and custom
backend SDK, respectively.

* In the Docker containers the environment variables indicating the
Triton version have changed to have a TRITON prefix, for example,
TRITON_SERVER_VERSION.

Documentation
-------------

The User Guide, Developer Guide, and API Reference `documentation for
the current release
<https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html>`_
provide guidance on installing, building, and running Triton Inference
Server.

You can also view the `documentation for the master branch
<https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/index.html>`_
and for `earlier releases
<https://docs.nvidia.com/deeplearning/triton-inference-server/archives/index.html>`_.

NVIDIA publishes a number of `deep learning examples that use Triton
<https://github.com/NVIDIA/DeepLearningExamples>`_.

An `FAQ
<https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/faq.html>`_
provides answers for frequently asked questions.

READMEs for deployment examples can be found in subdirectories of
deploy/, for example, `deploy/single_server/README.rst
<https://github.com/NVIDIA/triton-inference-server/tree/master/deploy/single_server/README.rst>`_.

The `Release Notes
<https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html>`_
and `Support Matrix
<https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html>`_
indicate the required versions of the NVIDIA Driver and CUDA, and also
describe which GPUs are supported by Triton Server.

Presentations and Papers
^^^^^^^^^^^^^^^^^^^^^^^^

* `High-Performance Inferencing at Scale Using the TensorRT Inference Server <https://developer.nvidia.com/gtc/2020/video/s22418>`_.

* `Accelerate and Autoscale Deep Learning Inference on GPUs with KFServing <https://developer.nvidia.com/gtc/2020/video/s22459>`_.

* `Deep into Triton Inference Server: BERT Practical Deployment on NVIDIA GPU <https://developer.nvidia.com/gtc/2020/video/s21736>`_.

* `Maximizing Utilization for Data Center Inference with TensorRT
Inference Server
<https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server>`_.

* `NVIDIA TensorRT Inference Server Boosts Deep Learning Inference
<https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/>`_.

* `GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT
Inference Server and Kubeflow
<https://www.kubeflow.org/blog/nvidia_tensorrt/>`_.

Contributing
------------

Contributions to Triton Inference Server are more than welcome. To
contribute make a pull request and follow the guidelines outlined in
the `Contributing <CONTRIBUTING.md>`_ document.

Reporting problems, asking questions
------------------------------------

We appreciate any feedback, questions or bug reporting regarding this
project. When help with code is needed, follow the process outlined in
the Stack Overflow (https://stackoverflow.com/help/mcve)
document. Ensure posted examples are:

* minimal – use as little code as possible that still produces the
same problem

* complete – provide all parts needed to reproduce the problem. Check
if you can strip external dependency and still show the problem. The
less time we spend on reproducing problems the more time we have to
fix it

* verifiable – test the code you're about to provide to make sure it
reproduces the problem. Remove all other problems that are not
related to your request/question.

.. |License| image:: https://img.shields.io/badge/License-BSD3-lightgrey.svg
:target: https://opensource.org/licenses/BSD-3-Clause
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2.2.0dev
2.2.0

0 comments on commit 4479257

Please sign in to comment.