Update README and versions for 20.08 release

triton-inference-server · Aug 27, 2020 · 4479257 · 4479257
1 parent 6d65a85
commit 4479257
Show file tree

Hide file tree

Showing 3 changed files with 248 additions and 10 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -146,8 +146,8 @@ FROM ${TENSORFLOW2_IMAGE} AS tritonserver_tf2
 ############################################################################
 FROM ${BASE_IMAGE} AS tritonserver_build
 
-ARG TRITON_VERSION=2.2.0dev
-ARG TRITON_CONTAINER_VERSION=20.08dev
+ARG TRITON_VERSION=2.2.0
+ARG TRITON_CONTAINER_VERSION=20.08
 
 # libgoogle-glog0v5 is needed by caffe2 libraries.
 # libcurl4-openSSL-dev is needed for GCS
@@ -366,8 +366,8 @@ ENTRYPOINT ["/opt/tritonserver/nvidia_entrypoint.sh"]
 ############################################################################
 FROM ${BASE_IMAGE}
 
-ARG TRITON_VERSION=2.2.0dev
-ARG TRITON_CONTAINER_VERSION=20.08dev
+ARG TRITON_VERSION=2.2.0
+ARG TRITON_CONTAINER_VERSION=20.08
 
 ENV TRITON_SERVER_VERSION ${TRITON_VERSION}
 ENV NVIDIA_TRITON_SERVER_VERSION ${TRITON_CONTAINER_VERSION}
@@ -377,7 +377,7 @@ LABEL com.nvidia.tritonserver.version="${TRITON_SERVER_VERSION}"
 
 ENV PATH /opt/tritonserver/bin:${PATH}
 
-# Need to include pytorch in LD_LIBRARY_PATH since Torchvision loads custom 
+# Need to include pytorch in LD_LIBRARY_PATH since Torchvision loads custom
 # ops from that path
 ENV LD_LIBRARY_PATH /opt/tritonserver/lib/pytorch/:$LD_LIBRARY_PATH
 

diff --git a/README.rst b/README.rst
@@ -30,13 +30,251 @@
 NVIDIA Triton Inference Server
 ==============================
 
-    **NOTE: You are currently on the r20.08 branch which tracks 
-    stabilization towards the next release.  This branch is not usable
-    during stabilization.**
-
 .. overview-begin-marker-do-not-remove
 
+NVIDIA Triton Inference Server provides a cloud inferencing solution
+optimized for NVIDIA GPUs. The server provides an inference service
+via an HTTP/REST or GRPC endpoint, allowing remote clients to request
+inferencing for any model being managed by the server. For edge
+deployments, Triton Server is also available as a shared library with
+an API that allows the full functionality of the server to be included
+directly in an application.
+
+What's New in 2.2.0
+-------------------
+
+* TensorFlow 2.x is now supported in addition to TensorFlow 1.x. See
+  the Frameworks Support Matrix for the supported TensorFlow
+  versions. The version of TensorFlow used can be selected when
+  launching Triton with the
+  --backend-config=tensorflow,version=<version> flag. Set <version> to
+  1 or 2 to select TensorFlow1 or TensorFlow2 respectively. By default
+  TensorFlow 1 is used.
+
+* Add inference request timeout option to Python and C++ client
+  libraries.
+
+* GRPC inference protocol updated to fix performance regression.
+
+* Explicit major/minor versioning added to TRITONSERVER and
+  TRITONBACKED APIs.
+
+* New CMake option TRITON_CLIENT_SKIP_EXAMPLES to disable building the
+  client examples.
+
+Features
+--------
+
+* `Multiple framework support
+  <https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/model_repository.html#framework-model-definition>`_. The
+  server can manage any number and mix of models (limited by system
+  disk and memory resources). Supports TensorRT, TensorFlow GraphDef,
+  TensorFlow SavedModel, ONNX, PyTorch, and Caffe2 NetDef model
+  formats. Both TensorFlow 1.x and TensorFlow 2.x are supported. Also
+  supports TensorFlow-TensorRT and ONNX-TensorRT integrated
+  models. Variable-size input and output tensors are allowed if
+  supported by the framework. See `Capabilities
+  <https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/capabilities.html#capabilities>`_
+  for detailed support information for each framework.
+
+* `Concurrent model execution support
+  <https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/model_configuration.html#instance-groups>`_. Multiple
+  models (or multiple instances of the same model) can run
+  simultaneously on the same GPU.
+
+* Batching support. For models that support batching, Triton Server
+  can accept requests for a batch of inputs and respond with the
+  corresponding batch of outputs. Triton Server also supports multiple
+  `scheduling and batching
+  <https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/model_configuration.html#scheduling-and-batching>`_
+  algorithms that combine individual inference requests together to
+  improve inference throughput. These scheduling and batching
+  decisions are transparent to the client requesting inference.
+
+* `Custom backend support
+  <https://github.com/NVIDIA/triton-inference-server/blob/master/docs/backend.rst>`_. Triton
+  Server allows individual models to be implemented with custom
+  backends instead of by a deep-learning framework. With a custom
+  backend a model can implement any logic desired, while still
+  benefiting from the GPU support, concurrent execution, dynamic
+  batching and other features provided by the server.
+
+* `Ensemble support
+  <https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/models_and_schedulers.html#ensemble-models>`_. An
+  ensemble represents a pipeline of one or more models and the
+  connection of input and output tensors between those models. A
+  single inference request to an ensemble will trigger the execution
+  of the entire pipeline.
+
+* Multi-GPU support. Triton Server can distribute inferencing across
+  all system GPUs.
+
+* Triton Server provides `multiple modes for model management
+  <https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/model_management.html>`_. These
+  model management modes allow for both implicit and explicit loading
+  and unloading of models without requiring a server restart.
+
+* `Model repositories
+  <https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/model_repository.html#>`_
+  may reside on a locally accessible file system (e.g. NFS), in Google
+  Cloud Storage or in Amazon S3.
+
+* HTTP/REST and GRPC `inference protocols
+  <https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/http_grpc_api.html>`_
+  based on the community developed `KFServing protocol
+  <https://github.com/kubeflow/kfserving/tree/master/docs/predict-api/v2>`_.
+
+* Readiness and liveness `health endpoints
+  <https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/http_grpc_api.html>`_
+  suitable for any orchestration or deployment framework, such as
+  Kubernetes.
+
+* `Metrics
+  <https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/metrics.html>`_
+  indicating GPU utilization, server throughput, and server latency.
+
+* `C library inferface
+  <https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/library_api.html>`_
+  allows the full functionality of Triton Server to be included
+  directly in an application.
+
 .. overview-end-marker-do-not-remove
 
+The current release of the Triton Inference Server is 2.2.0 and
+corresponds to the 20.08 release of the tensorrtserver container on
+`NVIDIA GPU Cloud (NGC) <https://ngc.nvidia.com>`_. The branch for
+this release is `r20.08
+<https://github.com/NVIDIA/triton-inference-server/tree/r20.08>`_.
+
+Backwards Compatibility
+-----------------------
+
+Version 2 of Triton is beta quality, so you should expect some changes
+to the server and client protocols and APIs. Version 2 of Triton does
+not generally maintain backwards compatibility with version 1.
+Specifically, you should take the following items into account when
+transitioning from version 1 to version 2:
+
+* The Triton executables and libraries are in /opt/tritonserver. The
+  Triton executable is /opt/tritonserver/bin/tritonserver.
+
+* Some *tritonserver* command-line arguments are removed, changed or
+  have different default behavior in version 2.
+
+  * --api-version, --http-health-port, --grpc-infer-thread-count,
+    --grpc-stream-infer-thread-count,--allow-poll-model-repository, --allow-model-control
+    and --tf-add-vgpu are removed.
+
+  * The default for --model-control-mode is changed to *none*.
+
+  * --tf-allow-soft-placement and --tf-gpu-memory-fraction are renamed
+     to --backend-config="tensorflow,allow-soft-placement=<true,false>"
+     and --backend-config="tensorflow,gpu-memory-fraction=<float>".
+
+* The HTTP/REST and GRPC protocols, while conceptually similar to
+  version 1, are completely changed in version 2. See the `inference
+  protocols
+  <https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/http_grpc_api.html>`_
+  section of the documentation for more information.
+
+* Python and C++ client libraries are re-implemented to match the new
+  HTTP/REST and GRPC protocols. The Python client no longer depends on
+  a C++ shared library and so should be usable on any platform that
+  supports Python. See the `client libraries
+  <https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/client_library.html>`_
+  section of the documentaion for more information.
+
+* The version 2 cmake build requires these changes:
+
+  * The cmake flag names have changed from having a TRTIS prefix to
+    having a TRITON prefix. For example, TRITON_ENABLE_TENSORRT.
+
+  * The build targets are *server*, *client* and *custom-backend* to
+    build the server, client libraries and examples, and custom
+    backend SDK, respectively.
+
+* In the Docker containers the environment variables indicating the
+  Triton version have changed to have a TRITON prefix, for example,
+  TRITON_SERVER_VERSION.
+
+Documentation
+-------------
+
+The User Guide, Developer Guide, and API Reference `documentation for
+the current release
+<https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html>`_
+provide guidance on installing, building, and running Triton Inference
+Server.
+
+You can also view the `documentation for the master branch
+<https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/index.html>`_
+and for `earlier releases
+<https://docs.nvidia.com/deeplearning/triton-inference-server/archives/index.html>`_.
+
+NVIDIA publishes a number of `deep learning examples that use Triton
+<https://github.com/NVIDIA/DeepLearningExamples>`_.
+
+An `FAQ
+<https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/faq.html>`_
+provides answers for frequently asked questions.
+
+READMEs for deployment examples can be found in subdirectories of
+deploy/, for example, `deploy/single_server/README.rst
+<https://github.com/NVIDIA/triton-inference-server/tree/master/deploy/single_server/README.rst>`_.
+
+The `Release Notes
+<https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html>`_
+and `Support Matrix
+<https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html>`_
+indicate the required versions of the NVIDIA Driver and CUDA, and also
+describe which GPUs are supported by Triton Server.
+
+Presentations and Papers
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+* `High-Performance Inferencing at Scale Using the TensorRT Inference Server <https://developer.nvidia.com/gtc/2020/video/s22418>`_.
+
+* `Accelerate and Autoscale Deep Learning Inference on GPUs with KFServing <https://developer.nvidia.com/gtc/2020/video/s22459>`_.
+
+* `Deep into Triton Inference Server: BERT Practical Deployment on NVIDIA GPU <https://developer.nvidia.com/gtc/2020/video/s21736>`_.
+
+* `Maximizing Utilization for Data Center Inference with TensorRT
+  Inference Server
+  <https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server>`_.
+
+* `NVIDIA TensorRT Inference Server Boosts Deep Learning Inference
+  <https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/>`_.
+
+* `GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT
+  Inference Server and Kubeflow
+  <https://www.kubeflow.org/blog/nvidia_tensorrt/>`_.
+
+Contributing
+------------
+
+Contributions to Triton Inference Server are more than welcome. To
+contribute make a pull request and follow the guidelines outlined in
+the `Contributing <CONTRIBUTING.md>`_ document.
+
+Reporting problems, asking questions
+------------------------------------
+
+We appreciate any feedback, questions or bug reporting regarding this
+project. When help with code is needed, follow the process outlined in
+the Stack Overflow (https://stackoverflow.com/help/mcve)
+document. Ensure posted examples are:
+
+* minimal – use as little code as possible that still produces the
+  same problem
+
+* complete – provide all parts needed to reproduce the problem. Check
+  if you can strip external dependency and still show the problem. The
+  less time we spend on reproducing problems the more time we have to
+  fix it
+
+* verifiable – test the code you're about to provide to make sure it
+  reproduces the problem. Remove all other problems that are not
+  related to your request/question.
+
 .. |License| image:: https://img.shields.io/badge/License-BSD3-lightgrey.svg
    :target: https://opensource.org/licenses/BSD-3-Clause
diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-2.2.0dev
+2.2.0