Skip to content

Releases: NVIDIA/DALI

DALI v1.23.0

24 Feb 15:33
ee99d8f
Compare
Choose a tag to compare

Key Features and Enhancements

This DALI release includes the following key features and enhancements:

  • Enabled conditional execution: support for if/else statements with runtime predicates inside pipeline (#4561, #4618, #4602, #4589, #4617).
  • Added GPU experimental.inputs.video operator that supports decoding large videos from memorybuffer across multiple iterations (#4613, #4584, #4603, #4564).
  • Added support for lossless JPEG decoding on CPU and GPU with fn.experimental.decoders.image (#4625, #4600, #4587, #4572, #4592, #4548).
  • Added fn.experimental.tensor_resize operator (#4492).
  • Added fn.experimental.equalize operator (#4575, #4565).
  • Added API for pre-allocation and releasing of memory pools (#4563, #4556).

Fixed Issues

  • Fixed GPU fn.constant operator synchronization issue (#4643).
  • Fixed out-of-bounds access with trailing wildcard in fn.reshape (#4631).
  • Fixed insufficient alignment issues in GPU video decoding (#4622).

Improvements

  • Dependencies update (#4649)
  • Reduce L0 test time (#4645)
  • Extend input API utilities to support input operators (#4642)
  • Add slice_flip_normalize_* to the minimum build (used by imgcodec)
  • VideoInput<MixedBackend> (#4613)
  • Move slice_flip_kernel* to separate compilation units (#4637)
  • Bump nvCOMP to 2.6.1 (#4638)
  • Add fn.experimental.crop_mirror_normalize (#4562)
  • Simplify setup stage of Cast operator (#4633)
  • Move to CUDA 12.0U1 (#4632)
  • Fix the warning in the build with sanitizer (#4626)
  • Optimize CPU time of JPEG lossless decoder (#4625)
  • Support inferring batch size from tensor argument inputs (#4617)
  • reshape: restore the support for trailing wildcard in rel_shape (#4623)
  • Add DALI Conditionals documentation (#4589)
  • Enable nose2 test timer (#4610)
  • New SliceFlipNormalizeGPU kernel (#4356)
  • DataId mechanism for fn.inputs.video operator (#4584)
  • Add experimental.tensor_resize operator (#4492)
  • MixedBackend support for InputOperator (#4603)
  • Fix HasHwDecoder (#4601)
  • Track DataNodes produced by .gpu() in conditionals (#4602)
  • Update the math expression docs (#4568)
  • Clear operator traces before launching the operator (#4605)
  • Skip JPEG lossless tests for compute capability < SM60 (#4600)
  • Add experimental python 3.11 support (#4586)
  • Improve error message when trying to decode JPEG lossless images on the CPU backend (#4587)
  • Improve pipeline graph traversal (#4583)
  • Make .so files patched in one go when the wheel is produced (#4582)
  • Operator trace mechanism (#4564)
  • Add equalize operator (#4575)
  • Add equalize kernel (#4565)
  • Support for JPEG lossless images in GPU fn.experimental.decoders.image (#4572)
  • Add experimental support for if statements in DALI (#4561)
  • Add CodeQL workflow for GitHub code scanning (#4438)
  • Update nvCOMP to 2.6 (#4579)
  • Give the ability to link each part of CUDA toolkit statically (#4570)
  • Fix TL0_python-self-test-base-cuda for CUDA 12 (#4577)
  • Add functions to preallocate pools and release unused pool memory (#4563)
  • Disable strict_overflow warning. (#4567)
  • Remove unused define_graph argument from build pipeline method (#4555)
  • Add release_unused function to memory pools. (#4556)
  • Change CUDA C++ standard to C++17 (#4506)
  • Create axes_utils.h (#4548)

Bug Fixes

  • Fixing API utils (#4651)
  • constant operator: Set proper stream in constant storage. (#4643)
  • Coverity 2023.01-02 (#4641)
  • Allow 1-off discrepancies in the equalize op between GPU and CPU baseline (#4639)
  • Fix pipeline leak in InputOperatorMixedTest (#4630)
  • reshape: Prevent out-of-bounds access with trailing wildcard in rel_shape (#4631)
  • Fix @autoserialize problem with unknown module (#4628)
  • Fix classification of argument input-only operators in AutoGraph (#4618)
  • Fix stack op error message so that it reports dim of offending operand (#4616)
  • Make sure that ulMaxWidth is aligned to 32 bytes in the video decoder (#4622)
  • Fix sanitizer error: memory & pipeline leaks (#4619)
  • Fix rel_shape length validation in reshape (#4595)
  • Fix non-VMM pool release_unused. Don't rely on cudaGetMemInfo in preallocation tests. (#4596)
  • Fix errors reported by LASAN (#4594)
  • Add nvjpeg calls used for lossless jpeg decoding to the stub generator (#4592)
  • Fix passing WITH_DYNAMIC_* falgs to conda build (#4597)
  • Fix pool preallocation tests (#4585)
  • Fix imgcodec fallback and error handling (#4573)
  • Fix CUDA_TARGET_ARCHS handling in CMake 3.18+ (#4559)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

No features were deprecated in this release.

Known issues:

  • The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
    If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.
  • Experimental VideoReaderDecoder does not support open GOP.
    It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.
  • The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
    To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)
  • In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams.
    As a workaround, you can manually synchronize the device before returning the data from the callback.
  • Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
    • privileged=yes in Extra Settings for AWS data points
    • --privileged or --security-opt seccomp=unconfined for bare Docker.

Binary builds

NOTE: DALI builds for CUDA 12 dynamically link the CUDA toolkit. To use DALI, install the latest CUDA toolkit.

CUDA 11.0 and CUDA 12.0 builds use CUDA toolkit enhanced compatibility. 
They are built with the latest CUDA 11.x/12.x toolkit respectively but they can run on the latest, 
stable CUDA 11.0/CUDA 12.0 capable drivers (450.80 or later and 525.60 or later respectively).
However, using the most recent driver may enable additional functionality. 
More details can be found in enhanced CUDA compatibility guide.

Install via pip for CUDA 12.0:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda120==1.23.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda120==1.23.0

or for CUDA 11:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.23.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.23.0

Or use direct download links (CUDA 12.0):

Or use direct download links (CUDA 11.0):

FFmpeg source code:

  • This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

DALI v1.22.0

19 Jan 14:43
c572c3f
Compare
Choose a tag to compare

Key Features and Enhancements

This DALI release includes the following key features and enhancements:

  • Added CUDA 12.0 support (#4502).
    • Reduced binary size for CUDA 12 builds.
  • Added CPU experimental.inputs.video operator that supports decoding video from memorybuffer across multiple iterations to reduce memory usage (#4519).
  • Added GPU fn.experimental.filter (convolution) operator (#4298, #4525).
  • Added support for decoding raw H264 and H265 streams from memory (#4480).

Fixed Issues

No major issues were fixed in this release.

Improvements

  • Update DALI TensorFlow examples to work with 2.11 (#4554)
  • Update nvCOMP to 2.5 (#4550)
  • Fix TL1_custom_src_pattern_build test (#4546)
  • Allow CPU dtype source in GPU cast_like (#4547)
  • Add GPU filter operator (2D, 3D) (#4525)
  • Remove usage of the unified memory from the remap test (#4544)
  • Split DALI operator tests into two jobs (#4543)
  • Update suppression list for sanitizer tests (#4542)
  • Update Boost preprocessor and rapidjson (#4538)
  • Update libtiff (#4531)
  • Fix linter errors & numpy dependency workaround (#4532)
  • VideoInput operator for the CPU (#4519)
  • Use pointer in NVDECLease. Store owner pointer in NVDECLease. (#4523)
  • Extract ResizeAttrBase to be reused in TensorResizeAttr (#4515)
  • Add GPU filter kernel (#4298)
  • Propagate SourceInfo (when unambiguous) from inputs to outputs. (#4518)
  • Limit NumPy version to pre-1.24 (#4527)
  • Avoid signed/unsigned comparison in clamp<S, U>. (#4524)
  • Update YOLO example for the latest to support the latest TensorFlow version (#4522)
  • Utilities and refactoring pre-VideoInput operator (#4513)
  • Enable CUDA 12.0 support (#4502)
  • Extracting InputOperator from ExternalSource (#4505)
  • Add expand_dims utility (#4493)
  • Remove Operator inheritance from VideoDecoderBase (#4508)
  • Extend decoding support (#4480)
  • Place AutoGraph as private submodule of DALI and enable tests (#4504)
  • Link CFITSIO library with cmake (#4487)

Bug Fixes

  • Add the missing installation of sanitizer to the deps image (#4521)
  • Fix DALI build without FFmpeg (#4534)
  • Replace usages of numpy.bool with bool (#4526)
  • Fix missing #include <optional>. (#4520)
  • Fix exclusion of CFITSIO test when BUILD_CFITSIO=OFF (#4510)
  • Don't look for duplicate arguments in parent schemas. (#4507)
  • Fix size argument to strncpy in cfitsio_test. Fix copyright notice. (#4509)

Breaking API changes

  • DALI builds for CUDA 12 dynamically link the CUDA toolkit. To use DALI, install the latest CUDA toolkit
  • DALI 1.21 was the last release built for CUDA 10.2.

Deprecated features

No features were deprecated in this release.

Known issues:

  • The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
    If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.
  • Experimental VideoReaderDecoder does not support open GOP.
    It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.
  • The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
    To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)
  • In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams.
    As a workaround, you can manually synchronize the device before returning the data from the callback.
  • Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
    • privileged=yes in Extra Settings for AWS data points
    • --privileged or --security-opt seccomp=unconfined for bare Docker.

Binary builds

NOTE: DALI builds for CUDA 12 dynamically link the CUDA toolkit. To use DALI, install the latest CUDA toolkit.

CUDA 11.0 and CUDA 12.0 builds use CUDA toolkit enhanced compatibility. 
They are built with the latest CUDA 11.x/12.x toolkit respectively but they can run on the latest, 
stable CUDA 11.0/CUDA 12.0 capable drivers (450.80 or later and 525.60 or later respectively).
However, using the most recent driver may enable additional functionality. 
More details can be found in enhanced CUDA compatibility guide.

Install via pip for CUDA 12.0:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda120==1.22.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda120==1.22.0

or for CUDA 11:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.22.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.22.0

Or use direct download links (CUDA 12.0):

Or use direct download links (CUDA 11.0):

FFmpeg source code:

  • This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

DALI v1.21.0

28 Dec 09:52
Compare
Choose a tag to compare

Key Features and Enhancements

This DALI release includes the following key features and enhancements:

  • Added experimental image decoding operators with support for the following higher dynamic ranges (#4223):
    • experimental.decoders.image
    • experimental.decoders.image_crop
    • experimental.decoders.image_random_crop
    • experimental.decoders.image_slice
  • Added the GPU debayer operator (#4495, #4486).

Fixed Issues

The following issues were fixed in this release:

  • Fixed the issue where the GPU numpy reader was crashing on a DALI process teardown with cufile 1.4.0 (#4466).
  • Fixed the issue where the GPU video decoder was failing in multi-GPU settings (#4517).

Improvements

  • Optimizing ShiftPixelCenter kernel configuration (#4430).
  • Update "Compiling from source" tutorial (#4010).
  • Imgcodec's decode operator (#4223).
  • Move to use CMake in DALI deps where possible (#4445).
  • Bump supported tf version (#4459).
  • Optimize inflate tests (#4456).
  • Execute whole Keras code in the expected device scope (#4462).
  • Update the TensorFlow test to work with 2.11.x (#4460).
  • Crop rounding argument to control the conversion of anchors to integral values (#4461).
  • Make Transpose's perm argument optional (by default, reverse dims) (#4465).
  • Add CastLike operator (#4467).
  • Accept negative axis in Cat and Stack operators (#4468).
  • Code drop AutoGraph based on TensorFlow 2.10.0 (#4485).
  • Remove build and doc files from AutoGraph (#4489).
  • Rearrange AutoGraph tests (#4490).
  • Adjust the documentation template for the latest sphinx_rtd_theme (#4481).
  • Bump the nvidia-tensorflow to 22.11 in tests (#4472).
  • Improve error reporting in the video decoder (#4484).
  • Move to generic CUDA_CALL for nvCOMP (#4474).
  • Extend the warning about the lack of the necessary CUDA libraries (#4473).
  • Allow negative axes in reductions module (#4470).
  • Add kernel-wrapper around NPP debayer calls (#4486).
  • Remove TF-specific codepaths from AutoGraph (#4491).
  • Lint the AutoGraph code (#4494).
  • Add bytes_per_sample_hint parameter to parallel external source (#4155).
  • Add debayer operator (#4495).
  • Remove trailing comments from .flake.ag (#4497).
  • Update DALI_DEPS_VERSION (#4496).
  • Deprecate CUDA 10.2 (#4503).
  • Extract CachingList from ExternalSource (#4501).

Bug Fixes

  • Do not call nvcomp with no input (#4434).
  • Fix libtiff CVE-2022-3970 (#4448).
  • TL3 SSD Install pycocotools from latest NVIDIA cocoapi repo (#4457).
  • Fix numpy reader crash (#4466).
  • Fix stub generation for dynamic linking (#4478).
  • Fix issues found by static analysis (#4477).
  • Fix PES tests with Python3.6/3.7 (#4500).
  • Patch FFmpeg for CVE-2022-3965, CVE-2022-3964 (#4499).
  • Fix video decoder cache for multiple GPUs (#4517).

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

  • DALI 1.21 is the final release that will support CUDA 10.2.

Known issues:

  • The GPU numpy reader might crash during the DALI process teardown with cufile 1.4.0.
  • The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
    If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.
  • Experimental VideoReaderDecoder does not support open GOP.
    It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.
  • The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
    To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)
  • In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams.
    As a workaround, you can manually synchronize the device before returning the data from the callback.
  • Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
    • privileged=yes in Extra Settings for AWS data points
    • --privileged or --security-opt seccomp=unconfined for bare Docker.

Binary builds

Install via pip for CUDA 10.2:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.21.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.21.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit
while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). 
Using the latest driver may enable additional functionality. 
More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.21.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.21.0

Or use direct download links (CUDA 10.2):

Or use direct download links (CUDA 11.0):

FFmpeg source code:

  • This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

DALI v1.20.0

30 Nov 16:13
b0c2e72
Compare
Choose a tag to compare

Key Features and Enhancements

This DALI release includes the following key features and enhancements:

  • Added the fn.experimental.remap operator for generic geometric transformation of images and video (#4379, #4419, #4365, #4374, #4425).
  • Added MPEG4 support to the GPU video decoder (#4424, #4327).
  • Added the fn.experimental.inflate operator that enables decompression of LZ4 compressed input (#4366).
  • Added support for broadcasting in arithmetic operators (CPU and GPU) (#4348).
  • Added experimental split and merge operators for conditional execution (#4359, #4405, #4358).
  • The following optimizations in GPU operators:
    • Optimized MelScale kernel (#4395).
    • Optimizations in the GPU decoder (#4351).
    • Simplified arithmetic GPU operator (#4411).
    • Split reduction kernels (#4383).
    • Avoiding copy from non-pinned memory in PreemphasisFilter operator (#4380).
    • Refactored the ConvertTimeMajorSpectrogram kernel (#4389).

Fixed Issues

The following issues were fixed in this release:

  • Fixed TensorList copy synchronization issues (#4458, #4453).
  • Fixed an issue with hint grid size in OpticalFlow (#4443).
  • Fixed the ES synchronization issues in integrated memory devices (#4321, #4423).
  • Added a missing CUDA stream synchronization before cuvidUnmapVideoFrame in nvDecoder (#4426).
  • Fixed the pipeline initialization in Python after deserialization (#4350).
  • Fixed issues with serialization of functions in recent notebook versions (#4406).
  • Fixed an integration with new TF version by replacing Status::OK() with Status() in the TF plugin (#4442).

Improvements

  • Update dependencies 22/11 (#4427)
  • fn.experimental.remap optimizations (#4419)
  • Add mkv support (#4424)
  • Add inflate operator (#4366)
  • Include nvCOMP's license and notice in the acknowledgements (#4368)
  • Use numpy instead of naive loops in remap test. (#4425)
  • MelScale kernel optimization (#4395)
  • Optimize GPU decoder (#4351)
  • Simplify arithmetic operator GPU implementation (#4411)
  • Add CVE reporting guideline to the repo and readme (#4385)
  • Add internal Split and Merge operators (#4359)
  • Fix fstring usage for warning in pipeline (#4401)
  • Add fn.experimental.remap operator (#4379)
  • Divide expression_impl to avoid recompiling all ops when touching a detail in the impl (#4412)
  • Refactor ConvertTimeMajorSpectrogram kernel (#4389)
  • Remove documentation about data_layout argument for paddle and pytorch iterators (#4409)
  • Serialize failing global functions by value (#4406)
  • Limit the TF memory usage in test_dali_tf_dataset_shape.py tests (#4400)
  • Split reduction kernels (#4383)
  • Add convenient conversions from a list of arrays to DALI TensorList (#4391)
  • Add permute_in_place function with tests. (#4387)
  • Split cuda utils.h & fix includes (#4386)
  • Enable MPEG4 GPU decoding (#4327)
  • Update CUDA toolkit for Jetson build to 11.8 (#4376)
  • Remove TensorFlow 1.15 support from CUDA 11 (#4377)
  • Avoid copying from non-pinned memory in PreemphasisFilter operator (#4380)
  • Support broadcasting in arithmetic operators (CPU & GPU) (#4348)
  • Remove unnecessary reset in the PyTorch SSD example (#4373)
  • Remap kernel implementation with NPP (#4365)
  • Utils and prerequisities for NppRemapKernel implementation (#4374)
  • Extend DALIInterpType to_string (#4370)
  • Validate ROI in imgcodec (#4279)
  • Workspace unification (#4339)
  • Extend and relax TensorList sample APIs (#4358)
  • Remove the Pipeline/Executor completion callback APIs (#4345)

Bug Fixes

  • Fix H2H copy in HW NVJPEG. (#4458)
  • Fix an issue with improper hint grid size in OpticalFlow (#4443)
  • Enable support for full-swing videos (#4447)
  • Fix TensorList copy ordering issues (#4453)
  • Replace Status::OK() with Status() for TF plugin (#4442)
  • Adds a cuda stream synchronization before cuvidUnmapVideoFrame in nvDecoder (#4426)
  • Fix ES synchronization issues in integrated memory devices (#4321)
  • Fix debug build warnings in the inflate op (#4433)
  • Fix ExecutorSyncTest that run the SimpleExecutor twice (#4432)
  • Fix setting pinned status of the tensor list in the Python (#4431)
  • Pinned resource test fix: reset the device buffer on a proper stream. (#4428)
  • Fix libtiff CVEs (#4414)
  • Fix pinned resource test on integrated GPUs (#4423)
  • Fix builtin test - do not use operators lib (#4420)
  • Harden the code against ODR violations (#4421)
  • Unroll nested namespaces (#4415)
  • Add proper validation for empty batch in External Source (#4404)
  • Fix video decoder test for aarch64 (#4402)
  • Fix to enable leading underscore in op name (#4405)
  • Serialize failing global functions by value (#4406)
  • Add cuh files to linter (#4384)
  • Avoid reading out of bounds (#4398)
  • Fix namespace resolution for CUDA and STL math functions (#4378)
  • Fix unnecessary copy of the workspace object. (#4371)
  • Fix pipeline initialization in python after deserialization (#4350)
  • Fix misleading video example with timestamps (#4364)
  • Fix sanitizer build tests (#4367)

Breaking API changes

  • Removed the Pipeline/Executor completion callback APIs (#4345).
  • [C++ API] Workspace unification: C++ workspace is no longer templated with backend type (#4339).

Deprecated features

  • DALI will drop support for CUDA 10.2 in an upcoming release.

Known issues:

  • The GPU numpy reader might crash during the DALI process teardown with cufile 1.4.0.
  • The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
    If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.
  • Experimental VideoReaderDecoder does not support open GOP.
    It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.
  • The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
    To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)
  • In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams.
    As a workaround, you can manually synchronize the device before returning the data from the callback.
  • Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
    • privileged=yes in Extra Settings for AWS data points
    • --privileged or --security-opt seccomp=unconfined for bare Docker.

Binary builds

Install via pip for CUDA 10.2:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.20.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.20.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit
while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). 
Using the latest driver may enable additional functionality. 
More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.20.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.20.0

Or use direct download links (CUDA 10.2):

Or use direct download links (CUDA 11.0):

FFmpeg source code:

  • This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

DALI v1.19.0

02 Nov 11:11
Compare
Choose a tag to compare

Key Features and Enhancements

This DALI release includes the following key features and enhancements:

  • Added the experimental.decoders.video stand-alone video decoder to decode video on GPU and CPU provided as an in-memory buffer (for example, through an external source) (#4354, #4296).
  • Added support to decode indexless videos (#4347, #4302, and #4335).

Fixed Issues

The following issues were fixed in this release:

  • Fixed the handling of Caffe LMDB empty samples (without data or labels) (#4266).

Improvements

  • Exclude HEVC files from video decoder test. (#4357)
  • Fix a typo in Debug Mode documentation (#4355)
  • Parallelize gpu video decoding (#4354)
  • Make tests for DALI linked dynamically with CUDA more flexible (#4341) [categories: Other]
  • Update MXNet version used in tests (#4342)
  • Enable indexless video decoding for GPU (#4347)
  • Prevent obtaining handle values from dead unique handles and stream leases. (#4346)
  • Update broadcasting shape simplification logic (#4314)
  • Add warning about the end of support for CUDA 10.2 (#4334)
  • Frames decoder gpu without index (#4302)
  • Enable indexless decoding in CPU video decoder (#4335)
  • Update outdated links in the documentation (#4329)
  • Add Mixed VideoDecoder (#4296)
  • Update cutlass and DALI_deps revision. (#4328)
  • Fixes and performance improvments in imgcodec/nvjpeg (#4318)
  • Update Jetson build env to support CUDA 11.4 and Orin (#4250)
  • Update nvJPEG2k version to 0.6.0 (#4320)
  • Add missing documentation to (Future)DecodingResult(Promise). (#4310)
  • Update libcudacxx target macros for clang and SM90. (#4315)
  • Don't use nvjpegGetHardwareDecoderInfo in pre-11.8 toolkits. (#4325)
  • Prune static cuda libraries DALI links with from unused archs (#4317)
  • Fix clang warnings (#4312)
  • Add pass-through tracking to auto-pinning buffers (#4294)
  • Update protobuf (v21.5 to v21.7) (#4313)
  • Extended ImageDecoder tests (#4297)
  • Refactor OpSchema - move implementation to one translation unit (#4293)
  • Emit the warning about the default value change only when using the default. (#4214)
  • Reduce the batch size in RN50 data pipeline tests. (#4304)
  • Enable ROI adjustment for multi-frame inputs + cleanup. (#4303)
  • Use GPU Convert in nvJPEG decoder (#4247)
  • Aggregating ImageDecoder (#4224)
  • Support palette TIFFs (#4206)
  • Refactor video decoder for reusability (#4290)
  • Add ROI support to nvJPEG (#4244)
  • RemapKernel API (#4284)
  • Presteps to image_decoder.* APIs (#4277)
  • Add frames decoder CPU without index (#4278)
  • Add experimental.decoders.video for CPU (#4270)
  • Fix a typo in the documentation (#4258)
  • Add orientation to GPU image data Convert (#4232)
  • Fix hang in TL1_tensorflow-dali_test (#4255)
  • Make test_dltensor_operator.py consistent when the HW decoder is available (#4272)
  • Fix issues in DALI in action snippet (#4268)
  • Assure operator documentation links to enum types (#4264)
  • Support applying orientation in Convert (#4219)
  • Add image decoder registry. (#4261)
  • Support tiled TIFFs (#4201)
  • Bump up TensorFlow version in tests (#4238)

Bug Fixes

  • Fix coverity issues (#4349)
  • Revert pruning of unused architectures (#4336)
  • Fix order of access order waiting in TL's set_order (#4338)
  • Fix NVJPEG pinned buffer synchronization. (#4337)
  • Change the default order of data storage objects (#4276)
  • Fix checking of the return status of the bundle lib tests (#4330)
  • Fix executor test - add test operators (#4323)
  • Fix parameter propagation in ImageDecoder. (#4309)
  • Fix normalization when running GPU color space conversion (#4285)
  • Fix support for ANY_DATA in nvJPEG2K (#4299)
  • Fix inconsistent tensor recreation in TensorList (#4286)
  • Fix no ffmpeg build (#4288)
  • Fix libtiff error handling (#4274)
  • Fix imgcodec batched APIs and tests (#4263)
  • Fix handling of Caffe LMDB without valid data (#4266)
  • Move params in PerThreadResources move constructor (#4265)
  • Fix fusing the dimensions in SliceFlipNormalizePermutePadGpu (#4234)
  • Improve error handling in LibTiffDecoder (#4210)
  • Fix exception handling in BatchParallelDecoderImpl (#4262)
  • Make nvjpeg decoder use its own thread pool (#4241)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

DALI will drop support for CUDA 10.2 in an upcoming release.

Known issues:

  • The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
    If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.
  • Experimental VideoReaderDecoder does not support open GOP.
    It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.
  • The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
    To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)
  • In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams.
    As a workaround, you can manually synchronize the device before returning the data from the callback.
  • Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
    • privileged=yes in Extra Settings for AWS data points
    • --privileged or --security-opt seccomp=unconfined for bare Docker.

Binary builds

Install via pip for CUDA 10.2:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.19.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.19.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit
while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). 
Using the latest driver may enable additional functionality. 
More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.19.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.19.0

Or use direct download links (CUDA 10.2):

Or use direct download links (CUDA 11.0):

FFmpeg source code:

  • This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

DALI v1.18.0

05 Oct 17:19
Compare
Choose a tag to compare

Key Features and Enhancements

This DALI release includes the following key features and enhancements:

  • Unified batch representation in the GPU and CPU stages of the pipeline (effort towards conditional execution) (#4253, #4236, #4220, #4189).
  • Added support to specify the fill_value argument for each sample in the fn.erase operator (#4182).
  • Added support for the memory video file in FramesDecoder (#4184).
  • Moved the audio_resample operator out of experimental module (#4194).

Fixed Issues

The following issues were fixed in this release:

  • Fixed an unnecessary synchronization in MakeContiguous. (#4248).
  • Fixed the Python tool to create the webdataset index (#4226).
  • Added a fix to prevent DALI from allocating GPU memory when constructing CPU TensorList (#4203).
  • Fixed a PyTorch example to comply with the new PyTroch (#4213).

Improvements

  • GPU image data conversion (#4208)
  • Fix libtiff and libtar vulnerabilities (#4245)
  • Update third party dependencies (#4233)
  • Reduce batch size in the WebDataset integration using External Source example (#4240)
  • Rename the set and copy sample APIs in TensorList (#4236)
  • Move nvjpeg decoder files to imgcodec/decoders/nvjpeg/ (#4235)
  • Add Nvjpeg decoder (#4178)
  • Rename TensorVector to TensorList (#4220)
  • Make JPEG HW decoder test to fully use HW and not hybrid approach (#4222)
  • Add bulk parameter passing to decoders and factories. (#4212)
  • Support any bitdepth in TIFF (#4180)
  • Remove TensorList and use only TensorVector (#4189)
  • [imgcodec] API adjustments (#4205)
  • ROI support for nvjpeg2k decoder (#4175)
  • Use deprecated PIL resampling import for Python 3.6, due to lack of availability of a newer version of PIL (#4200)
  • Add arithmetic expression broadcasting utils (#4188)
  • Support higher TIFF bitdepths (#4174)
  • Enable per-sample fill_value argument in Erase operator (#4182)
  • Fix python linter errors for the qa/ directory (#4117)
  • Fix usage of deprecated np.float in tests (#4192)
  • Adjust PIL interpolation types to module PIL.Image.Resampling (#4195)
  • Move audio_resample out of experimental module (#4194)
  • Support different layouts in imgcodec's Convert (#4157)
  • Fix typos in iterator last_batch_policy argument documentation (#4170)
  • Fix synchronization in external source tests (#4153)
  • Add support for memory video file in FramesDecoder (#4184)
  • Support outputting YCbCr in libjpeg-turbo decoder (#4156)
  • Use std::exchange in move operator for Tensors (#4183)

Bug Fixes

  • Unify buffers caching in CPU/GPU external source (#4253)
  • Fix builds without nvJPEG (#4252)
  • Separate nvjpeg lib wrapper and stub from the decoder (#4249)
  • Prevent unnecessary synchronization in MakeContiguous. (#4248)
  • Do not leak DecodeParams (#4242)
  • Fix AssertClose bug in Imgcodec tests (#4243)
  • Fix bug in CPU Convert (#4237)
  • Fix webdataset python index creation script (#4226)
  • Fix In memory video decoding tests (#4216)
  • Fix UnpackBits (#4227)
  • Fix issues detected by Coverity. (#4221)
  • Make TensorList constructor for CPU not using GPU memory (#4203)
  • Fix the indexing for newer PyTorch (#4213)
  • Fix possibly incorrect parallel write access to vector (#4211)
  • Fix Layout propagation in TensorVector (#4202)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

  • The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
    If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.
  • Experimental VideoReaderDecoder does not support open GOP.
    It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.
  • The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
    To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)
  • In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams.
    As a workaround, you can manually synchronize the device before returning the data from the callback.
  • Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
    • privileged=yes in Extra Settings for AWS data points
    • --privileged or --security-opt seccomp=unconfined for bare Docker.

Binary builds

Install via pip for CUDA 10.2:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.18.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.18.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit
while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). 
Using the latest driver may enable additional functionality. 
More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.18.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.18.0

Or use direct download links (CUDA 10.2):

Or use direct download links (CUDA 11.0):

FFmpeg source code:

  • This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

DALI v1.17.0

05 Oct 17:18
Compare
Choose a tag to compare

Key Features and Enhancements

This DALI release includes the following key features and enhancements:

  • Added CUDA 11.8 support.
  • Improved color conversion performance and precision (#4139).
  • Laid the groundwork for ongoing conditional execution effort (#4149, #4124, #4083, #3827, #4049).
  • Laid the groundwork for ongoing effort on improved decoding and processing of images.
  • Documentation improvements (#4168, #4102, #4059, #4094).

Fixed Issues

The following issues were fixed in this release:

  • Fixed default dtype in color twist family of operators (#4067)
  • Fix handling of TIFFs with palette (#4089)

Improvements

  • Separating nvjpeg2k utils in imgcodec (#4160)
  • Add NvJpeg2000Decoder (#4114)
  • Port operators Python tests to nose2 (#4037)
  • Refactor Tensor Vector (#4149)
  • Rename ImageDecoder to ImageDecoderFactory. (#4169)
  • Add section on deferred setup and shm limit to PES docs (#4168)
  • Change pinned version of matplotlib (#4167)
  • Add LibTIFF decoder (#4109)
  • Make decoder_test_helper.h accept TensorView (#4154)
  • Update dependencies (#4152)
  • Add color conversion support (#4143)
  • Extend the ImageDecoder testing framework to support GPU decoders (#4142)
  • Add color space conversion to imgcodec (#4121)
  • Fix CVE-2022-34526 (#4133)
  • Copy nvjpeg utils into imgcodec (#4148)
  • Fix linter for files inisde the dali_tf_plugin directory (#4118)
  • Add LibJpegTurboDecoder (#4099)
  • Color conversion - optimizations and tests (#4139)
  • Move to CUDA 11.7U1 (#4137)
  • Remove pageable copies from Convolution, Transpose and Warp kernels. (#4141)
  • Add AsTensor and related APIs to Tensor Vector (#4124)
  • [imgcodec] Add thread index and cuda stream to Decode APIs (#4128)
  • Move operator test files (#4125)
  • Silence some constexpr-related warnings in NVCC 10. (#4131)
  • Move libjpeg-turbo utils/impl to imgcodec directory (#4129)
  • Add missing constexpr to vec and mat. (#4130)
  • Parse EXIF metadata in PNG imgcodec parser (#4122)
  • Add parenthesis to assert to avoid using \ (#4123)
  • Fix error reported by flake8 5.0.1 (#4120)
  • Turn Python linter on by default (#3997)
  • Add decoder test framework (#4103)
  • Add dali namespace to third_party copy of OpenCV's exif (#4112)
  • Parsing EXIF metadata in WebP images (#4087)
  • Add PNG parser (#4052)
  • Fix OpenCV warning in jpeg compression distortion tests (#4107)
  • Document unsupported external source arguments in TF Dataset (#4102)
  • Add boilerplate synchronization for batch copying (#4083)
  • Pin Numba version to 0.55.2 (#4108)
  • Example image decoder using OpenCV (#4036)
  • Remove signal handler for SIGKILL (#4015)
  • Extract common functions from numpy reader (#4100)
  • Add JPEG EXIF parser (#4073)
  • Remove video reader warning that a frame has been seen twice (#4092)
  • Remove unnecessary loggin from resize checkerboard tests (#4086)
  • Add Jpeg2000 parser (#4068)
  • Fix flake8 warnings (#4074)
  • Fix & extend formatting of collections. (#4082)
  • Add inherited members to the Pytorch plugin docs (#4094)
  • Adjust Doxygen configuration (#4088)
  • Add imgcodec compatibility tests (#4057)
  • Add restrictions to set_type (#4071)
  • Add WebP parser (#4053)
  • Add JPEG Parser (#4050)
  • Silence buggy GCC warning about freeing non-heap objects. (#4077)
  • Add a tool for testing Imgcodec against ImageMagick (#4058)
  • BMP parser (#4062)
  • Make endian swapping work with ADL. (#4075)
  • Add utilities for swapping endianness. (#4069)
  • Add PNM parser (#4044)
  • Add references to image_processing/index. Add optional ordering to references. (#4059)
  • Extract EXIF parser from OpenCV (#4063)
  • Fix ifndef guards to be at the end of the file (#4064)
  • Stop exposing internal contiguous TV storage (#3827)
  • ReadValue extension to support enums (#4060)
  • Propagate device_id in ShareData and SetSample APIs (#4049)
  • Add TIFF parser (#4040)
  • Make the DALI video reader throw an exception when the VFR video is decoded (#4022)
  • Add ReadHeader util to parser baseclass (#4042)

Bug Fixes

  • Prevent excessive synchronization in MakeContiguous (#4228)
  • Prevent overflow in random_resized_crop tests (#4187)
  • Fix invalid destruction order in decoder test helper (#4186)
  • Added missing const in for loops (#4185)
  • Fix coverity issues (#4164)
  • Conditional compilation of TIFF Codec (#4166)
  • Fix zlib CVE-2022-37434 (#4150)
  • Pin matplotlib version to 3.5.2 (#4159)
  • Fix parsing of grayscale bitmaps (#4147)
  • Install flake8 for xavier builds (#4127)
  • Fix handling of TIFFs with palette (#4089)
  • Fix missing override in decoder test (#4105)
  • Disable HEVC tests for FramesDecoderGpu when it is not supported by the GPU (#4084)
  • Fix default dtype in color twist family of operators (#4067)
  • Fix libtiff CVE-2022-2058, CVE-2022-2057, CVE-2022-2056 (#4047)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

  • The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
    If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.
  • Experimental VideoReaderDecoder does not support open GOP.
    It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.
  • The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
    To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)
  • In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams.
    As a workaround, you can manually synchronize the device before returning the data from the callback.
  • Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
    • privileged=yes in Extra Settings for AWS data points
    • --privileged or --security-opt seccomp=unconfined for bare Docker.

Binary builds

Install via pip for CUDA 10.2:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.17.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.17.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit
while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). 
Using the latest driver may enable additional functionality. 
More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.17.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.17.0

Or use direct download links (CUDA 10.2):

Or use direct download links (CUDA 11.0):

FFmpeg source code:

  • This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

DALI v1.16.1

26 Aug 15:16
Compare
Choose a tag to compare

Key Features and Enhancements

This release includes bug fixes, so there are no new features or enhancements.

Fixed Issues

The following issues were fixed in this release:

  • Fixed the fn.decoders.image was leaking memory on corrupted images (#4138).
    • A memory leak in the libjpeg-turbo decoder implementation in case of corrupted images was fixed.
  • Fixed a crash in the fn.readers.numpy, when pad_last_batch is set, and more then one thread is used by DALI (#4056).
  • Fixed a faulty check that prevented the feed_input method from working after the pipeline was deserialized (#4096).

Improvements

  • None

Bug Fixes

  • Fix pad_last_batch in GPU NumpyReader (#4056)
  • Fix feed_input after deserialization (#4096)
  • Fix memory leak in libjpeg-turbo decoder implementation in case of corrupted images (#4138)
  • Add zlib to conda recipe (#4173)
  • Fix Numba versions in tests (#4111)
  • Fix device pick in Numpy reader tests (#4104)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

  • The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
    If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.
  • Experimental VideoReaderDecoder does not support open GOP.
    It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.
  • The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
    To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)
  • In experimental debug and eager modes, GPU external source is not properly synchronized with DALI internal streams. As a workaround, the user may manually synchronize the device before returning the data from the callback.
  • Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
    • privileged=yes in Extra Settings for AWS data points
    • --privileged or --security-opt seccomp=unconfined for bare Docker.

Binary builds

Install via pip for CUDA 10.2:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.16.1
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.16.1

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit
while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). 
Using the latest driver may enable additional functionality. 
More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.16.1
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.16.1

Or use direct download links (CUDA 10.2):

Or use direct download links (CUDA 11.0):

FFmpeg source code:

  • This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

DALI v1.16.0

25 Jul 12:38
83da787
Compare
Choose a tag to compare

Key Features and Enhancements

This DALI release includes the following key features and enhancements:

  • Added GPU non-silent region detection operator (#3944, #4001).
  • Added experimental support for the eager execution of stateful operators and arithmetic operators (#4016, #3952, #3969, #3990).
  • Added antialias flag to Resize operator for improved control over resampling mode used (#4032).
  • Added experimental support for custom GPU Numba operators (#3891, #3998, #4006, #4013).
  • Added support for processing video and handling of temporal arguments to color-manipulation operators and affine transform operators (#3937, #3946, #3917).

Fixed Issues

The following issues were fixed in this release:

  • Fixed DALI + PyTorch Lightning iterator issue resulting in subsequent epochs terminating too early (#3923, #4048).
  • Fixed scalars handling by the readers.tfrecord operator (#4024).
  • Fixed variable batch size handling by the crop and coord_transform operators (#4045, #3958).

Improvements

  • Add little-endian and big-endian read functions for InputStreams (#4038)
  • Add antialias flag to Resize (#4032)
  • Reformat python files (#4026)
  • Python formatting (#4035)
  • Enable nose2 in Python Tests (#4033)
  • Imgcodec module boilerplate (interfaces/placeholders/basic logic) (#4029)
  • Remove deprecated option options.experimental_optimization.map_vectorization.enabled (#4027)
  • Guided contribution tutorial (#4011)
  • Fix python formatting (#3982)
  • Add eager mode stateful operators (#4016)
  • Disable Numba GPU op for incompatible Numba versions (#4025)
  • Add missing quote marks to the DALI_AFFINITY_MASK usage example (#4020)
  • Add abstract InputStream. Refactor existing FileStreams to in to use it. (#4019)
  • Make DALI iterator to call reset() when iter() is called upon it (#3923)
  • Add eager mode operators coverage test (#3952)
  • Add ack for Numba GPU op (#3998)
  • Add eager mode arithm ops (#3969)
  • Reduce DALI conda package installation time (#3995)
  • Add Non-silent region GPU operator (#3944)
  • Workaround for nosetests in Python 3.10 (#3986)
  • Numba cuda operator (#3891)
  • Fix Python formatting (#3992)
  • Fix Python formatting (#3988)
  • Add examples of processing video that utilize per-frame operator (#3917)
  • Per frame affine transforms (#3946)
  • Handle partially pruned multi-output external sources (#3975)
  • Dependencies update (#3979)
  • Doxygen typo (#3989)
  • Add per frame parameters support to brightness_contrast and color_twist families (#3937)
  • Fix missing return (#3985)
  • Support vector alike output for OpSpec::TryGetRepeatedArgument (#3851)
  • Fix Python formatting (#3962)
  • Fix and reenable optimized Cast kernel (#3976)

Bug Fixes

  • Fix lack of reset when iter() is called on the DALI framework iterator (#4048)
  • Use actual batch size instead of max batch size in crop_attr.h (#4045)
  • Support scalars in readers.tfrecord (#4024)
  • Add const char* ctor to ThreadPool (#4005)
  • Remove unconditional float16 type mapping in Numba GPU op (#4013)
  • Change flake8 config (#4004)
  • Fix Numba CI issues (#4006)
  • Fix and simplify moving mean squares CPU kernel. (#4001)
  • Fix nan check and unused external source arguments in debug mode (#3990)
  • Fix fn.coord_transform handling of a default matrix in variable batch case (#3958)
  • Fix test_dali_tf_dataset_mnist_eager test (#3991)
  • Fix test_dali_tf_dataset_mnist_eager.py and test_dali_tf_dataset_mnist_graph.py tests (#3987)
  • Improve handling of "dtype" arguments in OpSchema/OpSpec (#3981)

Breaking API changes

  • The shape of scalars read by the readers.tfrecord operator is now () instead of (1,).
  • For cubic and linear interpolation modes, the resize operator applies the antialiasing filter by default now. The antialiasing can be turned off with the antialias flag.

Deprecated features

  • The triangular interpolation for resize operator has been deprecated as it is equivalent to linear interpolation with antialiasing on.

Known issues:

  • The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
    If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.
  • Experimental VideoReaderDecoder does not support open GOP.
    It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.
  • The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
    To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)
  • In experimental debug and eager modes, GPU external source is not properly synchronized with DALI internal streams. As a workaround, the user may manually synchronize the device before returning the data from the callback.
  • Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
    • privileged=yes in Extra Settings for AWS data points
    • --privileged or --security-opt seccomp=unconfined for bare Docker.

Binary builds

Install via pip for CUDA 10.2:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.16.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.16.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit
while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). 
Using the latest driver may enable additional functionality. 
More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.16.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.16.0

Or use direct download links (CUDA 10.2):

Or use direct download links (CUDA 11.0):

FFmpeg source code:

  • This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

DALI v1.15.0

22 Jun 14:31
Compare
Choose a tag to compare

Key Features and Enhancements

This DALI release includes the following key features and enhancements:

  • Added the GPU audio resampling operator (#3884, #3914 and #3911).
  • Improved the performance of the GPU fn.readers.numpy by custom GDS staging (#3894, #3905).
  • Added support for video processing and per-frame (temporal) arguments to the warp_affine operator (#3879, #3900).
  • Added HEVC support to the GPU frames decoder (#3896).
  • Added experimental support for the eager execution of stateless operators as Python functions and readers as iterators (#3887, #3930).
  • Added CUDA 11.7 support (#3906).
  • Profiling improvements:
    • Added more NVTX ranges to the executor (#3928)
    • Added thread names to all DALI threads (#3912)

Fixed Issues

The following issues were fixed in this release:

  • Added the missing device/device synchronization when copying pipeline outputs with copy_to_external (#3953).
  • Fixed the buffer synchronization between default and custom stream in a multi-GPU case (#3957).

Improvements

  • Fix Python formatting (#3961)
  • Fix coverity issues (#3974)
  • Add FindReduceGPU and FindRegionGPU kernels (#3951)
  • Fix Python formatting (#3965)
  • Add .style.yapf file (#3970)
  • Update Optical Flow example (#3971)
  • Fix per frame pass through (#3959)
  • Fixing Python code formatting (#3948)
  • Suppress the use of a staging buffer for nvJPEG input if it's already pinned.(#3956)
  • Fix cyclic dependency import problem in fn.py in python 3.6 (#3955)
  • Refactor qa test scripts (#3933)
  • Change thread pool creation for eager operators to lazy (#3931)
  • Fix sequence shape test (#3949)
  • Expose readers as iterators in eager mode (#3930)
  • Add Python linter (#3929)
  • Remove redundant quote marks from the protobuf version specifier (#3945)
  • Skip GDS tests when the GPU is incompatible. (#3941)
  • Add sequence processing to warp operator (#3879)
  • Add MovingMeanSquareGpu kernel (#3922)
  • Pin protobuf to <4 for Paddle Paddle (#3940)
  • Update compilation flags for the DALI TensorFlow plugin (#3943)
  • Change MultiDevice to MultiGpu test suffix (#3942)
  • Bump up the nvidia-tensorflow version to 20.05 in tests (#3938)
  • Add FindFirstLastGPU kernel (#3932)
  • Adjust PR template to ask for listing exisiting tests that apply (#3939)
  • Pin protobuf to <4 (#3934)
  • Add VFR detection (#3921)
  • Fix CVE-2022-0562 in libtiff (#3925)
  • Update RNN-T pipeline tests to include GPU resampling and silence detection (#3920)
  • Add more NVTX ranges to the executor (#3928)
  • Add HEVC support for FramesDecoderGpu (#3896)
  • Add a thread name to all DALI threads (#3912)
  • Add dataclasses pip package to tests deps to fix Python3.6 operator tests (#3926)
  • Add fn.experimental.audio_resample GPU (#3911)
  • Custom staging for GDS (#3894)
  • Update the readme roadmap link to use 2022 one (#3918)
  • Support specifying per-frame positional arguments in sequence processing test utility (#3901)
  • Move audio resampler CPU implementation to a single compilation unit (#3914)
  • Add stateless CPU eager operators (#3887)
  • Add CUDA 11.7 support (#3906)
  • Add VideoReaderDecoder test for missing labels (#3908)
  • Add signal resampling GPU kernel (#3884)
  • Optimize parameter passing for ScatterGather GPU (#3905)
  • Add references to ops documentation in the tutorials (#3904)
  • Enable per-frame operator on GPU (#3900)

Bug Fixes

  • Fix dltensor operator tests (#3984)
  • Prevent clobbering of outputs before non-blocking copy_to_external finishes. (#3953)
  • Fix a bug in AccessOrder when synchronizing with a default stream on the same device, which is not the current device. (#3957)
  • Workaound GDS memory leak in GDSMem tests. (#3936)
  • Fix circular imports in eager mode (#3919)
  • Remove intermediate Tensor and use DynamicScratchpad for op tile descirptors. (#3915)
  • Add missing moving of order in TensorVector's move assgiment/constructor (#3899)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

  • The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
    If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.
  • Experimental VideoReaderDecoder does not support open GOP.
    It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.
  • The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
    To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)
  • Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
    • privileged=yes in Extra Settings for AWS data points
    • --privileged or --security-opt seccomp=unconfined for bare Docker.

Binary builds

Install via pip for CUDA 10.2:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.15.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.15.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit
while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). 
Using the latest driver may enable additional functionality. 
More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.15.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.15.0

Or use direct download links (CUDA 10.2):

Or use direct download links (CUDA 11.0):

FFmpeg source code:

  • This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code: