Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce DALI proxy #5726

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open

Conversation

jantonguirao
Copy link
Contributor

@jantonguirao jantonguirao commented Nov 27, 2024

Category:

New feature

Description:

  • DALI proxy is a new way to integrate DALI pipelines with existing torch data loading pipelines
  • The idea is that torch data processes send data to the main process, where it is processed by DALI before handling it over to the training loop.
  • The solution allows for mixing data loading from Pytorch with partial processing on DALI

Co-author: @mdabek-nvidia

Additional information:

Affected modules and functionalities:

  • Torch plugin

Key points relevant for the review:

dali/python/nvidia/dali/plugin/pytorch/init.py

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A
    Added options to run with DALI proxy to RN50 and EfficientNet examples

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • [?] RST
    • [?] Jupyter
    • Other
  • N/A
    TODO

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [20865442]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [20865442]: BUILD PASSED

@jantonguirao jantonguirao changed the title Dali proxy2 Introduce DALI proxy Nov 28, 2024
@jantonguirao jantonguirao marked this pull request as ready for review November 28, 2024 09:21
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
@szkarpinski
Copy link
Collaborator

I see there are no tests except for the resnet50 example. I believe we should have normal TL0 tests as well.

@szkarpinski
Copy link
Collaborator

Did you test error propagation between the processes? Multiprocessing doesn't automagically propagate exceptions afaik. Maybe we should have tests to check how are the errors in particular processes reported?

@jantonguirao jantonguirao force-pushed the dali_proxy2 branch 2 times, most recently from caf2d0b to 0472490 Compare November 29, 2024 10:28
Copy link
Contributor

@klecki klecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posting few comments as the code started moving.
I also feel like we should limit the scope of the API and hide most of the implementation.

dali/python/setup.py.in Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
@jantonguirao jantonguirao force-pushed the dali_proxy2 branch 7 times, most recently from c4a7f74 to 1a50983 Compare December 3, 2024 17:52
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [21050826]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [21050826]: BUILD FAILED

@jantonguirao jantonguirao force-pushed the dali_proxy2 branch 2 times, most recently from 3bdfd37 to 014032d Compare December 3, 2024 18:32
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [21052236]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [21052236]: BUILD FAILED

Copy link
Contributor

@klecki klecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still didn't read all the tests, but posting more comment to the implementation.


**DALI Proxy** is a tool designed to integrate NVIDIA DALI pipelines with PyTorch data workers while maintaining the simplicity of PyTorch's dataset logic. The key features of DALI Proxy include:

- **Efficient GPU Utilization**: DALI Proxy ensures GPU data processing occurs on the same process running the main loop. This avoids performance degradation caused by multiple CUDA contexts for the same GPU.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Efficient GPU Utilization**: DALI Proxy ensures GPU data processing occurs on the same process running the main loop. This avoids performance degradation caused by multiple CUDA contexts for the same GPU.
- **Efficient GPU Utilization**: DALI Proxy ensures GPU data processing occurs in the same process running the main loop. This avoids performance degradation caused by multiple CUDA contexts for the same GPU.

?

- Each data worker invokes the proxy, which returns a **reference to a future processed sample**.
- During batch collation, the proxy groups data into a batch and sends it to the server for execution.
- The server processes the batch asynchronously and outputs the actual data to an output queue.
- The PyTorch DataLoader retrieves either the processed data or references to pending pipeline runs. If it encounters pipeline run references, it queries the DALI server for the actual data, waiting if necessary until the data becomes available in the output queue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is too complicated. I would stick with the simplified user POV - you call proxy in the worker to offload data for processing with DALI and put a placeholder for the result. When the data loader returns the processed data it replaces the placeholders with the actual results from DALI pipeline. Skip here the If it encounters pipeline run references and the waiting parts.


**1. DALI Pipeline**

The DALI pipeline defines the data processing steps. Input data is fed using ``fn.external_source``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are expanding, I guess this might be a nice place to mention the mapping between the external_source and the names of the parameters. Please, mention that we require at least one input, and it is the input to the proxy.

You can link to the operator doc and the argument with something like this AFAIR

:meth:`~nvidia.dali.fn.external_source`
:paramref:`~nvidia.dali.fn.external_source.source`


**5. Integration with PyTorch DataLoader**

The ``DataLoader`` wrapper provided by DALI Proxy simplifies the integration process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expose the dali_proxy DataLoader and DALIServer via autoclass here and link to those sections whenever you mention them? We have docstrings there, but we don't show it here.

Maybe mention that one can start and stop server by hand, but the context is the recommended way?

@@ -56,22 +56,36 @@ export PATH_TO_IMAGENET=/imagenet
export RESULT_WORKSPACE=./

# synthetic benchmark
python multiproc.py --nproc_per_node 8 ./main.py --amp --static-loss-scale 128 --batch-size 128 --epochs 1 --prof 1000 --no-checkpoints --training-only --data-backend synthetic --workspace $RESULT_WORKSPACE --report-file bench_report_synthetic.json $PATH_TO_IMAGENET
python multiproc.py --nproc_per_node 8 ./main.py --amp --static-loss-scale 128 --batch-size 128 --epochs 3 --prof 1000 --no-checkpoints --training-only --data-backend synthetic --workspace $RESULT_WORKSPACE --report-file bench_report_synthetic.json $PATH_TO_IMAGENET
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work correctly with more than one epoch? With synthetic benchmark the concept of the epoch didn't really exist as far as I can remember, that's why it just did 1k iterations. Dunno if it will now make 3k or explode, but making it longer doesn't really give us much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the iterator does have a size so it works

raise RuntimeError("The provided pipeline doesn't have any inputs")
pipe_input_names_set = set(pipe_input_names)
input_names_set = set(input_names or [])
if len(input_names_set) != len(input_names_set):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(input_names_set) != len(input_names_set):
if len(input_names_set) != len(input_names):

This will fail if you allow input_names to be None. Also, now it tests equality of the same thing.

pipe_input_names_set = set(pipe_input_names)
input_names_set = set(input_names or [])
if len(input_names_set) != len(input_names_set):
raise RuntimeError("``input_names`` argument should not contain any duplicated values")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise RuntimeError("``input_names`` argument should not contain any duplicated values")
raise RuntimeError(f"``input_names`` argument should not contain any duplicated values, got {input_names}.")

Comment on lines 413 to 473
call_impl.__signature__ = inspect.Signature(parameters)
_DALIProxy.__call__ = call_impl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool 😎
Does it have a chance of working with IDE or jupyter?

I think for IDE to work, we need the __call__ visible statically, there is no chance of injecting a proper signature stub, so we probably still need to have a static __call__(self, *inputs, **kwargs) with a docstring defined, and replace it with this hook.

Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
@jantonguirao jantonguirao force-pushed the dali_proxy2 branch 3 times, most recently from 75f5921 to 32983fe Compare December 24, 2024 12:16
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [21819002]: BUILD STARTED

Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [21823100]: BUILD STARTED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants