Auto Reorder #6

zjjott · 2024-02-06T09:55:42Z

Auto Reorder

Using Linear Program set instruction order
run tests

TF_CPP_MAX_VLOG_LEVEL=2 bazel run --compilation_mode=dbg xla/hlo/experimental/auto_reorder:auto_reorder_test --incompatible_strict_action_env --action_env=USE_CUDA --action_env=XLA_CUDA

This is a prospective change for openxla#10966. In particular, this will help fix an OSS build problem: "tensorflow/xla/linux/cpu/build_cpu" not being able to find the `InitializeAbslLogging` function. PiperOrigin-RevId: 620055000

…a tuple-tree of `numpy.ndarray`. This is intended for internal debugging use. It cannot be used on OSS because the relevant protobufs are not part of the public API. (Though it must not break the OSS build, naturally.) PiperOrigin-RevId: 620064326

…s to the library for internal debugging tools. PiperOrigin-RevId: 620068167

PiperOrigin-RevId: 620069321

These were used by KernelGen but are no longer needed. PiperOrigin-RevId: 620084345

…n is the entry computation root PiperOrigin-RevId: 620107928

PiperOrigin-RevId: 620111958

We need to honor it. PiperOrigin-RevId: 620121620

Updates LLVM usage to match [aa2c14de1adc](llvm/llvm-project@aa2c14de1adc) PiperOrigin-RevId: 620124069

PiperOrigin-RevId: 620149815

PiperOrigin-RevId: 620156928

This CL extracts current triton codegen requirements for each hlo instruction into a single function to clean the codes in the triton fusion passes. PiperOrigin-RevId: 620157253

This is required in cases where embedded thunk arguments share the same buffer (i.e. they are located at different offsets of the same buffer) PiperOrigin-RevId: 620179451

…aring the same buffer PiperOrigin-RevId: 620184639

PiperOrigin-RevId: 620194665

PiperOrigin-RevId: 620258542

PiperOrigin-RevId: 620259968

PiperOrigin-RevId: 620260337

Changes based on the Hurwitz Zeta algorithm from the article linked in the comments. PiperOrigin-RevId: 620272234

There is an internal issue with running tests on H100s requiring the change to be rolled back. Reverts 0ab2be0 PiperOrigin-RevId: 620273492

…n resharding costs for a given edge as part of one matrix object. PiperOrigin-RevId: 620273768

PiperOrigin-RevId: 620281417

Trying to prevent `error: "Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION"` PiperOrigin-RevId: 620284610

Updates LLVM usage to match [80aa52d8c5a8](llvm/llvm-project@80aa52d8c5a8) PiperOrigin-RevId: 620285862

@hawkinsp

…using mpi. Imported from GitHub PR openxla#7849 Mpi collectives as proposed in jax-ml/jax#11182. I only implemented the inter-process communication and this does not yet support more than 1 threads per process. Adding support for multiple threads/devices per process in the future seems quite a bit more involved if one wanted to do it properly. For MPI I am building and linking against https://github.com/eschnett/MPItrampoline, which dlopens the (wrapped) mpi library at runtime. To wrap and load the desired mpi library one needs compile https://github.com/eschnett/MPIwrapper and set `MPITRAMPOLINE_LIB=/path/to/libmpiwrapper.so`. @hawkinsp Copybara import of the project: -- b74bbb9 by Clemens Giuliani <[email protected]>: add mpi collectives -- 23508eb by Clemens Giuliani <[email protected]>: add explicit Init and Finalize methods and export them to python -- bbe5840 by Clemens Giuliani <[email protected]>: add comment -- 38d1562 by Clemens Giuliani <[email protected]>: fix windows build -- 201f723 by Clemens Giuliani <[email protected]>: fmt -- 2784869 by Clemens Giuliani <[email protected]>: bump xla_extension_version Merging this change closes openxla#7849 COPYBARA_INTEGRATE_REVIEW=openxla#7849 from inailuig:mpi_collectives 2784869 PiperOrigin-RevId: 620302264

…` to `bytes` `xla::PjRtValueType` is defined in C++, where its `std::string` value can contain any string (not necessarily UTF-8). Protobuf verison 3 requires a `string` field to contain UTF-8, so it is more suitable to use `bytes` to express this value. (Note that the string value of `xla::PjRtValueType` would be often consumed by Python, where nanobind would convert `std::string` into Python `str` with UTF-8 decoding. However, this is what some users of `xla::PjRtValueType` choose to do; this is not sufficient enough to constrain the string to be UTF-8 only in C++ APIs.) This is a preemptive change; there is no known problem of using a `string` field previously. PiperOrigin-RevId: 620315110

PiperOrigin-RevId: 620320903

PiperOrigin-RevId: 620324878

PiperOrigin-RevId: 622946035

PiperOrigin-RevId: 622947937

HloDimensionsInstruction::ClassOf should return false for kTopK. PiperOrigin-RevId: 622950232

PiperOrigin-RevId: 622953342

They don't work after the stream is initialized in GpuStream (the only Stream implementation to make use of the priority). Instead, move the parameter to Stream::Initialize, which is the only place it's actually used. PiperOrigin-RevId: 622958008

PiperOrigin-RevId: 622958040

…nate bool allocated_ member that's now unnecessary. PiperOrigin-RevId: 622964979

…nc values PiperOrigin-RevId: 622971220

PiperOrigin-RevId: 622987817

PiperOrigin-RevId: 623002758

PiperOrigin-RevId: 623009470

PiperOrigin-RevId: 623011436

PiperOrigin-RevId: 623013151

PiperOrigin-RevId: 623015159

PiperOrigin-RevId: 623053994

…mate;fix all-reduce cost estimate

mars1248 · 2024-04-18T07:01:01Z

xla/service/gpu/model/gpu_performance_model.cc

      cost_analysis->bytes_accessed(instr) / (1e6 * actual_bandwidth));
  total_time += communication_time;
  return total_time;
 }
+std::vector<double> GpuPerformanceWithCollectiveModel::GetInterInnerBandwidths(
+    const HloInstruction& instr, const GpuHloCostAnalysis* cost_analysis,
+    const se::DeviceDescription& gpu_device_info) {


这个函数是否是用来计算机内和机间的带宽的？能否添加一个简单函数说明？

mars1248 · 2024-04-18T07:26:32Z

xla/service/gpu/model/gpu_performance_model.cc

+  auto inner_node_numel_bytes =
+      numel_bytes * (std::min(kInnerNodeGpu, total_gpu) - 1);
+
+  //  all-gather-start(f32[12800,2400]{0,1} replica_groups={{0,1,2,3}})


这个地方可以改成计算公式

github-actions bot added the kokoro:force-run label Feb 6, 2024

zjjott force-pushed the feature/auto_reorder branch from d182865 to c0d7559 Compare March 26, 2024 07:30

wrengr and others added 28 commits March 28, 2024 14:32

[XLA:Python] Adding xla::PrimitiveType <-> numpy.dtype conversion…

4f8384e

…s to the library for internal debugging tools. PiperOrigin-RevId: 620068167

Integrate StableHLO at openxla/stablehlo@271e8634

bcdb690

PiperOrigin-RevId: 620069321

Delete populateRankSpecialization*Patterns functions

e396c88

These were used by KernelGen but are no longer needed. PiperOrigin-RevId: 620084345

Correctly handle output streaming case where the MoveToHost annotatio…

078a31b

…n is the entry computation root PiperOrigin-RevId: 620107928

Set release_base for all release platforms

d835533

PiperOrigin-RevId: 620111958

[XLA] Respect min_rank for reduce scatter version of MatchReduceScatter.

de04cff

We need to honor it. PiperOrigin-RevId: 620121620

Integrate LLVM at llvm/llvm-project@aa2c14de1adc

e806de5

Updates LLVM usage to match [aa2c14de1adc](llvm/llvm-project@aa2c14de1adc) PiperOrigin-RevId: 620124069

Automated Code Change

3a12f75

PiperOrigin-RevId: 620149815

[xla:gpu][NFC] Use absl::Span more consistenly

e686499

PiperOrigin-RevId: 620156928

[xla][gpu] Extracting triton codegen requirements for hlo instructions

35a5635

This CL extracts current triton codegen requirements for each hlo instruction into a single function to clean the codes in the triton fusion passes. PiperOrigin-RevId: 620157253

[xla:gpu] Create fake buffer allocations for embedded thunk

593d762

This is required in cases where embedded thunk arguments share the same buffer (i.e. they are located at different offsets of the same buffer) PiperOrigin-RevId: 620179451

[xla:gpu][NFC] Add AddressComputationThunk test with GEMM operands sh…

c3d52e9

…aring the same buffer PiperOrigin-RevId: 620184639

[xla:gpu][NFC] Use meaningful constexpr

b1c051c

PiperOrigin-RevId: 620194665

Deduplicate inferred mesh shapes when try_multiple_mesh_shapes=true.

24c0b39

PiperOrigin-RevId: 620258542

Delete the redundant compilation_cache_test

947a640

PiperOrigin-RevId: 620259968

Restore GOOGLE_CUDA guard in scoped_annotation.h

c4b031b

PiperOrigin-RevId: 620260337

Enhanced zeta readability based on the article

0b476d5

Changes based on the Hurwitz Zeta algorithm from the article linked in the comments. PiperOrigin-RevId: 620272234

Rollback openxla@0ab2be0.

b4a4647

There is an internal issue with running tests on H100s requiring the change to be rolled back. Reverts 0ab2be0 PiperOrigin-RevId: 620273492

Modify the matrix class to keep track of both memory and communicatio…

8bea463

…n resharding costs for a given edge as part of one matrix object. PiperOrigin-RevId: 620273768

[xla:gpu][NFC] Simplify collect_slice_info

33a8900

PiperOrigin-RevId: 620281417

Change include order in ml_dtypes.cc to prevent errors.

24fc9c0

Trying to prevent `error: "Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION"` PiperOrigin-RevId: 620284610

Integrate LLVM at llvm/llvm-project@80aa52d8c5a8

d55afd1

Updates LLVM usage to match [80aa52d8c5a8](llvm/llvm-project@80aa52d8c5a8) PiperOrigin-RevId: 620285862

Move tsl/python to xla/tsl/python

db8a37d

PiperOrigin-RevId: 620320903

Reverts c36f237

9b34ebe

PiperOrigin-RevId: 620324878

ghpvnist and others added 19 commits April 8, 2024 14:19

Add unbounded dynamism test for CholeskyOp.

f97959d

PiperOrigin-RevId: 622946035

Customize Reserved GPU HBM Size by flag in Pathways.

b3a7256

PiperOrigin-RevId: 622947937

HloTopKInstruction doesn't inherit from HloDimensionsInstruction,

1448c02

HloDimensionsInstruction::ClassOf should return false for kTopK. PiperOrigin-RevId: 622950232

Add unbounded dynamism tests for RngNormalOp and RngUniformOp.

8f1fced

PiperOrigin-RevId: 622953342

Add unbounded dynamism test for RngBitGeneratorOp.

e16c318

PiperOrigin-RevId: 622958040

Move StreamImplementation creation into Stream::Initialize, and elimi…

94fdc81

…nate bool allocated_ member that's now unnecessary. PiperOrigin-RevId: 622964979

[tsl:concurrency] Specify Isa/DynCast/Cast semantics for indirect asy…

1ff4f0d

…nc values PiperOrigin-RevId: 622971220

Enforce that CLs satisfy openxla/xla's buildifier checks

0a534f1

PiperOrigin-RevId: 622987817

Reverts 0a534f1

9949693

PiperOrigin-RevId: 623002758

Add unbounded dynamism test for TriangularSolveOp.

f253a74

PiperOrigin-RevId: 623009470

Add unbounded dynamism test for ReverseOp.

b53d9c8

PiperOrigin-RevId: 623011436

Add unbounded dynamism test for SortOp.

3e89666

PiperOrigin-RevId: 623013151

Add unbounded dynamism test for DynamicSliceOp.

79eccb4

PiperOrigin-RevId: 623015159

Automated Code Change

1acf05e

PiperOrigin-RevId: 623053994

try to fix communicate order issue; work on some scene

556df5a

AutoReorder: add hint, so that solve will be faster

ecfe74a

add constrain to limit two communicate fuse; add all-gather cost esti…

0fe4d0e

…mate;fix all-reduce cost estimate

add reduce-scatter cost estimate.

7866099

mars1248 reviewed Apr 18, 2024

View reviewed changes

zjjott and others added 10 commits April 22, 2024 14:49

add cuda error debug info.add all2all test

947a478

Merge branch 'github_1acf05e' into feature/auto_reorder

0c4f31d

after merge. some fix; communication op cost is uncorrect[WIP]

db4a5d5

fix uncorrect comm op cost

c3b71f7

fix allgather cost

78996c6

add allgather/reducescatter scaleradio

b43d42d

Support flash-attention custom call (#8)

5628d13

fix log

a4fc087

migrate convert xplane; PGLE using analytical as fallback estimator

0356daa

support export to mps and json; [WIP] convert xplant to offline sqlite

8ada939

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto Reorder #6

Auto Reorder #6

zjjott commented Feb 6, 2024

mars1248 Apr 18, 2024

mars1248 Apr 18, 2024

Auto Reorder #6

Are you sure you want to change the base?

Auto Reorder #6

Conversation

zjjott commented Feb 6, 2024

Auto Reorder

mars1248 Apr 18, 2024

Choose a reason for hiding this comment

mars1248 Apr 18, 2024

Choose a reason for hiding this comment