Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto Reorder #6

Open
wants to merge 3,335 commits into
base: main
Choose a base branch
from
Open

Auto Reorder #6

wants to merge 3,335 commits into from

Conversation

zjjott
Copy link

@zjjott zjjott commented Feb 6, 2024

Auto Reorder

Using Linear Program set instruction order
run tests

TF_CPP_MAX_VLOG_LEVEL=2 bazel run --compilation_mode=dbg xla/hlo/experimental/auto_reorder:auto_reorder_test --incompatible_strict_action_env --action_env=USE_CUDA --action_env=XLA_CUDA

wrengr and others added 28 commits March 28, 2024 14:32
This is a prospective change for openxla#10966.  In particular, this will help fix an OSS build problem: "tensorflow/xla/linux/cpu/build_cpu" not being able to find the `InitializeAbslLogging` function.

PiperOrigin-RevId: 620055000
…a tuple-tree of `numpy.ndarray`.

This is intended for internal debugging use. It cannot be used on OSS because the relevant protobufs are not part of the public API. (Though it must not break the OSS build, naturally.)

PiperOrigin-RevId: 620064326
…s to the library for internal debugging tools.

PiperOrigin-RevId: 620068167
These were used by KernelGen but are no longer needed.

PiperOrigin-RevId: 620084345
…n is the entry computation root

PiperOrigin-RevId: 620107928
PiperOrigin-RevId: 620111958
We need to honor it.

PiperOrigin-RevId: 620121620
Updates LLVM usage to match
[aa2c14de1adc](llvm/llvm-project@aa2c14de1adc)

PiperOrigin-RevId: 620124069
PiperOrigin-RevId: 620149815
This CL extracts current triton codegen requirements for each hlo instruction into a single function to clean the codes in the triton fusion passes.

PiperOrigin-RevId: 620157253
This is required in cases where embedded thunk arguments share the same buffer (i.e. they are located at different offsets of the same buffer)

PiperOrigin-RevId: 620179451
…aring the same buffer

PiperOrigin-RevId: 620184639
PiperOrigin-RevId: 620194665
Changes based on the Hurwitz Zeta algorithm from the article linked in the comments.

PiperOrigin-RevId: 620272234
There is an internal issue with running tests on H100s requiring the change to be rolled back.

Reverts 0ab2be0

PiperOrigin-RevId: 620273492
…n resharding costs for a given edge as part of one matrix object.

PiperOrigin-RevId: 620273768
PiperOrigin-RevId: 620281417
Trying to prevent `error: "Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION"`

PiperOrigin-RevId: 620284610
Updates LLVM usage to match
[80aa52d8c5a8](llvm/llvm-project@80aa52d8c5a8)

PiperOrigin-RevId: 620285862
…using mpi.

Imported from GitHub PR openxla#7849

Mpi collectives as proposed in jax-ml/jax#11182.

I only implemented the inter-process communication and this does not yet support more than 1 threads per process. Adding support for multiple threads/devices per process in the future seems quite a bit more involved if one wanted to do it properly.

For MPI I am building and linking against https://github.com/eschnett/MPItrampoline, which dlopens the (wrapped) mpi library at runtime. To wrap and load the desired mpi library one needs compile https://github.com/eschnett/MPIwrapper and set `MPITRAMPOLINE_LIB=/path/to/libmpiwrapper.so`.

@hawkinsp
Copybara import of the project:

--
b74bbb9 by Clemens Giuliani <[email protected]>:

add mpi collectives

--
23508eb by Clemens Giuliani <[email protected]>:

add explicit Init and Finalize methods and export them to python

--
bbe5840 by Clemens Giuliani <[email protected]>:

add comment

--
38d1562 by Clemens Giuliani <[email protected]>:

fix windows build

--
201f723 by Clemens Giuliani <[email protected]>:

fmt

--
2784869 by Clemens Giuliani <[email protected]>:

bump xla_extension_version

Merging this change closes openxla#7849

COPYBARA_INTEGRATE_REVIEW=openxla#7849 from inailuig:mpi_collectives 2784869
PiperOrigin-RevId: 620302264
…` to `bytes`

`xla::PjRtValueType` is defined in C++, where its `std::string` value can
contain any string (not necessarily UTF-8). Protobuf verison 3 requires a
`string` field to contain UTF-8, so it is more suitable to use `bytes` to
express this value.

(Note that the string value of `xla::PjRtValueType` would be often consumed by
Python, where nanobind would convert `std::string` into Python `str` with UTF-8
decoding. However, this is what some users of `xla::PjRtValueType` choose to
do; this is not sufficient enough to constrain the string to be UTF-8 only in
C++ APIs.)

This is a preemptive change; there is no known problem of using a `string`
field previously.

PiperOrigin-RevId: 620315110
PiperOrigin-RevId: 620320903
PiperOrigin-RevId: 620324878
ghpvnist and others added 19 commits April 8, 2024 14:19
HloDimensionsInstruction::ClassOf should return false for kTopK.

PiperOrigin-RevId: 622950232
They don't work after the stream is initialized in GpuStream (the only Stream implementation to make use of the priority).  Instead, move the parameter to Stream::Initialize, which is the only place it's actually used.

PiperOrigin-RevId: 622958008
…nate bool allocated_ member that's now unnecessary.

PiperOrigin-RevId: 622964979
PiperOrigin-RevId: 623002758
PiperOrigin-RevId: 623011436
PiperOrigin-RevId: 623013151
PiperOrigin-RevId: 623053994
cost_analysis->bytes_accessed(instr) / (1e6 * actual_bandwidth));
total_time += communication_time;
return total_time;
}
std::vector<double> GpuPerformanceWithCollectiveModel::GetInterInnerBandwidths(
const HloInstruction& instr, const GpuHloCostAnalysis* cost_analysis,
const se::DeviceDescription& gpu_device_info) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数是否是用来计算机内和机间的带宽的?能否添加一个简单函数说明?

auto inner_node_numel_bytes =
numel_bytes * (std::min(kInnerNodeGpu, total_gpu) - 1);

// all-gather-start(f32[12800,2400]{0,1} replica_groups={{0,1,2,3}})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方可以改成计算公式

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.