Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hoisting transpose operation #90

Open
giuseros opened this issue Dec 7, 2021 · 5 comments
Open

Hoisting transpose operation #90

giuseros opened this issue Dec 7, 2021 · 5 comments

Comments

@giuseros
Copy link
Collaborator

giuseros commented Dec 7, 2021

Hi @nicolasvasilache , all,
Before I adventure in writing a pass, I was wondering if you guys already thought about how transposition is handled in the code.

How (I think) transposition is handled in the current code

During packing:

  • We load each element one by one from memory
  • We store each element into a buffer of type memref<?x?x?xKx1>

In the micro-kernel

  • We load each element one-by-one from the Kx1 buffer
  • We call vector.transpose to have our 1xK vector
  • We execute the outerproduct

Is this what we are doing, or am I missing something?

What I would like to have

Is it possible to hoist the vector.transpose in the packing phase? I.e., :

During packing:

  • We load each element one by one from memory
  • We call vector.transpose and store in a buffer of type memref<?x?x?x1xK>

In the micro-kernel:

  • We load each element from vector.load from the 1xK buffer
  • We execute the outerproduct

Is this something hard to do?

Thanks,
Giuseppe

@giuseros
Copy link
Collaborator Author

giuseros commented Dec 7, 2021

cc: @stevenvar

@nicolasvasilache
Copy link
Contributor

nicolasvasilache commented Dec 8, 2021

Thanks for your Q, this is a missing feature that is being pushed back into our stack of shorter term things thanks to recent developments.
I'd recommend you come ask questions in the public IREE chat named "codegen" for low-latency iteration: https://discord.gg/ZNtWrXF6.

There are a few things intersecting here but a rough summary is:

  1. for x86 we have not yet needed those, the ISA has a "broadcast scalar and fma" op which made transposition unnecessary (and maybe even detrimental)
  2. @bjacob is also looking at mm for other ARM targets and is iterating on a mix of rank-reducing subview/subtensor and vector.shape_cast N-d <-> 1-d
  3. @gysit is investigating some issue with 2-d conv. padding related to rank-reducing subview/subtensor
  4. we know we want to extend linalg.pad_tensor with an optional permutation_map that will carry the information and that will need to be supported by hoisting.

I think 4. is realtively easy to get started on as far as extending the op semantics/verification/tests and tracking uses to ensure transformations fail in the presence of this permutation_map.
Vectorization should also be reasonably easy by just inserting the vector.transpose between the read/write.
Extensions to hoist padding are a bit more involved but we know what to do.

Do you guys want to take ownership of point 4. and starting working on core MLIR patches?

@giuseros
Copy link
Collaborator Author

giuseros commented Dec 8, 2021

Hi Nicolas, thanks for the answer. Yes, @stevenvar is having a look at this. We saw two possibilities:
a) Having a pass that hoists the transpose out of the micro-kernel
b) Hoisting the transpose "ab initio", on the line of point 4 in your list

Do you think that a) is the wrong way to go (or it is harder to do than b) )?

About your point 1., do you mean x86 has got a fmla vec, vec, scalar? In Arm there is an indexed fmla fmla vec_c vec_a, vec_b[i] that broadcasts the i-th lane of vec_b into a logical vector broadcast and then does fmla vec_c, vec_a, broadcast. The point is that vec_b is still a vector that needs to be loaded from memory rather than a single scalar

@nicolasvasilache
Copy link
Contributor

Do you think that a) is the wrong way to go (or it is harder to do than b) )?

I don't think a) is wrong in itself but def. harder given the state of the world and there are also tradeoffs + composability differences:

  1. padding and hoist padding happens at the tensor level, a) would happen on vectors much later in the pipeline.
  2. hoist padding already exists. It is unclear how the higher-D tensor would be created in a). Atm on tensors we do a bounding box analysis and a few sophisticated things that will be painful to repro.
  3. there are opportunities at the tensor level re canonicalization with consumer linalg ops and other things that are natural on tensors and for which vector is too late.
  4. we already identified the linalg.pad_tensor extensions as something we wanted in general so there is opportunity for convergence.

About your point 1., do you mean x86 has got a fmla vec, vec, scalar?

There is an instruction vfmadd231ps zmm0,zmm4,DWORD PTR [rsi+0x4]{1to16}
See slide 42 of this prez: https://drive.google.com/corp/drive/folders/1lLhWopx_WCtFq3gTDGVJEzV9hFD7dwmI.

@giuseros
Copy link
Collaborator Author

giuseros commented Dec 8, 2021

Thanks a lot for all the explanation. So yes, we can take ownership of point 4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants