Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Move from mt to dist #24

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

andresrossb
Copy link

Created pfor, pfor observations and lsqDB_dist.
will not run yet, still need to add using distributed and initialize the workers.
It's simply a starting idea on how to build the matrix.

@cortner cortner mentioned this pull request Dec 24, 2021
@cortner
Copy link
Member

cortner commented Dec 24, 2021

thank you, Andres. I see this is a PR from a fork. Would you be willing to give myself and @wcwitt push access so we can collaborate on this PR?

@cortner cortner changed the title First attempt at distributed WIP: Move from mt to dist Dec 24, 2021
andresrossb and others added 2 commits January 28, 2022 22:17
Created pfor, pfor observations and lsqDB_dist.
will not run yet, still need to add using distributed and initialize the workers.
It's simply a starting idea on how to build the matrix.
@wcwitt
Copy link
Collaborator

wcwitt commented Jan 31, 2022

My last version used distributed assembly, but with the design matrix as a SharedArray, which meant it couldn't work across multiple nodes. I now have a version with the design matrix as a DArray. I still convert it back to a regular array for the solvers, but they'll be the next step.

However, I recently rebased on the latest IPFitting, so I'll need to force push to get it here. Do you mind?

@cortner
Copy link
Member

cortner commented Feb 2, 2022

What do you mean "get it here"? You want it to go onto the v0.10 branch? I think then it needs to be rebased. Unfortunately we started planning this before the decision was made to retire IPFitting.

@cortner
Copy link
Member

cortner commented Feb 2, 2022

specifically I think we need to rebase onto main - how do you feel about this? Or do you actually WANT it on the 0.5.x branch?

@wcwitt
Copy link
Collaborator

wcwitt commented Feb 2, 2022

I mean I've already rebased to v0.10, so I would need to force push here. And this isn't my PR, so thought I should ask first.

@cortner
Copy link
Member

cortner commented Feb 2, 2022

oh, I see, perfect. yes please go ahead - can you edit the target branch, or shall I?

@wcwitt
Copy link
Collaborator

wcwitt commented Feb 2, 2022

Thanks. I'm not able to edit the target branch, sorry

@wcwitt
Copy link
Collaborator

wcwitt commented Feb 2, 2022

More generally, I don't think we should merge this until I have at least one of the solvers working with the distributed matrix. I'll keep rebasing to v.0.x as appropriate and flag you when it's ready for discussion/review

@cortner
Copy link
Member

cortner commented Feb 2, 2022

I think once it is rebased onto main you can just start merging changes in main into this branch. Personally I like that better, but I guess it is a matter of taste?

@cortner cortner changed the base branch from master to main February 2, 2022 22:11
@cortner
Copy link
Member

cortner commented Feb 2, 2022

just given you write access; so once this is done you can keep the PRs here if you prefer.

@wcwitt
Copy link
Collaborator

wcwitt commented Feb 8, 2022

While attempting to use the LSQR routine from IterativeSolvers, I've discovered some of the DistributedArray functionality is rather brittle.

For example, thus far I have been distributing over the configs for convenience, which means that the submatrices belonging to different workers are not all the same size. (Not great for load balancing, but the easiest way to start.) Constructing the DArray this way works fine. But it turns out some of the DArray math routines (e.g., matrix multiplication) implicitly assume submatrices of equal size.

Just noting this for now - still need to figure out a solution, likely by distributing more carefully.

@cortner
Copy link
Member

cortner commented Feb 8, 2022

what about distributing one way for assembly and then "redistribute" them for the linalg? But this is extremely weird in my view.

@wcwitt
Copy link
Collaborator

wcwitt commented Feb 9, 2022

Yeah that might work, although I'm trying to avoid ever putting the full matrix on one worker, which makes it a little tricky. I'll keep thinking about it.

In the meantime, I filed an issue, JuliaParallel/DistributedArrays.jl#237.

@jameskermode
Copy link

We also needed to use constant size blocks for the gap_fit parallelisation as it's a ScaLAPACK requirement. We had also distributed by configuration, applying some heuristics to give reasonable workload balance. Until now we added zero padding to the blocks to equalise the sizes which doesn't change to solution to the linear system (so long as you add zeros the RHS vectors as well) but is not optimal for memory usage or time, so we're rethinking whether it would have been better to distribute completely evenly. @Sideboard can fill in more details when we speak.

@wcwitt
Copy link
Collaborator

wcwitt commented Feb 21, 2022

I've been thinking about this and I'm leaning towards distributing fully evenly. It won't take that much more code than the zero padding. Will be good to hear your experience

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants