-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Move from mt to dist #24
base: main
Are you sure you want to change the base?
Conversation
thank you, Andres. I see this is a PR from a fork. Would you be willing to give myself and @wcwitt push access so we can collaborate on this PR? |
Created pfor, pfor observations and lsqDB_dist. will not run yet, still need to add using distributed and initialize the workers. It's simply a starting idea on how to build the matrix.
My last version used distributed assembly, but with the design matrix as a SharedArray, which meant it couldn't work across multiple nodes. I now have a version with the design matrix as a DArray. I still convert it back to a regular array for the solvers, but they'll be the next step. However, I recently rebased on the latest IPFitting, so I'll need to force push to get it here. Do you mind? |
What do you mean "get it here"? You want it to go onto the v0.10 branch? I think then it needs to be rebased. Unfortunately we started planning this before the decision was made to retire IPFitting. |
specifically I think we need to rebase onto |
I mean I've already rebased to v0.10, so I would need to force push here. And this isn't my PR, so thought I should ask first. |
oh, I see, perfect. yes please go ahead - can you edit the target branch, or shall I? |
Thanks. I'm not able to edit the target branch, sorry |
More generally, I don't think we should merge this until I have at least one of the solvers working with the distributed matrix. I'll keep rebasing to v.0.x as appropriate and flag you when it's ready for discussion/review |
I think once it is rebased onto |
just given you write access; so once this is done you can keep the PRs here if you prefer. |
While attempting to use the LSQR routine from IterativeSolvers, I've discovered some of the DistributedArray functionality is rather brittle. For example, thus far I have been distributing over the configs for convenience, which means that the submatrices belonging to different workers are not all the same size. (Not great for load balancing, but the easiest way to start.) Constructing the DArray this way works fine. But it turns out some of the DArray math routines (e.g., matrix multiplication) implicitly assume submatrices of equal size. Just noting this for now - still need to figure out a solution, likely by distributing more carefully. |
what about distributing one way for assembly and then "redistribute" them for the linalg? But this is extremely weird in my view. |
Yeah that might work, although I'm trying to avoid ever putting the full matrix on one worker, which makes it a little tricky. I'll keep thinking about it. In the meantime, I filed an issue, JuliaParallel/DistributedArrays.jl#237. |
We also needed to use constant size blocks for the gap_fit parallelisation as it's a ScaLAPACK requirement. We had also distributed by configuration, applying some heuristics to give reasonable workload balance. Until now we added zero padding to the blocks to equalise the sizes which doesn't change to solution to the linear system (so long as you add zeros the RHS vectors as well) but is not optimal for memory usage or time, so we're rethinking whether it would have been better to distribute completely evenly. @Sideboard can fill in more details when we speak. |
I've been thinking about this and I'm leaning towards distributing fully evenly. It won't take that much more code than the zero padding. Will be good to hear your experience |
Created pfor, pfor observations and lsqDB_dist.
will not run yet, still need to add using distributed and initialize the workers.
It's simply a starting idea on how to build the matrix.