Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for typo and directory changes in DYAD #40

Conversation

hariharan-devarajan
Copy link
Collaborator

So, the new structure didn't copy the images in the right place. Additionally, the path changes were not propagated into the notebook.

I fixed all typos. But the image need a fresh install of dlio_benchmark as it seems the architecture is not aching. I am getting illegal instructions error.

@ilumsden Can u help me fix the image so that I can continue testing. Some things till left to do are

Module 4: 04_flux_tutorial_conclusions.ipynb the file is not found.
Then to run the code,

  1. The directory path is wrong. (Fixed)
  2. seems like the architecture has changed. I get illegal instructions on the existing DLIO Benchmark code
  3. When i run the shutting down of service i get flux-exec: flux_reactor_run: Invalid argument

@vsoch
Copy link
Member

vsoch commented Jul 30, 2024

@hariharan-devarajan it looks like the build was OK, at least in the CI! Apologies for the path changes without updating the notebook, I didn't even open it, because I wouldn't have been oriented to work on it. I'll leave you and @ilumsden to work on this, but please ping me if/when it's ready for review.

@vsoch
Copy link
Member

vsoch commented Jul 30, 2024

A few nits preparing for future review:

  • Commit meaningfully (I am working on this too) with the format category: message and then the longer statement below it that describes the change.
  • Make sure there is not an empty cell added to the end

@ilumsden
Copy link
Collaborator

@hariharan-devarajan regarding the illegal instructions stuff, what architecture is the computer on which you built the container? Also, what command produced this error? If you could share a screenshot of the error, that would help too.

In my experience, errors like this usually occur on ARM systems (e.g., newer Macs) due to an architecture mismatch between the host system and the container. If this is the issue, it's easily solved using the --platform flag.

@hariharan-devarajan
Copy link
Collaborator Author

hariharan-devarajan commented Aug 12, 2024

@hariharan-devarajan regarding the illegal instructions stuff, what architecture is the computer on which you built the container? Also, what command produced this error? If you could share a screenshot of the error, that would help too.

In my experience, errors like this usually occur on ARM systems (e.g., newer Macs) due to an architecture mismatch between the host system and the container. If this is the issue, it's easily solved using the --platform flag.

Thanks @ilumsden. Currently, I am stuck with some deadlines. I would appreciate if u could test the Tutorial. If you have any issues, I can probably help u out. Let me know if u can try this out or not.

@ilumsden
Copy link
Collaborator

Sounds good. That's my top priority this week (unless I need to do more work on a paper, but I suspect my main work for that paper is done).

vsoch and others added 2 commits August 12, 2024 23:48
Problem: The tutorial series this year is no longer
RADIUSS, but renamed to HPCIC.
Solution: Rename all assets to HPCIC/hpcic, will
await merge on confirmation from the hpcic/radiuss
team!

Signed-off-by: vsoch <[email protected]>
@ilumsden ilumsden requested a review from vsoch August 21, 2024 00:29
@ilumsden
Copy link
Collaborator

@vsoch the DYAD notebook is now fully working. I tested it on an x86 instance from Jetstream cloud.

I still have to rebase this PR, but, other than that, this is ready for your review.

@ilumsden
Copy link
Collaborator

@hariharan-devarajan since the 2024-radiuss-aws branch from this repo has already been merged, you will need to edit the PR to merge into master.

@hariharan-devarajan
Copy link
Collaborator Author

What do u mean edit? U mean base it off master?

The changes I did were minor. Could we just apply the changes on a new PR?

hariharan-devarajan and others added 6 commits August 20, 2024 20:57
This commit corrects logic in the the PyTorch data loader for DYAD.
It also makes various corrections to the text in the DYAD notebook.
The flux-sched image for Ubuntu Jammy has a system install of
UCX 1.12.0. However, we are wanting to use UCX 1.13.1 with DYAD.
This commit updates LD_LIBRARY_PATH to point to UCX 1.13.1 to prevent
runtime issues with DYAD.
In light of the name change of DLIO Profiler to DFTracer, this commit
updates the env file created in the DYAD notebook to use the new names
for environment variables.
This commit fixes a bug in the DYAD PyTorch data loader
that causes 'brokers_per_node' to not be set before reference.
This commit tweaks the DLIO config file to use forking for
multiprocessing instead of spawning
This commit changes cpu-affinity to off when running DLIO for
training for consistency
@ilumsden
Copy link
Collaborator

We could do that if you'd prefer. The stuff I've added to get everything working is pretty minor too.

@vsoch
Copy link
Member

vsoch commented Aug 21, 2024

Please rebase before review - thank you!

@ilumsden
Copy link
Collaborator

@vsoch @hariharan-devarajan so that we can merge into master, I opened #43. That PR is the same as this one plus rebasing.

@vsoch please review that PR.

@ilumsden ilumsden closed this Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants