-
-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Could this be used to get a dask cluster running on AWS EMR? #28
Comments
Definitely. I haven't tried it, but it's likely that things will just work. Steps I'd try:
|
That workflow would make a very nice (v/b)log. |
thanks @jcrist One other question/thought that popped to mind based on this code: from dask_yarn import YarnCluster
from dask.distributed import Client
# Create a cluster where each worker has two cores and eight GB of memory
cluster = YarnCluster(environment='environment.tar.gz',
worker_vcores=2,
worker_memory="8GB")
# Scale out to ten such workers
cluster.scale(10)
# Connect to the cluster
client = Client(cluster) How do subsequent connections to the cluster work? I imagine you wouldn't re-execute this code each time as (i think) it would create the new cluster each time. Would a common work flow involve printing out the master scheduler endpoint, then each time you SSH into the master node, you connect with |
Yes, that is correct. The intent is that you create a cluster, do your work, then shut the cluster down. There's nothing baked into dask-yarn currently for spinning up a persistent cluster (although this would be fairly trivial to write up if it'd be useful to you). Since clusters are fairly quick to spin up, and keeping a persistent one that's idle would hog resources from others, this hasn't been a huge problem so far. If you're wanting to exit the terminal while something is running in the background, you can use more general solutions for this (e.g. |
I see, makes sense. thanks for the quick response. |
I'm currently working on writing up docs on getting started on AWS EMR, but the immediate tl;dr is:
All in all this was a fairly painless process. I hope to get the docs up in the next couple of days, will ping here again once there's a PR. |
Work is happening in #41. I have a nice bootstrap action written up, and just need to write the docs now. If you're feeling antsy, you can try the bootstrap action out already (just upload to s3 and configure in your EMR setup, see https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html). This sets up a working jupyter notebook server and dask environment, complete with dashboard access. I hope to finish the docs up sometime tomorrow. |
Thanks for the updates & all the work. None of this is too urgent from my end...just want to be able to leverage dask + EMR down the road. I am planning to do some more in depth experimentation with all this two weekends from now, so hopefully will be able to provide some useful feedback then. |
Glad to hear it. The initial documentation on EMR is now in, when you get a chance, it would be useful for you to try things out and report back what was confusing/undocumented/could-be-better :). |
Wanted to comment that I used the EMR documentation and it was amazingly clear and helpful - definitely excellent. |
Thanks for this, superhelpful. The only thing I found missing is how to connect to the dask dashboard if you are using the ssh tunnel to connect to jupyter, since the url provided by the notebook ( /proxy/42727/status) gives a 500 Error |
1 similar comment
Thanks for this, superhelpful. The only thing I found missing is how to connect to the dask dashboard if you are using the ssh tunnel to connect to jupyter, since the url provided by the notebook ( /proxy/42727/status) gives a 500 Error |
@manugarri I ran into a similar issue- turns out installing |
Potentially naive question, as I just learned what YARN was at a meetup last night. I think Amazon's EMR service is built around it. With that in mind, could you use this package, or parts of it, to get a dask cluster up and running on EMR?
I know the recommended deployment is using kubernetes, but my company blocked AWS' kubernetes service (EKS) 🤦♂️.
Any tips/advice would be greatly appreciated.
The text was updated successfully, but these errors were encountered: