-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testing plan/notes #1
Comments
Spark ThingsTweaking pyspark config:
conf = (SparkConf()
.setAppName("implicit_benchmark")
.setMaster('local[*]') # Run Spark locally with as many worker threads as logical cores on your machine.
.set('spark.driver.memory', '16G')
) Spark repartitioning:
|
Dask Things
resources:
notes from talking to Martin Durant:
|
Test this out: dask/dask-yarn#28 |
Raise issues for the following:
>>> bag = dask.bag.read_avro(urlpath, storage_options = {'profile_name': AWS_PROFILE})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/anaconda3/envs/gandalf/lib/python3.6/site-packages/dask/bag/avro.py", line 103, in read_avro
heads, sizes = zip(*out)
ValueError: not enough values to unpack (expected 2, got 0)
|
Testing Plan
Dummy Credit Card Application Dataset
Test 1
Leave the default schedulers
Modifications
Test 2 - Runing Some Calcs
Modifications
Test 3 - Running some Python UDFs
Test 4 - Scaling on a Single Machine
NYC Taxi Public Dataset
Coming soon..
Sample ETL Workflows
The text was updated successfully, but these errors were encountered: