Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalize recipe (+ test new rclone copy stage) #1

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

jbusecke
Copy link
Contributor

This seems like a pretty simple recipe, but running it locally on the hub with

mamba create -n runner0102 python=3.11 -y
conda activate runner0102
pip install pangeo-forge-runner==0.10.2 --no-cache-dir
pangeo-forge-runner bake \
  --repo=./ \
  --Bake.recipe_id=chirps-global-daily\
  -f configs/config_local_hub.py

Blows out the 128GB of memory! The files are extremely compressed, but should still fit into memory (1GB on disk, ~20GB in memory x2 files). This smells like the same issue we (@norlandrhagen) had with fsspec balloning the memory for older recipes and virtualizarr generation? What is going on here...

@jbusecke
Copy link
Contributor Author

Oh I did not set the target_chunks

@jbusecke
Copy link
Contributor Author

Ok nice, this is working locally.

Running a test on dataflow now.

@jbusecke
Copy link
Contributor Author

Ok that works out fine. @ks905383 can you check out the test dataset?

You can run this on the leap hub

import xarray as xr
xr.open_dataset("gs://leap-scratch/data-library/feedstocks/chirps_feedstock/chirps-global-daily.zarr", engine='zarr', chunks={})

@ks905383
Copy link
Collaborator

Nice, seems to work for 1981-1982 (and I see you've staged the recipe for the whole timeframe)

@jbusecke
Copy link
Contributor Author

@ks905383 awesome. I would love to see what sort of processing you apply to the data, so we can prototype that. If you think that should rather be applied afterwards we can get this finalized.

@jbusecke
Copy link
Contributor Author

I have rebased this on leap-stc/leap-data-management-utils#60 and will see if the copy stage works (moving it to m2lines-testing like in https://github.com/leap-stc/test-transfer) before merging and processing the entire thing.

  • Redirect to proper OSN bucket
  • Update secrets in manager to fit bucket

@jbusecke jbusecke changed the title Test recipe [Not working yet] Finalize recipe (+ test new rclone copy stage) Oct 24, 2024
@jbusecke
Copy link
Contributor Author

Ayyyy this worked apparently. Need to run for dinner, but will test output tomorrow! Exciting.

@jbusecke
Copy link
Contributor Author

Was able to confirm that the transfer works. Lets change the target bucket/creds and then ingest the entire recipe over at leap-stc/leap-data-management-utils#60

@jbusecke
Copy link
Contributor Author

There is still a bit of a smell here since I have to define the target bucket+path in the recipe and the creds are hardcoded here. @norlandrhagen do you have opinions on how to reconcile this?

@jbusecke
Copy link
Contributor Author

This should be all set, but it seems the original server is down at the moment.

cc @ks905383

@norlandrhagen
Copy link

For this bit?

| CopyRclone(target=catalog_store_urls["chirps-global-daily"].replace("https://nyu1.osn.mghpcc.org/","")) #FIXME`

Maybe we can separate the prefix from the bucket/path with some leap_data_management_utils help?

@jbusecke
Copy link
Contributor Author

jbusecke commented Oct 25, 2024

@norlandrhagen 👍 i think we could possibly do this as a clean up sprint where we reduce the amount of manuall entered (and highly interdependent) naming entries for the user?

Tracking this in leap-stc/LEAP_template_feedstock#61

@jbusecke
Copy link
Contributor Author

What the hell is this error?

gcsfs.retry.HttpError: The object leap-scratch/data-library/feedstocks/output/chirps_feedstock/chirps-global-daily-11557918430-1/chirps-global-daily.zarr/latitude/1 exceeded the rate limit for object mutation operations (create, update, and delete). Please reduce your request rate. See https://cloud.google.com/storage/docs/gcs429., 429

@jbusecke
Copy link
Contributor Author

I guess we were hammering the storage at a rate that is not allowed? See https://cloud.google.com/storage/docs/gcs429
image

I thought this was the cloud! Hahaha.

@jbusecke
Copy link
Contributor Author

This is pretty odd. We also specify max-workers as 50 and it seemed to have scaled beyond that?

@jbusecke
Copy link
Contributor Author

Ughhh it might be similar to this where each write tries to create an empty 'directory'? But that seems counterintuitive to my understanding of object storage...

@jbusecke
Copy link
Contributor Author

And I guess max-workers is useless when using dataflow prime.

@norlandrhagen
Copy link

I also hit a max quota error the other week, so there may be some google console settings to dial.

@jbusecke
Copy link
Contributor Author

Ill try one more time with even less workers and larger chunks (hopefully resulting in less writes?), but I wonder if we could address this more generally by creating an 'empty directory structure' before writing out the chunks. This seems to be what people suggest here and here

@jbusecke
Copy link
Contributor Author

ughhhh its scaling past the limit I set again... is this a bug in runner?

@jbusecke
Copy link
Contributor Author

Ill have to look into this more next week. Will have to turn to some other things now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants