Finalize recipe (+ test new rclone copy stage) #1

jbusecke · 2024-10-17T03:11:25Z

This seems like a pretty simple recipe, but running it locally on the hub with

mamba create -n runner0102 python=3.11 -y
conda activate runner0102
pip install pangeo-forge-runner==0.10.2 --no-cache-dir

pangeo-forge-runner bake \
  --repo=./ \
  --Bake.recipe_id=chirps-global-daily\
  -f configs/config_local_hub.py

Blows out the 128GB of memory! The files are extremely compressed, but should still fit into memory (1GB on disk, ~20GB in memory x2 files). This smells like the same issue we (@norlandrhagen) had with fsspec balloning the memory for older recipes and virtualizarr generation? What is going on here...

jbusecke · 2024-10-17T03:35:20Z

Oh I did not set the target_chunks

jbusecke · 2024-10-17T03:49:39Z

Ok nice, this is working locally.

Running a test on dataflow now.

jbusecke · 2024-10-17T04:09:03Z

Ok that works out fine. @ks905383 can you check out the test dataset?

You can run this on the leap hub

import xarray as xr
xr.open_dataset("gs://leap-scratch/data-library/feedstocks/chirps_feedstock/chirps-global-daily.zarr", engine='zarr', chunks={})

ks905383 · 2024-10-17T20:33:08Z

Nice, seems to work for 1981-1982 (and I see you've staged the recipe for the whole timeframe)

jbusecke · 2024-10-18T17:55:38Z

@ks905383 awesome. I would love to see what sort of processing you apply to the data, so we can prototype that. If you think that should rather be applied afterwards we can get this finalized.

jbusecke · 2024-10-24T16:37:45Z

I have rebased this on leap-stc/leap-data-management-utils#60 and will see if the copy stage works (moving it to m2lines-testing like in https://github.com/leap-stc/test-transfer) before merging and processing the entire thing.

Redirect to proper OSN bucket
Update secrets in manager to fit bucket

jbusecke · 2024-10-24T21:54:41Z

Ayyyy this worked apparently. Need to run for dinner, but will test output tomorrow! Exciting.

jbusecke · 2024-10-25T15:02:16Z

Was able to confirm that the transfer works. Lets change the target bucket/creds and then ingest the entire recipe over at leap-stc/leap-data-management-utils#60

jbusecke · 2024-10-25T15:04:03Z

There is still a bit of a smell here since I have to define the target bucket+path in the recipe and the creds are hardcoded here. @norlandrhagen do you have opinions on how to reconcile this?

jbusecke · 2024-10-25T17:13:37Z

This should be all set, but it seems the original server is down at the moment.

cc @ks905383

norlandrhagen · 2024-10-25T17:17:05Z

For this bit?

| CopyRclone(target=catalog_store_urls["chirps-global-daily"].replace("https://nyu1.osn.mghpcc.org/","")) #FIXME`

Maybe we can separate the prefix from the bucket/path with some leap_data_management_utils help?

jbusecke · 2024-10-25T17:22:20Z

@norlandrhagen 👍 i think we could possibly do this as a clean up sprint where we reduce the amount of manuall entered (and highly interdependent) naming entries for the user?

Tracking this in leap-stc/LEAP_template_feedstock#61

jbusecke · 2024-10-28T16:31:19Z

What the hell is this error?

gcsfs.retry.HttpError: The object leap-scratch/data-library/feedstocks/output/chirps_feedstock/chirps-global-daily-11557918430-1/chirps-global-daily.zarr/latitude/1 exceeded the rate limit for object mutation operations (create, update, and delete). Please reduce your request rate. See https://cloud.google.com/storage/docs/gcs429., 429

jbusecke · 2024-10-28T16:33:14Z

I guess we were hammering the storage at a rate that is not allowed? See https://cloud.google.com/storage/docs/gcs429

I thought this was the cloud! Hahaha.

jbusecke · 2024-10-28T16:34:19Z

This is pretty odd. We also specify max-workers as 50 and it seemed to have scaled beyond that?

jbusecke · 2024-10-28T16:39:39Z

Ughhh it might be similar to this where each write tries to create an empty 'directory'? But that seems counterintuitive to my understanding of object storage...

jbusecke · 2024-10-28T16:39:56Z

And I guess max-workers is useless when using dataflow prime.

norlandrhagen · 2024-10-28T16:44:26Z

I also hit a max quota error the other week, so there may be some google console settings to dial.

jbusecke · 2024-10-28T16:57:18Z

Ill try one more time with even less workers and larger chunks (hopefully resulting in less writes?), but I wonder if we could address this more generally by creating an 'empty directory structure' before writing out the chunks. This seems to be what people suggest here and here

jbusecke · 2024-10-28T17:02:25Z

ughhhh its scaling past the limit I set again... is this a bug in runner?

jbusecke · 2024-10-28T18:52:22Z

Ill have to look into this more next week. Will have to turn to some other things now.

Test recipe

a5370cf

Ready to test dataflow

8ef1232

Update recipe.py

5251d62

Full date range

a19597c

jbusecke mentioned this pull request Oct 17, 2024

New Dataset CHIRPS leap-stc/data-management#164

Open

jbusecke added 2 commits October 24, 2024 12:34

Update recipe.py

8c3b0a9

Update requirements.txt

07c0bfd

jbusecke mentioned this pull request Oct 24, 2024

Add rclone copy stage leap-stc/leap-data-management-utils#60

Open

5 tasks

jbusecke added 6 commits October 24, 2024 12:45

Update requirements.txt

72f86c8

Update requirements.txt

3e17987

Update requirements.txt

4f0d239

Update recipe.py

e203a41

Update recipe.py

7410be1

Update recipe.py

eab1ce6

jbusecke changed the title ~~Test recipe [Not working yet]~~ Finalize recipe (+ test new rclone copy stage) Oct 24, 2024

Update config_dataflow.py

ec1f52d

jbusecke added 5 commits October 25, 2024 11:07

Update catalog.yaml

7934492

Update recipe.py

622fd1e

Update recipe.py

c78daf4

Update recipe.py

07fdfcb

Update recipe.py

66754bd

jbusecke mentioned this pull request Oct 25, 2024

Reduce inputs required by user leap-stc/LEAP_template_feedstock#61

Open

Update config_dataflow.py

687649b

Update config_dataflow.py

a903a3b

jbusecke added 2 commits October 28, 2024 12:54

Update config_dataflow.py

7cf6564

Update recipe.py

39d5458

jbusecke mentioned this pull request Oct 31, 2024

Team Planning - Wednesday, October 30th leap-stc/data-and-compute-team#33

Closed

norlandrhagen added 11 commits November 14, 2024 09:31

bump beam & change config

23d4a82

version conflight

d9006af

wip

dab46d7

wip

95dbe6f

wip

b6e151c

chunking update

5135711

downgrade beam

8894564

update local hub config

8289fbb

chunk

0462ddb

updated chunking

eb9bd01

bump dataflow config, pin gcsfs, only storetozarr

bcf0561

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finalize recipe (+ test new rclone copy stage) #1

Finalize recipe (+ test new rclone copy stage) #1

jbusecke commented Oct 17, 2024

jbusecke commented Oct 17, 2024

jbusecke commented Oct 17, 2024

jbusecke commented Oct 17, 2024

ks905383 commented Oct 17, 2024

jbusecke commented Oct 18, 2024

jbusecke commented Oct 24, 2024

jbusecke commented Oct 24, 2024

jbusecke commented Oct 25, 2024

jbusecke commented Oct 25, 2024

jbusecke commented Oct 25, 2024

norlandrhagen commented Oct 25, 2024

jbusecke commented Oct 25, 2024 •

edited

Loading

jbusecke commented Oct 28, 2024

jbusecke commented Oct 28, 2024

jbusecke commented Oct 28, 2024

jbusecke commented Oct 28, 2024

jbusecke commented Oct 28, 2024

norlandrhagen commented Oct 28, 2024

jbusecke commented Oct 28, 2024

jbusecke commented Oct 28, 2024

jbusecke commented Oct 28, 2024

Finalize recipe (+ test new rclone copy stage) #1

Are you sure you want to change the base?

Finalize recipe (+ test new rclone copy stage) #1

Conversation

jbusecke commented Oct 17, 2024

jbusecke commented Oct 17, 2024

jbusecke commented Oct 17, 2024

jbusecke commented Oct 17, 2024

ks905383 commented Oct 17, 2024

jbusecke commented Oct 18, 2024

jbusecke commented Oct 24, 2024

jbusecke commented Oct 24, 2024

jbusecke commented Oct 25, 2024

jbusecke commented Oct 25, 2024

jbusecke commented Oct 25, 2024

norlandrhagen commented Oct 25, 2024

jbusecke commented Oct 25, 2024 • edited Loading

jbusecke commented Oct 28, 2024

jbusecke commented Oct 28, 2024

jbusecke commented Oct 28, 2024

jbusecke commented Oct 28, 2024

jbusecke commented Oct 28, 2024

norlandrhagen commented Oct 28, 2024

jbusecke commented Oct 28, 2024

jbusecke commented Oct 28, 2024

jbusecke commented Oct 28, 2024

jbusecke commented Oct 25, 2024 •

edited

Loading