Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Jobs] Allowing to specify intermediate bucket for file upload #4257

Open
wants to merge 64 commits into
base: master
Choose a base branch
from

Conversation

zpoint
Copy link
Collaborator

@zpoint zpoint commented Nov 4, 2024

Feature for #3978

  • Add a config jobs: bucket in ~/.sky/config.yaml.
    • Support s3, gcs, azure, r2, ibm
    • If this bucket is specified, the bucket name and storage type will be used. Otherwise, one will be generated named: skypilot-filemounts-{username}-{run_id}.
    • Add an attribute for storage YAML config: _bucket_sub_path(Support both COPY and MOUNT based on the _bucket_sub_path, The user will only see the subpath instead of the whole bucket if a subpath is specified)
      • _ means internal use only, remove _ to expose in the future
    • smoke test
      • Add a smoke test case: pytest -s tests/test_smoke.py::test_managed_jobs_intermediate_storage
        • Verify that the new config file is loaded and intermediate bucket names in the config are used.
      • Add a test case: pytest -s tests/test_smoke.py::TestStorageWithCredentials::test_bucket_sub_path --gcp
        • Verify that the bucket subdirectory file sync and deletion have the expected behavior. Mainly, ensure the _bucket_sub_path works as expected.

For managed jobs:

  • All these file sync operations will now use the intermediate buckets. Previously, each file sync would create a new temporary bucket:
    • Previous:
      • workdir specified in the task.yaml workdir -> {tmp-bucket-A}/workdir
      • Local file mount specified in the task.yaml file_mount -> {tmp-bucket-B}{i}
      • Temporary files tmp_files -> {tmp-bucket-C}/tmp-files
    • Now:
      • workdir specified in the task.yaml workdir -> {intermediate-bucket}/job-{run_id}/workdir
      • Local file mount specified in the task.yaml file_mount -> {intermediate-bucket}/job-{run_id}/local-file-mounts/{i}
      • Temporary files tmp_files -> {intermediate-bucket}/job-{run_id}/tmp-files
  • Clean up
    • After the managed job finishes, when cleanup is triggered
      • If user doesn't specific bucket config, sky generated {intermediate-bucket}/job-{run_id} bucket will be deleted automatically as before.
      • If user specific bucket name in config, the directory created above will be deleted, the bucket still exists after that, and the user needs to manually delete the bucket as usual.(Due to potential race problem)
    • Currently aws, gcs, r2 use the command line support to remove the directory, while cos and https(azure) use the SDK to list all files under directory and then delete

Even using the same bucket, each job creates its own subdirectory based on the '{run-id}', and subdirectories of auto-created files will be cleaned up after the job is finished.

This allows users who don't have permission to create a bucket to manually specify a bucket name under ~/.sky/config.yaml.

Test plan

smoke test

pytest -s tests/test_smoke.py::test_managed_jobs_storage
pytest -s tests/test_smoke.py::TestStorageWithCredentials::test_bucket_sub_path --aws

custom test

(sky) ➜  cat ~/.sky/config.yaml
jobs:
  bucket: s3://zpoint-bucket-s3/
(sky) ➜ cat ~/Desktop/hello-sky/work_dir_1.yaml
name: test_workdir_bucket_name_1

workdir: .

resources:
  cloud: aws
  instance_type: t3.small

file_mounts:
  # this will use the user config
  /checkpoint:
    name: zpoint-filemounts-bucket
    source: ~/Desktop/dir1
    mode: MOUNT
    store: azure

  # these will all use same bucket configured in ~/.sky/config.yaml jobs->bucket now for bucket storage
  /dir1: ~/Desktop/dir1
  /dir2: ~/Desktop/dir2
  /dir3/dir3.py: ~/Desktop/dir1/dir1.py

run: |
 sleep 10
  ls /checkpoint
  ls .

Launch

(sky) ➜ sky jobs launch ~/Desktop/hello-sky/work_dir_1.yaml
⠋ Syncing ~/Desktop/dir1 -> https://xxx/zpoint-filemounts-bucket/
⠹ Syncing . -> s3://zpoint-bucket-s3/
⠧ Syncing ~/Desktop/dir1 -> s3://zpoint-bucket-s3/
⠦ Syncing ~/Desktop/dir2 -> s3://zpoint-bucket-s3/
⠴ Syncing /var/folders/83/zxqx914s57x310rfnhq8kk9r0000gn/T/skypilot-filemounts-files-aca97801 -> s3://zpoint-bucket-s3/

Looks good

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@romilbhardwaj romilbhardwaj self-requested a review November 6, 2024 18:42
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpoint! Left some comments about the design, correctness and lifecycle management for the bucket contents.

sky/utils/controller_utils.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
@zpoint zpoint requested a review from romilbhardwaj November 8, 2024 09:13
sky/data/storage.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
@zpoint zpoint requested a review from romilbhardwaj November 9, 2024 15:39
sky/data/storage.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
Comment on lines 715 to 719
jobs_configs['properties']['bucket'] = {
'type': 'string',
'pattern': '^(https|s3|gs|r2|cos)://.+',
'required': []
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should include it in controller_resources_schema and enable this feature for SkyServe as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adde to controller_resources_schema

Could u point me out the code to change if I need to enable for SkyServe?

sky/data/storage.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
@romilbhardwaj
Copy link
Collaborator

Thanks @zpoint! I'm trying to run this yaml:

resources:
  cloud: aws

file_mounts:
  ~/aws: ~/aws

workdir: ~/tmp-workdir

num_nodes: 1

run: |
  echo "Hello, world!"
  ls ~/aws
  ls .

The task output should show me contents of my workdir and file_mount. Instead, I get:

├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(sky-3ba1-romilb, pid=2100) Hello, world!
(sky-3ba1-romilb, pid=2100) job-89ab894f
(sky-3ba1-romilb, pid=2100) job-89ab894f
✓ Managed job finished: 1 (status: SUCCEEDED).

Related to this: #4257 (comment)?

@zpoint zpoint requested a review from romilbhardwaj November 12, 2024 10:01
@zpoint zpoint requested a review from romilbhardwaj December 14, 2024 11:05
sky/data/storage.py Outdated Show resolved Hide resolved
is_sky_managed: Optional[bool] = None,
sync_on_reconstruction: Optional[bool] = True,
_bucket_sub_path: Optional[str] = None, # pylint: disable=invalid-name
_allow_bucket_creation: bool = True): # pylint: disable=invalid-name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this _allow_bucket_creation arg needed?

Can we instead check store.is_sky_managed field after the store's initialization to check if bucket was created, and if it was created then raise an error?

Eg. near L727 in controller_utils.py, we can initialize the store object and check is_sky_managed, and raise error appropriately.

store = store_cls(name=bucket_name,
                        source=workdir,
                        _bucket_sub_path=bucket_sub_path,
                        **store_kwargs)
if store.is_sky_managed:
  # Bucket was created, this should not happen since the bucket already exists
  # cleanup and raise error
  store.delete()
  raise ...

Let me know if I'm missing something. It would be great if we could avoid adding new arguments to the Store/Storage interface.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that in controller_utils.py. This avoids adding a parameter but keeps the logic outside of the store. If we're not going to reuse the limitation of creating a bucket, it will be fine to handle it in controller_utils.py

sky/task.py Show resolved Hide resolved
tests/test_yamls/intermediate_bucket.yaml Show resolved Hide resolved
@zpoint zpoint requested a review from romilbhardwaj December 20, 2024 04:34
@zpoint
Copy link
Collaborator Author

zpoint commented Dec 22, 2024

@romilbhardwaj Review reminder 😊

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpoint! Tested functionality - works nicely. Left some comments on the code.

sky/data/storage.py Outdated Show resolved Hide resolved
Comment on lines 669 to 670
f'Configured to use a non-existent bucket: {store.name}, '
'please check the jobs->bucket in your sky config.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f'Configured to use a non-existent bucket: {store.name}, '
'please check the jobs->bucket in your sky config.')
f'Jobs bucket {store.name!r} does not exist. '
'Please check jobs.bucket configuration in your SkyPilot config.')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the logging shown above is quite confusing ("Created S3 bucket", "Deleted S3 bucket"). Is there a way to hide it / different way to check if the bucket already exists?

I 12-23 14:28:46 controller_utils.py:691] ⚙︎ Translating workdir to SkyPilot Storage...
I 12-23 14:28:47 storage.py:1590]   Created S3 bucket 'romil-jobs-bucket2' in us-east-1
I 12-23 14:28:54 storage.py:1386] Deleted S3 bucket romil-jobs-bucket2.
sky.exceptions.StorageBucketCreateError: Configured to use a non-existent bucket: romil-jobs-bucket2, please check the jobs->bucket in your sky config.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Find a way to hide the above logs by using sky_logging.silent()

⚙︎ Translating workdir and file_mounts with local source paths to SkyPilot Storage...
sky.exceptions.StorageBucketCreateError: Jobs bucket 'zpointggg-bucket-s3-intermediate' does not exist.  Please check jobs.bucket configuration in your SkyPilot config.

sky/data/storage.py Outdated Show resolved Hide resolved
@zpoint zpoint requested a review from romilbhardwaj December 24, 2024 04:15
@zpoint zpoint requested a review from romilbhardwaj December 27, 2024 04:38
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpoint! lgtm, we can merge this after storage related smoke tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants