Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflows should enforce push-level tag completeness #3717

Open
RyanZotti opened this issue Sep 13, 2023 · 0 comments
Open

Workflows should enforce push-level tag completeness #3717

RyanZotti opened this issue Sep 13, 2023 · 0 comments

Comments

@RyanZotti
Copy link

Quilt's Workflow documentation mentions three benefits: consistency, completeness, and context. There should really be a fourth benefit, though it doesn't start with the letter c: freshness.

Let's say I have two tags: author, and code. Today's Quilt Workflows don't support push-level completeness. Before I give an illustrative example, let me explain how my organization uses Quilt.

My institution uses Quilt for data and GitHub for code. When a user pushes to an existing Quilt package, we require the user to specify two tags:

  • author: An email address. If someone has a question about the data captured in this package version, this is the person they can reach for questions
  • code: A link to a commit on GitHub. If someone wants to look at the code that produced the data, they can visit this link

We actually require more than those two tags, but I'm omitting them for simplicity.

Now let me illustrate the problem based on how workflows are designed today.

  • Person A pushes a change to a package. They update the author tag to point to their email address address. They update the code to point to their code.
  • The next day Person B pushes a change to the same package (after they "browse" the latest version, of course). Person B isn't aware of the company's package stanards. They don't know to update the author or code tag. Person B pushes their change to Quilt. Quilt Worklows examine the push and says, "look great!" and lets it go through. Now the tags are wrong. It looks like Person A authored two versions. Person B doesn't show up anywhere. The data from the second version has nothing to do with the code tag that the data is now associated with.
  • Fast forward a year. Person C looks at the package and has a question about the second version. They look at the code tag and view the code. The code doesn't make any sense - how could it have produced this data? So they look at the author tag. It says Person A. So they contact Person A. Person A has no idea what Person C is talking about. Person A says their code doesn't do that and they have no idea where the data came from. Person A says Person C will have to send an email to the entire department to figure out who might have produced the data from the second version. That's painful.

Example:

s3://example/.quilt/workflows/config.yml contents:

version:
  base: "1"
  catalog: "1"
default_workflow: "basic"
is_workflow_required: True
workflows:
  basic:
    name: Basic requirements
    description: Require code and author tags
    metadata_schema: minimal
    is_message_required: true
schemas:
  minimal:
    url: s3://example/schemas/minimal.json

s3://example/schemas/minimal.json contents:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "http://https://example.com/minimal.schema.json",
  "required": [
    "author",
    "code"
  ]
}

Person A

Person A produces some data.

mkdir -p data/a/b/c
echo "abc" > data/a/b/c/file-1.txt
echo "123" > data/a/b/c/file-2.txt

Person A adds correct, up-to-date tags and pushes to the Quilt package.

import quilt3

package_name = "rzotti/example"
registry = "s3://example"
p = quilt3.Package()
p.set_dir("data/a/b/c","data/a/b/c")
p.set_meta({
    "author": "[email protected]",
    "code":"https://github.com/my-company/example/commit/332be7a62b6ae5222aa74f5d525c8221a9393d45"
})
message = "First commit"
p.push(package_name, registry=registry, message=message)

Person B

Person B adds some data.

echo "xyz" > data/a/b/c/file-3.txt

Person B completely ignores the tags and performs a push. Ideally they should get an error reminding them to update the tags. Today this code works fine.

import quilt3

package_name = "rzotti/example"
registry = "s3://example"
p = quilt3.Package.browse(package_name, registry=registry)
p.set("data/a/b/c/file-3.txt","data/a/b/c/file-3.txt")
message = "Second commit"
p.push(package_name, registry=registry, message=message)

Quilt's Workflow capability already supports throwing errors at the push level for messages (if the user forgets to add a message to their push they get an error reminding them). Ideally Quilt could support similar errors for tags per push too.

I don't have a strong preference about how or where the config.yml or .json should allow the user to specify which tags need to be updated as part of the push as long as I can specify it somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant