Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid columns in compact manifest for AnVIL #6110

Closed
hannes-ucsc opened this issue Mar 31, 2024 · 15 comments
Closed

Invalid columns in compact manifest for AnVIL #6110

hannes-ucsc opened this issue Mar 31, 2024 · 15 comments
Assignees
Labels
+ [priority] High API API change affecting callers bug [type] A defect preventing use of the system as specified demo [process] To be demonstrated at the end of the sprint demoed [process] Successfully demonstrated to team manifests [subject] Generation and contents of manifests orange [process] Done by the Azul team spike:2 [process] Spike estimate of two points

Comments

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Mar 31, 2024

species for example, but there could be others. There may also be missing columns or extra columns. We didn't put a lot of effort into curating the manifest config for AnVIL. The default approach should be to include all fields as columns in a compact manifest and to derive the manifest config from the field mapping for the /index/… endpoints.

The HCA manifest is a hand-curated set of fields but we've learned that no matter what set of fields we pick, there will always be people requesting yet another field to be included, so maintaining that set became a whack-a-mole. There will always be people who request that fields be removed because they don't see a use case for those fields, which seems short-sighted/pedantic to me. The only valid reasons not to include a field should be technical limitations, e.g., size, cardinality.

The current manifest config for AnVIL also uses a naming convention that is inconsistent with the naming of fields in the /index/… endpoint responses. The Data Browser doesn't currently expose the compact manifest so we should be able to rename columns using the same dotted convention used for the /index/… endpoints which itself is based on the naming of fields in the AnVIL schema.

[edit] Quick link to important comment detailing the implementation plan.

@github-actions github-actions bot added the orange [process] Done by the Azul team label Mar 31, 2024
@dsotirho-ucsc dsotirho-ucsc added bug [type] A defect preventing use of the system as specified manifests [subject] Generation and contents of manifests labels Apr 2, 2024
@dsotirho-ucsc dsotirho-ucsc added the - [priority] Medium label Apr 2, 2024
hannes-ucsc added a commit that referenced this issue Apr 3, 2024
This is the main part, it does break the IT.
hannes-ucsc added a commit that referenced this issue Apr 3, 2024
Name special fields in camel case on AnVIL. This repairs the IT.
@hannes-ucsc hannes-ucsc added the API API change affecting callers label Apr 3, 2024
hannes-ucsc added a commit that referenced this issue Apr 3, 2024
This is the main part, it does break the IT.
hannes-ucsc added a commit that referenced this issue Apr 3, 2024
Name special fields in camel case on AnVIL. This repairs the IT.
hannes-ucsc added a commit that referenced this issue Apr 3, 2024
This is the main part, it does break the IT.
hannes-ucsc added a commit that referenced this issue Apr 3, 2024
Name special fields in camel case on AnVIL. This repairs the IT.
hannes-ucsc added a commit that referenced this issue Apr 3, 2024
Name special fields in camel case on AnVIL. This repairs the IT.
@hannes-ucsc
Copy link
Member Author

For demo, show compact manifest on anvilprod. Review column headers. Show that species column is gone and that organism_type is there. Compare manifest with /index/files response. Download a file using a URL from the manifest.

@hannes-ucsc hannes-ucsc added the demo [process] To be demonstrated at the end of the sprint label Apr 3, 2024
dsotirho-ucsc pushed a commit that referenced this issue Apr 3, 2024
dsotirho-ucsc pushed a commit that referenced this issue Apr 3, 2024
Name special fields in camel case on AnVIL. This repairs the IT.
@hannes-ucsc
Copy link
Member Author

hannes-ucsc commented May 7, 2024

Discovered during demo that the manifest contains additional columns compared to the /index/files endpoint, for example, activities.reference_assembly.

@hannes-ucsc
Copy link
Member Author

Additionally, the service response contains the files.uuid field which the manifest omits. It seems to have the same value as files.document_id. It could be that the service response field is redundant.

@hannes-ucsc
Copy link
Member Author

hannes-ucsc commented May 7, 2024

The manifest has files.file_size. The service response has files.size. Same for .file_name vs .name. And .file_url and .url

@hannes-ucsc
Copy link
Member Author

files.accessible is missing from manifest.

@achave11-ucsc
Copy link
Member

Assignee to consider next steps.

@hannes-ucsc
Copy link
Member Author

hannes-ucsc commented Aug 23, 2024

Spike in anvilprod to systematically compare the set of columns in the manifest with that of the inner entity properties in the /index/files endpoint. The set comparison should yield three sets, that of columns only in the manifest, that of properties only the endpoint and that of fields with other inconsistencies, e.g. a type mismatch.

@hannes-ucsc hannes-ucsc added the spike:1 [process] Spike estimate of one point label Aug 23, 2024
@hannes-ucsc hannes-ucsc removed their assignment Aug 23, 2024
@dsotirho-ucsc dsotirho-ucsc self-assigned this Aug 23, 2024
@dsotirho-ucsc
Copy link
Contributor

dsotirho-ucsc commented Aug 23, 2024

Only in the manifest

diagnoses.document_id
diagnoses.source_datarepo_row_ids
diagnoses.diagnosis_id

Only in the /index/{entity} response

activities.accessible
biosamples.accessible
bundles.accessible
datasets.description
datasets.accessible
donors.accessible
files.version
files.uuid
files.size (duplicate of `files.file_size`)
files.name (duplicate of `files.file_name`)
files.accessible

Mismatches

manifest          | /index/{entity} repsonse
------------------------------------------
bundle_uuid       | bundles.bundle_uuid
bundle_version    | bundles.bundle_version
files.file_url    | files.url
source_id         | sources.source_id
source_spec       | sources.source_spec

@hannes-ucsc
Copy link
Member Author

For demo, repeat the spike (before demo) and review differences (during demo). I would like to be present at that demo.

@dsotirho-ucsc dsotirho-ucsc added the demoed [process] Successfully demonstrated to team label Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
+ [priority] High API API change affecting callers bug [type] A defect preventing use of the system as specified demo [process] To be demonstrated at the end of the sprint demoed [process] Successfully demonstrated to team manifests [subject] Generation and contents of manifests orange [process] Done by the Azul team spike:2 [process] Spike estimate of two points
Projects
None yet
Development

No branches or pull requests

3 participants