fix: revert distinct for associations input file #871
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
✨ Context
I noticed that the gwas associations input file (
gs://gwas_catalog_inputs/gwas_catalog_associations_ontology_annotated.tsv
) contained duplicate rows, and added a line to drop duplicates after reading the file. This change resulted in an unexpected bug - in the resulting datavariantIds
were randomly mapped to differentstudyIds
in each run (likely due to pyspark's optimisations). The input file containing duplicates is not really a problem, so I am reverting the initial change in this PR.🛠 What does this PR implement
Reverts the changes made to drop duplicate rows from the associations input file.
This PR also renames the temporary
studyLocusId
created usingmonotonically_increasing_id()
torowId
, to distinguish it from the actualstudyLocusId
assigned at a later stage.🙈 Missing
🚦 Before submitting
dev
branch?make test
)?poetry run pre-commit run --all-files
)?