Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: revert distinct for associations input file #871

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

vivienho
Copy link
Contributor

@vivienho vivienho commented Oct 23, 2024

✨ Context

I noticed that the gwas associations input file (gs://gwas_catalog_inputs/gwas_catalog_associations_ontology_annotated.tsv) contained duplicate rows, and added a line to drop duplicates after reading the file. This change resulted in an unexpected bug - in the resulting data variantIds were randomly mapped to different studyIds in each run (likely due to pyspark's optimisations). The input file containing duplicates is not really a problem, so I am reverting the initial change in this PR.

🛠 What does this PR implement

Reverts the changes made to drop duplicate rows from the associations input file.

This PR also renames the temporary studyLocusId created using monotonically_increasing_id() to rowId, to distinguish it from the actual studyLocusId assigned at a later stage.

🙈 Missing

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

@github-actions github-actions bot added bug Something isn't working Datasource size-XS labels Oct 23, 2024
@github-actions github-actions bot removed the Method label Oct 23, 2024
@vivienho vivienho marked this pull request as ready for review October 25, 2024 11:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Datasource size-S
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant