Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #716 #722

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

shamikbose
Copy link
Contributor

@shamikbose shamikbose commented Jul 3, 2022

Note: This dataset has a few issues

  1. The abstracts have to be downloaded from Pubmed with eUtils, so it's slow since the API is throttled
  2. The way abstracts are generated seems to be inconsistent. In some cases, the titles are considered, but in others, they seem to be ignored. As a result, there are 7 mismatched offsets
    Is there a standard way these abstracts are formed?
  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py. - Note
  • If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Tagging errors down from 475 to 247
Abstract is build as follows:
`{title} {label}: {abstract.label}`
Mismatched offsets in 7 examples, all others pass
@shamikbose
Copy link
Contributor Author

Some concrete examples of strange abstract creation

Example 1:
18239642 '-174G>C' G-174C 394 401 rs1800795 NSM
18239642 '-572G>C' G-572C 375 382 rs1800796 NSM
18239642 '-596A>G' A-596G 356 363 rs1800797 NSM
Abstract
Title:Modifying effects of IL-6 polymorphisms on body size-associated breast cancer risk.OBJECTIVE:The association between obesity and breast cancer risk is complex. We examined whether the association between body size and breast cancer risk is modified by interleukin-6 (IL6) genotype.METHODS AND PROCEDURES:Five polymorphisms in the IL-6 gene (rs1800797/-596A>G, rs1800796/-572G>C, rs1800795/-174G>C, rs2069832/IVS2G>A, and rs2069849 exon 5 C>T) were studied. We investigated IL6 genotypes and haplotypes with indicators of body size among non-Hispanic white (NHW) and Hispanic/American Indian (AI) breast cancer cases and controls living in the Southwestern United States.RESULTS:We observed lower mean levels of BMI among NHW women who carried one or two copies of the GGCAC haplotype (in order: rs1800797, rs1800796, rs1800795, rs2069832, and rs2069849; P trend 0.02). This haplotype, with an estimated frequency of 43% in NHW study controls, was considerably less common in Hispanic/AI controls (19%). We did not detect significant interactions between IL6 genotypes or haplotypes and BMI categorized as low/normal (<25), overweight (25 to <30), or obese (> or =30) and breast cancer risk in either NHW or Hispanic/AI women. However, we detected consistent and significant interactions between waist-to-hip ratio (WHR) and IL6 rs1800795/-174 G>C genotype for breast cancer risk. These associations were restricted to postmenopausal NHW women. Among women without recent hormone exposure, those with a WHR >0.9 and the rs1800795 GG genotype had a greater than threefold increased risk of breast cancer (odds ratios (ORs) 3.22, 95% confidence intervals (CIs) 1.27, 817) when compared with women with a WHR <0.8 and the rs1800795 GG genotype (P interaction 0.01).DISCUSSION:These data suggest that IL-6 genotypes may influence breast cancer risk in conjunction with central adiposity.

Example 2
18092344 'c.30T>A' c.30T>A 36 43 rs2043211 NSM
18092344 'p.C10X' p.C10X 45 51 rs2043211 PSM
Abstract:
Title: No association of the CARD8 (TUCAN) c.30T>A (p.C10X) variant with Crohn's disease: a study in 3 independent European cohorts.
BACKGROUND:A recent study reported that the c.30T>A (p.Cys10Ter; rs2043211) variant, in the CARD8 (TUCAN) gene, is associated with Crohn's disease (CD). The aim of this study was to analyze the frequency of p.C10X in 3 independent European (IBD) cohorts from Germany, Hungary, and the Netherlands.METHODS:We included a European IBD cohort of 921 patients and compared the p.C10X genotype frequency to 832 healthy controls. The 3 study populations analyzed were: (1) Germany [CD, n = 317; ulcerative colitis (UC), n = 180], (2) Hungary (CD, n = 149; UC, n = 119), and (3) the Netherlands (CD, n = 156). Subtyping analysis was performed in respect to NOD2 variants (p.Arg702Trp, p.Gly908Arg, c.3020insC) and to clinical characteristics. Ethnically matched controls were included (German, n = 413; Hungarian, n = 202; Dutch, n = 217).RESULTS:We observed no significant difference in p.C10X genotype frequency in either patients with CD or patients with UC compared with controls in all 3 cohorts. Conversely to the initial association study, we found a trend toward lower frequencies of the suggestive risk wild type in CD from the Netherlands compared with controls (P = 0.14). We found neither evidence for genetic interactions between p.C10X and NOD2 nor the C10X variant to be associated with a CD or UC phenotype.CONCLUSIONS:Analyzing 3 independent European IBD cohorts, we found no evidence that the C10X variant in CARD8 confers susceptibility for CD.

In Example 1, the start and end match up if you include the "Title", but in Example 2, they match up if you exclude the word "Title".

@mariosaenger
Copy link
Collaborator

@phlobo What do we want to do with this dataset? It just contains the annotations but not the abstracts / texts. The latter could be downloaded via API however there might be a lot of offset errors due to changed content etc

@mariosaenger mariosaenger self-assigned this Oct 28, 2024
@phlobo
Copy link
Collaborator

phlobo commented Oct 28, 2024

@phlobo What do we want to do with this dataset? It just contains the annotations but not the abstracts / texts. The latter could be downloaded via API however there might be a lot of offset errors due to changed content etc

Would it be an option to include the abstracts (e.g., as a zip file) as part of the repo? I guess there are other datasets (MedMentions comes to my mind), that re-distribute Pubmed abstracts as part of a GitHub repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants