Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop a new data ingest / ETL pipeline for indexing eQTL data into the new mongo database #3

Open
4 of 8 tasks
karatugo opened this issue Jul 10, 2024 · 5 comments
Assignees

Comments

@karatugo
Copy link
Member

karatugo commented Jul 10, 2024

We need to develop a robust and scalable data ingest/ETL (Extract, Transform, Load) pipeline to facilitate the reading of eQTL (expression Quantitative Trait Loci) data from FTP sources, indexing it into a MongoDB database, and serving it via an API. This pipeline will ensure efficient data extraction, transformation, and retrieval to support downstream analysis and querying through a web service.

  • Scalable data ingest/ETL pipeline
  • Read from FTP sources
  • Ingest with the correct schema
  • Save to MongoDB
  • Index MongoDB - is it automatic? discuss with DBA team
  • Deploy to Sandbox
  • Deploy to Prod
  • Implement API
@karatugo
Copy link
Member Author

Files to Index

  • QTD0000*.all.tsv.gz: Contains comprehensive eQTL data. This should be the primary source for indexing.
  • QTD0000*.cc.tsv.gz: Contains specific eQTL data (likely condition-specific or subset). Also useful for indexing.
  • QTD0000*.permuted.tsv.gz: Contains permuted eQTL data for significance testing. Useful for specific analyses but not primary indexing.

Suggested MongoDB Schema

Here's a refined schema to capture the necessary details from these files:

  1. Study Information:

    • study_id: QTD000021
    • study_name: "Sample eQTL Study"
  2. Sample Information:

    • sample_id: Auto-generated or derived from context if available?
  3. eQTL Information:

    • molecular_trait_id: Corresponding trait ID.
    • molecular_trait_object_id: Object ID for the molecular trait.
    • chromosome: Chromosome number.
    • position: Position on the chromosome.
    • ref: Reference allele.
    • alt: Alternative allele.
    • variant: Variant identifier.
    • ma_samples: Minor allele sample count.
    • maf: Minor allele frequency.
    • pvalue: P-value of the association.
    • beta: Effect size.
    • se: Standard error.
    • type: Variant type (e.g., SNP).
    • aan: Additional annotation number.
    • r2: R-squared value.
    • gene_id: Gene identifier.
    • median_tpm: Median TPM (Transcripts Per Million).
    • rsid: Reference SNP ID.
  4. Permuted eQTL Information:

    • p_perm: Permuted p-value.
    • p_beta: Permuted beta value.

Example MongoDB Document Structure

{
  "study_id": "QTD000021",
  "study_name": "Sample eQTL Study",
  "samples": [
    {
      "sample_id": "sample001",
      "eqtls": [
        {
          "molecular_trait_id": "ENSG00000187583",
          "molecular_trait_object_id": "ENSG00000187583",
          "chromosome": "1",
          "position": 14464,
          "ref": "A",
          "alt": "T",
          "variant": "chr1_14464_A_T",
          "ma_samples": 41,
          "maf": 0.109948,
          "pvalue": 0.15144,
          "beta": 0.25567,
          "se": 0.17746,
          "type": "SNP",
          "aan": 42,
          "r2": 382,
          "gene_id": "ENSG00000187583",
          "median_tpm": 0.985,
          "rsid": "rs546169444",
          "permuted": {
            "p_perm": 0.000999001,
            "p_beta": 3.3243e-12
          }
        }
      ]
    }
  ]
}

Steps to Implement

  1. Extract Data:

    • Parse QTD0000*.all.tsv.gz and QTD0000*.cc.tsv.gz to extract eQTL data.
    • Parse QTD0000*.permuted.tsv.gz to extract permuted data and merge with the main eQTL data.
  2. Transform Data:

    • Normalize data fields and structure according to the MongoDB schema.
  3. Load Data:

    • Insert the structured documents into MongoDB.
    • Ensure appropriate indexes on fields such as gene_id, chromosome, position, and variant for efficient querying.
  4. API Development:

    • Develop endpoints for querying the eQTL data based on different parameters.

Indexing Strategy

  • Create indexes on key fields for efficient retrieval:
    • gene_id
    • chromosome
    • position
    • variant
    • rsid

@karatugo
Copy link
Member Author

@karatugo Focus on Mongo indexing, deployment and API development

@karatugo
Copy link
Member Author

Deployment to sandbox is in progress. I was able to run build step successfully. Deploy step has some errors at the moment. I'll prioritise this next week.

@karatugo
Copy link
Member Author

karatugo commented Oct 24, 2024

Sandbox deployment worked with singularity commands but while automating I got the error below.

  • Fix this error and test it in sandbox
FATAL:   could not open image /nfs/public/rw/gwas/deposition/singularity_cache/eqtl-sumstats-service_72de6563bdc84abc0be38ef294c854e3dd30f56e.sif: failed to retrieve path for /nfs/public/rw/gwas/deposition/singularity_cache/eqtl-sumstats-service_72de6563bdc84abc0be38ef294c854e3dd30f56e.sif: lstat /nfs/public: no such file or directory

@karatugo
Copy link
Member Author

Fixed the above error, now working on mongo save failed issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant