Develop a new data ingest / ETL pipeline for indexing eQTL data into the new mongo database #3

karatugo · 2024-07-10T13:40:27Z

We need to develop a robust and scalable data ingest/ETL (Extract, Transform, Load) pipeline to facilitate the reading of eQTL (expression Quantitative Trait Loci) data from FTP sources, indexing it into a MongoDB database, and serving it via an API. This pipeline will ensure efficient data extraction, transformation, and retrieval to support downstream analysis and querying through a web service.

karatugo · 2024-07-17T15:48:22Z

Files to Index

QTD0000*.all.tsv.gz: Contains comprehensive eQTL data. This should be the primary source for indexing.
QTD0000*.cc.tsv.gz: Contains specific eQTL data (likely condition-specific or subset). Also useful for indexing.
QTD0000*.permuted.tsv.gz: Contains permuted eQTL data for significance testing. Useful for specific analyses but not primary indexing.

Suggested MongoDB Schema

Here's a refined schema to capture the necessary details from these files:

Study Information:
- study_id: QTD000021
- study_name: "Sample eQTL Study"
Sample Information:
- sample_id: Auto-generated or derived from context if available?
eQTL Information:
- molecular_trait_id: Corresponding trait ID.
- molecular_trait_object_id: Object ID for the molecular trait.
- chromosome: Chromosome number.
- position: Position on the chromosome.
- ref: Reference allele.
- alt: Alternative allele.
- variant: Variant identifier.
- ma_samples: Minor allele sample count.
- maf: Minor allele frequency.
- pvalue: P-value of the association.
- beta: Effect size.
- se: Standard error.
- type: Variant type (e.g., SNP).
- aan: Additional annotation number.
- r2: R-squared value.
- gene_id: Gene identifier.
- median_tpm: Median TPM (Transcripts Per Million).
- rsid: Reference SNP ID.
Permuted eQTL Information:
- p_perm: Permuted p-value.
- p_beta: Permuted beta value.

Example MongoDB Document Structure

{
  "study_id": "QTD000021",
  "study_name": "Sample eQTL Study",
  "samples": [
    {
      "sample_id": "sample001",
      "eqtls": [
        {
          "molecular_trait_id": "ENSG00000187583",
          "molecular_trait_object_id": "ENSG00000187583",
          "chromosome": "1",
          "position": 14464,
          "ref": "A",
          "alt": "T",
          "variant": "chr1_14464_A_T",
          "ma_samples": 41,
          "maf": 0.109948,
          "pvalue": 0.15144,
          "beta": 0.25567,
          "se": 0.17746,
          "type": "SNP",
          "aan": 42,
          "r2": 382,
          "gene_id": "ENSG00000187583",
          "median_tpm": 0.985,
          "rsid": "rs546169444",
          "permuted": {
            "p_perm": 0.000999001,
            "p_beta": 3.3243e-12
          }
        }
      ]
    }
  ]
}

Steps to Implement

Extract Data:
- Parse QTD0000*.all.tsv.gz and QTD0000*.cc.tsv.gz to extract eQTL data.
- Parse QTD0000*.permuted.tsv.gz to extract permuted data and merge with the main eQTL data.
Transform Data:
- Normalize data fields and structure according to the MongoDB schema.
Load Data:
- Insert the structured documents into MongoDB.
- Ensure appropriate indexes on fields such as gene_id, chromosome, position, and variant for efficient querying.
API Development:
- Develop endpoints for querying the eQTL data based on different parameters.

Indexing Strategy

Create indexes on key fields for efficient retrieval:
- gene_id
- chromosome
- position
- variant
- rsid

karatugo · 2024-08-23T13:46:51Z

@karatugo Focus on Mongo indexing, deployment and API development

karatugo · 2024-10-17T17:23:17Z

Deployment to sandbox is in progress. I was able to run build step successfully. Deploy step has some errors at the moment. I'll prioritise this next week.

karatugo · 2024-10-24T17:14:22Z

Sandbox deployment worked with singularity commands but while automating I got the error below.

Fix this error and test it in sandbox

FATAL:   could not open image /nfs/public/rw/gwas/deposition/singularity_cache/eqtl-sumstats-service_72de6563bdc84abc0be38ef294c854e3dd30f56e.sif: failed to retrieve path for /nfs/public/rw/gwas/deposition/singularity_cache/eqtl-sumstats-service_72de6563bdc84abc0be38ef294c854e3dd30f56e.sif: lstat /nfs/public: no such file or directory

karatugo · 2024-10-24T17:48:27Z

Fixed the above error, now working on mongo save failed issue.

karatugo self-assigned this Jul 10, 2024

karatugo mentioned this issue Jul 10, 2024

Develop a new data ingest / ETL pipeline for indexing eQTL data into the new mongo database. eQTL-Catalogue/eQTL-SumStats#88

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop a new data ingest / ETL pipeline for indexing eQTL data into the new mongo database #3

Develop a new data ingest / ETL pipeline for indexing eQTL data into the new mongo database #3

karatugo commented Jul 10, 2024 •

edited

Loading

karatugo commented Jul 17, 2024

karatugo commented Aug 23, 2024

karatugo commented Oct 17, 2024

karatugo commented Oct 24, 2024 •

edited

Loading

karatugo commented Oct 24, 2024

Develop a new data ingest / ETL pipeline for indexing eQTL data into the new mongo database #3

Develop a new data ingest / ETL pipeline for indexing eQTL data into the new mongo database #3

Comments

karatugo commented Jul 10, 2024 • edited Loading

karatugo commented Jul 17, 2024

Files to Index

Suggested MongoDB Schema

Example MongoDB Document Structure

Steps to Implement

Indexing Strategy

karatugo commented Aug 23, 2024

karatugo commented Oct 17, 2024

karatugo commented Oct 24, 2024 • edited Loading

karatugo commented Oct 24, 2024

karatugo commented Jul 10, 2024 •

edited

Loading

karatugo commented Oct 24, 2024 •

edited

Loading