In this tutorial series, we demonstrate the benefit of packaging raw -omics data in Quilt packages with attached sample-level metadata. Annotating packages with workflow-standardized metadata enables the creation of AWS Athena tables, joining sample-level metadata & pipeline outputs (e.g. Nextflow) from your processed NGS data. Together, joining these two data sources in Athena allows users to query large datasets across multiple processing runs and cohorts efficiently using SQL.
For example, use an SQL query within a Jupyter Notebook to generate a table of EGFR expression across all colon cancer cell lines, where "colon cancer" represents a piece of sample-level metadata from the raw data Quilt packages, and "EGFR expression" is a piece of processed data from packaged Nextflow pipeline outputs.
The ultimate goal of this demo is to provide an end-to-end framework, from raw data to analysis, to maximize the utility of your NGS data, and make querying your datasets fast & easy (no more searching through directories & file systems to find specific sample or run IDs!).
For the purpose of this tutorial, we are using a subset of publicly available RNA-sequencing data generated by the Cancer Cell Line Encyclopedia (CCLE) initiative.
RNA-sequencing data is processed using the nf-core/rna-seq
Nextflow pipeline with the nf-quilt
plugin to package pipeline outputs into a Quilt package with pipeline parameters as metadata.
Although focussed on bulk RNA-seq data, this tutorial is generalizable - with the core principles applying across data types, and reproducible with your in-house datasets.
We have generated a series of 4 core tutorials (+ 1 optional) demonstrating a framework to go from raw NGS data to annotated Quilt data packages with metadata & Nextflow pipeline outputs, to enable quick data access & queries through AWS Athena.
00_curate_raw_ccle_rnaseq_data.ipynb
(optional)
01_create_metadata_workflow_schema.ipynb
02_generate_raw_data_pkgs_with_metadata.ipynb
Raw data is either generated in house by an instrument, or as in the case of this demo, curated from a public source. Here, we downloaded raw RNA-sequencing data in the form of fastqs from the Sequence Read Archive (SRA). Raw sequencing data was then packaged into Quilt packages, 1 per package per sample.
Sample-level metadata describing both biological (tumor type, patient age, histology ...) and technical (sequencer used, library kit, freezing media used for storage ...) features of the sample were obtained from SRA and attached as metadata to each Quilt package housing raw data.
Quilt workflows & metadata schemas were used to ensure the integrity of the metadata across samples -- a key step to maximize the utility of sample metadata in downstream analysis! No more Tumor vs. tumor vs. tumour...!!
03_run_nfcore_rnaseq_with_nfquilt.ipynb
The Nextflow nf-core/rnaseq
pipeline, in conjunction with nf-quilt
was used to process raw sequencing data (fastqs) and generate per sample expression values. Samples were processed together in batches (called "runs"), mirroring common practice in NGS centers when multiple samples on a sequencing flow cell are pre-processed at the same time. The nf-quilt
plugin automatically packages Nextflow pipeline output into a Quilt package, and appends detailed pipeline run metadata to the package.
04_athena_metadata_nfcore_output.ipynb
To enable valuable data searches, we must align the sample-level metadata appended to the raw data packages to the pipeline outputs. In this demo, the primary data generated by the pipeline is expression tables. With Athena, its possible to integrate sample metadata & pipeline output tables together to empower quick queries and slicing and dicing of large datasets.
06_query_athena_data_and_perform_analysis
Once Athena is enabled, the world (or data in this case...) is the Computational Biologist's oyster! Computational biologists can now use the Athena to make SQL queries to obtain desired subsets of data to empower their analysis quickly and efficiently. Queries can be performed directly in Jupyter notebooks, enabling seamless data loading upstream of analysis.
In contrast, without Athena capabilities, comp bio folks would have figure out which samples they want by loading a master metadata table somewhere, perform some detective work to track down where the output tables of their desired samples live, and load those files 1-by-1.
Additionally, Athena tables are compatible with interactive dashboards (e.g. Tableau, Spotfire, QuickSight), allowing you to keep track of the number of samples, which samples, or other accounting metrics that may be helpful beyond computational teams (business development, project management) in a "no-code" manner.
The tutorials are in the form of Jupyter Notebooks, and are fully executable. To run the notebooks, the following pre-requisites are required:
- Python >=3.7
- Required Python packages:
pip install -r requirements.txt
- AWS credentials
- Quilt Open Data Account
- NextFlow Tower Account (optional)
We love to help! Please reach out to the Quilt Data team with any comments or questions. Let's get your data up to snuff together!
- Laura Richards: [email protected]
- Simon Kohnstamm: [email protected]
- Kevin Moore: [email protected]