Skip to content

Latest commit

 

History

History
45 lines (43 loc) · 5.22 KB

DOWNLOADS_README.md

File metadata and controls

45 lines (43 loc) · 5.22 KB

Contents of Zenodo archive of processed CLIPNET data and results

Data to reproduce figures and training data are available at 10.5281/zenodo.10597358. To preserve directory structure, we packaged the data into tar files, divided roughly by figure/analysis. Below is a description of the files:

  • procap_library_prefixes.txt: Prefixes for the PRO-cap libraries (n=67) used to train and evaluate CLIPNET.
  • procap_to_1kGP_conversion.json: Lists the individual ID for each PRO-cap library (for extracting genotypes from 1kGP). Note that some libraries were ultimately excluded from CLIPNET, so this file has more than 67 entries.
  • training_data.tar.gz: Contains processed data used to train the CLIPNET models.
    • individual_pints_peaks/: Contains the PINTS peaks for each individual PRO-cap library.
    • individual_jittered_windows/: Contains the jittered (uniformly random, +/- 250bp around center of each peak) 1 kb windows for each individual PRO-cap library.
    • processed_data/: Contains the processed data used to train the models, including the individualized sequences and PRO-cap signal (RPM normalized). Packaged as npz arrays. Data were concatenated across libraries, then split into the data folds described in processed_data/data_fold_assignments.csv.gz. We note that the PRO-cap data are structured as N x 2000 arrays (1000 bp pl strand, 1000 bp mn strand). The sequence data are structured as N x 1000 x 4 arrays (N = number of sequences, 1000 = sequence length, 4 = two-hot encoding of sequences).
  • evaluation_metric.tar.gz: Contains the evaluation metrics for the CLIPNET models. Supporting data are in evaluation_data.tar.gz.
    • ensemble_test/: Contains the evaluation metrics for the individual models on the complete hold out data set (fold 0).
    • individual_test/: Contains the evaluation metrics for the model folds on the individual model hold out folds (model 1 used fold 1 as a hold out, model 2 used fold 2, etc).
    • fixed_uniq_windows.bed.gz: A fixed set of 48,058 1 kb windows used to evaluate the models. We selected PRO-cap peaks that were present in at least 60 of the 67 libraries, then selected 1 kb windows around each of them (with 250 bp jittering).
    • mean_predictor_corrs.csv.gz: Correlation between an averaged PRO-cap track (across loci) against individual tracks.
    • replicate_pearsons.csv.gz: Correlation between tracks from isogenic replicates (n=9).
    • clipnet_test_predictions.h5: Prediction of the ensembled model on data fold 0.
    • puffin_clipnet_test_perf.csv.gz: Track correlations for Puffin's PRO-cap head.
  • evaluation_data.tar.gz: Contains data and predictions used to evaluate the performance of the CLIPNET models.
    • processed_data/: Contains the processed data used to evaluate the models.
      • procap/: Contains the processed PRO-cap signal (csv) for each data fold.
      • sequences/: Contains the sequences (fasta) for each data fold.
    • merged_pl_rpm.bw: bigWig file containing RPM-normalized plus strand signals, averaged across all individuals.
    • merged_mn_rpm.bw: bigWig file containing RPM-normalized minus strand signals, averaged across all individuals.
  • deepshap_scores.tar.gz: Contains DeepSHAP contribution scores.
    • merged_windows_all.bed.gz: A nonredundant set of 212,777 windows around PRO-cap peaks (union across all libraries) used for calculating DeepSHAP scores.
    • all_tss_windows_reference_seq.fna.gz: The reference (hg38) sequence for the windows in merged_windows_all.bed.gz.
    • all_seqs_onehot.npz: A one-hot encoded version of the reference sequence. This and the score arrays are structured as N x 4 x 1000 arrays for compatibility with TF-MoDISco.
    • mean_across_folds_all_profile.npz: The profile contribution scores (mean across model folds).
    • mean_across_folds_all_quantity.npz: The quantity contribution scores (mean across model folds).
  • tfmodisco_results.tar.gz: Contains TF-MoDISco results.
    • mean_across_folds_all_profile_modisco.h5: The TF-MoDISco results for the profile contribution scores.
    • mean_across_folds_all_quantity_modisco.h5: The TF-MoDISco results for the quantity contribution scores.
    • mean_across_folds_all_profile_modisco/: A report of the TF-MoDISco results for the profile contribution scores.
    • mean_across_folds_all_quantity_modisco/: A report of the TF-MoDISco results for the quantity contribution scores.
    • mean_across_folds_all_modisco_positions.h5: Distribution of TF-MoDISco motif positions around the max TSS for each window.
  • qtl_analysis.tar.gz: Contains the finished QTL analysis (log L2 ref - alt scores). Supporting data are in qtl_data.tar.gz.
  • qtl_data.tar.gz: Contains analysis of both tiQTLs and diQTLs.
    • tiqtl/: Contains the tiQTL analysis.
      • predictions/: Contains predictions for each individual centered on each tiQTL.
        • ensemble_predictions/: Contains the predictions of the ensemble model.
        • individual_predictions/: Contains the predictions of the individual models.
      • tiQTL_snps.bed.gz: The SNPs used for the tiQTL analysis (note that we dropped multiallelic SNPs).
      • tiqtl_windows.bed.gz: The windows used for the tiQTL analysis.
    • diqtl/: Contains the diQTL analysis. Identical structure to tiqtl/.