Data to reproduce figures and training data are available at 10.5281/zenodo.10597358. To preserve directory structure, we packaged the data into tar files, divided roughly by figure/analysis. Below is a description of the files:
procap_library_prefixes.txt
: Prefixes for the PRO-cap libraries (n=67) used to train and evaluate CLIPNET.procap_to_1kGP_conversion.json
: Lists the individual ID for each PRO-cap library (for extracting genotypes from 1kGP). Note that some libraries were ultimately excluded from CLIPNET, so this file has more than 67 entries.training_data.tar.gz
: Contains processed data used to train the CLIPNET models.individual_pints_peaks/
: Contains the PINTS peaks for each individual PRO-cap library.individual_jittered_windows/
: Contains the jittered (uniformly random, +/- 250bp around center of each peak) 1 kb windows for each individual PRO-cap library.processed_data/
: Contains the processed data used to train the models, including the individualized sequences and PRO-cap signal (RPM normalized). Packaged as npz arrays. Data were concatenated across libraries, then split into the data folds described inprocessed_data/data_fold_assignments.csv.gz
. We note that the PRO-cap data are structured as N x 2000 arrays (1000 bp pl strand, 1000 bp mn strand). The sequence data are structured as N x 1000 x 4 arrays (N = number of sequences, 1000 = sequence length, 4 = two-hot encoding of sequences).
evaluation_metric.tar.gz
: Contains the evaluation metrics for the CLIPNET models. Supporting data are inevaluation_data.tar.gz
.ensemble_test/
: Contains the evaluation metrics for the individual models on the complete hold out data set (fold 0).individual_test/
: Contains the evaluation metrics for the model folds on the individual model hold out folds (model 1 used fold 1 as a hold out, model 2 used fold 2, etc).fixed_uniq_windows.bed.gz
: A fixed set of 48,058 1 kb windows used to evaluate the models. We selected PRO-cap peaks that were present in at least 60 of the 67 libraries, then selected 1 kb windows around each of them (with 250 bp jittering).mean_predictor_corrs.csv.gz
: Correlation between an averaged PRO-cap track (across loci) against individual tracks.replicate_pearsons.csv.gz
: Correlation between tracks from isogenic replicates (n=9).clipnet_test_predictions.h5
: Prediction of the ensembled model on data fold 0.puffin_clipnet_test_perf.csv.gz
: Track correlations for Puffin's PRO-cap head.
evaluation_data.tar.gz
: Contains data and predictions used to evaluate the performance of the CLIPNET models.processed_data/
: Contains the processed data used to evaluate the models.procap/
: Contains the processed PRO-cap signal (csv) for each data fold.sequences/
: Contains the sequences (fasta) for each data fold.
merged_pl_rpm.bw
: bigWig file containing RPM-normalized plus strand signals, averaged across all individuals.merged_mn_rpm.bw
: bigWig file containing RPM-normalized minus strand signals, averaged across all individuals.
deepshap_scores.tar.gz
: Contains DeepSHAP contribution scores.merged_windows_all.bed.gz
: A nonredundant set of 212,777 windows around PRO-cap peaks (union across all libraries) used for calculating DeepSHAP scores.all_tss_windows_reference_seq.fna.gz
: The reference (hg38) sequence for the windows inmerged_windows_all.bed.gz
.all_seqs_onehot.npz
: A one-hot encoded version of the reference sequence. This and the score arrays are structured as N x 4 x 1000 arrays for compatibility with TF-MoDISco.mean_across_folds_all_profile.npz
: The profile contribution scores (mean across model folds).mean_across_folds_all_quantity.npz
: The quantity contribution scores (mean across model folds).
tfmodisco_results.tar.gz
: Contains TF-MoDISco results.mean_across_folds_all_profile_modisco.h5
: The TF-MoDISco results for the profile contribution scores.mean_across_folds_all_quantity_modisco.h5
: The TF-MoDISco results for the quantity contribution scores.mean_across_folds_all_profile_modisco/
: A report of the TF-MoDISco results for the profile contribution scores.mean_across_folds_all_quantity_modisco/
: A report of the TF-MoDISco results for the quantity contribution scores.mean_across_folds_all_modisco_positions.h5
: Distribution of TF-MoDISco motif positions around the max TSS for each window.
qtl_analysis.tar.gz
: Contains the finished QTL analysis (log L2 ref - alt scores). Supporting data are inqtl_data.tar.gz
.qtl_data.tar.gz
: Contains analysis of both tiQTLs and diQTLs.tiqtl/
: Contains the tiQTL analysis.predictions/
: Contains predictions for each individual centered on each tiQTL.ensemble_predictions/
: Contains the predictions of the ensemble model.individual_predictions/
: Contains the predictions of the individual models.
tiQTL_snps.bed.gz
: The SNPs used for the tiQTL analysis (note that we dropped multiallelic SNPs).tiqtl_windows.bed.gz
: The windows used for the tiQTL analysis.
diqtl/
: Contains the diQTL analysis. Identical structure totiqtl/
.