Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

single sample processing for pindel cohort #96

Open
wants to merge 64 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
86e99c8
Apply fix provided Bt Kai Ye/Xiaofei Yang to recover missed complex e…
keiranmraine Jan 13, 2020
d5726b4
Apply fix provided Bt Kai Ye/Xiaofei Yang to recover missed complex e…
keiranmraine Jan 13, 2020
a2a1bcd
deletions working, cleaned up install process
keiranmraine Apr 20, 2020
3251131
Simple del/ins "working", but looks like switch in parsing format is …
keiranmraine Apr 21, 2020
51296ce
Initial pass at cohort pindel processing, no visualisation outputs (B…
keiranmraine May 6, 2020
ca37b83
Merge pull request #89 from cancerit/feature/insertLargerThanDel
keiranmraine May 6, 2020
49ecb5f
Functional cohort sample code with blat and BAM outputs
keiranmraine May 12, 2020
79a822f
Script for assessing impact of different lengths of target seq
keiranmraine May 12, 2020
e831ae6
Correct missing param default
keiranmraine May 12, 2020
e8f7868
minor fixes
keiranmraine May 14, 2020
50eaaee
Final version of plots
keiranmraine May 14, 2020
7302b1d
Fix bug in read selection, performance implications but best I can th…
keiranmraine Jun 2, 2020
9fa478b
cleanup
keiranmraine Jun 2, 2020
477a0e6
lazy stuff
keiranmraine Jun 2, 2020
8f9d163
convert to have ability to work on multisample VCF
keiranmraine Jun 23, 2020
e166a67
working towards vaf fill in
keiranmraine Jul 28, 2020
673609c
fix conflict during merge
keiranmraine Jul 28, 2020
cee0c94
Working fill-in code
keiranmraine Aug 3, 2020
67b6445
Finalise last of cohort processing scripts
keiranmraine Aug 6, 2020
9ae6799
Update tests for changes to how data is passed
keiranmraine Aug 7, 2020
c263413
Minor updates to deploy
keiranmraine Aug 13, 2020
0c385bb
Cleanup following bug hunt
keiranmraine Aug 25, 2020
11d1ed0
Final few issues with change to compressed files handled.
keiranmraine Aug 26, 2020
c3edc49
reduce change of exceptionally high number of files in a single direc…
keiranmraine Sep 4, 2020
e61a3a9
Fix up cohort tools that share VcfBlatAugment object for changes to c…
keiranmraine Sep 4, 2020
a15da93
typo
keiranmraine Sep 25, 2020
538b077
Cleanup so restarts work if failure occured during samtools sort
keiranmraine Sep 25, 2020
01e4f38
some lagging differences
keiranmraine Mar 30, 2021
ea0b2c4
docker cleanup
keiranmraine May 20, 2021
08135e5
add in new filter form Stan
keiranmraine May 20, 2021
95e1b48
hide vs-code workspace files
keiranmraine May 20, 2021
9ab3df8
Merge branch 'dev' into feature/patientCohortAxt
keiranmraine May 24, 2021
a73c82b
correct cli
keiranmraine Aug 24, 2021
68a877a
Cleanup legacy dev practices, add new method of handling license headers
keiranmraine Aug 26, 2021
089d30b
Correct c/p error
keiranmraine Aug 26, 2021
086c7ce
Indicate users should access wiki
keiranmraine Aug 26, 2021
7d1c812
remove comments
keiranmraine Aug 27, 2021
31aea3b
Correction to copyright line
keiranmraine Aug 27, 2021
5c54634
Correct skywalking eyes pattern
keiranmraine Aug 27, 2021
5b1db62
consolidate versions
keiranmraine Sep 8, 2021
5e2a44c
remove dev comment
keiranmraine Sep 8, 2021
1ce8000
update swe image, fix up badges
keiranmraine Sep 8, 2021
378cdc5
Detail the changes
keiranmraine Sep 8, 2021
31b8b7e
Iterative improvement to calls
keiranmraine Oct 6, 2021
12c8600
Some simple util scripts for generating data grids
keiranmraine Oct 6, 2021
ab98c3b
Faster blats due to reduced read parsing when multiple hits at same l…
keiranmraine Oct 6, 2021
c1afb9e
missing licenses
keiranmraine Oct 6, 2021
4f6b59e
current state
keiranmraine Nov 10, 2021
bd547b8
Improves large del processing
keiranmraine Nov 16, 2021
1a10d7b
Addition of simple repeat filters
keiranmraine Dec 13, 2021
eca430e
bring in recent changes from dev
keiranmraine Dec 15, 2021
83af748
some docs
keiranmraine Jan 20, 2022
bf3e985
Update Implement.pm
rulixxx Sep 30, 2022
03a97dd
Update pindelCohortVafSliceFill.pl
rulixxx Oct 10, 2022
cb472f9
Update pindelCohortVafSliceFill.pl
rulixxx Oct 10, 2022
8f53e43
metanorm filtering rules
rulixxx Nov 29, 2022
9113d8a
Create metanormRules.lst
rulixxx Nov 29, 2022
ac057fb
Update Pindel.pm
rulixxx Nov 30, 2022
cb8479a
Update Implement.pm
rulixxx Nov 30, 2022
b1be25f
Update Implement.pm
rulixxx Nov 30, 2022
4eab4bd
Update CHANGES.md
rulixxx Nov 30, 2022
93839f5
Update Implement.pm
rulixxx Dec 23, 2022
66dd39d
upped version
Dec 23, 2022
011bd27
reformatted blat command
Dec 23, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,12 @@
/CHANGES.md
/.gitignore
/.git
/perl/blib
/pm_to_blib
/perl/docs
/perl/docs.tar.gz
/python/env
/install_tmp
/.circleci
/*.code-workspace
/tmp
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,5 @@
/perl/pm_to_blib
.idea/*
/python/env
/tmp
*.code-workspace
206 changes: 20 additions & 186 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,188 +1,22 @@
# CHANGES

## 3.6.0
- Addition of `FF019` and `FF020` flags
- New flag rule set `pulldownFfpeRulesFragment.lst` including FF019 and FF020 made

## 3.5.0

- Update to core pindel algorithm to allow complex DI events to have longer inserted sequence than deleted
- Masking real events

## 3.4.1

- Updated Dockerfile to use pcap-core 5.4.0 - htslib/samtools 1.11

## 3.4.0

- Updated Dockerfile to use pcap-core 5.2.2
- Modified setup script to use build/\*.sh

## 3.3.0

- I/O hardening, see [milestone 3](https://github.com/cancerit/cgpPindel/milestone/3)

## 3.2.2

- Handle Input files that may have no reads at all, specifically an issue when generating a normal panel.

## 3.2.1

- Added Dockerfile and docker documentation

## 3.2.0

- Tabix search for high depth/excluded regions now performed in memory using IntervalTrees
- Reduces runtime of input step by ~50%
- Improved disk access profile
- Zero impact on results

## 3.1.2

- 3.0.5 introduced species parsing bug causing single word species names to be invalid.

## 3.1.1

- Fix regression - ability to cope with chromosomes with no events.

## 3.1.0

- Incorporates updated pindel which improves sensitivity
- Internally interpret QCFAIL to determine if whole pair fails

## 3.0.6

- Fixed version tag

## 3.0.5

- Handles species names with spaces in it
- modified checks for species,assembly and checksum

## 3.0.4

- Output bug for pindel BAM/CRAM corrected. When more than 1 chr in output files had no reads.

## 3.0.3

- Changes to how germline filter determined resulted in dummy germline bed file not being generated as previously.
- This release reinstates the old behaviour.

## 3.0.2

- Correct example rule files for \*Fragment.lst files to use FFnnn filter types

## 3.0.1

- Update tabix calls to directly use query_full (solves GRCh38 contig name issues).

## 3.0.0

- Germline bed file is now merged for adjacent regions (#31)
- More compressed intermediate files (#55)
- Change to `Const::Fast` where appropriate (#41)
- Removed TG VG from genotype.
- Readgroups are always variable, often 1 in data from last few years
- Not used by our filters.
- Supports BAM/CRAM inputs
- Output will be aligned with inputs
- bam vs cram
- bai vs csi
- Although ground work for csi input/output has been done `Bio::DB::HTS` doesn't support csi indexed input yet.
- Created our own fork at [`cancerit/Bio::DB::HTS`][cancerit-biodbhts] so that this could be enabled.
- You will need to install this manually or use one of our images for this functionallity.
- [dockstore-cgpwxs][ds-cgpwxs-git]
- [dockstore-cgpwxs][ds-cgpwgs-git]

<!-- -->

## 2.2.5

- Update tabix->query to tabix->query_full

## 2.2.4

- Force sorting of FILTER field to make records easier to diff.
- Fix sorting of final VCF to handle events with same start better when using comparison tools

## 2.2.3

Correct read sorting during collection of DI events. Caused some events to be split into many and
others to be missed (Thanks to @liangkaiye for patch)

## 2.2.3

Correct read sorting during collection of DI events. Caused some events to be split into many and
others to be missed (Thanks to @liangkaiye for patch)

## 2.2.2

Correction to sorting of VCF files

## 2.2.0

Reduces the amount of temporary space required and overall I/O

To process 40 million readpairs (40x Tumour + 40x Normal, chr21, 100bp reads):

Original time:

```
User time (seconds): 3553.88
System time (seconds): 63.92
Percent of CPU this job got: 159%
Elapsed (wall clock) time (h:mm:ss or m:ss): 37:51.63
File system inputs: 64
File system outputs: 1782080
```

New time:

```
User time (seconds): 3572.21
System time (seconds): 74.06
Percent of CPU this job got: 167%
Elapsed (wall clock) time (h:mm:ss or m:ss): 36:15.01
File system inputs: 0
File system outputs: 1139128
```

```
Original peak size: 650MB
New peak size: 291MB
```

__~55%__ reduction in working space and about __40%__ fewer writes to the file system.

Exactly the same results:

```bash
$ diff old/f9c3bc8e-dbc4-1ed0-e040-11ac0d4803a9_vs_f9c3bc8e-dbc1-1ed0-e040-11ac0d4803a9.germline.bed new/f9c3bc8e-dbc4-1ed0-e040-11ac0d4803a9_vs_f9c3bc8e-dbc1-1ed0-e040-11ac0d4803a9.germline.bed

$ diff_bams -a old/f9c3bc8e-dbc4-1ed0-e040-11ac0d4803a9_vs_f9c3bc8e-dbc1-1ed0-e040-11ac0d4803a9_wt.bam -b new/f9c3bc8e-dbc4-1ed0-e040-11ac0d4803a9_vs_f9c3bc8e-dbc1-1ed0-e040-11ac0d4803a9_wt.bam
Reference sequence count passed
Reference sequence order passed
Matching records: 194543

$ diff_bams -a old/f9c3bc8e-dbc4-1ed0-e040-11ac0d4803a9_vs_f9c3bc8e-dbc1-1ed0-e040-11ac0d4803a9_mt.bam -b new/f9c3bc8e-dbc4-1ed0-e040-11ac0d4803a9_vs_f9c3bc8e-dbc1-1ed0-e040-11ac0d4803a9_mt.bam
Reference sequence count passed
Reference sequence order passed
Matching records: 239737

$ /software/CGP/canpipe/live/bin/canpipe_live vcftools --gzvcf old/f9c3bc8e-dbc4-1ed0-e040-11ac0d4803a9_vs_f9c3bc8e-dbc1-1ed0-e040-11ac0d4803a9.flagged.vcf.gz --gzdiff new/f9c3bc8e-dbc4-1ed0-e040-11ac0d4803a9_vs_f9c3bc8e-dbc1-1ed0-e040-11ac0d4803a9.flagged.vcf.gz
...
Comparing individuals in VCF files...
N_combined_individuals: 2
N_individuals_common_to_both_files: 2
N_individuals_unique_to_file1: 0
N_individuals_unique_to_file2: 0
Comparing sites in VCF files...
Found 15321 SNPs common to both files.
Found 0 SNPs only in main file.
Found 0 SNPs only in second file.
After
```

[cancerit-biodbhts]: https://github.com/cancerit/Bio-DB-HTS/releases/tag/v2.10-rc1
[ds-cgpwgs-git]: https://github.com/cancerit/dockstore-cgpwgs
[ds-cgpwxs-git]: https://github.com/cancerit/dockstore-cgpwxs
## 1.0.1
- Added a file check after blat step

## 1.0.0
- Added filters to FlagVcf.pl to allow flagging of per-sample vcf outputs
- Fixed bugs in Implement.pm and pindelCohortVafSliceFill.pl
- Adds code to allow single sample processing with more accurate VAF calculations (via BLAT)
- Status of new scripts, "pre-release" indicates defaults and CLI may change:
- stable
- pindelCohort.pl
- pindel_blat_vaf.pl
- pre-release
- pindelCohort_to_vcf.pl
- pindel_vcfSortNsplit.pl
- pindelCohortMerge.pl
- pindelCohortVafFill.pl
- pindelCohortVafSplit.pl
- pindelCohortVafSliceFill.pl
- pinning to pindel v3.6.0
- Switch license management to skywalking-eyes.
3 changes: 2 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ USER root
# ALL tool versions used by opt-build.sh
# need to keep in sync with setup.sh
ENV VER_CGPVCF="v2.2.1"\
VER_VCFTOOLS="0.1.16"
VER_VCFTOOLS="0.1.16"\
VER_BLAT="v385"

# hadolint ignore=DL3008
RUN apt-get -yq update \
Expand Down
2 changes: 2 additions & 0 deletions build/opt-build-local.sh
Original file line number Diff line number Diff line change
Expand Up @@ -63,3 +63,5 @@ if [ ! -e $SETUP_DIR/cgpPindel.success ]; then
cd $SETUP_DIR
touch $SETUP_DIR/cgpPindel.success
fi

rm -rf $SETUP_DIR
7 changes: 7 additions & 0 deletions build/opt-build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -75,3 +75,10 @@ if [ ! -e $SETUP_DIR/cgpVcf.success ]; then
rm -rf distro.* distro/*
touch $SETUP_DIR/cgpVcf.success
fi
set -x
if [ ! -e $SETUP_DIR/ucscTools.success ]; then
curl -sSL http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64.${VER_BLAT}/blat/blat > $INST_PATH/bin/blat
curl -sSL http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64.${VER_BLAT}/pslPretty > $INST_PATH/bin/pslPretty
chmod ugo+x $INST_PATH/bin/blat $INST_PATH/bin/pslPretty
touch $SETUP_DIR/ucscTools.success
fi
12 changes: 10 additions & 2 deletions perl/Makefile.PL
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,6 @@
# 2009, 2010, 2011, 2012’.
#


use ExtUtils::MakeMaker;

WriteMakefile(
Expand All @@ -42,7 +41,16 @@ WriteMakefile(
bin/FlagVcf.pl
bin/pindel_merge_vcf_bam.pl
bin/pindel_np_from_vcf.pl
bin/pindel_germ_bed.pl)],
bin/pindel_germ_bed.pl
bin/pindelCohort.pl
bin/pindelCohort_to_vcf.pl
bin/pindel_vcfSortNsplit.pl
bin/pindel_blat_vaf.pl
bin/pindelCohortMerge.pl
bin/pindelCohortVafFill.pl
bin/pindelCohortVafSplit.pl
bin/pindelCohortVafSliceFill.pl
)],
PREREQ_PM => {
'Const::Fast' => 0.014,
'Try::Tiny' => 0.19,
Expand Down
Loading