Skip to content
forked from micknudsen/snpahoy

ID SNP analysis for NGS sequencing

License

Notifications You must be signed in to change notification settings

asp8200/snpahoy

 
 

Repository files navigation

install with conda CI Coverage Status

SNP Ahoy!

Just a little tool for checking ID SNPs. It works in both germline and somatic modes as described in the sections below. By default, only sites with at least 30X coverage are considered, and sites with major allele frequency greater than or equal to 95% are considered homyzygote.

$ snpahoy --help
Usage: snpahoy [OPTIONS] COMMAND [ARGS]...

Options:
  --minimum_coverage INTEGER      Only consider SNP positions with a least this
                                  coverage  [default: 30]

  --minimum_base_quality INTEGER  Only count bases with at least this quality
                                  [default: 1]

  --homozygosity_threshold FLOAT  Consider a SNP position homozygote if
                                  frequency of most common allele is this or
                                  higher  [default: 0.95]

  --help                          Show this message and exit.

Commands:
  germline
  somatic

Germline Mode

To run in germline mode, simply provide a BAM file using the --bam_file option.

$ snpahoy germline --help
Usage: snpahoy germline [OPTIONS]

Options:
  --bed_file PATH          BED file with SNP postions  [required]
  --bam_file PATH          BAM file (must be indexed)  [required]
  --output_json_file PATH  JSON output file  [required]
  --help                   Show this message and exit.

The output JSON file contains input information, genotypes at all SNP positions, and a summary. In case a SNP is not genotyped (as for the chrY ones in the example below), the empty string is reported as genotype.

{
    "input": {
        "settings": {
            "minimum-coverage": 30,
            "minimum-base-quality": 1,
            "homozygosity-threshold": 0.95
        },
        "files": {
            "bed-file": "snps.bed",
            "bam-file": "germline.bam"
        }
    },
    "output": {
        "details": {
            "chr1:4789323": {
                "genotype": "CC",
                "depth": 88,
                "counts": {
                    "A": 1,
                    "C": 87,
                    "G": 0,
                    "T": 0
                }
            },
            "chr1:4895801": {
                "genotype": "TT",
                "depth": 65,
                "counts": {
                    "A": 0,
                    "C": 0,
                    "G": 0,
                    "T": 65
                }
            },

            (...)

            "chrY:23164803": {
                "genotype": "",
                "depth": 0,
                "counts": {
                    "A": 0,
                    "C": 0,
                    "G": 0,
                    "T": 0
                }
            }
        },
        "summary": {
            "snps": {
                "total": 1041,
                "genotyped": 1016
            },
            "heterozygotes-fraction": 0.4744,
            "mean-maf-homozygote-sites": 0.0022,
            "mean-off-genotype-frequency": 0.0023
        }
    }
}

Somatic Mode

To run in somatic mode, provide tumor and germline BAM files using the --tumor_bam_file and --germline_bam_file options.

$ snpahoy somatic --help
Usage: snpahoy somatic [OPTIONS]

Options:
  --bed_file PATH           BED file with SNP postions  [required]
  --tumor_bam_file PATH     Tumor BAM file (must be indexed)  [required]
  --germline_bam_file PATH  Germline BAM file (must be indexed)  [required]
  --output_json_file PATH   JSON output file  [required]
  --help                    Show this message and exit.

Output is similar to that in germline mode. Only sites which are genotyping in both tumor and germline are used, and the homozygote sites used in mean MAF calculations are the homozygote sites in the germline sample.

{
    "input": {
        "settings": {
            "minimum-coverage": 30,
            "minimum-base-quality": 1,
            "homozygosity-threshold": 0.95
        },
        "files": {
            "bed-file": "snps.bed",
            "tumor-bam-file": "tumor.bam",
            "germline-bam-file": "germline.bam"
        },
        "output": {
            "details": {
                "tumor": { ... },
                "germline": { ... }
            }
        },
        "summary": {
            "snps": {
                "total": 1041,
                "genotyped": 1000
            },
            "tumor": {
                "heterozygotes-fraction": 0.474,
                "mean-maf-homozygote-sites": 0.0019,
                "mean-off-genotype-frequency": 0.0019
            },
            "germline": {
                "heterozygotes-fraction": 0.474,
                "mean-maf-homozygote-sites": 0.0022,
                "mean-off-genotype-frequency": 0.0019
            }
        }
    }
}

This tool is developed with the MSK IMPACT panel in mind. Suggeseted cut-offs for identifying sample swap or contamination are 0.55 for heterozygotes fractions and 0.01 for mean MAFs.

Installation

The recommended way to install snpahoy is by using conda:

$ conda install -c micknudsen snpahoy

About

ID SNP analysis for NGS sequencing

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%