Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load custom variants #350

Open
silenus092 opened this issue Oct 30, 2018 · 3 comments
Open

Load custom variants #350

silenus092 opened this issue Oct 30, 2018 · 3 comments

Comments

@silenus092
Copy link

Hi,

I successfully install and test ExAC browser by following instructions from readme.

However, I want to try it by loading my VCF files (custom variants) into mongodb and show them on ExAC browser as well, so I begin to inspect how the code works.

  1. I paste my vcf file and name it as ExAC*.test1.vcf.gz under an exac_data directory

  2. I use the following command to load the variants into database
    python manage.py load_variants_file

  3. Unfortunately , the process was not successfully loaded. I saw the error message,and it said
    the python code cannot pass the info tags AC_AFR , AC_AMR , AC_EAS , AC_FIN etc.
    Because my VCF file didn't contain those tags.
    I found that the ExAC_HC.0.3.vep was annotated by VariantAnnotator and VEP , also the header represents tags as follows AN_AFR ,AN_AMR ,AC_AFR , AC_AMR , AC_EAS , AC_FIN etc.

4.Then , I try running both VariantAnnotator and VEP on my vcf file,nevertheless, those info tags still did not appear.

Therefore ,could you kindly explain the process of how to load custom variants to ExAC browser, so I can reproduce and apply a process in my vcf file.
I also read from #268 however it is not much of help for this problem

Best regards,
Bob

@nawatts
Copy link

nawatts commented Oct 30, 2018

Hi @silenus092,

The ExAC browser is tailored to the ExAC dataset. It was not designed to allow loading arbitrary datasets and there is no supported way to do so.

get_variants_from_sites_vcf in parsing.py is the relevant code for loading variants. You would need to modify that function based on the info fields present in your VCF file.

exac_browser/parsing.py

Lines 53 to 147 in a212465

def get_variants_from_sites_vcf(sites_vcf):
"""
Parse exac sites VCF file and return iter of variant dicts
sites_vcf is a file (gzipped), not file path
"""
vep_field_names = None
for line in sites_vcf:
try:
line = line.strip('\n')
if line.startswith('##INFO=<ID=CSQ'):
vep_field_names = line.split('Format: ')[-1].strip('">').split('|')
if line.startswith('##INFO=<ID=DP_HIST'):
dp_mids = map(float, line.split('Mids: ')[-1].strip('">').split('|'))
if line.startswith('##INFO=<ID=GQ_HIST'):
gq_mids = map(float, line.split('Mids: ')[-1].strip('">').split('|'))
if line.startswith('#'):
continue
# If we get here, it's a variant line
if vep_field_names is None:
raise Exception("VEP_field_names is None. Make sure VCF header is present.")
# This elegant parsing code below is copied from https://github.com/konradjk/loftee
fields = line.split('\t')
info_field = dict([(x.split('=', 1)) if '=' in x else (x, x) for x in re.split(';(?=\w)', fields[7])])
consequence_array = info_field['CSQ'].split(',') if 'CSQ' in info_field else []
annotations = [dict(zip(vep_field_names, x.split('|'))) for x in consequence_array if len(vep_field_names) == len(x.split('|'))]
coding_annotations = [ann for ann in annotations if ann['Feature'].startswith('ENST')]
alt_alleles = fields[4].split(',')
# different variant for each alt allele
for i, alt_allele in enumerate(alt_alleles):
vep_annotations = [ann for ann in coding_annotations if int(ann['ALLELE_NUM']) == i + 1]
# Variant is just a dict
# Make a copy of the info_field dict - so all the original data remains
# Add some new keys that are allele-specific
pos, ref, alt = get_minimal_representation(fields[1], fields[3], alt_allele)
variant = {}
variant['chrom'] = fields[0]
variant['pos'] = pos
variant['rsid'] = fields[2]
variant['xpos'] = get_xpos(variant['chrom'], variant['pos'])
variant['ref'] = ref
variant['alt'] = alt
variant['xstart'] = variant['xpos']
variant['xstop'] = variant['xpos'] + len(variant['alt']) - len(variant['ref'])
variant['variant_id'] = '{}-{}-{}-{}'.format(variant['chrom'], variant['pos'], variant['ref'], variant['alt'])
variant['orig_alt_alleles'] = [
'{}-{}-{}-{}'.format(variant['chrom'], *get_minimal_representation(fields[1], fields[3], x))
for x in alt_alleles
]
variant['site_quality'] = float(fields[5])
variant['filter'] = fields[6]
variant['vep_annotations'] = vep_annotations
variant['allele_count'] = int(info_field['AC_Adj'].split(',')[i])
if not variant['allele_count'] and variant['filter'] == 'PASS': variant['filter'] = 'AC_Adj0' # Temporary filter
variant['allele_num'] = int(info_field['AN_Adj'])
if variant['allele_num'] > 0:
variant['allele_freq'] = variant['allele_count']/float(info_field['AN_Adj'])
else:
variant['allele_freq'] = None
variant['pop_acs'] = dict([(POPS[x], int(info_field['AC_%s' % x].split(',')[i])) for x in POPS])
variant['pop_ans'] = dict([(POPS[x], int(info_field['AN_%s' % x])) for x in POPS])
variant['pop_homs'] = dict([(POPS[x], int(info_field['Hom_%s' % x].split(',')[i])) for x in POPS])
variant['ac_male'] = info_field['AC_MALE']
variant['ac_female'] = info_field['AC_FEMALE']
variant['an_male'] = info_field['AN_MALE']
variant['an_female'] = info_field['AN_FEMALE']
variant['hom_count'] = sum(variant['pop_homs'].values())
if variant['chrom'] in ('X', 'Y'):
variant['pop_hemis'] = dict([(POPS[x], int(info_field['Hemi_%s' % x].split(',')[i])) for x in POPS])
variant['hemi_count'] = sum(variant['pop_hemis'].values())
variant['quality_metrics'] = dict([(x, info_field[x]) for x in METRICS if x in info_field])
variant['genes'] = list({annotation['Gene'] for annotation in vep_annotations})
variant['transcripts'] = list({annotation['Feature'] for annotation in vep_annotations})
if 'DP_HIST' in info_field:
hists_all = [info_field['DP_HIST'].split(',')[0], info_field['DP_HIST'].split(',')[i+1]]
variant['genotype_depths'] = [zip(dp_mids, map(int, x.split('|'))) for x in hists_all]
if 'GQ_HIST' in info_field:
hists_all = [info_field['GQ_HIST'].split(',')[0], info_field['GQ_HIST'].split(',')[i+1]]
variant['genotype_qualities'] = [zip(gq_mids, map(int, x.split('|'))) for x in hists_all]
yield variant
except Exception:
print("Error parsing vcf line: " + line)
traceback.print_exc()
break

In particular, it looks like the population specific fields are not present in your VCF.

exac_browser/parsing.py

Lines 120 to 122 in a212465

variant['pop_acs'] = dict([(POPS[x], int(info_field['AC_%s' % x].split(',')[i])) for x in POPS])
variant['pop_ans'] = dict([(POPS[x], int(info_field['AN_%s' % x])) for x in POPS])
variant['pop_homs'] = dict([(POPS[x], int(info_field['Hom_%s' % x].split(',')[i])) for x in POPS])

variant['pop_hemis'] = dict([(POPS[x], int(info_field['Hemi_%s' % x].split(',')[i])) for x in POPS])

In addition to the code for loading variants, there is code in other parts of the browser (such as the HTML templates) that assumes variants have all the fields defined in the ExAC VCF. That would also have to be modified to support the format of variants in your dataset.

@silenus092
Copy link
Author

silenus092 commented Oct 31, 2018

Thank you very much , then
if it'spossible to generate VCF file similar to ExAC_HC.0.3.*.vep.vcf that including all required tags (INFO) such as pop_acs,pop_ans,pop_homs ,AC_AMR ,Hemi_FIN etc. so I don't have to modify the code.
how can I achieve this task? like which VEP version , external resources and command are used

@nawatts
Copy link

nawatts commented Nov 1, 2018

How to generate correct values for those fields (and whether or not that is possible at all) would depend on your sample data.

The VEP version used for ExAC was version 85 (http://exac.broadinstitute.org/faq)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants