Load custom variants #350

silenus092 · 2018-10-30T04:07:29Z

Hi,

I successfully install and test ExAC browser by following instructions from readme.

However, I want to try it by loading my VCF files (custom variants) into mongodb and show them on ExAC browser as well, so I begin to inspect how the code works.

I paste my vcf file and name it as ExAC*.test1.vcf.gz under an exac_data directory
I use the following command to load the variants into database
python manage.py load_variants_file
Unfortunately , the process was not successfully loaded. I saw the error message,and it said
the python code cannot pass the info tags AC_AFR , AC_AMR , AC_EAS , AC_FIN etc.
Because my VCF file didn't contain those tags.
I found that the ExAC_HC.0.3.vep was annotated by VariantAnnotator and VEP , also the header represents tags as follows AN_AFR ,AN_AMR ,AC_AFR , AC_AMR , AC_EAS , AC_FIN etc.

4.Then , I try running both VariantAnnotator and VEP on my vcf file,nevertheless, those info tags still did not appear.

Therefore ,could you kindly explain the process of how to load custom variants to ExAC browser, so I can reproduce and apply a process in my vcf file.
I also read from #268 however it is not much of help for this problem

Best regards,
Bob

The text was updated successfully, but these errors were encountered:

nawatts · 2018-10-30T13:37:32Z

Hi @silenus092,

The ExAC browser is tailored to the ExAC dataset. It was not designed to allow loading arbitrary datasets and there is no supported way to do so.

get_variants_from_sites_vcf in parsing.py is the relevant code for loading variants. You would need to modify that function based on the info fields present in your VCF file.

exac_browser/parsing.py

Lines 53 to 147 in a212465

    
           def get_variants_from_sites_vcf(sites_vcf): 
        
               """ 
        
               Parse exac sites VCF file and return iter of variant dicts 
        
               sites_vcf is a file (gzipped), not file path 
        
               """ 
        
               vep_field_names = None 
        
               for line in sites_vcf: 
        
                   try: 
        
                       line = line.strip('\n') 
        
                       if line.startswith('##INFO=<ID=CSQ'): 
        
                           vep_field_names = line.split('Format: ')[-1].strip('">').split('|') 
        
                       if line.startswith('##INFO=<ID=DP_HIST'): 
        
                           dp_mids = map(float, line.split('Mids: ')[-1].strip('">').split('|')) 
        
                       if line.startswith('##INFO=<ID=GQ_HIST'): 
        
                           gq_mids = map(float, line.split('Mids: ')[-1].strip('">').split('|')) 
        
                       if line.startswith('#'): 
        
                           continue 
        
                       # If we get here, it's a variant line 
        
                       if vep_field_names is None: 
        
                           raise Exception("VEP_field_names is None. Make sure VCF header is present.") 
        
                       # This elegant parsing code below is copied from https://github.com/konradjk/loftee 
        
                       fields = line.split('\t') 
        
                       info_field = dict([(x.split('=', 1)) if '=' in x else (x, x) for x in re.split(';(?=\w)', fields[7])]) 
        
                       consequence_array = info_field['CSQ'].split(',') if 'CSQ' in info_field else [] 
        
                       annotations = [dict(zip(vep_field_names, x.split('|'))) for x in consequence_array if len(vep_field_names) == len(x.split('|'))] 
        
                       coding_annotations = [ann for ann in annotations if ann['Feature'].startswith('ENST')] 
        
                       alt_alleles = fields[4].split(',') 
        
                       # different variant for each alt allele 
        
                       for i, alt_allele in enumerate(alt_alleles): 
        
                           vep_annotations = [ann for ann in coding_annotations if int(ann['ALLELE_NUM']) == i + 1] 
        
                           # Variant is just a dict 
        
                           # Make a copy of the info_field dict - so all the original data remains 
        
                           # Add some new keys that are allele-specific 
        
                           pos, ref, alt = get_minimal_representation(fields[1], fields[3], alt_allele) 
        
                           variant = {} 
        
                           variant['chrom'] = fields[0] 
        
                           variant['pos'] = pos 
        
                           variant['rsid'] = fields[2] 
        
                           variant['xpos'] = get_xpos(variant['chrom'], variant['pos']) 
        
                           variant['ref'] = ref 
        
                           variant['alt'] = alt 
        
                           variant['xstart'] = variant['xpos'] 
        
                           variant['xstop'] = variant['xpos'] + len(variant['alt']) - len(variant['ref']) 
        
                           variant['variant_id'] = '{}-{}-{}-{}'.format(variant['chrom'], variant['pos'], variant['ref'], variant['alt']) 
        
                           variant['orig_alt_alleles'] = [ 
        
                               '{}-{}-{}-{}'.format(variant['chrom'], *get_minimal_representation(fields[1], fields[3], x)) 
        
                               for x in alt_alleles 
        
                           ] 
        
                           variant['site_quality'] = float(fields[5]) 
        
                           variant['filter'] = fields[6] 
        
                           variant['vep_annotations'] = vep_annotations 
        
                           variant['allele_count'] = int(info_field['AC_Adj'].split(',')[i]) 
        
                           if not variant['allele_count'] and variant['filter'] == 'PASS': variant['filter'] = 'AC_Adj0' # Temporary filter 
        
                           variant['allele_num'] = int(info_field['AN_Adj']) 
        
                           if variant['allele_num'] > 0: 
        
                               variant['allele_freq'] = variant['allele_count']/float(info_field['AN_Adj']) 
        
                           else: 
        
                               variant['allele_freq'] = None 
        
                           variant['pop_acs'] = dict([(POPS[x], int(info_field['AC_%s' % x].split(',')[i])) for x in POPS]) 
        
                           variant['pop_ans'] = dict([(POPS[x], int(info_field['AN_%s' % x])) for x in POPS]) 
        
                           variant['pop_homs'] = dict([(POPS[x], int(info_field['Hom_%s' % x].split(',')[i])) for x in POPS]) 
        
                           variant['ac_male'] = info_field['AC_MALE'] 
        
                           variant['ac_female'] = info_field['AC_FEMALE'] 
        
                           variant['an_male'] = info_field['AN_MALE'] 
        
                           variant['an_female'] = info_field['AN_FEMALE'] 
        
                           variant['hom_count'] = sum(variant['pop_homs'].values()) 
        
                           if variant['chrom'] in ('X', 'Y'): 
        
                               variant['pop_hemis'] = dict([(POPS[x], int(info_field['Hemi_%s' % x].split(',')[i])) for x in POPS]) 
        
                               variant['hemi_count'] = sum(variant['pop_hemis'].values()) 
        
                           variant['quality_metrics'] = dict([(x, info_field[x]) for x in METRICS if x in info_field]) 
        
                           variant['genes'] = list({annotation['Gene'] for annotation in vep_annotations}) 
        
                           variant['transcripts'] = list({annotation['Feature'] for annotation in vep_annotations}) 
        
                           if 'DP_HIST' in info_field: 
        
                               hists_all = [info_field['DP_HIST'].split(',')[0], info_field['DP_HIST'].split(',')[i+1]] 
        
                               variant['genotype_depths'] = [zip(dp_mids, map(int, x.split('|'))) for x in hists_all] 
        
                           if 'GQ_HIST' in info_field: 
        
                               hists_all = [info_field['GQ_HIST'].split(',')[0], info_field['GQ_HIST'].split(',')[i+1]] 
        
                               variant['genotype_qualities'] = [zip(gq_mids, map(int, x.split('|'))) for x in hists_all] 
        
                           yield variant 
        
                   except Exception: 
        
                       print("Error parsing vcf line: " + line) 
        
                       traceback.print_exc() 
        
                       break

In particular, it looks like the population specific fields are not present in your VCF.

exac_browser/parsing.py

Lines 120 to 122 in a212465

    
           variant['pop_acs'] = dict([(POPS[x], int(info_field['AC_%s' % x].split(',')[i])) for x in POPS]) 
        
           variant['pop_ans'] = dict([(POPS[x], int(info_field['AN_%s' % x])) for x in POPS]) 
        
           variant['pop_homs'] = dict([(POPS[x], int(info_field['Hom_%s' % x].split(',')[i])) for x in POPS])

exac_browser/parsing.py

Line 129 in a212465

    
           variant['pop_hemis'] = dict([(POPS[x], int(info_field['Hemi_%s' % x].split(',')[i])) for x in POPS])

In addition to the code for loading variants, there is code in other parts of the browser (such as the HTML templates) that assumes variants have all the fields defined in the ExAC VCF. That would also have to be modified to support the format of variants in your dataset.

silenus092 · 2018-10-31T03:33:29Z

Thank you very much , then
if it'spossible to generate VCF file similar to ExAC_HC.0.3.*.vep.vcf that including all required tags (INFO) such as pop_acs,pop_ans,pop_homs ,AC_AMR ,Hemi_FIN etc. so I don't have to modify the code.
how can I achieve this task? like which VEP version , external resources and command are used

nawatts · 2018-11-01T19:14:48Z

How to generate correct values for those fields (and whether or not that is possible at all) would depend on your sample data.

The VEP version used for ExAC was version 85 (http://exac.broadinstitute.org/faq)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load custom variants #350

Load custom variants #350

silenus092 commented Oct 30, 2018

nawatts commented Oct 30, 2018

silenus092 commented Oct 31, 2018 •

edited

Loading

nawatts commented Nov 1, 2018

Load custom variants #350

Load custom variants #350

Comments

silenus092 commented Oct 30, 2018

nawatts commented Oct 30, 2018

silenus092 commented Oct 31, 2018 • edited Loading

nawatts commented Nov 1, 2018

silenus092 commented Oct 31, 2018 •

edited

Loading