converting the `download/*_genome.csv` files into a species list #58

taylorreiter · 2024-03-14T14:39:26Z

PreHGT runs at the genus level to pull in (pseudo) pangenome information, which is used to estimate contamination vs. real transfer events.

Right now, we don't do a good job of reporting how many/which species are represented for each genera. Below I include some code I recently used to get the species (organism name) information from ncbi based on the genome accession (`GCA*/GCF*).

I ran this on all files matching download/*_genome.csv

Install tools

conda install -c conda-forge ncbi-datasets-cli jq

collect genome accessions without csv headers

for infile in *csv
do
  cat $infile | tail -n +2 >> genomes.csv
done

get species (organism name)

while IFS= read -r accession
do
    datasets summary genome accession "$accession" | jq -r '.reports[] | [.accession, .organism.organism_name] | @csv'
done < <(awk -F, 'NR>1 {match($2, /(GCA|GCF)_[0-9]+\.[0-9]+/, m); print m[0]}' genomes.csv)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

converting the `download/*_genome.csv` files into a species list #58

converting the `download/*_genome.csv` files into a species list #58

taylorreiter commented Mar 14, 2024

converting the download/*_genome.csv files into a species list #58

converting the download/*_genome.csv files into a species list #58

Comments

taylorreiter commented Mar 14, 2024

converting the `download/*_genome.csv` files into a species list #58

converting the `download/*_genome.csv` files into a species list #58