typora-copy-images-to | sequence-fold | command-fold | last-update |
---|---|---|---|
./ |
80 xclip -o | tr -d "\n" | fold -w80 |
75 xclip -o | tr -d "\n" | fold -sw75 | sed -e 's/$/\\/' |
April 2nd, 2018 |
by alper yilmaz for GTU Bioinformatics Program Course
2019-03-14 (PDF version of this document is accessible at goo.gl/bhrkqQ
)
[TOC]
Instructions:
- If you are viewing this document in JupyterLab environment, you can edit and run commands in cells. You can use
Ctrl+Enter
keyboard shortcut to run a cell. Normally the code is run in Python kernel but if the code has preceding!
then it's run in terminal. Same effect is achieved by adding%%bash
on top of the cell.- Also, JupyterLab environment provides actual terminal access. Under
File
menu go toNew
option and chooseTerminal
.
- Also, JupyterLab environment provides actual terminal access. Under
- To exercise terminal commands, please register an account at Docker Cloud and log in to Play With Docker site with your Docker ID. Then, click
Add New Instance
and type the following command in terminal windowdocker run -it alperyilmaz/blast-course
- Alternative access: Windows users should download and use Putty software to connect remote server in order to run the commands. The IP address is will be provided during or right after class and you should have your usernames provided. Linux users or Windows users with cygwin installed in their system should use
ssh
for connection. - Any command starting with dollar sign '$', such as:
$ blastn -db Ecoli_cds -query test.fa
can be copied and pasted to terminal without the dollar sign and executed. Please exercise how to paste text to Putty window before hand (the middle click trick) in order to have better experience.
- Web (Howto Guide link)
- No installation (both software and index)
- Suitable for small jobs
- Output is in HTML (links to resulting sequences)
- Web service (REST or programmatic access)
- Usually an app or software uses this channel to query and retrieve results (details) (sample code)
- Commandline
- Needs terminal access (which rocks ;)
- Needs installation of executable and preparation of indexes
- No limit in job size
- No need to wait for queue
- Scriptable (can easily be part of a pipeline or automated script)
Web BLAST is actually the commandline BLAST running in server. Web BLAST is running BLAST version 2.6.1+ (see below) and the commandline BLAST is running 2.6.0+ version (see
blastn -h
output below).Web BLAST is configured so that the input and arguments are retrieved from HTML form filledin by user and the results are sent to user in HTML format. In commandline version, the user provides an actual file as input and gets an actual file as output.
The concept of commandline and Linux operating system itself might be a new concept to most of us. Thus, couple of the resources below might orient you for this new concept.
-
For brief introduction to Linux commandline please go over the presentation prepared by Christopher Shaffer at Washington University.
-
Check out BLAST in commandline presentation by Andrew Alverson.
-
A single page of Linux cheat sheet might give you an idea of important commands.
No need to run these commands, they are just for demonstration purposes.
-
a short example
$ blastn -db Ecoli_cds -query test.fa
-
an advanced example (looping through all output options - no single click involved)
$ for i in {0..12}; do blastn -db customdb -query test.fa -outfmt $i -out custom-format"$i".out; done
No need to run those commands for now.
- NCBI indexes
- can be downloaded and updated regularly (local Blast, works offline)
nr
content is around 27 GB (compressed) andnt
content is around 36 GB (compressed)- whole BLAST in your own computer!
- accessed remotely by
-remote
option in terminal (remote Blast)- sequences are sent to Blast web server for search (needs internet connection)
- then why not just use the web version? Scriptable, remember!
- can be downloaded and updated regularly (local Blast, works offline)
- Custom index (local Blast, works offline)
- You can prepare a custom Fasta file and then index it. After that you can Blast many sequences against your custom index.
Note: Beware that
nr
ornt
database contains ALL sequences in non-redundant fashion. It includes mRNA sequences as well as genome files. So, if you BLAST a gene sequence tonr
you'd get too many hits due to chromosome, partial chromosome, contigs, etc. results.
The BLAST software has overwhelming number of arguments (shown below)
$ blastn -h
USAGE
blastn [-h] [-help] [-import_search_strategy filename]
[-export_search_strategy filename] [-task task_name] [-db database_name]
[-dbsize num_letters] [-gilist filename] [-seqidlist filename]
[-negative_gilist filename] [-entrez_query entrez_query]
[-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
[-subject subject_input_file] [-subject_loc range] [-query input_file]
[-out output_file] [-evalue evalue] [-word_size int_value]
[-gapopen open_penalty] [-gapextend extend_penalty]
[-perc_identity float_value] [-qcov_hsp_perc float_value]
[-xdrop_ungap float_value] [-xdrop_gap float_value]
[-xdrop_gap_final float_value] [-searchsp int_value] [-max_hsps int_value]
[-sum_stats bool_value] [-penalty penalty] [-reward reward] [-no_greedy]
[-min_raw_gapped_score int_value] [-template_type type]
[-template_length int_value] [-dust DUST_options]
[-filtering_db filtering_database]
[-window_masker_taxid window_masker_taxid]
[-window_masker_db window_masker_db] [-soft_masking soft_masking]
[-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value]
[-best_hit_score_edge float_value] [-window_size int_value]
[-off_diagonal_range int_value] [-use_index boolean] [-index_name string]
[-lcase_masking] [-query_loc range] [-strand strand] [-parse_deflines]
[-outfmt format] [-show_gis] [-num_descriptions int_value]
[-num_alignments int_value] [-line_length line_length] [-html]
[-max_target_seqs num_sequences] [-num_threads int_value] [-remote]
[-version]
DESCRIPTION
Nucleotide-Nucleotide BLAST 2.2.30+
Use '-help' to print detailed descriptions of command line arguments
If you type blastn -help
you'll get more detailed explanation for each argument. Luckily we don't need to specify each argument since most of them have defaults:
-num_alignments <Integer, >=0>
Number of database sequences to show alignments for
Default = `250'
* Incompatible with: max_target_seqs
Required arguments are:
-query <File_In>
Input file name
Default = `-'
-db <String>
BLAST database name
* Incompatible with: subject, subject_loc
Let's run the command from Jupyter. Select the cell below and then just click Run button or press Ctrl+Enter
keys
blastn -h
OR
! blastn -h
The "BLAST Results" page contains options for Dowloading data and formatting the view, they also have commandline counterparts
Please check out the interactive demo of Needleman-Wunsch algorithm and observe how match, mismatch and gap penalty scores effect the resulting alignment. (alternative link, you need to download index.html and nwunsch.js files). The Wikipedia page for this algorithm explains how to fill in the table. While Needleman-Wunsch algorithm is used for global alignment, Smith-Waterman algorithm is used for local alignments. The visualization of local alignment algorithm is located here.
If you're interested in more depth of alignment algorithms used by BLAST, please download Sequence Alignment Teacher software (link to its publication).
The commandline argument -task
is used for defining the search strategy which are predefined sets of arguments.
-task <String, Permissible values: 'blastn' 'blastn-short' 'dc-megablast' 'megablast' 'rmblastn' >
Task to execute
Default = `megablast'
blastn: the traditional program used for inter-species comparisons
blastn-short: optimized for sequences less than 30 nucleotides
dc-megablast: typically used for inter-species comparisons
megablast: for very similar sequences (e.g, sequencing errors)
rmblastn: RepeatMasker compatible version of the standard NCBI blastn program (details)
option/task | megablast | blastn | blastn-short |
---|---|---|---|
word_size | 28 | 11 | 7 |
gapextend | none | 2 | 2 |
Blast User Manual Table C2 lists differences between different strategies.
megablast
is default task forblastn
. If you don't get any hits, you should trydc-megablast
andblastn
By using -export_search_strategy
option you can save the search strategy and by using -import_search_strategy
option you can import an existing strategy.
NCBI Blast toolkit comes with necessary executable to generate BLAST index. Let's see contents of the working directory before we generate the index.
ls
Let's generate index for Ecoli-cds.fa
file.
makeblastdb -in Ecoli-cds.fa -dbtype nucl -title Ecoli_cds -out Ecoli_cds
If you are interested in contents of the file, you can easily see the contents oft the file irrespective of its size (Linux rocks ;). The cat
command can print contents of a file on screen. If the file contents is large then you can kill the process by pressing Control+c
. Since the fasta file is long, let's use another function to glimpse at file contents. head
prints first 10 lines of the file and tail
prints last 10 lines of the file.
head Ecoli-cds.fa
Let's see some more commandline tricks. What is the fastest way to know how many genes are there in the fasta file? Let's count number of lines that contain >
characters in the file. To achieve this we'll use grep
command which prints the line containing a specific pattern. Let's ask grep
to print lines which contain >
character.
grep ">" Ecoli-cds.fa | head
Now, let's use the terminal pipe, |
to direct output of grep
to another command which can count number of lines. The command is called wc
and when used with -l
argument it counts number of lines.
grep ">" Ecoli-cds.fa | wc -l
grep
command prints lines that contain a certain character or pattern and sends its output to another command which counts number of lines. The pipe character |
(you can print this character by pressing AltGr+<
keys) is used to pipe data between different commands. It looks like the fasta file contains 4097 sequences.
About copy/paste: The long commands are broken into multiple lines with
\
character, so you should include those characters if there're any. When you paste the command to terminal you would notice an extra>
character at beginning of lines, that's normal, and indicates that line continues from previous line.
blastn -query test.fa -db Ecoli_cds
In order have shorter output, we are requesting 10 descriptions and 5 alignments only.
blastn -remote -db nr -query test.fa -entrez_query "E.coli[Organism]" \
-evalue 1e-20 -num_descriptions 10 -num_alignments 5
Please go over BLAST Results manual.
There are numerous formats generated by BLAST, they can be listed in detail upon $ blastn -help
command:
-outfmt <String>
alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = XML Blast output,
6 = tabular,
7 = tabular with comment lines,
8 = Text ASN.1,
9 = Binary ASN.1,
10 = Comma-separated values,
11 = BLAST archive format (ASN.1)
12 = JSON Seqalign output
As you can see, options 6 and 7 are tabular format. Let's examine format 6:
blastn -query test.fa -db 16SMicrobial -max_target_seqs 1 -outfmt 6
If you also examine format 7, it's tabular and it contains column names (in commented lines)
blastn -query test.fa -db 16SMicrobial -max_target_seqs 1 -outfmt 7
Options 6, 7 and 10 can be additionally configured to produce a custom format specified by space delimited format specifiers. Please check the help output for details. For instance, the command below generates a tabular output in which taxon id or kingdom names are printed in addition to standard alignment information.
blastn -query test.fa -db 16SMicrobial -max_target_seqs 1 -outfmt '6 \
qseqid sseqid evalue bitscore sgi sacc staxids sscinames scomnames \
sskingdoms stitle'
The output can be saved to a named file via -out
argument or it can be directed to a file or piped to another command for filtering, sorting operations:
blastn -db Ecoli_cds -query test.fa -out results.out
OR
blastn -db Ecoli_cds -query test.fa > results.out
OR
blastn -query Ecoli-cds.fa -db Ecoli_cds -outfmt 6 -evalue 0.001 -task \
blastn | awk '$1 != $2 && $3 > 95' | sort -k3nr
The text output is more plain than web output (local blast software can generate a local html output if desired). But the simplicity of plain text brings the power of filtering, sorting the results easily.
For instance, let's count number of results by wc
command
blastn -query Ecoli-cds.fa -db Ecoli_cds -outfmt 6 -evalue 0.001 -task blastn | wc -l
By the help of awk
command, we can filter lines. In this example we are doing BLAST of E.coli proteins to themselves in search of paralogs. First thing to do is remove lines where subject and query are same proteins. Since subject and query information is on columns 1 and 2, we can ask awk
to show show lines where content of column 1 (i.e. $1
) is not equal (i.e. !=
) to content of second column (i.e. $2
)
blastn -query Ecoli-cds.fa -db Ecoli_cds -outfmt 6 -evalue 0.001 -task \
blastn | awk '$1 != $2' | head
The 3rd column is percent alignment and we can observe there are near perfect matches (%100) and matches with low similarity (~ %70). Let's use awk
again to filter high percentage alignments. In awk
, multiple conditions can be used in AND or OR fashion. For AND fashion we can use &&
. Below we are asking awk
to filter lines where first column content is not equal to second column content AND third column value greater than 95.
blastn -query Ecoli-cds.fa -db Ecoli_cds -outfmt 6 -evalue 0.001 -task \
blastn | awk '$1 != $2 && $3 > 95' | head
When using table format for output (-outfmt 6
) there are several columns that can be used to sort or filter the results. There's no single column which is perfect for each case, which column to use depends on the purpose and characteristic of the alignment. Here are the columns and their caveats:
- bitscore: depends on alignment length. A perfect match for a short query will get lower score compared to perfect match for a long query sequence.
- evalue: depends on alignment length and index database size. A short sequence might not get
0.0
e-value at all. Also, probability of finding a match in a large database is higher than probability in a smaller database. - percent identity: can be very misleading when considered alone. If alignment results are sorted according percent identity, very short and insignificant perfect match will have 100% identity and show up on top of the list.
- alignment length: can be misleading when considered alone. If alignment results are sorted according to alignment length, alignments with gaps and mismatches will show up on top of the list.
The examples below were taken from Martin Jones' Python for Biologists website - link
Display just the hits that are longer than 590 bases (i.e. just the lines where the fifth field is greater than 590):
awk '$5 > 590'
display just the hits that include the very start of the query sequence (i.e. the lines where the eighth field is equal to one):
awk '$8 == 1'
display just the hits with fewer than four miss-matches (i.e. the lines where the sixth field is less than four):
awk '$6 < 4'
combine multiple criteria with &&
– here’s how we display just the hits with fewer than four miss-matches whose length is less than 599 bases:
awk '6 < 4 && 5 < 599'
Let's define an environment variable to tell where we keep database files
%env BLASTDB=/home/jovyan/blastdb
Print directory names that are searched for database indexes:
blastdbcmd -show_blastdb_search_path
List database indexes for a given directory:
blastdbcmd -list $HOME/blastdb
Print information for all sequences in the index with given format (id and taxon id, in the example below):
blastdbcmd -db 16SMicrobial -entry all -outfmt "%g %T" | head
Collect sequence id that belongs to E.coli (taxon id=562) then pipe those sequence ids back into blastdbcmd
to retrieve their sequences:
blastdbcmd -db 16SMicrobial -entry all -outfmt "%g %T" | awk '$2 == 562 \
{print $1}' | blastdbcmd -db 16SMicrobial -entry_batch - -out ecoli-16s.fa
Let's check the contents of the file
head ecoli-16s.fa
BLAST results in XML format can be parsed with BioPython. In that case, you have programmatic access to BLAST result contents. In order to have XML formatted output, you need to use -outfmt 5
argument.
Let's generate BLAST output in XML format:
blastn -query test.fa -db Ecoli_cds -outfmt 5 -evalue 0.000001 -out blast_result.xml
from Bio.Blast import NCBIXML
result_handle = open("blast_result.xml")
blast_records = NCBIXML.parse(result_handle)
E_VALUE_THRESH = 0.04
for blast_record in blast_records:
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < E_VALUE_THRESH:
print("****Alignment****")
print("sequence:", alignment.title)
print("length:", alignment.length)
print("e value:", hsp.expect)
print(hsp.query[0:75] + "...")
print(hsp.match[0:75] + "...")
print(hsp.sbjct[0:75] + "...")
Let's get familiar with online version of Clustal
Please use online Clustal service to align the sequences below:
>Sequence1
GARFIELDTHELASTFATCAT
>Sequence2
GARFIELDTHEFASTCAT
>Sequence3
GARFIELDTHEVERYFASTCAT
>Sequence4
THEFATCAT
The website will show the following output:
As you might have known, the global alignment is affected by numerous factors and the result might not be the "true alignment". Also, various algorithms give different result for same input sequence. You can test this by using the sample sequence above in ClustalO, MAFFT, T-Coffee and Muscle programs online.
The ClustalO generates pairwise distances between sequences and then executes progressive alignment based on distance matrix. Input, output, percent identity matrix, phylogenetic tree files can be downloaded from online ClustalO service.
The "Submission Details" tab contains information about how the software was run in the server.
Let's run ClustalO locally the exactly same way it has run on the server.
Use the "garfield" sequences to perform MSA in terminal.
There are numerous output formats available:
--outfmt={a2m=fa[sta],clu[stal],msf,phy[lip],selex,st[ockholm],vie[nna]} MSA output file format (default: fasta)
Let's perform multiple sequence alignment with minimum number of arguments
$ clustalo --infile garfield.fa
>Sequence1
GARFIELDTHELASTFAT-CAT
>Sequence2
GARFIELDTHEFASTCAT----
>Sequence3
GARFIELDTHEVERYFASTCAT
>Sequence4
--------THEFAT-----CAT
clustalo --infile garfield.fa
The default output format is fasta formatted multiple sequence alignment. Let's check out clustal
output
$ clustalo --infile garfield.fa --outfmt clustal
CLUSTAL O(1.2.4) multiple sequence alignment
Sequence1 GARFIELDTHELASTFAT-CAT
Sequence2 GARFIELDTHEFASTCAT----
Sequence3 GARFIELDTHEVERYFASTCAT
Sequence4 --------THEFAT-----CAT
***.
clustalo --infile garfield.fa --outfmt clustal
Let's match the commandline arguments with submission details of online version of ClustalO.
$ clustalo --infile garfield.fa --threads 8 --MAC-RAM 8000 --outfmt clustal --resno --output-order tree-order --seqtype protein
CLUSTAL O(1.2.4) multiple sequence alignment
Sequence4 --------THEFAT-----CAT 9
Sequence3 GARFIELDTHEVERYFASTCAT 22
Sequence1 GARFIELDTHELASTFAT-CAT 21
Sequence2 GARFIELDTHEFASTCAT---- 18
***.
clustalo --infile garfield.fa --threads 8 --MAC-RAM 8000 --outfmt clustal --resno --output-order tree-order --seqtype protein
The output matches exactly the output from online version.