Genome Blast

BLAST is the tool most frequently used for calculating sequence similarity. BLAST comes in variations for use with different query sequences against different databases. Disclaimer: Information in this "Blast How to Guide" was compiled from http://www.ncbi.nlm.nih.gov

Blast Form

Tool or Species tab on echinobase website leads to BLAST form. The BLAST form has five sections: Program selection, Database selection, Enter query sequence, Algorithm parameters and Output format.

search

Blast programs

Program Selection allows you to optimize your search for different scenarios.

blastp

compares an amino acid query sequence against a protein sequence database

blastn

compares a nucleotide query sequence against a nucleotide sequence database

blastx

compares a nucleotide query sequence translated in all reading frames against a protein sequence database

tblastn

compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames

tblastx

compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx program cannot be used with the nr database on the BLAST Web page.

Blast databases

For Strongylocentrotus purpuratus you can BLAST against Spur0.5 assembly(Scaffolds), Spur3.1 assembly(Scaffolds), Spur3.1 contigs, RNA-seq, Sp Genes, Sp Peptides, RNA-seq Peptides and Clones database
For Lytechinus variegatus genome you can BLAST against Lytechinus variegatus genome assembly 0.4 and Lytechinus variegatus RNA sequences
For Patiria miniata you can BLAST against Patiria miniata genome assembly 1.0, Patiria miniata contigs 1.0 and Patiria miniata RNA sequences

Enter query sequence

Query sequence(s) to be used for a BLAST search should be pasted in the 'Search' text area. It accepts a number of different types of input and automatically determines the format or the input. To allow this feature there are certain conventions required with regard to the input of identifiers (e.g., accessions or gi's). Accepted input types are FASTA, bare sequence, or sequence identifiers.

FASTA sequence

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished from the sequence data by a greater-than (">") symbol at the beginning. It is recommended that all lines of text be shorter than 80 characters in length.
An example sequence in FASTA format is:
>SPU_XXXXXX
ATGCCTGCAATGAGCGCCGACGCTCTGCGTGCCCCGTCCTACAACGTTTCGCATCTTCTCAACGCCGTACAGTCAGAGATGAACCGCGGGAGGGACGATGTGGAATT
TGGAAAAAGTTTCACAAGTTGACCAACGAGATGATCGTGACAAAAAGCGGGAGGCGAATGTTCCCAGTCCTATCCGTGCTCGACTTCTCCGCCGCAGACGATCACCG
TGGAAGTACGTCAACGGCGAGTGGATCCCCGGCGGCAAGCCCGACGGCTCGCCTCCGACCACTGGATGAAACAGGCCGTCAACTTCAGCAAAGTGAAGTTGTCGAA
AACTCAACGGAAGCGGGCAGGTGATGCTAAACTCCCTTCACAAG
Where SPU_XXXXXX is SPU id or any identification code

Bare Sequence

This may be just lines of sequence data, without the FASTA definition line, e.g.: ATGCCTGCAATGAGCGCCGACGCTCTGCGTGCCCCGTCCTACAACGTTTCGCATCTTCTCAACGCCGTACAGTCAGAGATGAACCGCGGGAGGGACGATGTGGAATT
TGGAAAAAGTTTCACAAGTTGACCAACGAGATGATCGTGACAAAAAGCGGGAGGCGAATGTTCCCAGTCCTATCCGTGCTCGACTTCTCCGCCGCAGACGATCACCG
TGGAAGTACGTCAACGGCGAGTGGATCCCCGGCGGCAAGCCCGACGGCTCGCCTCCGACCACTGGATGAAACAGGCCGTCAACTTCAGCAAAGTGAAGTTGTCGAAC
AACTCAACGGAAGCGGGCAGGTGATGCTAAACTCCCTTCACAAG

Load query file from disk

This function allows users to upload a text file containing queries formatted in FASTA format. The file can also contain sequence identifiers instead of FASTA sequences. Long sequence should be uploaded through this option to avoid possible browser buffer size limit.

Blast parameters

Low Complexity Filter

The query sequence is filtered for low complexity regions by default. The server filters your query sequence for low compositional complexity regions by default. Low complexity regions commonly give spuriously high scores that reflect compositional bias rather than significant position-by- position alignment. Filtering can eliminate these potentially confounding matches (e.g., hits against proline-rich regions or poly-A tails) from the blast reports, leaving regions whose blast statistics reflect the specificity of their pairwise alignment. Queries searched with the blastn program are filtered with DUST. Other programs use SEG.Low complexity sequence found by a filter program is substituted using the letter "N" in nucleotide sequence (e.g., "NNNNNNNNNNNNN") and the letter "X" in protein sequences (e.g., "XXXXXXXXX"). Users may turn off filtering by using the "Filter" option on the "Advanced options for the BLAST server" page.

FILTER (Mask for lookup table only)

This option masks only for purposes of constructing the lookup table used by BLAST. The BLAST extensions are performed without masking. This option is still experimental and may change in the near future.

EXPECT

The statistical significance threshold for reporting matches against database sequences; the default value is 10, such that 10 matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported. Fractional values are acceptable.

Matrix

A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues. The theory of amino acid substitution matrices is described in [1], and applied to DNA sequence comparison in [2]. In general, different substitution matrices are tailored to detecting similarities among sequences that are diverged by differing degrees [1-3]. A single matrix may nevertheless be reasonably efficient over a relatively broad range of evolutionary change [1-3]. Experimentation has shown that the BLOSUM-62 matrix [4] is among the best for detecting most weak protein similarities. For particularly long and weak alignments, the BLOSUM-45 matrix may prove superior. A detailed statistical theory for gapped alignments has not been developed, and the best gap costs to use with a given substitution matrix are determined empirically. Short alignments need to be relatively strong (i.e. have a higher percentage of matching residues) to rise above background noise. Such short but strong alignments are more easily detected using a matrix with a higher "relative entropy" [1] than that of BLOSUM-62. In particular, short query sequences can only produce short alignments, and therefore database searches with short queries should use an appropriately tailored matrix. The BLOSUM series does not include any matrices with relative entropies suitable for the shortest queries, so the older PAM matrices [5,6] may be used instead. For proteins, a provisional table of recommended substitution matrices and gap costs for various query lengths is:

Query length Substitution matrix Gap costs
<35 PAM-30 (9,1)
35-50 PAM-70 (10,1)
50-85 BLOSUM-80 (10,1)
>85 BLOSUM-62 (11,1)

Query Genetic Code

Genetic code to be used in blastx translation of the query.

Frame shift penalty for blastx

When protein aligned to the nucleotide there are 6 possibilities of match at any point. In OOF alignment - upper sequence is DNAP - 3-frame translated DNA. Lower sequence is protein. At any position next protein base may be aligned to 6 possible bases in DNAP:

(TBO - traditional blast output)

0: 3 nucleotides missing - gap (TBO notation "-")

OOF alignment with DNAP:

DTRGGDTPQKSVFSRAQNTLWGERGDTQKRGGAQRGDIFSLWGG-GVLCV
| | | | | | | | | | | | | | | | |
D G T K F A T G G Q G Q D S G K V V

TBO:

DGTKFATGGQGQDSG-VV
DGTKFATGGQGQDSG VV
DGTKFATGGQGQDSGKVV

1: 2 nucleotides missing - "frameshift -2" (TBO notation "\\")

OOF alignment with DNAP:

DTRGGDTPQKSVFSRAQNTLWGERGDTQKRGGAQRGDIFSLWGGGGVLCV
| | | | | | | | | | | | | | |/ | |
D G T K F A T G G Q G Q D S GK V V

TBO:

DGTKFATGGQGQDSG\\GVV
DGTKFATGGQGQDSG VV
DGTKFATGGQGQDSG KVV

2: 1 nucletide missing - "frameshift -1" (TBO notation "\")

OOF alignment with DNAP:

DTRGGDTPQKSVFSRAQNTLWGERGDTQKRGGAQRGDIFSLWGGERGV
| | | | | | | | | | | | | | / | |
D G T K F A T G G Q G Q D S G K V
TBO:

DGTKFATGGQGQDS\GEV
DGTKFATGGQGQDS G V
DGTKFATGGQGQDS GKV

3: Complete match

OOF alignment with DNAP:

DTRGGDTPQKSVFSRAQNTLWGERGDTQKRGGAQRGDIFSLWGGEKRGV
| | | | | | | | | | | | | | | | |
D G T K F A T G G Q G Q D S G K V

TBO:

DGTKFATGGQGQDSGKV
DGTKFATGGQGQDSGKV
DGTKFATGGQGQDSGKV

4: 1 nucleotide insertion - "frameshift +1" (TBO notation "/")

OOF alignment with DNAP:

DTRGGDTPQKSVFSRAQNTLWGERGDTQKRGGAQRGDIFSLWGGVEKRGV
| | | | | | | | | | | | | | | \
D G T K F A T G G Q G Q D S G K V

TBO:

DGTKFATGGQGQDSG/KV
DGTKFATGGQGQDSG KV
DGTKFATGGQGQDSG KV

5: 2 nucleotides insertion - "frameshift +2" (TBP notation "//")

OOF alignment with DNAP:

DTRGGDTPQKSVFSRAQNTLWGERGDTQKRGGAQRGDIFSLFLWGGEKRGV
| | | | | | | | | | | | | | \ | |
D G T K F A T G G Q G Q D S G K V

TBO:

DGTKFATGGQGQDS//GKV
DGTKFATGGQGQDS GKV
DGTKFATGGQGQDS GKV

BLASTN Program Advanced Options

-G Cost to open a gap [Integer]
default = 5
-E Cost to extend a gap [Integer]
default = 2
-q Penalty for a mismatch in the blast portion of run [Integer]
default = -3
-r Reward for a match in the blast portion of run [Integer]
default = 1
-e Expectation value (E) [Real]
default = 10.0
-W Word size, default is 11 for blastn, 3 for other programs.
-v Number of one-line descriptions (V) [Integer]
default = 100
-b Number of alignments to show (B) [Integer]
default = 100

BLASTP Program Advanced Options

BLASTX Program Advanced Options

TBLASTN Program Advanced Options

-G Cost to open a gap [Integer]
default = 11
-E Cost to extend a gap [Integer]
default = 1
-e Expectation value (E) [Real]
default = 10.0
-W Word size, default is 11 for blastn, 3 for other programs.
-v Number of one-line descriptions (V) [Integer]
default = 100
-b Number of alignments to show (B) [Integer]
default = 100

Limited values for gap existence and extension are supported for these three programs.
Some supported and suggested values are:

Existence Extension

10 1
10 2
11 1
8 2
9 2

Blast output format Graphical Overview

An overview of the database sequences aligned to the query sequence is shown. The score of each alignment is indicated by one of five different colors, which divides the range of scores into five groups. Multiple alignments on the same database sequence are connected by a striped line. Mousing over a hit sequence causes the definition and score to be shown in the window at the top, clicking on a hit sequence takes the user to the associated alignments.

Description

Restricts the number of short descriptions of matching sequences reported to the number specified; default limit is 100 descriptions

Alignments

Restricts database sequences to the number specified for which high-scoring segment pairs (HSPs) are reported; the default limit is 100. If more database sequences than this happen to satisfy the statistical significance threshold for reporting (see EXPECT below), only the matches ascribed the greatest statistical significance are reported.

Blast Output

The general organization of Blast results page is as follows:

Header

This section is at the beginning of the BLAST result page and ends just before the Graphic Overview. The header gif is placed at the top clearly labels the page as "BLAST search results". Below the header, it lists the BLAST program used for the search, its version and date. The next piece of information is on the target database. It contains the database title, the total number of sequences, and the total length in number of letters. Information for the input query follows. This is parsed out from the definition line (defline) of the input query.

search

Graphical Overview

This section graphically summarizes the BLAST output. The following example is taken from the actual search mentioned above. The title of the graph hyperlinks to a short explanation. The number given in this title is the total number of alignment segments (high scoring pairs, or HSPs), which is greater or equal to the setting in "Alignments" since some database sequences could have more than one segment aligned to the input query. The text box is for displaying the information on matching database sequences. Mouse-over the hits in the graph (colored bars), browser will display the information for that HSP in this box .Within the bordered graph, the top segment displays the color key and the query based scale. The colored bars represent the actual HSPs. The position of each bar indicates the region of the query the HSP covers

Description

This section gives the sequence producing significant alignment from the database with the query sequence along with score and e- value.The sequence id also links out to genome browser.

Alignment

The actual alignment is displayed here.The alignment header has the following fields:# Features: the annotated sequence feature found in the HSP covered region of the database sequence# Score: the alignment score for the HSP in bit (or in raw score)
# Expect: the significance for the HSP
# Identities: identities between query and database sequence, length includes gaps if present
# Strand: query strand over database sequence strand

Statistics

This section summarizes the search and provides detailed search statistics. We will further divide this section into a few subsections and described them individually below.
*First is the details database information.
*Next is the statistical constant used in the calculation of Expect value.
*Scoring matrix used by the BLAST to evaluate and score alignment is given after that.
*Detailed search statistics are given in the subsection below. The last subsection list mostly the dropoff and cutoff values used in the search
.A: Two hits window size, distance between two word hits. Blastp and discontiguous megablast, imposes a two-hits requirement. The window size sets the distance between the beginning of first hit and end of the second.
# X1: dropoff for first ungapped extension raw (bits);
# X2: dropoff for second gapped extension raw (bits);
# X3: dropoff for final gapped extension with traceback raw (bits).
The X dropoff value, difference between the highest score achieved and score at current position, determines when an extension effort should be terminated.
# S1: the cutoff score for the maximal-scoring segment pair (MSP);
# S2: the cutoff for multiple HSPs.S1 and S2 define cutoff scores calculated by the algorithm, not adjustable. Pairwise comparisons scoring below these cutoff scores are ignored

Further Reading