SPU proteins

Table of Contents

  1. Electronic annotation of GLEAN proteins
  2. Manual characterization of GLEAN proteins
  3. Alternative,complimentary classification proteins

Electronic annotation of GLEAN proteins

More than 10,000 SPU_ (GLEAN) gene models were annotated by a consortium of scientists. The remaining (approximately 18,000) SPU_ proteins were annotated as follows. First, larger proteins ( >100 amino acids) were selected, and each of them was annotated by protein sequence match. Annotated proteins from other organisms in the non-redundant protein set (nr) or reference sequences (refseq_protein) in NCBI database was used for the comparison. Initially, BLAST report (graphic output) was obtained from querying NCBI's protein data sets with each SPU_ (GLEAN) protein. Then, following criteria were used to annotate the SPU_ protein:
1. Inspect the extent of homology by surveying the span of query-to-subject sequence alignment. Typically at least 50% of query sequence alignment is required to accept the match. Once accepted, the subject gene's name and symbol are adapted to the SPU_ gene model. Usually the name is tagged by "-like" suffix. Alignments spanning at least 80% or 90% of the query sequence are regarded as true homology, and the gene name is then assigned without "-like" suffix. Gene names are assigned to SPU_ gene models in accordance to mammalian (mouse) gene names and symbols, after checking the E-value (Even if the SPU_ protein is more homologous to the proteins of insects or fungi/bacteria, we used mammalian names.)
2. An E-value greater than 50 is usually taken, but rarely, 30-50 range is also accepted if for example the query and subject sequences share critical protein motifs, or the homology occurs across a wide range of species. Usually an E-value greater than 100 or even 150 is required to assign gene names without "-like" tag.
3. Consider the uniqueness (abundance) of the SPU_ protein sequences. The more unique the sequences in the SPU_ protein (or the lower the abundance of the motifs in the SPU_ protein), the higher the clarity and confidence of the match to be valid. SPU_ proteins carrying highly redundant sequences that avidly match to dozens or hundreds of subject proteins are difficult to define.
4. Try to match the SPU_ protein to other proteins within the same S. purpurarus proteome, and consider cross-species multiple aligments also.
5. Gene models without significant homologies to proteins in other species could either be spurious proteins, or Sp-specific proteins. Usually they have paralogs with high quality matches, which then are assigned to be "hypothetical protein-xxx" where xxx is a serial number, and commented to be "Sp-specific".
6. Gene models with low quality orthologous matches are also assigned to be hypothetical protein-xxx.
7. Whenever it is possible, use gene names other than "hypothetical protein-xxx", including gene names after motif names (such as Sp-Ccdc123 or Sp-Wdr48, etc, for coiled-coil domain containing and WD repeat) or Sp-Kiaa1234 if the putative subject proteins are hypothetical, arbitrarily named proteins.
8. Redundancy of query proteins in S. purpuratus is indicated by number 1, 2, or 3 in the 9th columns of the currently used gene_info_table format. Pol polyprotein-like (presumably from retroposons), ankyrin-repeat sequences, notch-like or EGF-like sequences occur most frequently, and serial numbers are used to name these genes.

Back to the top of the page

Manual characterization of GLEAN proteins

The first release of Sp genome data in the year 2006, and resulting GLEAN gene models (predicted genes) were initially characterized by nearly 200 scientists in the US and overseas. The annotators focused on the groups of proteins that are highly orthologous to the proteins with known functions in human and other organisms. The annotation data was published in the December issue of Developmental Biology as follows:
Dev Biol. 2006 Dec 1;300(1):9-14. Epub 2006 Oct 10. The sea urchin's siren.

Dev Biol. 2006 Dec 1;300(1):27-34. Epub 2006 Oct 18. High regulatory gene use in sea urchin embryogenesis: Implications for bilaterian development and evolution.

Dev Biol. 2006 Dec 1;300(1):366-84. Epub 2006 Sep 3. The chemical defensome: environmental sensing and response genes in the Strongylocentrotus purpuratus genome.

Dev Biol. 2006 Dec 1;300(1):2-8. Epub 2006 Oct 10. Shedding genomic light on Aristotle's lantern.

Dev Biol. 2006 Dec 1;300(1):194-218. Epub 2006 Aug 25. Protein tyrosine and serine-threonine phosphatases in the sea urchin, Strongylocentrotus purpuratus: identification and potential functions.

Dev Biol. 2006 Dec 1;300(1):132-52. Epub 2006 Aug 24. RTK and TGF-beta signaling pathways genes in the sea urchin genome.

Dev Biol. 2006 Dec 1;300(1):49-62. Epub 2006 Sep 22. Sea urchin Forkhead gene family: phylogeny and embryonic expression.

Dev Biol. 2006 Dec 1;300(1):238-51. Epub 2006 Sep 14. The genomic repertoire for cell cycle control and DNA metabolism in S.purpuratus.

Dev Biol. 2006 Dec 1;300(1):30 8-20. Epub 2006 Sep 1. The sea urchin histone gene complement.

Dev Biol. 2006 Dec 1;300(1):385-405. Epub 2006 Aug 5. Oogenesis: single cell development anddifferentiation.

Dev Biol. 2006 Dec 1;300(1):15-26. Epub 2006 Jul 20. In the beginning...animal fertilizationand sea urchin development.

Dev Biol. 2006 Dec 1;300(1):121-31. Epub 2006 Aug 24. A genome-wide survey of the evolutionarily conserved Wnt pathways in the sea urchin Strongylocentrotus purpuratus.

Dev Biol. 2006 Dec 1;300(1):153-64. Epub 2006 Sep 1. Genomics and expression profiles of the Hedgehog and Notch signaling pathways in sea urchin development.

Dev Biol. 2006 Dec 1;300(1):461-75. Epub 2006 Sep 5. Opsins and clusters of sensory G-protein-coupled receptors in the sea urchingenome.

Dev Biol. 2006 Dec 1;300(1):267-81. Epub 2006 Aug 12. Sea urchin metallo proteases: a genomic survey of the BMP-1/tolloid-like, MMP and ADAM families.

Dev Biol. 2006 Dec 1;300(1):485-95. Epub 2006 Sep 26. The S. purpuratus genome: a comparative perspective.

Dev Biol. 2006 Dec 1;300(1):74-89. Epub 2006 Aug 22. Identification and characterization of homeobox transcription factor genes in Strongylocentrotus purpuratus, and their expression in embryonic development.

Dev Biol. 2006 Dec 1;300(1):416-33. Epub 2006 Sep 12. A functional genomic and proteomic perspective of sea urchin calcium signalingand egg activation.

Dev Biol. 2006 Dec 1;300(1):90-107. Epub 2006 Aug 22. Gene families encoding transcription factors expressed in early development of Strongylocentrotus purpuratus.

Dev Biol. 2006 Dec 1;300(1):219-37. Epub 2006 Aug 26. Analysis of cytoskeletal and motility proteins in the sea urchin genome assembly.

Dev Biol. 2006 Dec 1;300(1):180-93. Epub 2006 Sep 12. The sea urchin kinome: a first look.

Dev Biol. 2006 Dec 1;300(1):349-65. Epub 2006 Sep 3. The immune gene repertoire encoded in the purple sea urchin genome.

Dev Biol. 2006 Dec 1;300(1):165-79. Epub 2006 Aug 24. Lineage-specific expansions provide genomic complexity among sea urchin GTPases.

Dev Biol. 2006 Dec 1;300(1):321-34. Epub 2006 Aug 30. The genomic underpinnings of apoptosisin Strongylocentrotus purpuratus.

Dev Biol. 2006 Dec 1;300(1):476-84. Epub 2006 Aug 22. A database of mRNA expression patterns for the sea urchin embryo.

Dev Biol. 2006 Dec 1;300(1):35-48. Epub 2006 Aug 10. Identification and developmental expression of the ets gene family in the seaurchin (Strongylocentrotus purpuratus).

Dev Biol. 2006 Dec 1;300(1):108-20. Epub 2006 Aug 22. The C2H2 zinc finger genes of Strongylocentrotus purpuratus and their expression in embryonic development.

Dev Biol. 2006 Dec 1;300(1):335-48. Epub 2006 Aug 15. A genome-wide analysis of biomineralization-related proteins in the sea urchin Strongylocentrotus purpuratus.

Dev Biol. 2006 Dec 1;300(1):282-92. Epub 2006 Aug 22. Intermediary metabolism in sea urchin:the first inferences from the genome sequence.

Dev Biol. 2006 Dec 1;300(1):406-15. Epub 2006 Aug 4. Germ line determinants are not localized early in sea urchin development, but do accumulate in the small micromere lineage.

Dev Biol. 2006 Dec 1;300(1):434-60. Epub 2006 Aug 10. A genomic view of the sea urchin nervous system.

Dev Biol. 2006 Dec 1;300(1):293-307. Epub 2006 Aug 4. Translational control genes in the seaurchin genome.

Dev Biol. 2006 Dec 1;300(1):63-73. Epub 2006 Aug 4. Genetic organization and embryonic expression of the ParaHox genes in the seaurchin S. purpuratus: insights into the relationship between clustering and

Dev Biol. 2006 Dec 1;300(1):252-66. Epub 2006 Aug 7. The echinoderm adhesome.

Back to the top of the page

Alternative,complimentary classification proteins

We are currently working on a new scheme to classify Sp proteome into alternative protein groups. The scheme is intended to be radically different from existing protein classification approaches, and aims to provide database query terms that are complementary to conventional protein family/domain names. 3,362 Pfam domain names from the GLEAN proteins were re-grouped according to a variety of attributes including structure, biological processes, intracellular association, enzymatic activity, etc. Currently 3,362 pfam groups have been compressed to 135 new groups.
http://piano.caltech.edu/4/135-alternative_categories
http://piano.caltech.edu/4/alternative_category-pfam.correlationtable

Back to the top of the page