could not find driverhost:localhost;dbname:orthologidpinemap***pinemap^^^ PlantOrDB

Ortholog and Paralog

Genes with different functions are originally generated from some ancestral genes by gene duplication, mutation and functional recombination. It is widely accepted that orthologs are homologous genes evolved from speciation events while paralogs are homologous genes resulted from gene duplication events (Fitch WM, 1970; Jensen, 2001; Erik L.L. Sonnhammer & Eugene V. Koonin, 2012). With the rapid increase of genomic data, identifying and distinguishing these genes among different species is becoming an important part of functional genomics research. In the past, some people considered genes with same or similar functions in different species are orthologs whereas others were trying to identify orthologs by similarity among gene sequences. However, ortholog genes in different species do not always keep the same functions, and the most similar gene sequences are not always orthologs (Theissen, 2002). It is getting more complicated as speciation and duplication events can occur alternately. As shown in the follow figure A, orthologs are reflexive because, as an example, Arth-A1 is an ortholog of Orsa-A1 and vise versa. Secondly, ortholgos are non-transitive: Arth-A1 is an ortholog of Orsa-A1 and Arth-A2 is an ortholog of Orsa-A1, but Arth-A1 and Arth-A2 are not orthologs. Thirdly, orthologs do not always have one-to-one relationship. Sometimes, they have one-to-many or many-to-many relationship due to duplication after speciation. For example, both Arth-A1 and Arth-A2 in Arabidopsis have a many-to-many ortholog relationship with Orsa-A1 and Orsa-A2 in rice, whereas Orsa-B in rice has a one-to-many ortholog relationship with Arth-B1 and Arth-B2 in Arabidopsis. Finally, there are two types of paralogs: in-paralog and out-paralog because duplication and speciation can occur alternately. There are three pairs of in-paralogs: Arth-A1 and Arth-A2, Orsa-A1 and Orsa-A2, and Arth-B1 and Arth-B2, all of which are result of duplication. An out-paralog relation exists between any gene from Arth-A1, Arth-A2, Orsa-A1 and Orsa-A2 and any gene from Arth-B1, Arth-B2, and Orsa-B, which is result of duplication-speciation or duplication-speciation-duplication event. With regard to sequence similarity, in general, a gene sequence is more similar to its in-paralogs than its orthologs, while it is more similar to its orthologs than its out-paralogs. As shown in the follow figure B, for example, Arth-A1 has the shortest distance in sequence similarity to its paralog Arth-A2, the intermediate distance to its ortholog Orsa-A1, and the longest distance to its out-paralog Arth-B1.

Whole pipeline

The whole system is composed of a MySQL database, two Perl-based data processing pipelines and AJAX-based PHP web interfaces. One data processing pipeline is used to pre-build ortholog/homolog gene families and dump the resultant data into the database (we refer to this pipeline as “pre-built” pipeline thereafter), and the other one is for on-the-fly classification of the query sequence uploaded online by a user into one existing gene family (we refer to this pipeline as “on-the-fly” pipeline thereafter). The “pre-built” pipeline clusters protein sequences into ortholog and homolog gene families by different filtration criteria. Then, it creates multiple alignments, builds phylogenetic trees and detects diagnostic characters for all gene families. After a user submits a query sequence online, the “on-the-fly” pipeline will find the best matched gene family for the query, and insert it into the proper places within the existing phylogenetic tree and protein sequence alignment of the best matched gene family.

Gene Family Genetator

Gene Family Generator clusters all amino acid sequences into gene families based on orthology search results using BLAST (McGinnis & Madden, 2004). Within Gene Family Generator, there are three steps: all-against-all BLAST search, BLAST result filter and gene family creation. In order to get the all-against-all blast results, PlantOrDB adopted a Perl-based program as a part of Gene Family Generator, which is suitable for multiple cores in a standard single server and shortened BLAST execution time tremendously. Gene Family Generator conducted all-against-all BLAST search for all peptide sequences. For the step BLAST result filter, we adopted e-value threshold and overlap region rule when filter the all-against-all blast results. For any two gene sequences, either from the same species or different species, if their BLAST e-value is within the 1e-10 cutoff (e-value threshold) and the overlapped region is more than 80% of the longer sequence (overlap region rule), BLAST result filter approach will treat them as homolog relatives (genes). We can generate homolog gene families by homolog relations. If a gene is a homolog relative gene to a gene in a gene family, this gene will be considered as a member of that family. Gene Family Generator picks randomly a gene (sequence), finds all relative genes recursively from the all-against-all BLAST outputs and generates one gene family. Then, Gene Family Generator picks randomly another gene, which is not listed as a gene within an established gene family so that one gene only belongs to one gene family, and iterates the whole process to assign every gene to an appropriate gene family. The minimum gene number for a given gene family is 2, which means that singleton sequences will be discarded at the current version.

Alignment Constructor

Alignment Constructor conducts multiple sequence alignment for individual gene families. PlantOrDB plugged in MAFFT 7.0 (Katoh and Standley, 2013) for multiple sequence alignment. Many tools including MAFFT (Katoh and Standley, 2013) and ClustalW (Larkin et al., 2007) are able to complete multiple sequence alignment tasks with reliable accuracy. In particular, MAFFT 7.0 has an unique option “--add ” (Katoh and Frith, 2012) that can add a new unaligned query sequence into an existing multiple sequence alignment. This unique feature is essential for our “on-the-fly” pipeline, which needs to cluster a query sequence uploaded online by a user temporarily without re-doing the time-consuming whole process of multiple sequence alignment. That is why MAFFT 7.0 is advantageous here over other multiple sequence alignment tools.

Tree Builder

Tree Builder uses multiple sequence alignments to build phylogenetic trees. PlantOrDB adopted FastTree2 (Price et al., 2010) as our Tree Builder. Like PAUP* (Swofford D.L., 2003), PHYLIP (Felsenstein, J, 1989), RAxML (Stamatakis, 2006) and PhyML (Stéphane Guindon and Frédéric Delsuc, 1970) are also popular tools for creating phylogenetic trees. Since some of our gene families contain over 10,000 genes, we have put tremendous efforts in experimenting different tree building tools that can be scaled up to process huge gene families. Based on approximately maximum-likelihood method, FastTree2 was designed to process huge multiple sequence alignments efficiently, using reasonable amount of memory without sacrificing the quality of phylogenetic trees. Also, FastTree2 is from 100 to 1,000 times faster than PhyML 3.0 or RAxML 7 for large alignments (Price et al., 2010). That was why we selected FastTree2 for PlantOrDB.

Events Identifier

The fourth step in our “pre-built” pipeline is Event Identifier which can identify the event for every node in phylogenetic trees. There are two kinds of event in phylogenetic trees: speciation events and duplication events. Here, Speciation versus Duplication Inference (SDI) algorithm (Zmasek & Eddy, 2001) was adopted for indentifying event for every node in phylogenetic trees. The SDI algorithm use species phylogenetic tree shown in supplemental figure S2 as reference. PlantOrDB has implemented event identifier program by Perl.

Diagnostics Generator

The fifth component in our “pre-built” pipeline is Diagnostics Generator, which extracts all diagnostic characters for all gene families. Our pipeline determines diagnostic characters that characterize or define each ortholog gene set (or group) using both multiple sequence alignments and phylogenetic trees. There are two types of diagnostic characters to differentiate groups in PlantOrDB: pure and private. For a node, we define all of its child sequences as its clades. Both pure and private diagnostic characters are exclusively appeared in its clade. The difference between pure and private is that the pure diagnostic characters are shared by all members within a clade whereas the private diagnostic characters are shared by some members within a clade.

PlantOrDB data source

35 Land Plant Species Genomes:

Shrot NameFull NameTaxon ID
AcoeruleaAquilegia coerulea218851
AlyrataArabidopsis lyrata59689
AthalianaArabidopsis thaliana3702
BdistachyonBrachypodium distachyon15368
BrapaBrassica rapa3711
CrubellaCapsella rubella81985
CpapayaCarica papaya3649
CclementinaCitrus Cclementina85681
CsinensisCitrus sinensis2711
CsativusCucumis sativus3659
EgrandisEucalyptus grandis71139
FvescaFragaria vesca57918
GmaxGlycine max3847
GraimondiiGossypium raimondii29730
LusitatissimumLinum usitatissimum4006
MdomesticaMalus domestica3750
MesculentaManihot esculenta3983
MtruncatulaMedicago truncatula3880
MguttatusMimulus guttatus4155
OsativaOryza sativa4530
PvirgatumPanicum virgatum38727
PvulgarisPhaseolus vulgaris3885
PpatensPhyscomitrella Ppatens3218
PtrichocarpaPopulus trichocarpa3694
PpersicaPrunus persica3760
RcommunisRicinus communis3988
SmoellendorffiiSelaginella moellendorffii88036
SitalicaSetaria italica4555
SlycopersicumSolanum lycopersicum4081
StuberosumSolanum tuberosum4113
SbicolorSorghum bicolor4558
ThalophilaThellungiella halophila98038
TcacaoTheobroma cacao3641
VviniferaVitis vinifera29760
ZmaysZea mays4577

6 Green Algae Species Genomes:

Shrot NameFull NameTaxon ID
VcarteriVolvox carteri3067
CreinhardtiiChlamydomonas reinhardtii3055
Csubellipsoidea C-169Coccomyxa subellipsoidea C-169574566
Mpusilla CCMP1545Micromonas pusilla CCMP1545564608
Mpusilla RCC299Micromonas pusilla RCC299296587
OlucimarinusOstreococcus lucimarinus242159

Analyze Results:

Number of analyzed sequences: 1,530,047

Number of sequences in Homolog gene families: 1,291,670

Number of Homolog gene families/phylogenetic trees: 49,355

More Details

Query Classification

Traditionally, the only way to plug a new query sequence into an existing gene family is to add this sequence into its most similar family (or the best matched family), redo multiple sequence alignment, and reconstruct the phylogenetic tree using the new alignment result. Fortunately, CAOS algorithm can be used to not only extract the character attributes of a given gene family and compute its diagnostic character states, but also add a new query sequence into an existing phylogenetic tree properly after working with MAFFT 7.0 that can add a query sequence into the existing alignment without reconstructing the whole multiple sequence alignment (Sarkar et al., 2002). When a user submits a query sequence online, the backend “on-the-fly” pipeline will be invoked to process the sequence by the following steps: 1) determine the best matched gene family by BLAST, 2) align the query sequence into the existing multiple sequence alignment using MAFFT 7.0 (i.e., --add option), and 3) insert the aligned sequence into the phylogenetic tree of the best matched family by CAOS program. In order to insert the query sequence into the existing phylogenetic tree of its best matched gene family in an appropriate position, CAOS program uses this existing tree as a guide tree, searches the matches between query sequence and diagnostic characters of nodes from root to branch of the guide tree, and determines the proper node position for the query sequence.