Genomes contain a large number of genes that do not have recognizable homologues in other species. These genes, found in only one or a few closely related species, are known as orphan genes. Their limited distribution implies that many of them are probably involved in lineage-specific adaptive processes. One important question that has remained elusive to date is how orphan genes originate. It has been proposed that they might have arisen by gene duplication followed by a period of very rapid sequence divergence, which would have erased any traces of similarity to other evolutionarily related genes. However, this explanation does not seem plausible for genes lacking homologues in very closely related species. In the present article, we review recent efforts to identify the mechanisms of formation of primate orphan genes. These studies reveal an unexpected important role of transposable elements in the formation of novel protein-coding genes in the genomes of primates.
- coding potential
- de novo gene formation
- evolutionary rate
- gene duplication
- orphan gene
- transposable element
In all genome sequence projects undertaken to date, researchers have identified a surprisingly large number of genes that show no similarity to genes from other species. For example, in the analysis of the complete mouse genome sequence, it was found that as many as 14% of the genes did not show homology with genes from non-mammalian species . A recent analysis of Drosophila melanogaster genes has determined that approx. 18% are restricted to the Drosophila group, lacking homologues in other insects . The origin of many such orphan genes is puzzling. Domazet-Loso and Tautz  proposed that they might have predominantly arisen by gene duplication. They argued that the lack of success in detecting homology with other family members could be explained by a very rapid divergence of one of the copies driven by adaptive evolution. Indeed, it has been observed repeatedly that orphan genes display very high evolutionary rates [3–5]. However, in a study using protein evolution simulations, it was shown that even for very rapidly evolving proteins, it should be possible to identify homologues by sequence similarity searches if such homologues exist, at least within the distance separating humans from other extant vertebrates .
A number of mechanisms for the formation of novel sequences other than gene duplication have been reported, including exon shuffling, lateral gene transfer, gene fusion, exonization from TEs (transposable elements) and de novo formation from non-coding genomic regions . A recent study of Drosophila lineage-specific genes has found that approx. 12% of the genes have probably originated de novo from genomic non-coding regions . This fraction is much higher than previously suspected and suggests that de novo gene formation may have been overlooked in the past. The same origin was previously proposed for five genes expressing sperm proteins in Drosophila  and, more recently for a gene termed BSC4 in Saccharomyces cerevisiae .
Approx. 3% of human genes are restricted to primates . We have recently used comparative genomics tools to try to elucidate how these novel genes have been formed . The present review focuses on the results of this analysis and other related studies.
Identification of orphan genes
The first task in the identification of orphan genes is to perform homology searches using all the proteins from the genome being investigated for orphan genes as queries and the proteins from a set of other genomes as targets. In the study of primate orphan genes, we classified all human proteins into four age groups, primate-specific, mammalian-specific, vertebrate-specific and eukaryotic-specific, depending on the presence or absence of homologues in 15 complete eukaryotic genomes, using BLASTP sequence similarity searches . This procedure was similar to that employed to identify lineage-specific genes in other studies [2,5].
From the results of these searches, we found 270 genes that were primate-specific, as we could only identify homologues in Macaca mulatta (rhesus macaque) and Pan troglodytes (chimpanzee). These genes were termed primate orphan genes. Few of them had been experimentally characterized . One of these was dermcidin, a well-characterized peptide with antimicrobial activity secreted in the skin and also involved in neural survival . Other genes with associated functional data were HB-1, a minor histocompatibillity protein , SPHAR (S-phase-response protein), involved in the regulation of DNA synthesis , and FAM9B (family with sequence similarity 9, member B) and FAM9C (family with sequence similarity 9, member C), proposed to mediate recombination during meiosis .
Are orphan genes protein-coding?
In a recent study, Clamp et al.  argued that the vast majority of the ORFs (open reading frame) present in human orphan genes were likely to be spurious predictions. This was based on their short size and poor level of sequence conservation across species. However, strong sequence divergence does not necessarily imply lack of function. For example, we noticed that the subset of primate orphan genes encoding experimentally verified proteins showed a similarly high level of sequence divergence to that found in the remaining primate orphan genes . Similarly, many of the functionally characterized orphan proteins were remarkably short; for example, dermcidin is only 110 amino acids in length.
One way to assess whether a gene is protein-coding or not is by measuring the coding potential of its putative coding sequence. Codon usage scores are calculated as the log likelihood ratio of codon frequencies under two models: a coding model and a codon equiprobability model . In the coding model, the likelihood of each codon is estimated using the relative frequency of that codon in a large dataset of bona fide protein-coding genes. Under the codon equiprobability model, the likelihood of each codon is considered to be 1/64. To be able to compare sequences of different length, the mean codon usage score across all codons in the sequence is computed. Positive values indicate that the sequence has a high probability of being protein-coding, whereas negative values indicate that it does not.
We applied this calculation to the dataset of 270 primate orphan genes and compared it with several coding and non-coding sequence datasets. The average coding potential of orphan genes was 0.063, similar to the value observed for short human genes encoding experimentally validated proteins (0.081), but clearly distinct from the values observed in a collection of non-coding RNA genes (−0.082) or of non-coding frames extracted from the primate orphan genes (−0.064) . These results led us to conclude that the primate orphan genes we had identified mainly comprised protein-coding genes.
Rapid evolution of orphan genes
One of the most striking features of orphan genes is their high evolutionary rate. This has been observed in very diverse groups of organisms such as Drosophila , yeast , mammals  and primates . In coding sequences, we can distinguish between non-synonymous substitutions, which involve amino acid replacements, and synonymous substitutions, which do not involve such replacements. The frequency of the two types of substitution can be estimated using pairs of aligned homologous coding sequences, for example by employing maximum likelihood methods . The ratio between the number of non-synonymous substitutions per non-synonymous site (Ka) and the number of synonymous substitutions per synonymous site (Ks) is informative on the selective forces acting on the protein. Most genes show Ka/Ks values much lower than 1, indicating that negative (purifying) selection drives the evolution of the encoded protein. However, some genes show quite high Ka/Ks values, e.g. close to or even higher than 1. This means that they are subject to either very weak negative selection or positive (adaptive) selection, or to a combination of the two forces.
Primate orphan genes showed a median Ka/Ks of 0.91 using human and macaque orthologous sequence alignments . In comparison, mammalian-specific genes showed a median Ka/Ks of 0.64, vertebrate-specific genes of 0.38 and eukaryotic conserved genes of 0.23. Therefore groups of younger genes tended to show higher evolutionary rates. Similar results have been found in previous studies. In a study by Albà and Castresana , evolutionary rates were estimated from human and mouse orthologous pairs, and human genes were classified into four age groups: tetrapods, deuterostomes, metazoans and eukaryotes. They found that the youngest group, tetrapods, evolved almost four times faster that the older group. Cai et al.  performed a similar study in Ascomycota. They defined several levels of lineage-specificity using Ascomycota, other fungi and animal genomes and they found increasingly higher evolutionary rates in genes showing increasingly higher lineage-specificity. Although these differences may be related to the action of positive selection in some of the most rapidly evolving genes, as has been suggested in the case of the primate-specific gene family morpheus , in general they are likely to be due to different degrees of purifying selection. In particular, it seems plausible that old proteins will contain a larger proportion of strongly constrained functional and structural sites than younger proteins.
A second feature that correlates with gene age is protein length, which tends to increase as we consider older proteins [4,12,22]. Interestingly, the change is protein length is accompanied by a change in the structural content of proteins, as shown in a study based on proteins with known structures from different phyla . This study showed that young proteins were often short and tended to adopt α, β and α+β structures, proteins of intermediate age showed intermediate length and could also adopt α/β structures, and the oldest proteins were the longest and predominantly of the α/β class. The authors proposed that these would represent different stages in the evolution of a protein .
Mechanisms of formation of orphan genes
Although the mechanisms of formation of orphan genes are not yet fully understood, several studies have shed light on this issue in recent years. In the study of primate orphan genes we distinguished among three main mechanisms (Figure 1), as follows.
Gene duplication is the most important mechanism for the formation of novel genes. In general, however, both gene copies retain significant sequence similarity to gene family members evolving in other species. This means that the vast majority of novel genes formed by gene duplication will not qualify as orphan genes. This does not preclude the possibility that some orphan genes will have detectable paralogues in the same genome. In fact, we found that approx. 66 of the primate orphan genes showed significant similarity to human genes that were conserved in other mammals, indicating that gene duplication had played a role in their formation . This included many cases with partial sequence similarity, suggesting the formation of chimaeric gene structures formed by two or more sequences with diverse origins. In Drosophila, Zhou et al.  found several lineage-specific chimaeric genes, presumably formed by gene duplication and the recruitment of additional exon or intron sequences from genes located in adjacent regions of the genome.
De novo formation from non-coding genomic regions
Genes formed entirely de novo from non-coding genomic regions have been reported in Drosophila [8,9] and in S. cerevisiae . The first step to determine whether an orphan gene has been formed from a previously non-coding region is to identify the syntenic genomic regions from closely related species. Secondly, one needs to rule out the existence of a similar, but unannotated, gene in the same location in the other species. This generally involves translating the syntenic regions in all possible frames to assess whether a similar gene with an uninterrupted ORF can be reconstructed. In the primate orphan gene dataset, we discovered 15 cases in which the corresponding mammalian genomic regions did not have the capacity to encode similar proteins . These genes may therefore have originated de novo from non-coding regions.
Exaptation from TEs
The exonization of sequences from TEs is well documented. In mammals, this seems to be facilitated by the presence of potential splice sites in some TEs . One example is the human tumour necrosis factor receptor gene (p75TNFR), in which an alternative 5′ exon is exapted from a TE . This process could also generate completely new genes, as suggested for two mouse genes, lungerkine and mNSC1, comprising TE sequences almost entirely .
Searches of TE sequences within genes can be performed using the TE genomic annotations from the Repeat Masker program available at the UCSC Genome Database . A surprisingly high number of primate orphan genes, 142, were found to contain TE sequences . The vast majority of these genes were located in primate-specific genomic regions. One example is the F379 retina specific protein family. This family is encoded by four genes that are located in subtelomeric DNA regions. These genes are formed by three coding exons, and the second exon is completely covered by two SINEs (short interspersed transposable elements): an Alu (Flam_C subtype) and an MIR (mammalian-wide interspersed repeat)  (Figure 2).
In the human genome, there are four main types of TE: in decreasing frequency, LINEs (long interspersed transposable elements), SINEs, LTR (long terminal repeat) elements and DNA elements (Figure 3). In contrast, the most common TEs in primate orphan genes were SINEs, including several subtypes of the Alu family. This indicates that SINEs are more easily exonized than other types of TE and can therefore contribute in a more significant manner to processes of novel gene formation.
Novel genes are continuously being formed in genomes. The origin of orphan genes, those that do not have detectable homologues in other species, has remained enigmatic. However, several recent studies have started to shed light on the mechanisms of formation of such genes. Data from comparative genomics analysis indicate that genes can originate from different types of non-coding genomic sequence at an unexpectedly high frequency. In primates, the exonization of transposable elements, and in particular Alu elements, appears to have facilitated the formation of many new genes.
We received financial support from Ministerio de Ciencia e Innovación (Formación de Profesorado Universitario to M.T.-R., Plan Nacional BFU200-07120, Ramón y Cajal Programme to R.C.), Ministerio de Sanidad y Consumo (Red Española de Esclerosis Múltiple to N.B.) and Fundació Institució Catalana de Recerca i Estudis Avançats (to M.M.A).
Protein Evolution: Sequences, Structures and Systems: Biochemical Society Focused Meeting to commemorate the 200th Anniversary of Charles Darwin's birth held at the Wellcome Trust Conference Centre, Cambridge, U.K., 26–27 January 2009. Organized and Edited by Roman Laskowski (EMBL-EBI, Hinxton, U.K.), Michael Sternberg (Imperial College London, U.K.) and Janet Thornton (EMBL-EBI, Hinxton, U.K.).
Abbreviations: ORF, open reading frame; SINE, short interspersed transposable element; TE, transposable element
- © The Authors Journal compilation © 2009 Biochemical Society