Charles Darwin's theory of evolution was based on studies of biology at the species level. In the time since his death, studies at the molecular level have confirmed his ideas about the kinship of all life on Earth and have provided a wealth of detail about the evolutionary relationships between different species and a deeper understanding of the finer workings of natural selection. We now have a wealth of data, including the genome sequences of a wide range of organisms, an even larger number of protein sequences, a significant knowledge of the three-dimensional structures of proteins, DNA and other biological molecules, and a huge body of information about the operation of these molecules as systems in the molecular machinery of all living things. This issue of Biochemical Society Transactions contains papers from oral presentations given at a Biochemical Society Focused Meeting to commemorate the 200th Anniversary of Charles Darwin's birth, held on 26–27 January 2009 at the Wellcome Trust Conference Centre, Cambridge. The talks reported on some of the insights into evolution which have been obtained from the study of protein sequences, structures and systems.
- gene duplication
- horizontal gene transfer
- protein evolution
- protein network
- protein structure
In the 150 years since On the Origin of Species  was published, we have learnt much about that “mystery of mysteries”: how species originate and evolve. Much of the knowledge has come from huge advances in molecular biology which have revealed the constituent molecules of all living things (proteins, DNA, RNA and other molecules), how these operate in concert in all living things and how individual traits are passed from parent to offspring. We also know more about how natural selection operates on variations present at the molecular and whole-organism levels to “preserve the favoured races in the struggle for life”.
Furthermore, the analyses of, first, protein sequences and then genomic sequences have provided ample evidence for “the mutual relations of all the beings which live around us”. The similarity of related species is reflected by the similarity of their genetic make-up and allows the evolutionary history of proteins and protein families to be traced back to “some other and generally extinct species”.
As of May 2009, the genomes of over 900 organisms had been sequenced , and the sequences of over 7.5 million protein sequences were in the UniProtKB/TrEMBL Protein Database . More than 51000 experimentally determined models of protein three-dimensional structures had been deposited at the worldwide Protein Data Bank (wwPDB) , whereas the MetaCyc database  held over 1200 experimentally elucidated metabolic pathways from more than 1600 different organisms. Besides these, there were over 150 databases holding gene expression data, proteomic data, protein interaction data, metabolomic and metabonomic data, phenomic and phenotypic data, as well as pathways and quantitative models .
The papers in this issue of Biochemical Society Transactions reflect how this wealth of data is beginning to provide answers to some of the questions concerning the finer details of evolution at the molecular level.
How do new genes arise?
The principal source of novel genes appears to be gene duplication , wherein a gene gets copied to another part of the DNA or, very rarely, the whole genome is duplicated. One copy of the duplicated gene undergoes modification which may result in it finding a new adaptive function and so be retained by natural selection . There is abundant evidence for gene duplication in the sequenced genomes.
Ponting and Goodstadt  compare the mouse and human genomes (see pp. 734–739). The sequencing of both of these is now nearly complete and, crucially, includes the ‘difficult’ (in terms of sequencing) parts of the genome where many duplicated segments are found and where many lineage-specific genes appear to lie. The last common ancestor of these two species lived some 90 million years ago in the Cretaceous period, and many duplications, insertions and deletions have occurred in that time. The authors find that approx. 20% of human and 24% of mouse genes have appeared by duplication since the split, with fewer than 1% of the genes which were present in the last common ancestor having been lost. Most changes have occurred in large multicopy families which are better able to respond to changes in the environment. Over 15000 genes in the two genomes are direct orthologues and have tended to preserve their sequences. These genes are likely to be crucial for function. Mice have more genes than humans overall, largely as a result of a high level of gene duplication in the olfactory and vomeronasal receptors and pheromone genes. Primates, on the other hand, appear to have lost these genes in the last 25 million years. An interesting observation is that, because small gene duplications are more common than large ones, compact genes are more likely to be duplicated along with their promoters and non-coding regulatory regions and their open reading frames intact.
Not all new genes arise as a result of gene duplication. Toll-Riera et al.  analysed orphan genes in primate genomes to discern how these might have arisen (see pp. 778–782). Orphan genes are ones with no recognizable homologues in other species, hence many are likely to be lineage-specific. Approx. 3% of primate genes appear only in primate genomes. They tend to exhibit accelerated evolution, yet this rate is not so high as to explain their occurrence as the result of gene duplication followed by rapid sequence divergence. Rather, it seems that most of these orphans have come from transposable elements, in particular the widespread Alu elements, becoming novel protein-coding genes.
In bacteria, new genes can arise by HGT (horizontal gene transfer), also referred to as lateral gene transfer, wherein genetic material from a different organism, or indeed a different species, is incorporated into the genome. The result is that some traits can be acquired much faster than through Darwinian evolution where genes are only transferred vertically, i.e. from parent to offspring. HGT is responsible for the rapid spread of drug-resistance in pathogenic bacteria and the rise of the so-called superbugs. Many unicellular eukaryotes can also acquire new genes by HGT, and some of these species are important pathogens of humans and agriculturally important plants. Whitaker et al.  describe various approaches for identifying HGT in eukaryotes to assess their contribution to the evolution of the receiving organisms (see pp. 792–795). Zámocký et al.  provide evidence from two fungal genomes that the fungal catalase peroxidase genes were acquired by HGT from ancient bacteria (see pp. 772–777).
How is a new function acquired?
When a gene is duplicated, how can a new function arise for the copy quickly enough for it to be retained in the proteome rather than being lost by mutations to become merely a pseudogene? This is the so-called ‘Ohno's dilemma’. This dilemma is particularly troubling if the duplicated gene is highly specialized and requires major change to alter its function. However, many genes are not so specialized. For example, enzymes such as the esterases, cytochromes P450 and GSTs (glutathione transferases) have a broad specificity and are able to interact with many different substrates. Over time, copies of such a gene can evolve a more finely tuned specificity for a single substrate. The paper by Mannervik et al.  describes experimental approaches, using directed evolution of enzymes, to demonstrate how this can happen (see pp. 740–744). Using DNA shuffling of two related GSTs, the authors randomly picked chimaeric variants and tested their relative activities on a set of substrates. By clustering the activity patterns obtained, new variants (or functional quasi-species), having a different pattern of activities from either parent, were observed.
Function can also be modified by sequence insertions which, if adopted, can embellish the three-dimensional structure of the resultant protein and provide potential for quite dramatically different functions. Dessailly et al.  describe the example of the HUP domains, which are ancient domains found in many proteins (see pp. 745–750). Their structure is the common Rossmann fold which allows for large structural insertions at various points. As a consequence, the HUP domains are found in combination with over 150 other domains and are involved in markedly different functional activities.
Gene duplication, followed by gene fusion or fission, or exon shuffling can lead to new combinations of protein domains, and hence the potential for the creation of new proteins. Buljan and Bateman  show that most domain gains and losses occur at protein termini, as changes at the end are less likely to disrupt the protein's structure (see pp. 751–755). They also trace the evolution of the immunoglobulin superfamily through 500 million years from its origins in the cell receptor proteins of primitive sponges to their current status as one of the largest families in the human genome.
What are the constraints on protein evolution?
Amino acid mutations may alter function and, consequently, natural selection may retain these mutations. Over the last half century, studies of the three-dimensional structures of proteins have revealed how a protein's three-dimensional structure relates to its biochemical function. Gong et al.  show how different locations in the three-dimensional structure impose different constraints on which amino acid substitutions can be tolerated in order to preserve structure and/or function (see pp. 727–733). Some mutations can have a dramatic impact on a protein's function (if, say, to a catalytic residue), whereas others may have none at all.
Warnecke et al.  describe several factors at the DNA and mRNA levels which can constrain the choice of amino acids in certain protein coding regions (see pp. 756–761). For example, as the amino acids are encoded by different DNA codons, organisms whose genomes contain different intergenic GC-content exhibit corresponding differences in their amino acid usage. In Escherichia coli, and some other organisms, highly expressed genes tend to be encoded by amino acids that are relatively cheap (in terms of energy) to synthesize. Another factor that affects amino acid choice may be the secondary structure of the mRNA encoding a protein; secondary structure can affect the mRNA's half-life or translational properties and consequently its ‘fitness’ and ability to lead to protein product. The location of the gene in the nucleosome can also act as a constraint, as the ability of DNA to twist and bend around the histones in the nucleosome is governed by its sequence. Most interestingly, the authors demonstrate that, in species (such as humans) where the intronic regions separating exons tend to be very long and numerous, additional motifs identifying the intron–exon boundaries are found within the exons themselves. As a consequence, in order to maintain these so-called ESEs (exonic splice enhancers), the ESE regions impose a severe constraint on which amino acids can be encoded there. This constraint is reflected in detectable biases for the types of amino acids encoded near to exon boundaries, and also in a lower rate of SNPs (single nucleotide polymorphisms) in these regions.
Tracing protein evolutionary history
Given the sheer amount of sequence data that we now have, it is possible to trace the evolution of many proteins or protein families through time. One such family is the spectrins, described by Baines  (see pp. 796–803), which are involved in cross-linking proteins and lipids within the cell membrane to the cytoskeleton. This linkage enhances the mechanical stability of the membrane and strengthens the cell's resistance to mechanical stresses. A variant that evolved in mammals following a gene duplication is responsible for the elasticity of non-nucleated red blood cells. Spectrins are found in very primitive organisms such as choanoflagellates where they perform a different, albeit unknown, function, and so must date from the ancestors of the first primitive animals and have evolved new functions since then. Variation in their forms and functions in fish, frogs, chickens and mammals suggest that gene duplications have resulted in functionally divergent versions.
Another ancient family are the SNARE (soluble N-ethylmaleimide-sensitive fusion protein-attachment protein receptor) proteins which are responsible for pulling together and joining membranes during vesicle fusion. They are crucial in the endocytic and secretory pathways in eukaryotes. Kienle et al.  identify 20 distinct groups of SNAREs using HMMs (hidden Markov models) and suggest these are likely to correspond to the original SNARE repertoire of the proto-eukaryotic cell (see pp. 787–791). There seems to have been a major expansion of SNAREs during the rise of multicellularity and a further expansion of secretory SNAREs in vertebrates. In contrast, in fungi, which developed multicellularity independently, there was no similar SNARE expansion, but this can be understood in terms of fundamental differences in lifestyles between fungi and vertebrates.
The rate of protein evolution
Evolution does not occur at a constant rate, but depends on the environmental selective pressures on the individuals of the species at any given time. Such changes in evolutionary rate, or episodic evolution, can be detected from analysis of amino acid changes in groups of related genes. Previous analyses have focused on mammals. In the paper by Studer and Robinson-Rechavi  (see pp. 783–786), fish-specific genes from five sequenced fish genomes were analysed. They found that episodic positive selection has affected at least 77% of the genes in question, much larger than obtained from the previous studies on mammalian genomes.
Individual proteins can evolve at different rates. In his talk, Dan Tawfik  described a number of factors affecting the rate of a protein's evolution. The work of his group aims to reproduce the evolution of new proteins in the laboratory and determine the traits affecting their ‘evolvability’; these include functional promiscuity, conformational plasticity and the modularity of their fold.
The evolution of gene expression
Araxi Urrutia  focused on the evolution of gene expression in her talk. She showed that highly expressed genes appear to be adapted for rapid expression, having shorter introns, higher codon usage bias and encoding shorter proteins with higher frequencies of less complex amino acids. Furthermore, gene expression patterns are related to their genomic context: housekeeping genes tend to cluster in gene dense GC-rich regions of the genome.
A widespread, yet simple, regulator of protein activity is phosphorylation, performed by kinase proteins, which involves transfer of the γ-phosphate of ATP to a specific side chain on the target protein. The phosphorylation of the given side chain alters the function of the protein, perhaps switching its function on or off. Structural studies have revealed much about the regulatory mechanisms of phosphorylation and the differences in structure of active and inactive forms of the kinases, as discussed by Dame Louise Johnson  (see pp. 627–641). CDKs (cyclin-dependent kinases), which are important in the cell cycle, are primary targets in cancer treatment and indeed there are nine approved drugs that target the ATP-binding site of CDKs. Some drugs bind to the active and some to the inactive form of their target proteins. Resistance to the active-binding drugs is less likely as mutations affecting the active form are less likely to be tolerated.
Proteins do not operate in isolation, but rather form large interconnected networks of interacting molecules containing positive- and negative-feedback loops and other regulatory mechanisms. Huvet et al.  (pp. 762–767) present an analysis of nine E. coli proteins involved in response to attack by bacteriophages which necessitates the repair of damage to the inner cell membrane. The system in E. coli has been well studied and can be represented by a mechanistic model describing the interactions and the dependencies of the proteins in the system. Analysis of 129 bacterial genomes shows that either all nine proteins are found in a given genome or hardly any at all. This illustrates the importance of considering genes in context, e.g. as parts of interacting networks.
Robertson and Lovell  do just that, using protein–protein interaction data to help explain the evolution of new functions (see pp. 768–771). Functional change can be measured by comparing the numbers of interactions that paralogues make with other proteins, and how many of these interactions are shared. Indeed, this provides a valuable measure of change which does not correlate with sequence divergence because it is so dependent on just those residues (<10–15% of a protein's total) that are involved in binding. Both whole-genome and single-genome duplications can contribute to functional innovation via ‘rewiring’ the interactions between proteins, but their effects are different for different functional categories because of the different evolutionary constraints involved.
Molecular biology has proved beyond doubt Darwin's assertion that all species are descended from earlier forms as a result of “descent with modification” rather than individual creation. Many of these evolutionary paths can now be traced at the protein and protein family levels. Moreover, our expanding knowledge of how new genes are created and functions altered have demonstrated how the molecular machinery of cells can generate the raw material on which natural selection can act and lead to newer, better adapted, forms.
Protein Evolution: Sequences, Structures and Systems: Biochemical Society Focused Meeting to commemorate the 200th Anniversary of Charles Darwin's birth held at the Wellcome Trust Conference Centre, Cambridge, U.K., 26–27 January 2009. Organized and Edited by Roman Laskowski (EMBL-EBI, Hinxton, U.K.), Michael Sternberg (Imperial College London, U.K.) and Janet Thornton (EMBL-EBI, Hinxton, U.K.).
Abbreviations: CDK, cyclin-dependent kinase; ESE, exonic splice enhancer; GST, glutathione transferase; HGT, horizontal gene transfer; SNARE, soluble N-ethylmaleimide-sensitive fusion protein-attachment protein receptor
- © The Authors Journal compilation © 2009 Biochemical Society