We present the BPIFAn/BPIFBn systematic nomenclature for the PLUNC (palate lung and nasal epithelium clone)/PSP (parotid secretory protein)/BSP30 (bovine salivary protein 30)/SMGB (submandibular gland protein B) family of proteins, based on an adaptation of the SPLUNCn (short PLUNCn)/LPLUNCn (large PLUNCn) nomenclature. The nomenclature is applied to a set of 102 sequences which we believe represent the current reliable data for BPIFA/BPIFB proteins across all species, including marsupials and birds. The nomenclature will be implemented by the HGNC (HUGO Gene Nomenclature Committee).
- cognate representative nomenclature
- lung and nasal epithelium clone (PLUNC)
- phylogenetic tree
The PLUNC (palate lung and nasal epithelium clone) family has been introduced elsewhere in this issue . The purpose of the present article is to introduce a unified nomenclature for these genes/proteins which arose out of discussions at the recent focused meeting ‘Proteins with a BPI/LBP/PLUNC-like Domain: Revisiting the Old and Characterizing the New’ held in Nottingham, U.K. As discussed below, various confusing and duplicate names exist for these proteins, which hinders communication between workers in the field, and makes it difficult for those from outside the field to understand the relationships between the proteins. Our aim in proposing this nomenclature is to make the relationships between different proteins as clear as possible.
A feature of the PLUNC proteins is that they are rapidly evolving: orthologues show high sequence dissimilarity (compared with interspecies comparisons for other proteins), and there are significantly different numbers of paralogues in different species [1–4]. It was therefore felt necessary to define a nomenclature that could most distinctly indicate 1:1 orthology between proteins, but also indicate the more subtle relationships where species or lineage-specific paralogues have arisen.
The meeting was not simply focused on the PLUNC family, but also on the wider BPI (bactericidal/permeability-increasing protein) fold-containing superfamily, which we define as all proteins that can be strongly predicted to contain either one or both domains of the BPI fold. This superfamily can be robustly segregated by simple phylogenetic analysis into PLUNC and non-PLUNC branches . The non-PLUNC branch includes BPI, LBP (lipopolysaccharide-binding protein), CETP (cholesteryl ester-transfer protein), PLTP (phospholipid-transfer protein) and also other proteins such as BPIL2  or LBPBPI1 . Suitable names already exist for BPI, LBP, CETP and PLTP, and very little is characterized about many of the other proteins in the non-PLUNC branch. Nomenclature discussions were therefore initially restricted to the PLUNC branch. However, as discussed below, this restriction was later relaxed to allow for future inclusion of the non-PLUNC proteins.
Our proposals, which maintain a strong correspondence with a nomenclature that has become common in most of the literature and in database annotations, and have been approved by the HGNC (HUGO Gene Nomenclature Committee).
Over 25 years ago, rodent PSP (parotid secretory protein) was the first member of the PLUNC family that was identified and cloned . Studies on PSP have been largely focused on expression analysis and it is well recognized as a highly abundant protein in rodent saliva. A second related gene, SMGB (submandibular gland protein B) was subsequently cloned from rat salivary glands . The cloning of mouse PLUNC in 1999  added another related protein to this family, and analysis contained within this paper recognized the similarity that these three proteins shared with the database sequences for mouse VEMSGP (von Ebner minor salivary gland protein) (GenBank® accession number U46068) and cow BSP30 (bovine salivary protein 30) A (accession number U79413) . It was following the analysis of the human and mouse PLUNC genes that, in 2002, we showed that humans contain at least seven expressed genes in the PLUNC locus . Subsequently, this number has undergone a number of revisions and has now been refined to eight authentic genes and three pseudogenes within the human locus. As has been highlighted elsewhere in this issue, this number varies across mammalian species [1,12]. At that time, we used a nomenclature which distinguished between the two possible types of proteins, based on length: SPLUNCn (short PLUNCn; e.g. SPLUNC2) for the one domain (‘short’) proteins comprising approximately 250 amino acids, and LPLUNCn (long PLUNC) for the two domain (‘long’) proteins comprising approximately 450 amino acids .
At the meeting, a parallel discussion session considered the various issues surrounding defining a satisfactory nomenclature. These discussions were then continued in a plenary session. One issue that was discussed was whether PLUNC was suitable as a basis for a ‘root’ name for the family. The term PLUNC was originally coined as an acronym for palate, lung and nasal epithelium clone . This acronym does not accurately convey the true variety of localizations of PLUNC, and furthermore the word ‘carcinoma’ has been substituted for ‘clone’ in many instances in the databases: a rewording that seems to have little basis in science. There is also a widespread concern that the SPLUNC1 and LPLUNC1 style of nomenclature can lead to confusion, whereby SPLUNC1 is misinterpreted as a short form of a longer protein LPLUNC1.
Continuation of the discussions following the meeting led to the proposal of a nomenclature that can encompass the whole BPI fold-containing superfamily with the introduction of a BPIF root. New gene symbols will be allocated by the HGNC for the PLUNC branch proteins, whereas the symbols will have the status of aliases for the well-established BPI, LBP, CETP and PLTP proteins.
The BPIF superfamily is divided into BPIFA, BPIFB, BPIFC etc., subfamilies. BPIFA will replace the SPLUNC root, BPIFB will replace the LPLUNC root, BPIFC will replace the symbol for BPIL2  and BPIFD, BPIFE and so on, will be used by the HGNC as aliases for BPI, LBP, etc., in a manner yet to be finalized. The allocation of these latter aliases will not be discussed further here.
Nomenclature for the PLUNC branch proteins
(i) The nomenclature system will be a modification of the existing SPLUNCn/LPLUNCn style names:
(a) The SPLUNC root will be replaced by BPIFA
(b) The LPLUNC root will be replaced by BPIFB
(c) Wherever possible the existing numbering of proteins will be retained
(ii) Assignment to families will be on the basis of inspection of sequence-based phylogenetic trees and pairwise identity patterns as described below.
(iii) Small amendments will be made to the nomenclature of a limited number of proteins. A number of proteins for which no SPLUNCn/LPLUNCn style name existed will have a name allocated.
(iv) The main amendment to the current usage is where expansion of the number of paralogues has occurred in a lineage, and where none of the resulting proteins can be identified at the sequence level to have retained significantly the most similarity to the presumptive orthologue. In these cases ‘A’, ‘B’, ‘C’ etc. will be appended to the gene name. Where expansions have occurred in two lineages separately, different letters of the alphabet will then be used in the two lineages.
Application of the nomenclature proposals
For the purpose of refining and testing the above nomenclature proposals, a collection of 102 sequences was established, which is a combination of well-established PLUNC branch sequences, along with other sequences in which we have moderately high confidence from the current NCBI nr database. Initially, BLAST searches against the nr database were performed using all human and/or mouse LPLUNC and SPLUNC proteins as queries until a converged set of proteins was obtained. The sequences were reduced to maximum 90% pairwise identity using cd-hit, and sequences from the set in Chiang et al.  were then added. A number of manually collected sequences were added, especially to populate more fully the BASE branch. An iterative process of constructing phylogenetic trees and inspecting apparent anomalies against EST (expressed sequence tag) databases was then followed. In this process, a number of sequences were either discarded as faulty predictions, or were replaced by more secure predictions. Phylogenetic trees were prepared using ClustalW  and visualized using ITOL . Sequence analysis was facilitated by using the Jalview resource  and in-house Python Scripts. A phylogenetic tree of the resulting dataset is shown in Figure 1. We believe that this represents a good snapshot of the current state of knowledge of the species distribution of PLUNC proteins, and this is our basis for assessing the effectiveness of the proposed nomenclature.
By inspection of the phylogenetic tree, all proteins in the collection were assigned to the appropriate BPIFAn or BPIFBn families, giving rise to the annotations in Figure 1. A table of correspondences between the new nomenclature and existing names is given in Table 1. The delineation of families was largely guided by analysis of proteins from eutherian mammals (i.e. excluding avian and marsupial sequences). As discussed by Chiang et al. , a number of chicken proteins can be assigned to the families defined by the eutherian mammals. For the more divergent proteins chicken TENP  and chicken OVO36  we preferred to create new family designations (BPIFB7 and BPIFB8 respectively). The mouse protein vomeromodulin [19,20] is only weakly placed in the PLUNC branch by phylogenetic methods, however, its genomic location confirms this analysis. It is assigned to its own family (BPIFB9).
In order to test the robustness of this procedure of naming of proteins we analysed the pairwise identities of all the collected sequences with a single representative from each family. This method is more objective than a method based on inspecting a phylogenetic tree, since it is less affected by the details of a large multiple-sequence alignment. We selected the human protein (where it exists as an expressed protein) as the prime representative sequence for each family. Where no human proteins exists, we chose the mouse protein (for BPIFA5, BPIFA6 and BPIFB5) and the chimpanzee protein [for BPIFA4 (BASE)].
We then used ClustalW to pairwise align each sequence against the representative sequence from each family. The resulting pairwise identities are plotted in Figure 2. With the exception of BPIFA2 proteins, in all the eutherian proteins the identity of the cognate representative (i.e. to the representative sequence for the family to which the protein is assigned) is significantly higher than that of any of the non-cognate representatives. In most cases the pairwise identity of the cognate representative protein exceeds 50%, and the pairwise identity of non-cognate representatives is less than 30%. There are, however, a number of cases that complicate the analysis, so that a ‘>50%-cognate versus <30%-non-cognate’ rule does not universally apply. The most simple cases are for BPIFB3/4, BPIFA5 and BPIFA4.
BPIFB3 and BPIFB4 have retained a significantly higher similarity to each other, so that the pairwise identities of BPIFB3 proteins with BPIFB4 proteins (and vice versa) is approximately 40%.
BPIFA5, which is specific to rodents [2,3], has clearly arisen from a duplication of the BPIFA1 proteins. The pairwise identities of BPIFA5 proteins with BPIFA1 proteins (and vice versa) are approximately 55%.
The BPIFA4 (BASE) family is quite divergent, with pairwise identities with the representative sequence between 40–50% [21,22]. The horse BPIFA4 (also known as latherin) has the lowest pairwise identity with chimpanzee BPIFA4, and in the phylogenetic tree its position is somewhat anomalous relative to the BPIFA4 sequences from other species. This suggests that it has been under different evolutionary pressures to the other BPIFA4 proteins and may have developed, at least in part, a different function to the other BPIFA4 proteins .
The most complex family is BPIFA2. This is very highly divergent, and is the only case in the current dataset where addition of letters of the alphabet to the protein name is required. In the case of cow, four BPIFA2 related paralogues exist, which are currently referred to as BSP30A, BSP30B, BSP30C and BSP30D . Three of the proteins have similar pairwise identities with human BPIFA2 of approximately 40%. The fourth protein (BSP30C) has a substantially lower similarity to human BPIFA2; however, it is still clearly a member of this cluster of proteins. As none of these proteins shows greater similarity to human BPIFA2, these proteins are renamed BPIFA2A, BPIFA2B, BPIFA2C and BPIFA2D. Fortunately, it is possible at this stage to retain the correspondence of letters between BSP30A and BPIFA2A, etc. In rats, a similar situation exists, where there are two BPIFA2 paralogues, which have the historical gene symbols Psp and Smgb . There is only a marginal difference in pairwise similarities to human BPIFA2 (and all the values are much lower than 35%), and thus they become BPIFA2E and BPIFA2F respectively. Note that these symbols represent the rat protein; rodent gene symbols are of the format Bpifa2e and Bpifa2f. For other mammalian species mentioned in the present paper, protein and gene symbols are of the same format. The BPIFA2E and BPIFA2F symbols are chosen to use letters of the alphabet distinct from those chosen for the BSP30 proteins. In mouse, Bpifa2e is a protein-coding gene, but the mouse orthologue of rat Bpifa2f is a pseudogene, which previously was previously known as mouse Splunc4 , and is now referred to as Bpifa2f-ps, in accordance with rules for mouse pseudogene nomenclature. This is the only instance where the numbering in the new system is different to the SPLUNCn/LPLUNCn system. Besides these two lineage-specific duplications, the pairwise identities between the remaining non-primate sequences and human BPIFA2 are also low, at approximately 50%, indicating that there is some form of very strong evolutionary pressure acting on all the BPIFA2 proteins.
It is worthwhile to briefly return to the case of BPIFA5, in order to illustrate the opposite case to BPIFA2. Since mouse and rat BPIFA1 have retained much greater similarity to human BPIFA1 than the mouse and rat BPIFA5 proteins, a new symbol (BPIFA5) is allocated to it, rather than introducing BPIFA1A and BPIFA1B.
All sequences used in this study are available as Supplementary Online Data of http://www.biochemsoctrans.org/bst/039/bst0390976add.htm.
The methodology used in Figure 2 should allow ready checking of the assignment of new protein sequences. The authors are willing to offer assistance in classifying new protein sequences, and would be grateful to be informed about sequences that are in conflict with the system.
We have presented the BPIFAn/BPIFBn nomenclature as a relatively simple modification of the SPLUNCn/LPLUNCn nomenclature to allow systematic treatment of all known PLUNC branch proteins. We have also assembled a set of 102 sequences that we believe represent the current reliable data for BPIFA/BPIFB proteins across all species, including marsupials and birds. There are currently six BPIFA and nine BPIFB families. The BPIFA2 family is, by a significant margin, the most diverse family, with two lineage-specific duplications, and low pairwise identities with the remaining members. The nomenclature will be implemented by the HGNC, which will also finalize proposals for creating BPIF-style aliases for the members of the wider BPIF superfamily.
We are very grateful to Sven Gorr, Tom Wheeler, Peter Di, Klaus Kopec, Mats Lindahl and Joel Gautron who actively contributed to the nomenclature discussions at the meeting and were subsequently able to comment during the development of this framework.
Proteins with a BPI/LBP/PLUNC-Like Domain: Revisiting the Old and Characterizing the New: A Biochemical Society Focused Meeting held at New Business School, University of Nottingham, U.K., 5–7 January 2011. Organized and Edited by Colin Bingle (Sheffield, U.K.) and Sven-Ulrik Gorr (University of Minnesota School of Dentistry, Minneapolis, MN, U.S.A.).
Abbreviations: BPI, bactericidal/permeability-increasing protein; BSP30, bovine salivary protein 30; CETP, cholesteryl ester-transfer protein; HGNC, HUGO Gene Nomenclature Committee; LBP, lipopolysaccharide-binding protein; PLUNC, palate, lung and nasal epithelium clone; SPLUNCn, short PLUNCn; LPLUNCn, long PLUNCn; PLTP, phospholipid-transfer protein; PSP, parotid secretory protein; SMGB, submandibular gland protein B; VEMSGP, von Ebner minor salivary gland protein
- © The Authors Journal compilation © 2011 Biochemical Society