## Abstract

Many functional proteins do not have well defined folded structures. In recent years, both experimental and computational approaches have been developed to study the conformational behaviour of this type of protein. It has been shown previously that experimental RDCs (residual dipolar couplings) can be used to study the backbone sampling of disordered proteins in some detail. In these studies, the backbone structure was modelled using a common geometry for all amino acids. In the present paper, we demonstrate that experimental RDCs are also sensitive to the specific geometry of each amino acid as defined by energy-minimized internal co-ordinates. We have modified the FM (flexible-Meccano) algorithm that constructs conformational ensembles on the basis of a statistical coil model, to account for these differences. The modified algorithm inherits the advantages of the FM algorithm to efficiently sample the potential energy landscape for coil conformations. The specific geometries incorporated in the new algorithm result in a better reproduction of experimental RDCs and are generally applicable for further studies to characterize the conformational properties of intrinsically disordered proteins. In addition, the internal-co-ordinate-based algorithm is an order of magnitude more efficient, and facilitates side-chain construction, surface osmolyte simulation, spin-label distribution sampling and proline *cis*/*trans* isomer simulation.

- coil library
- flexible-Meccano algorithm
- intrinsically disordered protein (IDP)
- molecular dynamics simulation
- residual dipolar coupling (RDC)
- statistical coil model

## Introduction

Approximately 50% of mammalian proteins are predicted to contain long (more than 30 residues) disordered regions, and approximately 25% of their proteins are predicted to be fully disordered in the absence of a well-defined three-dimensional structure under physiological conditions [1]. These so-called IDPs (intrinsically disordered proteins) play key roles in a variety of physiological processes, including signalling, cell cycle control, molecular recognition, transcription and replication, and in the development of neurodegenerative diseases, such as Alzheimer's disease and Parkinson's disease [2,3]. Despite the marginal amount of well-defined structure, conformational characterization of these proteins provides insights into the dynamics of their stability, leading to further understanding of disease-related aggregation and fibrillation. Owing to their structural heterogeneity, conventional approaches for structure determination are inappropriate for studying such flexible systems. Novel experimental techniques and computational models therefore become essential for characterizing their rapidly interconverting nature.

Although MD (molecular dynamics) simulations can provide an atomic-resolution description of the unfolded protein ensemble [4,5], there are still limitations in the currently available potential energy force fields to correctly describe the conformational sampling [6] and timescale of unfolded proteins in solution [7,8]. Alternatively, we have developed a conformational sampling algorithm termed FM (flexible-Meccano) [9,10] based on the so-called statistical coil model [9,11,12]. FM efficiently samples the backbone dihedral angle energy surface (φ/ψ) derived from highly resolved crystallographic structures excluding secondary elements, and constructs conformers with only sequence information. To verify FM-generated models, NMR spectroscopy provides the most informative experimental parameters, at amino-acid-specific resolution. In addition to using regular parameters, such as chemical shifts, scalar couplings, nuclear Overhauser effects and relaxation rates, to characterize the properties of unfolded proteins [13], RDCs (residual dipolar couplings) have also been demonstrated to be extremely useful to describe the unfolded state ensemble [14–19] owing to their sensitivity to local conformational sampling. The distribution of RDCs can be calculated very precisely as ensemble and time averages from the well-understood geometry-dependence of nucleus–nucleus dipolar interactions [20]. Accordingly, our ensemble model was verified by comparing RDCs calculated from ensemble structures and experimental measurements [9,18]. In order to improve agreement between the properties of unfolded proteins and our model, increasing the amount of data available from different types of RDCs is essential. Thus eight types of published experimental RDCs for urea-denatured ubiquitin [18,21], as well as five types of newly measured couplings of urea-denatured Protein G, were used to assist in refining our algorithm. In the present paper, we describe a new method, which uses energy-minimized geometry derived from an existing potential energy force field [22], to construct the structural ensemble in a more accurate and efficient way. The predicted RDCs derived from the ensemble generated by the new algorithm improve the agreement with experimental data. This improvement is expected to have direct consequences on the quality of ensembles selected against experimental observables, for example using the ASTEROIDS approach [23–26], from a pool of structures generated using FM.

## Protein purification and NMR measurement

The purification and preparation of denatured ubiquitin and Protein G (8 M urea and 10 mM glycine/HCl buffer, pH 2.5, at 25°C) in isotropic solution or in stretched polyacrylamide gel are described in [18,27]. The RDCs for ubiquitin are taken from our previous publications [18,21]; RDCs for GB1 (the first Ig-binding domain of Protein G) was recorded using the same methods as described in the ubiquitin studies.

## Data analysis

Theoretical RDCs were calculated on the assumption of steric exclusion [28,29] using an efficient in-house-written algorithm. The detailed algorithm is described in [30]. Briefly, the maximal extension of a molecule for each direction of a unit sphere is calculated. The probability for finding the molecule in a certain orientation is then derived as the volume that can be occupied by the molecule between two infinitely extended parallel planes relative to the total distance between the planes. The alignment tensor then corresponds to the average over all orientations of second rank spherical harmonics weighted by this probability. The theoretical RDCs are then calculated from the alignment tensor:
where *D _{k}*

_{,ij}represents the RDC between nuclei

*i*and

*j*for ensemble member

*k*with individual alignment tensor

*S*

_{k}_{,m}(written in irreducible form [31]),

*Y*

_{2m}are spherical harmonics,

*r*

_{k}_{,ij}, Θ

_{k}_{,ij}and Φ

_{k}_{,ij}are the polar co-ordinates of the internuclear vector, and γ are the nuclear gyromagnetic ratios. The distances used for calculating

*D*

_{HN}and

*D*

_{CαHα}are 1.02 and 1.1 Å (1 Å=0.1 nm) respectively according to Bax and co-workers [32] otherwise using values calculated from the co-ordinates. RDCs then were averaged over all members of the ensemble to obtain the predicted value. The size of each ensemble throughout the present paper is 50000.

χ^{2} analysis is used to indicate the agreement between experimental and theoretical values. It is defined as:
where σ* _{i}* is the experimental error, and the summation runs over all observed data

*N*.

## IC (internal co-ordinate)-based algorithm for constructing unfolded protein ensembles

Geometries, in the form of ICs, were derived from energy-minimized structures in the CHARMM force field topology [22]. Each line in the IC contains the names of four atoms and three parameters (see Figure 1 as example). These three parameters indicate the dihedral angle between these four atoms, the angle between the last three atoms and the bond length between the last two atoms respectively. Therefore, from the co-ordinates of the first three atoms and these three parameters, the co-ordinates of the fourth atom can be derived. Accordingly, the algorithm starts from three seed atoms: Cα(*i*), C(*i*) and N(*i*+1) for residue *i*, which can be present in a folded domain, for the cases of partially folded proteins, or a standard three-atom geometry for constructing a fully unfolded polypeptide chain.

Each additional residue (*i*) is built consecutively according to the order of the topology file: O(*i*), N(*i*), C(*i*−1), Cα(*i*−1), H_{N}(*i*), Cβ(*i*) and Hα(*i*). While constructing atoms N(*i*) and C(*i*−1), a combination of ψ and φ angles is randomly taken from the coil library database. Once a residue is constructed, an amino-acid-specific sphere [33] is placed at the position of Cβ (or Cα for glycine). If the sphere is overlapped with the other pre-built ones, this residue will be rejected and another φ/ψ angle combination will be selected from the database, until a non-steric clash conformation is found.

## Difference between the two algorithms

Instead of using peptide planes derived from highly resolved X-ray structures, as is the case in the previous FM algorithm, this new algorithm applies energy-minimized backbone geometry as building blocks, giving specific conformations for different amino acid types. A detailed comparison between the geometries in terms of IC of these two algorithms is listed in Supplementary Table S1 at http://www.biochemsoctrans.org/bst/040/bst0400989add.htm. The most pronounced differences are the angles between atoms Cα, N and H_{N}; for some amino acid types, this can differ from the previous FM model by up to 7°. The tetrahedral angles around Cα, instead of using idealized 109°, range from ~105° to ~114° in the energy-minimized geometry. We note that, although the local geometries in terms of angle and bond length are different between these two algorithms, the radius of gyration (*R*_{g}) and φ/ψ angle distribution generated from them are very similar (Supplementary Figures S1 and S2 at http://www.biochemsoctrans.org/bst/040/bst0400989add.htm), indicating that the new algorithm is not altering the overall geometry or the local sampling.

In addition to the difference between building blocks, the new algorithm is approximately ten times faster than the original version (50000 structures of a 76-amino-acid protein can be created in 2 min on a single Intel 2.8 GHz CPU) mainly because a Levenberg–Marquardt minimization is no longer used to position the peptide plane [34]. We denote this new algorithm FM2 in the present paper for further comparison.

## Using extensive sets of RDCs to verify the computational model

RDCs reporting the time and ensemble average of the nuclear dipolar–dipolar interaction provide a quantitative description of the local order in the unfolded state and are therefore probably the most powerful parameters for verifying the conformational sampling of a simulated structural ensemble. It has been demonstrated that RDCs calculated from a statistical coil ensemble can reasonably well reproduce experimental RDCs in unfolded states, as well as identifying transient long-range contact or residual structures in systems that diverge from the unfolded state [9,11,17,19,35]. Owing to the relative difficulty of RDC measurements, most RDCs reported to date for unfolded proteins are limited to a few types, mostly *D*_{HN} and *D*_{CαHα}. In order to extend our understanding of the unfolded protein conformation, we have collected RDCs for denatured ubiquitin using up to eight different coupling types and five different coupling types for denatured Protein G.

The data from ubiquitin have been used previously in comparison with the original FM algorithm to characterize the conformational sampling of this unfolded system, and more generally to determine the precision to which highly flexible proteins can be analysed from RDC data [18]. In the present paper, we repeat this analysis using the new conformational sampling algorithm based on amino-acid-specific ICs. The comparisons of *D*_{HN}, *D*_{CαHα} and *D*_{HNHα-1} from these two algorithms and experimental values for both proteins are shown in Figure 2; the other types of RDCs are shown in Supplementary Figure S3 (at http://www.biochemsoctrans.org/bst/040/bst0400989add.htm). As the overall level of alignment is unknown, the scaling needs to be optimized against the experimental data. In all cases, only one scaling factor is applied, optimized in this case according to *D*_{HN}.

The prediction from FM is reproduced as reported previously [18]. In the previous studies using the FM algorithm, agreement between prediction and experiment was improved by using an additional scaling factor for all H–H RDCs, compared with RDCs measured between covalently bound spins. As a result of the comparison, more extended conformational sampling than that present in the statistical coil was evoked, presumably because of the extension of the chain due to the presence of high concentrations of urea [21,36–38]. Using FM2, the agreement between experimental and predicted RDCs is improved significantly for both *D*_{CαHα} and *D*_{HNHα-1} when the single scaling factor is optimized against *D*_{HN}. This remarkable improvement is due to the difference of angles between backbone geometries, which, although small, nevertheless has a measurable effect on the ability to reproduce the experimental data compared with the common peptide plane geometry that was used for the previous study. A few degrees difference in bond orientation and tetrahedral angle geometry around Cα can significantly change the predicted RDCs, despite the high similarity of overall geometry and local sampling between ensembles generated from these two algorithms. As an example, a 7° difference in Cα-N-H_{N} can change *D*_{HN} by approximately 5% assuming an extended conformation (results not shown).

In the light of these results, we have reassessed our previous conclusions that urea-denatured proteins sample an enhanced population of extended conformation. The predicted values for *D*_{CαHα} of the FM2 case are still found to be overestimated compared with the experimental data, whereas other RDCs are found to be either underestimated or overestimated (Supplementary Table S3). As in the previous study, different ensembles by means of enhancing the extended region (50°<ψ<180° and φ<0° in the Ramachandran space) sampling from one to four times more in 0.5 increments. In other words, seven ensembles were generated respectively having 59% (standard), 68.3%, 74.2%, 78.2%, 81.2%, 83.4% and 85.2% of φ/ψ angles in the extended region. The χ^{2} values from the FM2 algorithm for different levels of extension in the case of ubiquitin are shown in Figure 3. This target function converges approximately 80% of φ/ψ angles distributed in the extended region of Ramachandran space, in excellent agreement with previous results [37]. Crucially, the χ^{2} is smaller in all cases for FM2 ensembles (Supplementary Figure S4 at http://www.biochemsoctrans.org/bst/040/bst0400989add.htm). A comparison of *D*_{HN}, *D*_{CαHα} and *D*_{HNHα-1} predicted from more extended sampling for both proteins (Figure 4) shows that better sampling improves the predicted value in both algorithms for *D*_{CαHα}, but not for *D*_{HNHα-1}. In fact, *D*_{HNHα-1} shows less dependency on the level of sampling (Supplementary Figures S4 and S5 at http://www.biochemsoctrans.org/bst/040/bst0400989add.htm). Therefore the improvement of *D*_{CαHα} is attributed not only to backbone geometry, but also to more extended sampling, whereas the improvement of *D*_{HNHα-1} is mainly contributed by the energy-minimized geometry. Furthermore, as shown previously [18], reproductions of FM2-predicted long-range RDCs, *D*_{HNHN+1} and *D*_{HNHN+2}, are significantly improved with more extended sampling (Figure 5) in addition to the fact that the energy-minimized geometry also improves the predicted values (Supplementary Figure S3A).

We have also repeated the analysis using the genetic algorithm ASTEROIDS [39] to select FM2-generated ensembles to describe backbone conformational sampling from RDCs, and predicting side-chain RDCs for unfolded proteins using three-staggered rotamer populations derived from ^{3}*J*-couplings [27]. The new algorithm improves results with no contradiction compared with conclusions based on previous analyses (Supplementary Figures S6 and S7 at http://www.biochemsoctrans.org/bst/040/bst0400989add.htm.

In order to determine the general improvement of the amino-acid-specific geometry, we have applied the same approach to GB1. Extensive RDCs were again measured under conditions of urea denaturation (see above). Figure 2(B) shows the comparison of the ability of the two algorithms to reproduce the experimental data, again indicating a better reproduction using the FM2 approach, whereas Figure 4(B) shows the same level of reproduction of the data when the same level of extended conformational sampling is used for GB1 as for ubiquitin. These results indicate that the results are transferrable between the two systems and further underline the remarkable sensitivity of RDCs to the details of local amino acid geometry, as well as to the conformational sampling regime.

## Conclusion

It is now generally accepted that many functional proteins do not have well-defined folded structures. In recent years, both experimental and computational approaches were developed to study this type of protein [40–42]. On the computational side, several methods based on sampling-then-selecting were applied on different biologically important systems to have structural insight into IDPs, e.g. protein phosphatase 1 regulators [43], α-synuclein [24] and Sic1 protein [44]. To construct a geometrically correct ensemble of structures for further analysis is critical. In the present paper, we have described a new algorithm to construct such ensembles based on a statistical coil model. This algorithm inherits the advantage of the FM method that sufficiently samples the energy landscape for coil conformation and combines this with more accurate amino-acid-specific geometries from energy-minimized calculations. This new algorithm results in a better reproduction of experimental RDCs and is generally applicable for further studies to characterize the conformational properties of IDPs. In addition, the IC-based algorithm also facilitates side-chain construction [21], surface osmolyte simulation [21], spin-label distribution sampling, and proline *cis*–*trans* isomer simulation.

## Funding

We acknowledge financial support from FINOVI, a MALZ TAU-STRUCT grant from the Agence Nationale de Recherche (France) and the National Science Council of Taiwan (to J.-r.H.)

## Footnotes

Intrinsically Disordered Proteins: A Biochemical Society Focused Meeting held at University of York, U.K., 26–27 March 2012. Organized and Edited by Jennifer Potts (York, U.K.) and Mike Williamson (Sheffield, U.K.).

**Abbreviations:**
FM, flexible-Meccano;
IC, internal co-ordinate;
IDP, intrinsically disordered protein;
RDC, residual dipolar coupling

- © The Authors Journal compilation © 2012 Biochemical Society