SCIENTIFIC FRONTIER I:
Beyond the Genome

The human genome will be sequenced by the year 2005 and with it the sequence of all human proteins, the proteome, will become available (on the order of 105sequences). Moreover, other genomes are being sequenced including those of several plants and bacteria. Protein sequences, however, provide little insight into protein function unless there is sequence homology to a protein of known function. The structure of a protein in its active, folded state results from a delicate balance on non-covalent interactions within the protein itself and between the protein and its environment. The energies involved are small and difficult to measure or compute. Consequently, experimental determinations of the exact geometry of an active site, the disposition of solvent molecules, the amplitudes of molecular motions bringing functional groups into close proximity, etc., are all critically important to understanding protein function. Thus, the challenge for structurally characterizing the proteome is compounded by the need for many high resolution structures and for detailed descriptions of dynamics. This information will have a broad impact on fundamental biology, medicine, and biotechnology. Structures will be used to understand the molecular basis for disease, to develop diagnostics or therapies, to understand the molecular basis for the action of toxins and combating them, to drug design and pesticide development, for enzyme engineering and bioremediation, for use as industrial catalysts, sensors, and so on.

Over the next decade the already large ratio of the number of known protein sequences to known three dimensional structures will increase dramatically. Homology modeling will reduce the size of the problem even though such an approach will not yield high resolution models of structure or dynamics. It is estimated that the number of distinct protein domain folds is between 400 and 8,000 depending, in part, on the definition of a distinct fold. To generate a representative set of structures comprising all of these folds, so that homology modeling can approximate the remaining structures, will require between 3,000 and 10,000 structures to be determined. Currently, about 400 three-dimensional structures are being determined annually by the sum of all techniques, with only a fraction giving new folds. The rate of generation of new structures must accelerate to even begin to keep up with the sequence information being generated now.

NMR is playing an increasingly important role in characterizing the structure and dynamics of proteins, complementing results obtained by other techniques, most notably X-ray crystallography. In 1995 about one quarter of the new structures came from NMR studies yielding time-averaged representations of the molecules in aqueous solution, often at physiological temperatures, conditions which are arguably closer to that of the native functional state than exists in crystal form.

In addition, NMR can provide information on the hydration of biomolecules that is critical for maintaining the functional state. It can also provide unique insights into the stabilities of different regions of the protein and information on the nature of unstructured states and conformational ensembles of inherently flexible structures. Thus, NMR is ideally suited to the study of protein folding-considered by many to be the major unsolved problem in structural biology and perhaps holding the keys to protein structure prediction.

NMR has tremendous potential for providing a structural and dynamic basis for understanding the way a protein sequence translates to a folded conformation, but there are truly some aspects that are beyond what is directly coded by the genome. A great many of the proteins are post-translationally modified in some way-for the purpose of signaling, for the purpose of modifying stability or activity, or for the purpose of marking proteins for disposal. These modifications including phosphorylation and glycosylation can be monitored by NMR to achieve a functional understanding of their role.

Even though membrane proteins represent 30% of the proteome, relatively little is known about the structure of these proteins, because of their resistance to crystallization. Solid-state NMR can yield time-averaged structures of proteins in the fluid membrane environment, the milieu which is critically important for the function of membrane proteins. The precise distance and orientational constraints provided by solid-state techniques yield high resolution structures for this important class of proteins. In addition, with the recent development of transverse optimized relaxation spectroscopy (TROSY) it may be possible to use solution methods and GHz NMR fields to solve the structures of membrane proteins reconstituted into small vesicles. While the limits of this approach are not yet defined, success in this arena would constitute a major breakthrough.

NMR is also uniquely suited for the determination of both the rates and types of molecular motion in solution (where isotropic global motions occur) and in the solid state (characterized by anisotropic global motions). In the search for correlations between dynamics and function, not only is it important to know the time scale of the motion but other details, such as, the axis about which the motion occurs, its amplitude, whether the process is a diffusional or discontinuous process, and whether motions from adjacent sites are correlated. NMR can provide this information in exquisite detail so that unifying functional correlations can be elucidated.

The limitations of solution NMR spectroscopy as a method for macromolecular structure determination are steadily receding; the molecular weight limit for complete structures is approaching 50,000 Dalton; concentrations as low as 200 micromolar are analyzable; and the quality of NMR structures is continuously being improved. Great enhancements over this performance are just on the horizon. Residual dipolar interactions in weakly aligned systems have provided the first absolute (i.e. relative to the laboratory reference frame) structural constraints for solution NMR. The structural dependence of isotropic chemical shifts and other NMR observables can be calculated through quantum chemical methods better today than ever before, leading to improved structural constraints. Such improvements will lead to increased precision and accuracy in the structure determined by solution NMR. In addition to improving the structural constraints, higher magnetic fields, such as those proposed for the NMRC, will be able to optimally take advantage of the TROSY experiments for improved resolution leading NMR to backbone structures of proteins having molecular weights perhaps as high as 100,000 Dalton. New approaches for selective isotope labeling by taking advantage of metabolic pathways in the cell cultures producing protein samples, and by novel chemical procedures, for instance, stereospecific deuteration will improve resonance selection. Moreover, new methods in molecular biology for splicing protein fragments are becoming available so that labeled domains in a natural abundance background can be achieved, and hence opportunities are arising to focus structural efforts on specific domains in larger structures.

Higher fields, high temperature super-conducting probes, low temperature coils and preamplifiers are all leading to major improvements in sensitivity. Presently, protein concentrations of hundreds of micromolar are needed for NMR experiments. This solubility requirement is a limiting factor for many solution NMR studies and is analogous to the crystallographer's problem of obtaining suitable crystals. The developments mentioned above in combination with larger sample volumes will reduce the protein concentrations needed, to tens of micromolar and the problems of aggregation that have led to "poorly behaved" samples for NMR characterization will become less frequent.

With the implementation of an NMR Collaboratorium operating over the Next- Generation Internet, most of the work of spectroscopic assignment and data analysis could be carried out by individual research groups at remote locations. The development of automated assignment software, and indeed the development of complete turnkey packages for both assignment and three-dimensional structure analysis would facilitate access to the most advanced NMR technology by relatively non-expert personnel. As has been the experience of the synchrotron community (Appendix II), this increased access will increase the rate at which protein structures are solved using NMR techniques. The simultaneous development of high-resolution protein structures, together with improved computational techniques for structure refinement, can be expected to result in a new generation of structures with much higher information content. These will then form the basis for better calculations aimed at a more fundamental understanding of how proteins function.

Thus, we foresee that NMR spectroscopy will play an increasingly important role in protein structure determination, both from the perspective of being a major contributor to the development of the protein structure database and from the standpoint of providing unique information about protein structure and dynamics leading to understanding of both mechanistic and kinetic functional attributes. Membrane protein structure and dynamics will be determined in both micellar and lipid bilayer environments. The next-generation GHz (and higher) NMR technology, when available to the broad research community, will significantly increase the impact of NMR in structural biology research beyond the genome.