WO2008118545A2 - Methods for generating novel stabilized proteins - Google Patents

Methods for generating novel stabilized proteins Download PDF

Info

Publication number
WO2008118545A2
WO2008118545A2 PCT/US2008/053344 US2008053344W WO2008118545A2 WO 2008118545 A2 WO2008118545 A2 WO 2008118545A2 US 2008053344 W US2008053344 W US 2008053344W WO 2008118545 A2 WO2008118545 A2 WO 2008118545A2
Authority
WO
WIPO (PCT)
Prior art keywords
seq
segment
sequence
chimeras
sequences
Prior art date
Application number
PCT/US2008/053344
Other languages
French (fr)
Other versions
WO2008118545A3 (en
Inventor
Frances H. Arnold
Yougen Li
Original Assignee
The California Institute Of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US12/024,515 external-priority patent/US20080248545A1/en
Application filed by The California Institute Of Technology filed Critical The California Institute Of Technology
Publication of WO2008118545A2 publication Critical patent/WO2008118545A2/en
Publication of WO2008118545A3 publication Critical patent/WO2008118545A3/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N9/00Enzymes; Proenzymes; Compositions thereof; Processes for preparing, activating, inhibiting, separating or purifying enzymes
    • C12N9/0004Oxidoreductases (1.)
    • C12N9/0071Oxidoreductases (1.) acting on paired donors with incorporation of molecular oxygen (1.14)
    • C12N9/0077Oxidoreductases (1.) acting on paired donors with incorporation of molecular oxygen (1.14) with a reduced iron-sulfur protein as one donor (1.14.15)

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Organic Chemistry (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Molecular Biology (AREA)
  • Microbiology (AREA)
  • Biotechnology (AREA)
  • Biomedical Technology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Medicinal Chemistry (AREA)
  • Peptides Or Proteins (AREA)

Abstract

The disclosure provides methods for identifying and producing stabilized chimeric proteins.

Description

METHODS FOR GENERATING NOVEL STABILIZED PROTEINS
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] The application claims priority under 35 U. S. C. §119 to 60/900,229, filed February 8, 2007; and 60/918,528, filed, March 16, 2007, the application also claims priority to U.S. Patent Application No. 12/024,515, filed February 1, 2008, the disclosures of which are incorporated herein by reference.
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH
[0002] The U.S. Government has certain rights in this invention pursuant to Grant No. GM068664 awarded by the National Institutes of Health and Grant No. DAAD19-03-0D-0004 awarded by ARO - US Army Robert Morris Acquisition Center.
FIELD OF THE INVENTION
[0003] The invention relates to biomolecular engineering and design, including methods for the design and engineering of biopolymers such as proteins and nucleic acids.
BACKGROUND
[0004] A repertoire of stable proteins that can be further refined for research, industry and medical use is important.
SUMMARY
[0005] The disclosure provides a polypeptide comprising sequences from CYP102A1, CYP102A2 or CYP102A3 and having the general structure from N-terminus to C-terminus : [segment I]-
[segment 2] -[segment 3] -[segment 4] -[segment 5] -[segment 6]-
[segment 7] -[segment 8] wherein: segment 1 is amino acid residue from about 1 to about xl of SEQ ID NO:1 ("1"), SEQ ID NO: 2 ("2") or SEQ ID Nθ:3 ("3"); segment 2 is from about amino acid residue xl to about x2 of SEQ ID N0:l ("1"), SEQ ID NO : 2 ("2") or SEQ ID Nθ:3
("3"); segment 3 is from about amino acid residue x2 to about x3 of SEQ ID Nθ:l ("1"), SEQ ID Nθ:2 ("2") or SEQ ID NO : 3 ("3"); segment 4 is from about amino acid residue x3 to about x4 of SEQ ID Nθ:l
("1"), SEQ ID NO:2 ("2") or SEQ ID Nθ:3 ("3"); segment 5 is from about amino acid residue x4 to about x5 of SEQ ID N0:l ("1"), SEQ ID N0:2 ("2") or SEQ ID NO : 3 ("3"); segment 6 is from about amino acid residue x5 to about x6 of SEQ ID N0:l ("1"), SEQ ID N0:2 ("2") or SEQ ID N0:3 ("3"); segment 7 is from about amino acid residue xβ to about x7 of SEQ ID N0:l ("1"), SEQ ID N0:2 ("2") or SEQ ID N0:3 ("3"); and segment 8 is from about amino acid residue x7 to about x8 of SEQ ID N0:l ("1"), SEQ ID NO : 2 ("2") or SEQ ID N0:3 ("3"); wherein: xl is residue 62, 63, 64, 65 or 66 of SEQ ID N0:l, or residue 63, 64, 65, 66 or 67 of SEQ ID N0:2 or SEQ ID N0:3; x2 is residue 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 132 or 132 of SEQ ID N0:l, or residue 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, or 133 of SEQ ID N0:2 or SEQ ID NO:3; x3 is residue 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, or 177 of SEQ ID NO:1, or residue 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, or 178 of SEQ ID N0:2 or SEQ ID NO: 3; x4 is residue 214, 215, 216, 217 or 218 of SEQ ID N0:l, or residue 215, 216, 217, 218 or 219 of SEQ ID NO:2 or SEQ ID N0:3; x5 is residue 266, 267, 268, 269 or 270 of SEQ ID NO:1, or residue 268, 269, 270, 271 or 272 of SEQ ID N0:2 or SEQ ID NO:3; x6 is residue 326, 327, 328, 329 or 330 of SEQ ID N0:l, or residue 328, 329, 330, 331 or 332 of SEQ ID NO:2 or SEQ ID N0:3; x7 is residue 402, 403, 404, 405 or 406 of SEQ ID NO:1, or residue 404, 405, 405, 407 or 408 of SEQ ID NO:2 or SEQ ID NO:3; and x8 is an amino acid residue corresponding to the C-terminus of the heme domain of CYP102A1, CYP102A2 or CYP102A3 or the C-terminus of SEQ ID N0:l, SEQ ID NO : 2 or SEQ ID NO: 3; wherein the polypeptide has a CO-binding peak at 450 ran; and wherein the general structure is selected from the group consisting of: 11311333, 11312331, 11312333, 21111133, 21111213, 21111231, 21111233, 21111311, 21111313, 21111331, 21111332, 21112131, 21112133, 21112211, 21112213, 21112231, 21112313, 21112323, 21113213, 21113231, 21113233, 21113311, 21113313, 21113331, 21113332, 21211133, 21211213, 21211231, 21211233, 21211311, 21211313, 21211331, 21211332, 21211333, 21212131, 21212211, 21212232, 21212311, 21212313, 21212323, 21212331, 21213213, 21213233, 21213311, 21213313, 21213331, 21311113, 21311123, 21311131, 21311132, 21311133, 21311211, 21311212, 21311213, 21311221, 21311232, 21311312, 21311321, 21311322, 21311323, 21311332, 21312113, 21312131, 21312132, 21312221, 21312232, 21312312, 21313113, 21313131, 21313132, 21313133, 21313211, 21313212, 21313213, 21313223, 21313232, 21313321, 21313323, 21313332, 21321333, 21322313, 21322333, 21331331, 21332213, 21332311, 21332313, 22111233, 22111313, 22111331, 22111333, 22112213, 22112231, 22112311, 22112313, 22112332, 22113331, 22113333, 22211231, 22211233, 22211313, 22211333, 22212213, 22212231, 22212233, 22212311, 22212313, 22212331, 22212332, 22213313, 22213331, 22213333, 22311111, 22311113, 22311131, 22311132, 22311133, 22311211, 22311213, 22311221, 22311223, 22311232, 22311311, 22311312, 22311313, 22311321, 22311323, 22312113, 22312131, 22312212, 22312213, 22312321, 22312323, 22313113, 22313131, 22313133, 22313211, 22313213, 22313223, 22313311, 22313312, 22313313, 22313321, 22322333, 22332331, 22332333, 22333333, 23311231, 23311331, 23311333, 23312133, 23312213, 23312231, 23312233, 23312313, 23312331, 23312332, 23312333, 23313313, 23313331, 31311331, 31311333,31312313, 31312331, 31313331, 32112231 and 32311333. In one embodiment, the polypeptide comprises a heme domain and the heme domain is fused to a functional reductase domain having at elast 50% identitity to the reductase domain of SEQ ID N0:l, 2, or 3.
[0006] The disclosure also provides a polypeptide having the general structure from N-terminus to C-terminus : [segment I]- [segment 2] -[segment 3] -[segment 4] -[segment 5] -[segment 6]- [segment 7] -[segment 8] wherein segment 1 comprises at least 50- 100% identity to the sequence of SEQ ID NO: 4, 5, or 6; wherein segment 2 comprises at least 50-100% identity to the sequence of SEQ ID NO: 7, 8, or 9; wherein segment 3 comprises at least 50-100% identity to the sequence of SEQ ID NO: 10, 11 or 12; segment 4 comprises at least 50-100% identity to the sequence of SEQ ID NO: 13, 14, or 15; segment 5 comprises at least 50-100% identity to the sequence of SEQ ID NO: 16, 17, or 18; segment 6 comprises at least 50-100% identity to the sequence of SEQ ID NO: 19, 20, or 21; segment 7 comprises at least 50-100% identity to the sequence of SEQ ID NO:22, 23, or 24; and segment 8 comprises at least 50-100% identity to a sequence of SEQ ID NO: 25, 26, or 27, and wherein the polypeptide has a CO binding peak at 450 run. In one aspect, the the polypeptide comprises a heme domain and the heme domain is fused to a functional reductase domain having at least 50% identity to the reductase domain of SEQ ID NO:1, 2, or 3.
[0007] Also provided are polynucleotide encoding a polypeptide as set forth herein and above. The polynucleotide can be contained in a vector, such as an expression vector, or in a host cell (either as part of the genome or within a vector with in the host cell.
[0008] The disclosure also provides an enzyme extract comprising a polypeptide produced from the host cell of the disclosure .
[0009] The disclosure provides chimeric polypeptides of P450. The disclosure provides a polypeptide comprising sequences from CYP102A1 ("1"), A2 ("2") and A3 ("3") having the general formula 11311333, 11312331, 11312333, 21111133, 21111213, 21111231, 21111233, 21111311, 21111313, 21111331, 21111332, 21112131, 21112133, 21112211, 21112213, 21112231, 21112313, 21112323, 21113213, 21113231, 21113233, 21113311, 21113313, 21113331, 21113332, 21211133, 21211213, 21211231, 21211233, 21211311, 21211313, 21211331, 21211332, 21211333, 21212131, 21212211, 21212232, 21212311, 21212313, 21212323, 21212331, 21213213, 21213233, 21213311, 21213313, 21213331, 21311113, 21311123, 21311131, 21311132, 21311133, 21311211, 21311212, 21311213, 21311221, 21311232, 21311312, 21311321, 21311322, 21311323, 21311332, 21312113, 21312131, 21312132, 21312221, 21312232, 21312312, 21313113, 21313131, 21313132, 21313133, 21313211, 21313212, 21313213, 21313223, 21313232, 21313321, 21313323, 21313332, 21321333, 21322313, 21322333, 21331331, 21332213, 21332311, 21332313, 22111233, 22111313, 22111331, 22111333, 22112213, 22112231, 22112311, 22112313, 22112332, 22113331, 22113333, 22211231, 22211233, 22211313, 22211333, 22212213, 22212231, 22212233, 22212311, 22212313, 22212331, 22212332, 22213313, 22213331, 22213333, 22311111, 22311113, 22311131, 22311132, 22311133, 22311211, 22311213, 22311221, 22311223, 22311232, 22311311, 22311312, 22311313, 22311321, 22311323, 22312113, 22312131, 22312212, 22312213, 22312321, 22312323, 22313113, 22313131, 22313133, 22313211, 22313213, 22313223, 22313311, 22313312, 22313313, 22313321, 22322333, 22332331, 22332333, 22333333, 23311231, 23311331, 23311333, 23312133, 23312213, 23312231, 23312233, 23312313, 23312331, 23312332, 23312333, 23313313, 23313331, 31311331, 31311333,31312313, 31312331, 31313331, 32112231 and 32311333, from N-terminus to C- terminus; wherein the polypeptide has a CO-binding peak at 450 nm. In one aspect, the polypeptide comprises a first peptide segment comprising about 64 to 68 amino acids having at the C-terminus of the first peptide segment a sequence E(S or E or K)RFD (SEQ ID NO:29), a second peptide segment comprising about 56 to 60 amino acids having at the C-terminus of the second peptide segment a sequence K(G or D)YH(A or E or S) (SEQ ID NO: 30), a third peptide segment comprising about 42 to 46 amino acids having at the C- terminus of the third peptide segment a sequence GFNYR (SEQ ID NO:31), a fourth peptide segment comprising about 48 to 52 amino acid having at the C-terminus of the fourth peptide segment a sequence (D or S)LVD(K or S or R) (SEQ ID NO:32), a fifth peptide segment comprising about 50 to 54 amino acids having at the C- terminus of the fifth peptide segment a sequence HETTS (SEQ ID NO:33), a sixth peptide segment comprising about 58 to 62 amino acids having at the C-terminus of the sixth peptide segment a sequence PTAPA (SEQ ID NO:34), a seventh peptide segment comprising about 74 to 78 amino acids having at the C-terminus of the seventh peptide segment a sequence G(Q or M)QFA (SEQ ID NO: 35), an eighth peptide segment comprising a sequence extending from the C-terminus of the seventh peptide segment to the C-terminus of the heme domain of a P450 BM3 of SEQ ID NO:1, 2, or 3, or the C-terminus of a P450 BM3 of SEQ ID NO : 1 , 2, or 3, and wherein the polypeptide has a CO- binding peak at 450 nm.
[0010] The sequences and thermostabilities of cytochrome P450 proteins assembled by structure-guided SCHEMA recombination were determined in order to identify relationships that would allow prediction of the stabilities of untested sequences. The disclosure shows that a chimera's thermostability can be predicted from the additive contributions of sequence fragments. Those contributions can be determined either by linear regression of stability-sequence data or, with less accuracy, from the frequencies with which the specific sequence fragments appear in folded vs. unfolded chimera population. Using these observations as the basis for predicting highly stable sequences, a diverse family of 40 thermostable cytochrome P450s whose half-lives of inactivation at 57 0C are as much as 100 times that of the most stable parent. Differing from any known natural P450 by up to 100 amino acid substitutions and from one another by as many as 88, the stable P450s are diverse, yet still retain catalytic activity. Some are significantly more active than the parent enzymes towards a nonnatural substrate, 2-phenoxyethanol . This stabilized protein family provides a unique ensemble for biotechnological applications and for studying sequence-stability-function relationships.
BRIEF DESCRIPTION OF THE FIGURES
[0011] Figure 1 shows thermostabilities of parental and chimeric cytochromes P450. The distribution of T50 values for 185 chimeric cytochromes P450 has an average of 50.4 0C and standard deviation of 4.5 0C. Thermostabilities for parents Al, A2 and A3 are indicated (solid lines), with four experimental replicate measurements for A2 to examine measurement variability (dotted lines, standard deviation of 1.0 0C) .
[0012] Figure 2A-B shows sequence elements contribute additively to thermostabilities of chimeric cytochromes P450. a, Predicted T50 from a simple linear model correlates with the measured T50 for 185 P450 chimeras with r = 0.857. b, Linear model derived from data in a accurately predicts stabilities of 20 additional chimeras, including the most-stable P450 (MTP) (top rightmost point) .
[0013] Figure 3A-B depicts relative frequencies of sequence elements among folded chimeras correlates with relative stability contributions. a, Thermostability contributions of fragments from parents Al and A3 relative to those from parent A2, obtained by linear regression analysis of 205 folded chimeras with measured T50. b, Frequencies of fragments from parents Al and A3 relative to those from parent A2 among folded chimeras, c, Relative fragment thermostability contributions correlate with their relative frequencies among folded chimeras .
[0014] Figure 4A-D shows chimera thermostabilities and folding status predicted from sequence element frequencies in multiple sequence alignments of folded and unfolded proteins, a, Consensus energies computed from Boltzmann statistics and fragment frequencies of folded chimeras correlate with measured thermostabilities (T5os) . b, The distribution of consensus energies of 620 folded chimeras and 335 unfolded chimeras. Folded chimeras (dark grey) have lower consensus energies than unfolded chimeras (light grey) . Overlap region is shown. The consensus energies were calculated as in a. Distributions are histogram densities, c, Consensus energies computed from Boltzmann statistics and fragment frequencies using folded and unfolded chimeras correlate with measured thermostabilities (T50) . d, Folded chimeras (dark grey) have lower consensus energies than unfolded chimeras (light grey) . Overlap region is shown. Consensus energies were calculated as in c.
[0015] Figure 5A-D shows linear regression analysis of protein stability, a. Predicted T50 compared to experimental Tsofor the training data set. The r value for the regression line is 0.901. Squares represent outlier points removed after training, b. Predicted T5ocompared to measured T50for the test data set. The r value for the regression line is 0.856. c. Prediction accuracy (indicated by correlation coefficient between predicted T5oand measured T50) depends on the number of chimeras used for regression analysis, d. Prediction of T50S of 6,561 members of the synthetic protein library.
[0016] Figure 6 shows prediction accuracy (indicated by the Spearman rank-order correlation coefficient between predicted consensus energies and measured T50) is related to the number of chimeras used for consensus analysis.
[0017] Figure 7A-B shows sequence diversity for 40 stable chimeric cytochrome P450 heme domains and the three parent sequences, a. The number of amino acid differences between each pair of chimeras (black) and for parent-chimera pairs (grey) . Pairwise sequence differences range from 7 to 167 amino acids, b. It is not possible to create a two-dimensional illustration with all chimera-chimera Euclidean distances perfectly proportional to the underlying sequence differences. A multi-dimensional scaling (XGOBI) was used to optimize a two-dimensional representation that minimizes the discrepancy between the Euclidean distances and the sequence differences.
[0018] Figure 8 shows a comparison of the ranking performance using regression (circles) to the ranking performance using consensus (filled circles) . The points represent the performance of each ranking method when partitioning the set of three parents and 205 chimeras with measured T50 values into the top 10, 20, 30...200. For example, the y-positions of the leftmost points indicate that the consensus method correctly flags 4 of the top 10 chimeras while the regression method correctly flags 6. The x-positions of the leftmost points indicate that the consensus method correctly flags 96 of the bottom 99 chimeras while the regression method correctly flags 97. The regression model has superior ranking performance for all threshold choices.
[0019] Figure 9 depicts the sequence domains.
[0020] Figure 10 shows the amino acid sequence for CYP102A1. [0021] Figure 11 shows the amino acid sequence for CYP102A2. [0022] Figure 12 shows the amino acid sequence for CYP102A3. [0023] Figure 13A-B show an alignment of SEQ ID NOs: 1-3. [0024] Figure 14 shows chemical structures and abbreviations. Substrates are grouped according to the pairwise correlations. Members of a group are highly correlated; intergroup correlations are low.
[0025] Figure 15 shows a summary of normalized activities for all 56 enzymes acting on 11 substrates. Activities are shown using a color scale (white indicating highest and black lowest activity) , with columns representing substrates and rows representing proteins. Not-analyzed A3, A3-R1 and A3-R2 proteins are shown in grey. Protein rows are ordered by their chimeric sequence first, and then by heme domain (RO) and Rl, R2- and R3-fusions. [0026] Figure 16A-D shows substrate-activity profiles for parent heme domain mono- and peroxygenases . Panel (A) shows parent peroxygenases, panel (B) parent holoenzyme monooxygenases profiles, panel (C) the Al protein set and panel (D) the A2 protein set. In (A) and (B) the origin of the heme domain (Al("l")l A2("2") and A3 ("3"))- The protein set in panel (C) includes the heme domain Al or its Rl-, R2- or R3-fusion protein. Panel (D) depicts the A2 protein set.
[0027] Figure 17A-F shows K-means clustering analysis separates chimeras into five clusters. All protein-activity profiles are depicted in (A) . Panels (B) through (F) show profiles for sequences within each cluster. Panel (B) depicts 32312333-R1/R2, 32313233- R1/R2. Panel (C) depicts 22213132-R2, 21313111-R3, 21313311-R3. Panel (D) depicts A1-R1/R2, 12112333-R1/R2, 11113311-R1/R2 and 22213132-R1. Panel (E) depicts 21313111-R1/R2, 22313233-R2, 22312333-R2, 32312231-R2, 32312333-R0, 32312333-R3, 32313233-R0, and 32313233-R3. Panel (F) depicts the remaining sequences. [0028] Figure 18 shows the interface between the FMN backbone and heme domain based on the IBVY structure. Residue indicate the degree of conservation. Hydrogen bonds are shown as dashed lines. The amino acids correspond to CYP102A1 numbering.
[0029] Figure 19A-P shows substrate-activity profiles of all chimeras. The columns are coded as follows from front to back: heme domain (RO, front), Rl-, R2-, R3-fusion protein. [0030] Figure 20A-B are examples of the correlation of absorbances values measured within substrate Group A and Group B. Panel (A) shows the correlation between diphenyl ether (DP) and ethyl phenoxyacetate (PA) with a R2=0.71. Panel (B) shows the correlation between tolbutamide (TB) activity and chlorzoxazone CH) activity with R2=0.94.
DETAILED DESCRIPTION
[0031] As used herein and in the appended claims, the singular forms "a," "and," and "the" include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to "a domain" includes a plurality of such domains and reference to "the protein" includes reference to one or more proteins, and so forth.
[0032] Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood to one of ordinary skill in the art to which this disclosure belongs. Although methods and materials similar or equivalent to those described herein can be used in the practice of the disclosed methods and compositions, the exemplary methods, devices and materials are described herein.
[0033] The publications discussed above and throughout the text are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the inventors are not entitled to antedate such disclosure by virtue of prior disclosure.
[0034] Proteins fold into native structures determined by their amino acid sequences and thereby become biologically active. Stability of the native structure therefore plays a vital role in function, and also in protein turnover, genetic diseases, mutational tolerance, functional evolvability, and even the rate of evolution. Proteins with enhanced stability are of significant benefit in industrial applications, where they are better suited to formulation, long-term storage, and extended use in non-natural environments such as elevated temperature. Stabilized proteins are also better starting points for engineering, because their enhanced robustness to mutations makes it easier for them to acquire new functional properties.
[0035] The versatile cytochrome P450 family of heme-containing redox enzymes hydroxylate a wide range of substrates to generate products of significant medical and industrial importance. A particularly well-studied member of this diverse enzyme family, cytochrome P450 BM3 (CYP102A1, or "Al"; SEQ ID N0:l; see also GenBank Accession No. J04832, which is incorporated herein by reference) from Bacillus megaterium, has been engineered extensively for biotechnological applications that include fine chemical synthesis and producing human metabolites of drugs. [0036] The disclosure demonstrates a new approach to making highly stable, functional proteins with diverse sequences by predicting the stable chimeras in a site-directed SCHEMA recombination library. Two approaches to analyzing a subset of the SCHEMA chimeras were shown to be effective. Both approaches assume that the fragments contribute in an additive manner to the overall stability, but estimate those contributions using different data. One calculates the contributions by linear regression of sequence- stability data, and the other is based on consensus analysis of the MSAs of folded versus unfolded proteins. Both approaches identify highly stable sequences; SCHEMA recombination ensures that they also retain biological function and exhibit high sequence diversity by conserving important functional residues while exchanging tolerant ones.
[0037] That fragments of the primary sequence contribute additively to stability may appear surprising, considering the cooperative nature of protein folding and many tertiary contacts in the native structure. The high degree of additivity observed in this study may be a feature of SCHEMA library design. SCHEMA identifies those sequence fragments that minimize the number of contacts, or interactions, that can be broken upon recombination. Two residues in a chimera are defined to have a contact if any heavy atoms are within 4.5 A; the contact is broken if they do not appear together in any parent at the same positions. Among a total of about 500 contacts for a P450 chimera, an average of fewer than 30 were broken for the sequences in the SCHEMA library. The fragments that were swapped in this library have a high number of internal contacts; the inter-fragment contacts are either few or are conserved among the parents. Therefore, the fragments function as pseudo-independent structural modules that make roughly additive contributions to stability. The additivity was strong enough to enable detection of sequencing errors based on deviations from additivity, prediction of thermostabilities for uncharacterized chimeras with high accuracy, and prediction of the T50 of the most stable chimera to within measurement error. This additivity enabled a new approach to stabilizing an entire protein family that does not require high throughput selection or screening. [0038] An "amino acid" is a molecule having the structure wherein a central carbon atom (the -carbon atom) is linked to a hydrogen atom, a carboxylic acid group (the carbon atom of which is referred to herein as a "carboxyl carbon atom"), an amino group (the nitrogen atom of which is referred to herein as an "amino nitrogen atom"), and a side chain group, R. When incorporated into a peptide, polypeptide, or protein, an amino acid loses one or more atoms of its amino acid carboxylic groups in the dehydration reaction that links one amino acid to another. As a result, when incorporated into a protein, an amino acid is referred to as an "amino acid residue."
[0039] "Protein" or "polypeptide" refers to any polymer of two or more individual amino acids (whether or not naturally occurring) linked via a peptide bond, and occurs when the carboxyl carbon atom of the carboxylic acid group bonded to the -carbon of one amino acid (or amino acid residue) becomes covalently bound to the amino nitrogen atom of amino group bonded to the -carbon of an adjacent amino acid. The term "protein" is understood to include the terms "polypeptide" and "peptide" (which, at times may be used interchangeably herein) within its meaning. In addition, proteins comprising multiple polypeptide subunits (e.g., DNA polymerase III, RNA polymerase II) or other components (for example, an RNA molecule, as occurs in telomerase) will also be understood to be included within the meaning of "protein" as used herein. Similarly, fragments of proteins and polypeptides are also within the scope of the invention and may be referred to herein as "proteins." In one aspect of the disclosure, a stabilized protein comprises a chimera of two or more parental peptide segments.
[0040] A "peptide segment" refers to a portion or fragment of a larger polypeptide or protein. A peptide segment need not on its own have functional activity, although in some instances, a peptide segment may correspond to a domain of a polypeptide wherein the domain has its own biological activity. A stability-associated peptide segment is a peptide segment found in a polypeptide that promotes stability, function, or folding compared to a related polypeptide lacking the peptide segment. A destabilizing- associated peptide segment is a peptide segment that is identified as causing a loss of stability, function or folding when present in a polypeptide.
[0041] A particular amino acid sequence of a given protein (i.e., the polypeptide's "primary structure," when written from the amino-terminus to carboxy-terminus) is determined by the nucleotide sequence of the coding portion of a mRNA, which is in turn specified by genetic information, typically genomic DNA (including organelle DNA, e.g., mitochondrial or chloroplast DNA). Thus, determining the sequence of a gene assists in predicting the primary sequence of a corresponding polypeptide and more particular the role or activity of the polypeptide or proteins encoded by that gene or polynucleotide sequence.
[0042] "Polynucleotide" or "nucleic acid sequence" refers to a polymeric form of nucleotides. In some instances a polynucleotide refers to a sequence that is not immediately contiguous with either of the coding sequences with which it is immediately contiguous (one on the 5' end and one on the 3' end) in the naturally occurring genome of the organism from which it is derived. The term therefore includes, for example, a recombinant DNA which is incorporated into a vector; into an autonomously replicating plasmid or virus; or into the genomic DNA of a prokaryote or eukaryote, or which exists as a separate molecule (e.g., a cDNA) independent of other sequences. The nucleotides of the invention can be ribonucleotides, deoxyribonucleotides, or modified forms of either nucleotide. A polynucleotides as used herein refers to, among others, single-and double-stranded DNA, DNA that is a mixture of single- and double-stranded regions, single- and double-stranded RNA, and RNA that is mixture of single- and double-stranded regions, hybrid molecules comprising DNA and RNA that may be single-stranded or, more typically, double-stranded or a mixture of single- and double-stranded regions.
[0043] In addition, polynucleotide as used herein refers to triple-stranded regions comprising RNA or DNA or both RNA and DNA. The strands in such regions may be from the same molecule or from different molecules. The regions may include all of one or more of the molecules, but more typically involve only a region of some of the molecules. One of the molecules of a triple-helical region often is an oligonucleotide. The term polynucleotide encompasses genomic DNA or RNA (depending upon the organism, i.e., RNA genome of viruses), as well as mRNA encoded by the genomic DNA, and cDNA. Polynucleotides encoding P450 from Bacillus megaterium see e.g., GenBank accession no. J04832 and subtilis are known. [0044] A "nucleic acid segment," "oligonucleotide segment" or "polynucleotide segment" refers to a portion of a larger polynucleotide molecule. The polynucleotide segment need not correspond to an encoded functional domain of a protein; however, in some instances the segment will encode a functional domain of a protein. A polynucleotide segment can be about 6 nucleotides or more in length (e.g., 6-20, 20-50, 50-100, 100-200, 200-300, 300- 400 or more nucleotides in length) . A stability-associated peptide segment can be encoded by a stability-associated polynucleotide segment, wherein the peptide segment promotes stability, function, or folding compared to a polypeptide lacking the peptide segment. [0045] A chimera is a combination of at least two segments of at least two different parent proteins. As appreciated by one of skill in the art, the segments need not actually come from each of the parents, as it is the particular sequence that is relevant, and not the physical nucleic acids themselves. For example, a chimeric P450 will have at least two segments from two different parent P450s. The two segments are connected so as to result in a new P450. In other words, a protein will not be a chimera if it has the identical sequence of either one of the parents. A chimeric protein can comprise more than two segments from two different parent proteins. For example, there may be 2, 3, 4, 5-10, 10-20, or more parents for each final chimera or library of chimeras. The segment of each parent enzyme can be very short or very long, the segments can range in length of contiguous amino acids from 1 to the entire length of the protein. In one embodiment, the minimum length is 10 amino acids. In one embodiment, a single crossover point is defined for two parents. The crossover location defines where one parent's amino acid segment will stop and where the next parent's amino acid segment will start. Thus, a simple chimera would only have one crossover location where the segment before that crossover location would belong to one parent and the segment after that crossover location would belong to the second parent. In one embodiment, the chimera has more than one crossover location. For example, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11-30, or more crossover locations. How these crossover locations are named and defined are both discussed below. In an embodiment where there are two crossover locations and two parents, there will be a first contiguous segment from a first parent, followed by a second contiguous segment from a second parent, followed by a third contiguous segment from the first parent. Contiguous is meant to denote that there is nothing of significance interrupting the segments. These contiguous segments are connected to form a contiguous amino acid sequence. For example, a P450 chimera from CYP102A1 (hereinafter "Al") and CYP102A2 (hereinafter "A2"), with two crossovers at 100 and 150, could have the first 100 amino acids from Al, followed by the next 50 from A2, followed by the remainder of the amino acids from Al, all connected in one contiguous amino acid chain. Alternatively, the P450 chimera could have the first 100 amino acids from A2, the next 50 from Al and the remainder followed by A2. As appreciated by one of skill in the art, variants of chimeras exist as well as the exact sequences. Thus, not 100% of each segment need be present in the final chimera if it is a variant chimera. The amount that may be altered, either through additional residues or removal or alteration of residues will be defined as the term variant is defined. Of course, as understood by one of skill in the art, the above discussion applies not only to amino acids but also nucleic acids which encode for the amino acids .
[0046] Protein stability is a key factor for industrial protein use (e.g., enzyme reaction) in denaturing conditions required for efficient product development and in therapeutic and diagnostic protein products. Methods for optimizing protein stability have included directed evolution and domain shuffling. However, screening and developing such recombinant libraries is difficult and time consuming.
[0047] Directed evolution has proven to be an effective technique for engineering proteins with desired properties. Because the probability of a protein retaining its fold and function decreases exponentially with the number of random substitutions introduced (Bloom et al . , Proc . Natl Acad. Sci. USA, 102, 606-611, 2005), only a few mutations are made in each generation in order to maintain a reasonable fraction of functional proteins for screening (Voigt et al., Advances in Protein Chemistry, VoI 55, Academic Press, pp. 79-160, 2001). Creating libraries with higher levels of mutation while maintaining structure and function requires identifying mutations that are less likely to disrupt the structure (Lutz and Patrick, Curr . Opin. Biotechnol., 15, 291-297, 2004). One strategy to accomplish this is homologous recombination: mutations introduced by recombination are less deleterious than random mutations because they are compatible with the backbone structure (Drummond et al . , Proc. Natl Acad. Sci. USA, 102, 5280-5385, 2005). Random recombination of highly similar proteins often generates libraries with a high fraction of functional sequences; however, as more distantly related proteins are recombined, the fraction of chimeric proteins that fold correctly decreases.
[0048] Efforts have been made to identify consensus mutations that provide stabilizing effects. Consensus stabilization has been shown to be effective in some cases and to some degree, but not all consensus mutations are stabilizing (e.g., more than 40% of the consensus residues identified from multiple sequence alignment of naturally occurring β-lactamases are in fact destabilizing rather than stabilizing (Amin et al. Prot . Eng. Des . & SeI., 17(11):787- 793, 2004)) . These methods have two problems: first single mutations generally have small effects on stability and second not all mutations can be combined such that the stabilizing effects can be properly measured.
[0049] Thus, methods of protein development have focused on providing stabilized proteins by generating a large number of recombined proteins and assaying each recombined protein for activity. A method of identifying stabilizing mutations is a first step in removing or narrowing possible candidates. For this reason it is of value to be able to make multiple versions of a protein that are stabilized. If one has many stable variants to choose from, then those variants that exhibit all of the properties of interest can be identified by appropriate analysis of those properties. The disclosure provides a method for making many (e.g., from 1 to many thousand) variants of a protein having amino acid sequences that may differ at multiple amino acid positions and that are stabilized and thus are likely to be functional. Such techniques for generating libraries of stabilized proteins have not previously been provided in the art. [0050] A number of techniques are used for generating novel proteins including, for example, rational design, which uses computational methods to identify sites for introducing disulfide bonds; directed evolution; and consensus stabilization. The foregoing methods do not utilize a linear regression or consensus analysis to assist selectively designing stabilized proteins. [0051] Recombination has been widely applied to accelerate in vitro protein evolution. In this process, the genetic information of several genes is exchanged to produce a library of recombined, recombinant mutants. These mutants are screened for improvement in properties of interest, such as stability, activity, or altered substrate specificity. In vitro recombination methods include DNA shuffling, random-priming recombination, and the staggered extension process (StEP) . In DNA shuffling, the parental DNA is enzymatically digested into fragments. The fragments can be reassembled into offspring genes. In the random-priming method, template DNA sequences are primed with random-sequence primers and then extended by DNA polymerase to create fragments. The template is removed and the fragments are reassembled into full-length genes, as in the final step of DNA shuffling. In each of these methods, the number of cut points can be increased by starting with smaller fragments or by limiting the extension reaction. StEP recombination differs from the first two methods because it does not use gene fragments. The template genes are primed and extended before denaturation and reannealing. As the fragments grow, they reanneal to new templates and thus combine information from multiple parents. This process is cycled hundreds of times until a full-length offspring gene is formed. The foregoing methods are known in the art .
[0052] Recently, it has been shown that recombining genes that have evolved independently in nature is a powerful way to quickly accumulate large improvements in stability and function. Given the explosive growth in the gene databases due to the exhaustive sequencing of large numbers of organisms, the sequences of homologous genes are easily accessible. These sequences can be synthesized or cloned for evolution of protein functions by recombination methods described above and known in the art. [0053] Common to these experimental approaches to recombination in vitro is that the genes are cut and reformed randomly, that is, there is little or no a priori input into the experimental protocol regarding which genes are chosen for recombination and where the cut points should occur, other than in regions of high sequence similarity. Using the SCHEMA method (described further herein) sequences are predicted that are more likely to generate diverse recombined, recombinant gene libraries and the desired improvements in the recombined, recombinant genes.
[0054] As a first step in performing any recombination techniques a set of related polypeptides is identified. The relatedness of the polypeptides can be determined in any number of ways known in the art. For example, polypeptides may be related structurally either in their primary sequence or in the secondary or tertiary sequence. Methods of identifying sequence identity or 3D structural similarities are known and are further described herein. Another method to identify a related polypeptide is through evolutionary analysis. Evolutionary trees have been developed for a large number of proteins and are available to those of skill in the art.
[0055] A parental sequence used as a basis for defining a set of related polypeptides can be provided by any of a number of mechanisms, including, but not limited to, sequencing, or querying a nucleic acid or protein database. Additionally, while the parental sequence can be provided in a physical sense (e.g., isolated or synthesized) , typically the parental sequence or sequences are obtain in silico.
[0056] For embodiments of the disclosure involving amino acid sequences, the parental sequences typically are derived from a common family of proteins having similar three-dimensional structures (e.g., protein superfamilies) . However, the nucleic acid sequences encoding these proteins might or might not share a high degree of sequence identity. As described later herein, the methods include assessing crossover positions using any number of techniques (e.g., SCHEMA etc.).
[0057] Sequence similarity/identity of various stringency and length can be detected and recognized using a number of methods or algorithms known to one of skill in the art. For example, many identity or similarity determination methods have been designed for comparative analysis of sequences of biopolymers, for spell- checking in word processing, and for data retrieval from various databases. With an understanding of double-helix pair-wise complement interactions among the four principal nucleobases in natural polynucleotides, models that simulate annealing of complementary homologous polynucleotide strings can also be used as a foundation of sequence alignment or other operations typically performed on the character strings corresponding to the sequences herein (e.g., word-processing manipulations, construction of figures comprising sequence or subsequence character strings, output tables, etc.) . An example of a software package for calculating sequence identity is BLAST, which can be adapted to the disclosure by inputting character strings corresponding to the sequences herein.
[0058] After providing parental sequences, the sequences are aligned. In other embodiments, a plurality of parental sequences are provided, which are then aligned with either a reference sequence, or with one another. Alignment and comparison of relatively short amino acid sequences (for example, less than about 30 residues) is typically straightforward. Comparison of longer sequences can require more sophisticated methods to achieve optimal alignment of two sequences .
[0059] Optimal alignment of sequences can be performed, for example, by a number of available algorithms, including, but not limited to, the "local homology" algorithm of Smith and Waterman (Adv. Appl. Math. 2:482, 1981), the "homology alignment" algorithm of Needleman and Wunsch (J. MoI. Biol. 48:443, 1970), the "search for similarity" method of Pearson and Lipman (Proc. Natl. Acad. Sci. USA 85:2444, 1988), or by computerized implementations of these algorithms (e.g., GAP, BESTFIT, FASTA and TFASTA available in the Wisconsin Genetics Software Package Release 7.0, Genetics Computer Group, 575 Science Dr., Madison, Wis.; and BLAST, see, e.g., Altschul et al., Nuc. Acids Res. 25:3389-3402, 1977 and Altschul et al., J. MoI. Biol. 215:403-410, 1990). Alternatively, the sequences can be aligned by inspection. Generally the best alignment (i.e., the relative positioning resulting in the highest percentage of sequence identity over the comparison window) generated by the various methods is selected. However, in certain embodiments of the disclosure, the best alignment may alternatively be a superpositioning of selected structural features, and not necessarily the highest sequence identity.
[0060] The term "sequence identity" means that two amino acid sequences are substantially identical (i.e., on an amino acid-by- amino acid basis) over a window of comparison. The term "sequence similarity" refers to similar amino acids that share the same biophysical characteristics. The term "percentage of sequence identity" or "percentage of sequence similarity" is calculated by comparing two optimally aligned sequences over the window of comparison, determining the number of positions at which the identical residues (or similar residues) occur in both polypeptide sequences to yield the number of matched positions, dividing the number of matched positions by the total number of positions in the window of comparison (i.e., the window size), and multiplying the result by 100 to yield the percentage of sequence identity (or percentage of sequence similarity) . With regard to polynucleotide sequences, the terms sequence identity and sequence similarity have comparable meaning as described for protein sequences, with the term "percentage of sequence identity" indicating that two polynucleotide sequences are identical (on a nucleotide-by- nucleotide basis) over a window of comparison. As such, a percentage of polynucleotide sequence identity (or percentage of polynucleotide sequence similarity, e.g., for silent substitutions or other substitutions, based upon the analysis algorithm) also can be calculated. Maximum correspondence can be determined by using one of the sequence algorithms described herein (or other algorithms available to those of ordinary skill in the art) or by visual inspection.
[0061] As applied to polypeptides, the term substantial identity or substantial similarity means that two peptide sequences, when optimally aligned, such as by the programs BLAST, GAP or BESTFIT using default gap weights or by visual inspection, share sequence identity or sequence similarity. Similarly, as applied in the context of two nucleic acids, the term substantial identity or substantial similarity means that the two nucleic acid sequences, when optimally aligned, such as by the programs BLAST, GAP or BESTFIT using default gap weights (described in detail below) or by visual inspection, share sequence identity or sequence similarity.
[0062] One example of an algorithm that is suitable for determining percent sequence identity or sequence similarity is the FASTA algorithm, which is described in Pearson, W. R. & Lipman, D. J., (1988) Proc. Natl. Acad. Sci . USA 85:2444. See also, W. R. Pearson, (1996) Methods Enzymology 266:227-258. Preferred parameters used in a FASTA alignment of DNA sequences to calculate percent identity or percent similarity are optimized, BL50 Matrix 15: -5, k-tuple=2; joining penalty=40, optimization=28; gap penalty -12, gap length penalty=-2; and width=16.
[0063] Another example of a useful algorithm is PILEUP. PILEUP creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments to show relationship and percent sequence identity or percent sequence similarity. It also plots a tree or dendogram showing the clustering relationships used to create the alignment. PILEUP uses a simplification of the progressive alignment method of Feng & Doolittle, (1987) J. MoI. Evol . 35:351-360. The method used is similar to the method described by Higgins & Sharp, CABIOS 5:151- 153, 1989. The program can align up to 300 sequences, each of a maximum length of 5,000 nucleotides or amino acids. The multiple alignment procedure begins with the pairwise alignment of the two most similar sequences, producing a cluster of two aligned sequences. This cluster is then aligned to the next most related sequence or cluster of aligned sequences. Two clusters of sequences are aligned by a simple extension of the pairwise alignment of two individual sequences . The final alignment is achieved by a series of progressive, pairwise alignments. The program is run by designating specific sequences and their amino acid or nucleotide coordinates for regions of sequence comparison and by designating the program parameters. Using PILEUP, a reference sequence is compared to other test sequences to determine the percent sequence identity (or percent sequence similarity) relationship using the following parameters: default gap weight (3.00), default gap length weight (0.10), and weighted end gaps. PILEUP can be obtained from the GCG sequence analysis software package, e.g., version 7.0 (Devereaux et al., (1984) Nuc. Acids Res. 12:387-395). [0064] Another example of an algorithm that is suitable for multiple DNA and amino acid sequence alignments is the CLUSTALW program (Thompson, J. D. et al., (1994) Nuc. Acids Res. 22:4673- 4680). CLUSTALW performs multiple pairwise comparisons between groups of sequences and assembles them into a multiple alignment based on sequence identity. Gap open and Gap extension penalties were 10 and 0.05 respectively. For amino acid alignments, the BLOSUM algorithm can be used as a protein weight matrix (Henikoff and Henikoff, (1992) Proc. Natl. Acad. Sci. USA 89:10915-10919). [0065] Another method of determining relatedness is through protein and polynucleotide alignments. Common methods include using sequence based searches available on-line and through various software distribution routes. Homology or identity at the amino acid or nucleotide level can be determined by BLAST (Basic Local Alignment Search Tool) and by ClustalW analysis using the algorithm employed by the programs blastp, blastn, blastx, tblastn and tblastx (Karlin et al., Proc. Natl. Acad. Sci. USA 87, 2264-2268, 1990; Thompson et al., Nucleic Acids Res 22,4673-4680, 1994; and Altschul, J. MoI. Evol. 36, 290-300, 1993, (fully incorporated by reference) which are tailored for sequence similarity searching. The approach used by the BLAST program is to first consider similar segments between a query sequence and a database sequence, then to evaluate the statistical significance of all matches that are identified and finally to summarize only those matches which satisfy a preselected threshold of significance. For a discussion of basic issues in similarity searching of sequence databases (see Altschul et al., Nature Genetics 6, 119-129, 1994, which is fully incorporated by reference) . The search parameters for histogram, descriptions, alignments, expect (i.e., the statistical significance threshold for reporting matches against database sequences), cutoff, matrix and filter are at the default settings. The default scoring matrix used by blastp, blastx, tblastn, and tblastx is the BLOSUM62 matrix (Henikoff et al., Proc. Natl. Acad. Sci. USA 89, 10915-10919, 1992, fully incorporated by reference). For blastn, the scoring matrix is set by the ratios of M (i.e., the reward score for a pair of matching residues) to N (i.e., the penalty score for mismatching residues), wherein the default values for M and N are 5 and -4, respectively.
[0066] Accordingly, by using such methods families or groups of structurally related polypeptides can be identified. Typically the protein homology (whether they are evolutionarily, and therefore structurally, related) is determined primarily by sequence similarity (sequences are more similar than expected at random) . Sequences that are as low as 15-20% similar by alignments are likely related and encode proteins with similar structures. Additional structural relatedness can be determine using any number of further techniques including, but not limited to, X-ray crystallography, NMR, searching a protein structure databases, homology modeling, de novo protein folding, and computational protein structure prediction. Such additional techniques can be used alone or in addition to sequence-based alignment techniques. In one aspect, the degree of similarity/identity between two polypeptides (including peptide segments or domains) or polynucleotide sequences should be at least about 20% or more (e.g., 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 98% or 99%) .
[0067] In some aspects, parent sequences are chosen from a database of sequences, by a sequence homology search such as BLAST. Parental sequences will typically be between about 20% and 95% identical, typically between 35 and 80% identical. The lower the identity, the more the mutation level (and possibly the greater the possible stability enhancement and functional variation in the resulting sequences) following recombination between parental strands. The higher the identity, the higher the probability the sequences will fold and function.
[0068] Thermodynamic stability is an important biological property that has evolved to an optimal level to fit the functional needs of proteins. Therefore, investigating the stability of proteins is important not only because it affords information about the physical chemistry of folding, but also because it can provide important biological insights. A proper understanding of protein stability is also useful for technological purposes. The ability to rationally make proteins of high stability, low aggregation or low degradation rates will be valuable for a number of applications. For example, proteins that can resist unfolding can be used in industrial processes that require enzyme catalysis at high temperatures (Van den. Burg et al., Proc. Natl. Acad. Sci . U.S.A. 95(5): 2056-60, 1998); and the ability to produce proteins with low degradation rates within the cell can help to maximize production of recombinant proteins (Kwon et al., Protein Eng. 9(12): 1197-202, 1996) .
[0069] Stability measurements can also be used as probes of other biological phenomena. The most basic of these phenomena is biological activity. The ability of proteins to populate their native states is a universal requirement for function. Therefore, stability can be used as a convenient, first level assay for function. For example, libraries of polypeptide sequences can be tested for stability in order to select for sequences that fold into stable conformations and might potentially be active (Sandberg et al., Biochem. 34: 11970-78, 1995). Heme domains of cytochromes can be assayed for proper folding using CO-binding to the iron/heme. The heme domain for SEQ ID N0:l extends from about amino acid residue 1 to about 434; and for SEQ ID NO : 2 or 3 extends from about amino acid residue 1 to about 436.
[0070] Changes in stability can also be used to detect binding. When a ligand binds to the native conformation of a protein, the global stability of a protein is increased (Schellman, Biopolymers 14: 999-1018, 1975; Pace & McGrath, (1980) J. Biol. Chem. 255: 3862-65; Pace & Grimsley, Biochem. 27: 3242-46, 1988). The binding constant can be measured by analyzing the extent of the stability increase. This strategy has been used to analyze the binding of ions and small molecules to a number of proteins (Pace & McGrath, (1980) J. Biol. Chem. 255: 3862-65; Pace & Grimsley, (1988) Biochem. 27: 3242-46; Schwartz, (1988) Biochem. 27: 8429-36; Brandts & Lin, (1990) Biochem. 29: 6927-40; Straume & Freire, (1992) Anal. Biochem. 203: 259-68; Graziano et al., (1996) Biochem. 35: 13386-92; Kanaya et al., (1996) J. Biol. Chem. 271: 32729-36). [0071] The linkage between stability and binding has recently been implemented as a method to detect ligand binding (U.S. Pat. No. 5,679,582 to Bowie & Pakula) . This method, however, does not take advantage of the high sensitivity available from an analytical technique such as MALDI mass spectrometry, and cannot be employed at the low protein levels that MALDI mass spectrometry can detect. Moreover, proteolytic methods can require additional steps to isolate and analyze proteolytic fragments and cannot be performed in an in vivo setting. Finally, this method cannot be employed to generate quantitative measurements of protein stability. [0072] The expressed chimeric recombinant proteins are measured for stability and/or biological activity. Techniques for measuring stability and activity are known in the art and include, for example, the ability to retain function (e.g. enzymatic activity) at elevated temperature or under 'harsh' conditions of pH, salt, organic solvent, and the like; and/or the ability to maintain function for a longer period of time (e.g., in storage in normal conditions, or in harsh conditions) . Function will of course depend upon the type of protein being generated and will be based upon its intended purpose. For example, P450 mutants can be tested for the ability to convert alkanes to alcohols under various conditions of pH, solvents and temperature.
[0073] The best methodology for protein stabilization depends on the target protein and the relative ease with which folding status and stability are measured. The linear regression model uses stability data, which are often more difficult to obtain than a simple determination of folding status. The linear regression model, however, requires fewer measurements and always predicted more true positives with fewer false positives than the consensus approach based on folding status (Fig. 8) . While the linear regression model predicted absolute thermostabilities with higher accuracy than the consensus model, the latter nonetheless reliably predicted highly stable chimeras. The two approaches have significant overlaps in their predicted stable sequences, including the MTP. Eight of the top seventeen stable chimeras predicted by consensus have predicted T50 > 6O0C by linear regression (Table 3) . Among the 50 chimeras with lowest consensus energies, 28 sequences are also predicted very stable by linear regression, and all 17 chimeras constructed have measured T50 >58°C (there are 87 chimeras with predicted T50 > 58°C) . Nineteen of the top 50 stable chimeras predicted by linear regression have consensus energies ranked in the top 50. The consensus analysis can also be performed using multiple sequence alignments that are based on functional status (functional or not) rather than folding status, since most chimeras that fold also retain functions shared by the parent proteins. [0074] Consensus stabilization is based on the idea that the frequencies of sequence elements correlate with their corresponding stability contributions. This correlation is typically assumed to follow a Boltzmann-like exponential relationship. Such a relationship is most sensible if, in analogy to statistical mechanics, the sequences are randomly sampled from the ensemble of all folded sequences. Natural sequences are related by divergent evolution and may not comprise such a sample. The chimeric data set, in contrast, represents a large and nearly random sample of all the 6,561 possible chimeras. Dramatic support for the fundamental assumptions underlying consensus stabilization approaches were found: sequence elements contribute additively to stability, stabilizing fragments occur at higher frequencies among folded sequences, and the consensus sequence is the most stable in the ensemble. Taking advantage of the unique access to unfolded sequences, failure to fold is also informative: incorporation of sequence element frequencies among unfolded sequences significantly improves predictions of both relative stability and protein folding status. These results demonstrate the tolerance of the consensus stabilization idea to different ensembles (chimeric libraries versus evolved families) and sequence changes (recombination versus stepwise mutation) , supporting its use as a reliable method for protein stabilization.
[0075] The ensemble of 40 stable, homologous P450s also represents a valuable resource for studying the relationship between enzyme stability and activity. Comparative studies of homologous enzymes from mesophilic and thermophilic organisms often find that the highly stable enzymes from thermophiles are less active at room temperature than their marginally stable mesophilic counterparts. These findings have been used to argue that highly stable proteins are inherently less capable of functioning as active enzymes at room temperature, perhaps due to decreased flexibility. An alternative viewpoint holds that proteins from thermophiles are poor enzymes at low-temperature (e.g. room temperature) because they have evolved under pressure to function at high temperatures, just as proteins from mesophiles are marginally stable because they have never been selected to fold at higher temperatures . This viewpoint that the poor low-temperature activity of naturally thermostable enzymes represents evolutionary statistics rather than an inherent biophysical tradeoff has anecdotal support from engineering experiments that have dramatically stabilized proteins while retaining their room- temperature activity. The current data, in which a large set of proteins with varying stabilities has been generated by recombination without evolutionary selection for either activity or stability, provide a more rigorous test. Over half of the 40 thermostable chimeras in Table 3 are also more active on 2- phenoxyethanol than the most active parent, demonstrating that there is no fundamental biophysical tradeoff between stability and activity on this substrate. Such trade-offs, if they exist, must be connected to significantly more optimized functions. [0076] The disclosure demonstrates that chimeric proteins exhibit a broad range of stabilities, and that stability of a given folded sequence can be predicted based on data (either stability or folding status) from a limited sampling of the chimeric library. Using this information, dozens of diverse, highly stable proteins were created.
[0077] The following examples are meant to further explain, but not limited the foregoing disclosure or the appended claims.
EXAMPLES
[0078] Thermostability measurements. Cell extracts were prepared and P450 concentrations were determined as reported previously13. Cell extract samples containing 4 μM of P450 were heated in a thermocycler over a range of temperatures for 10 minutes followed by rapid cooling to 4°C for 1 minute. The precipitate was removed by centrifugation. The P450 remaining in the supernatant was measured by CO-difference spectroscopy. T50, the temperature at which 50 percent of protein irreversibly denatured after a 10-min incubation, was determined by fitting the data to a two-state denaturation model. To check the variability and reproducibility of the measurement, four parallel independent experiments (from cell culture to T50 measurement) were conducted on A2, which yielded an average T5o of 43.6 0C and a standard deviation (σM) of 1.00C. For some sequences, T5o s were measured twice, and the average of all the measurements was used in the analysis.
TSO = aO + ∑∑aijxij
[0079] Linear regression. The linear model ' J was used for regression, where T5o is the dependent variable and fragments X1-, (from the ith position and jth parent, where i = 1, 2,...8 and j = 1 or 3) are the independent variables. The were dummy-coded, such that if a chimera took fragment 1 from parent 1, xu=l and X13=O . Parent A2 was used as the reference for all eight fragments, so the constant term (ao) is the predicted T50 of A2. The thermostability contribution of each fragment relative to the corresponding A2 fragment is given by the regression coefficient a1D. Regression was performed using SPSS (SPSS for Windows, ReI. 11.0.1. 2001. Chicago: SPSS Inc.).
[0080] Construction of thermostable chimeric cytochrome P450s. To construct a given stable chimera, two chimeras having parts of the targeted gene (e.g. 21311212 and 11312333 for the target chimera 21312333) were selected as templates. The target gene was constructed by overlap extension PCR, cloned into the pCWori expression vector, and transformed into the catalase-free E. coli strain SN0037. All constructs were confirmed by sequencing. [0081] Enzyme activity assay. Activity on 2-phenoxyethanol was analyzed in 96-well plates using the 4-aminoantipyrine (4-AAP) assay. 80 μl of P450 chimera (4 μM) was mixed 20 μl of 2- phenoxyethanol (3 M) in each well. The reaction was initiated by adding 20 μl of 120 mM hydrogen peroxide. The reaction mixture was incubated at room temperature for two hours. Then 50 μl of basic buffer (0.2 M NaOH and 4 M Urea) was added into the reaction mixture to raise the pH for the 4-AAP assay. 25 μl of 0.6% 4-AAP was added, the reading at 500 nm was taken for zeroing, and then 25 μl of 0.6% potassium persulfate was added. After incubation of 10 minutes at room temperature, the absorbance at 500 nm was recorded. The total turnover number (TTN) was calculated and then normalized to the most active parent, Al.
[0082] To generate a library of novel CYP102A sequences for these applications, a structure-guided SCHEMA recombination of the heme domains of CYP102A1 and its homologs CYP102A2 (A2) and CYP102A3 (A3) was used to create at least 2,300 new, properly folded and catalytically active enzymes. The folded chimeras exhibit a great deal of sequence diversity, differing from the closest parent sequence by an average of 72 amino acid substitutions. Some of these chimeric P450s were shown to be more stable than any of the parents.
[0083] The SCHEMA library was constructed by site-directed recombination at seven crossover sites, so that a chimeric P450 sequence is made up of eight fragments, each chosen from one of the three parents. The thermostabilities of a subset of the folded chimeras were measured and analyzed the relationship between sequence and stability. Thermostability is well described by a model that assumes the contributions of the chimera' s eight sequence fragments are additive. The sequences of 620 folded and 335 unfolded chimeras were examined and found that the most thermostable chimeras tend to contain 'consensus fragments' , or those appearing more frequently among the folded chimeras and less frequently in the unfolded ones. Chimera thermostability can thus be predicted by determining either the folding status or thermostabilities of a small sampling of the library. Based on these results, chimeras were predicted, constructed and characterized; 40 chimeric cytochrome P450s that are highly stable, catalytically active and have sequences that are significantly different from any known P450.
[0084] Thermostabilities of chimeric P450 heme domains. 620 folded and 335 unfolded chimeric P450 heme domains from a SCHEMA library constructed from three parents (CYP102A1, A2 and A3 (SEQ ID NO: 1-3) have been provided. The 38 = 6,561 possible chimeric sequences are written according to fragment composition: 23121321, for example, represents a protein which inherits the first fragment from parent A2, the second from A3, the third from Al, and so on. To determine the relationship between sequence and stability, thermostabilities of 185 folded P450 chimeras were measured (Table 1) in the form of T50, the temperature at which 50% of the protein irreversibly denatured after incubation for ten minutes. The parental proteins have T50 values of 54.9 0C (Al), 43.6 0C (A2) and 49.1 0C (A3). The T50 distribution of the chimeras, shown in Fig. 1, has an average of 50.3 0C and a standard deviation of 4.5 °C. This subset of the folded P450s contains many that are more stable than the most stable parent (Al).
Table 1. T50 values and sequences of 205 chimeric cytochromes P450.
Sequence T50(0C) Sequence T50 (°C) Sequence T50 (0C) Sequence T50(0C)
32233232 39.8 32312322 49.1 32212231 47.4 23213333 56.1
32313233 52.9 32312231 52.6 23212212 48.0 21333233 54.2
21133233 48.8 21232332 49.3 22113223 49.9 22233212 44.0
31312113 45.0 31331331 47.3 22233211 46.3 21313112 54.8
21332223 48.3 21132222 45.6 23213311 49.5 31213233 50.6
21312323 61.5 21212333 63.2 31212321 44.9 22132113 40.6
22312322 54.6 21231233 50.6 23112233 51.0 31112333 55.7
21212112 51.2 22212322 50.7 32332323 48.5 31212331 51.8
23133121 47.3 21112122 50.3 22112223 52.8 22232222 47.5
11312233 51.6 22111223 51.3 32313231 52.5 23332221 46.4
21133312 45.4 23233212 39.5 32132232 42.5 21332131 58.5
21133313 50.8 31312212 48.9 22232233 49.6 23231233 45.5
11332233 43.3 32211323 46.6 22232322 45.4 22111332 50.9
31212332 53.4 21213231 54.9 22333211 50.7 23312121 49.3
12211232 49.1 21332312 52.9 22332223 52.4 22332222 50.3
31312133 52.6 22332211 53.0 23213212 49.0 23312323 53.8
12232332 39.2 22113323 53.S 23333213 50.1 21131121 53.0
22133232 47.9 22113332 48.7 31312233 57.9 32212232 48.8
22233221 46.8 22213132 52.0 22232333 53.7 22112323 55.3
23113323 51.0 31213332 50.8 31333233 46.5 21232232 49.5
11332212 47.8 22113211 51.1 22213212 50.5 11212333 50.4
32332231 49.4 22313323 60.0 22132212 46.6 31212232 51.0
22132331 53.3 32333233 47.2 21332233 58.9 23213211 47.4
23313111 56.9 22331223 51.7 23333131 50.5 11331312 43.5
23112323 46.0 23333233 51.0 31312332 54.9 23331233 50.9
11113311 51.2 22333332 49.0 21333221 51.3 22133323 49.4
21232233 50.6 23332331 48.0 22333223 49.9 33333233 46.3
12332233 47.1 21233132 42.4 21111333 62.4 22233323 48.4
23333311 45.7 13333211 45.7 12212212 44.8 32232131 43.9
32132233 42.9 22232331 50.5 11313233 48.3 31312323 52.3
22331123 47.9 22313233 58.5 32113232 47.9 21313313 64.4
12212332 48.4 31311233 56.9 21113322 50.4 22333231 53.1
31212323 48.7 21132321 49.3 31313232 51.9 22232123 43.1
21132323 50.1 12322333 47.9 31332233 49.9 33312333 54.7
23332231 51.4 23313233 56.3 21133232 46.4 22313232 58.8
12112333 50.9 21332322 48.8 22112211 54.7 22312111 53.0
22133212 47.2 22132231 53.0 21333333 58.0 32212233 49.9 Sequence T50 (0C) Sequence T50 (0C) Sequence T50(0C) Sequence T50 (0C)
31113131 54.9 21113312 53.0 22213223 50.8 21212321 53.3
23313333 61.2 22312223 56.2 21332112 50.4 21333211 55.9
21113133 51.9 23332223 46.7 21331332 52.0 22232212 46.2
21111323 54.4 32212323 48.4 11313333 53.8 23313323 50.9
22212123 47.7 21212111 57.2 32311323 52.0 32312333 57.8
12211333 50.6 31212212 47.1 23132231 48.0 12313331 51.2
23113112 46.3 22232121 49.7 12232232 40.9 21311331 62.9
21313122 50.5 21232212 47.8 21212231 59.9 21313231 61.0
23112333 54.3 21333223 49.1 21132212 48.8 22312133 57.1
12213212 44.0 23213232 48.5 23133311 44.2 22312231 60.0
23132233 43.6 22113232 51.1 22113111 49.2 22312311 55.6
21313311 56.9 11331333 46.3 23212211 50.7 22312332 59.1
21332231 60.0 22333321 49.2 21132112 47.1 22312333 63.5
23133233 43.1 21232321 46.0 23132311 44.5 21312333 64.4
21312123 60.8
Note: The first 185 chimeras are those for data training and testing, and the last 20 chimeras (bold) are those used to test the linear regression model.
[0085] Linear regression analysis of chimera thermostability.
The T5O values of the 185 chimeric P450s were analyzed by linear regression, under the implicit assumption that each fragment contributes independently to protein thermostability. Regression of T50 against chimera fragment composition revealed a strong linear correlation between predicted and observed T50 over all 185 chimeras: Pearson r = 0.857 (Fig. 2a) (Table 2).
Table 2. Thermostability contribution from each fragment calculated by linear regression.
Figure imgf000034_0001
Note: The thermostability contribution of each fragment shown is relative to the corresponding fragment from parent A2, which was used as the reference.
[0086] To examine whether the results allow generalization from one data subset to another and address the possibility of over- fitting, the data was randomly divided into two parts, a training set (140 data points) and a test set (45 data points). The standard deviation of regression (σR) and measurement (σM = 1.0 °C) were used to guide the data training. After each training cycle, every data point was weighted in terms of its role in determining the regression line. If the prediction error (the temperature difference between the predicted T50 and measured one) of a data point was more than 2σR, it was removed. When σR was less than 2σM (2.0 0C), the training process stopped. After two training cycles, a σR of 1.9 0C was achieved. After removing only 8 outliers, r for the training set was improved from 0.847 to 0.901 (Fig. 5a) . When the trained regression parameters (Table 2) were used to predict thermostabilities of proteins in the test data set, r = 0.856, indicating that additive contributions derived from one group of proteins can be used to accurately predict thermostabilities of another group (Fig. 5b) . The linear regression model was further confirmed by 10-fold cross-validation.
[0087] The model parameters obtained from the training set (Table 2) were used to predict that the most thermostable P450 (MTP) chimera in the library would have a T50 of 63.8 0C and fragment composition 21312333. This sequence was constructed, expressed and characterized; its T50 of 64.40C, within measurement error of the predicted value, made it 9.5 0C more stable than the most thermostable parent, Al. It was in fact the most stable of all 239 chimeras characterized to date. To further test the model predictions, the T50 values of nineteen additional chimeras from the subset of 620 folded chimeras were measured, seven predicted to be highly thermostable and twelve picked at random (Table 1) . Predicted and measured T50 values for all 20 new P450s, including the MTP, correlated extremely well (r = 0.949) (Fig. 2b). [0088] In the absence of noise, determining the weights for predictor variables in a regression model only requires making as many measurements as there are predictor variables. In the presence of noise, additional measurements will tend to increase the accuracy of the predictions. A certain number of sequences from the 205 chimeras with measured T50S were randomly selected and tested the ability of regression models based on these sequences to predict the T50S of the remaining chimeras. 35 to 40 measurements were sufficient for accurate predictions of chimera stability, although slight improvements in prediction accuracy could be seen with more data points (Fig. 5c) .
[0089] Protein stabilization by additivity of fragment contributions. Linear regression model parameters obtained from 205 T50 measurements (Table 2) were used to predict T50 values for all 6,561 chimeras in the SCHEMA P450 library (Fig. 5d) . A significant number (-300) of chimeras are predicted to be more stable than the most stable parent. Those with predicted T50 values greater or equal to 60 0C (total of 31) were selected for construction and further characterization. Five were identified previously; the remaining 26 were constructed. All 31 predicted stable chimeras were stable, with T50 between 58.5°C and 64.4°C (Table 3). The stability predictions were quite accurate, with root mean square deviations between the predicted and measured T50 values of 1.5°C, close to the measurement error (1.00C).
Table 3. A stabilized cytochrome P450 heme domain family.
Figure imgf000037_0001
'predicted to be highly stable by linear regression, 2predicted to be stable by consensus analysis, 3activity on 2-phenoxyethanol is reported as total turnover number normalized to the most active parent protein, Al
[0090] Consensus analysis of folded and unfolded sequences.
Whether the multiple sequence alignment of the folded chimeras could be used to predict the stable sequences, similar to 'consensus stabilization' methods based on natural sequence alignments were analyzed. From the collection of folded chimeras and the estimates of fragment stability contributions, whether protein fragments that appear more often in the folded chimeras contribute more to protein stability can be assessed. The stability contributions of each fragment (relative to the least-stable parent, A2) derived from the regression analysis (Fig. 3a) were determined and the difference in frequency of each fragment (relative to A2) calculated in the set of 620 folded chimeras (Fig. 3b) . These data reveal a strong positive relationship between relative stability contribution and relative fragment frequency (Fig. 3c) .
[0091] To assess the predictive value of sequence-element frequencies more quantitatively, the stability of each chimera was calculated using the approach of Steipe et al. Assuming the frequency of a fragment at position i is exponentially related to its stability contribution and that these fragment contributions are additive, a total chimera consensus energy can be calculated
from ' . Lower consensus energies (based on the multiple sequence alignment (MSA) of 620 folded chimeras) are associated with higher T50 values (Fig. 4a; Pearson r =0.34, P < 10"5) . Furthermore, folded proteins tend to have lower consensus energies than unfolded ones (Fig. 4b; Wilcoxon signed rank test P « 10"9) . [0092] A unique feature of the synthetic protein family is that it includes unfolded sequences, and those unfolded sequences contain useful information. Destabilizing effects were thought to follow the same pattern as stabilizing ones, such that the frequency of a fragment at position i in the unfolded population μ2 predicts its destabilizing effect just as frequency in the folded population predicts its stabilizing effect. Revised chimera
consensus energies ' , computed using fragment frequencies from the 620 folded and 335 unfolded sequences, Attorney Docket No. 1034345-000263
significantly improved prediction of both stability (Fig. 4c; Pearson r = -0.53, P < 10"15) and folding status (Fig. 4d) . Incorporating the unfolded sequences in this way makes interpretation of the consensus energies straightforward: fragment with energies below zero appear more often in folded vs . unfolded proteins and tend to have stabilizing effects. It also helps control for small biases in chimera library construction or sampling which are the same for both multiple sequence alignments and which therefore cancel out in the calculation of consensus energy.
[0093] The tradeoff between the number of chimera sequences used to calculate the energies and the statistical error associated with ranking chimeras by consensus was calculated. A random subsets containing 5, 10, 15...300 sequences from 628 chimeras were selected at random from the SCHEMA library (a mixture of folded and unfolded chimeras) and calculated consensus energies for three parents and 205 chimeras with known T5oS . The Spearman rank correlation coefficient (rs) was then calculated between the consensus energy predictions and the measured T50 values. This was repeated 50 times, and the average rs and standard deviation calculated for each sample size (Fig. 6) . The average rank-order correlation coefficient is reliably above 0.4 (with standard deviations values less than 0.09) when 200 or more chimera sequences are used.
[0094] Protein stabilization by consensus. Having demonstrated that sequence and folding status alone can be used to make nontrivial predictions of relative stability, the most stable chimeras were then predicted. The total consensus energies of all 6,561 chimeras in the library were calculated; the 20 with the lowest consensus energies are listed in Table 4. Due to bias in the library construction, the data set of 955 chimeras has very few representatives of A2 at position 4, preventing accurate assessment of this fragment's thermostability contribution. Three sequences with this fragment were not constructed; the remaining seventeen were constructed. These chimeras are all highly stable, with T5o values between 58.2°C and 64.4°C (Table 3). The sequence with consensus fragments at all eight positions (21312333) and therefore Attorney Docket No. 1034345-000263
the lowest consensus energy is the "consensus sequence", and should be the most stable chimera. Indeed, the consensus sequence has the highest measured stability among all 239 chimeras with known T50 and is also the MTP predicted by the linear regression model.
Table 4: The 20 chimeras with lowest total consensus energies.
Figure imgf000040_0001
[0095] Stability predictions identify errors in sequencing. The stability predictions were found sufficiently accurate to identify both sequencing errors and point mutations in the chimeras. The sequences of P450 chimeras were originally determined in high throughput by DNA probe hybridization, which has a -3% error rate; small numbers of point mutations during library construction are also expected. Thus approximately 7 incorrect sequence readings are expected for the total set of 239 chimeras studied, and other sequences may have point mutations. 13 chimeras with prediction error of more than 4°C from the original set of 190 chimeras whose T50S were measured and analyzed by linear regression were resequences. Five either had incorrect sequences or contained point mutations (Table 5) ; these five chimeras were eliminated from the subsequent linear regression analysis to determine the model parameters in Table 2.
Table 5: Sequence Errors and Mutations Identified by Linear
Regression Attorney Docket No. 1034345-000263
Figure imgf000041_0001
[0096] Further work also showed that both the regression and consensus models do well enough to significantly increase the odds of identifying sequencing errors and mutations. From the initial high-throughput CO difference spectroscopy and probe hybridization sequencing analysis, chimeras 22313333, 21311311, and 22311333 had been labeled unfolded. All three, however, were predicted to be highly stable. Full sequencing showed that the original 22313333 construct was incomplete and missing some fragments. Re-constructed 22313333 was folded and very stable, with T50= 64.3°C. Similarly, the original 21311311 construct had an insertion, which after removal generated a chimera with T50= 61.00C. Finally, 22311333 had two point mutations leading to two amino acid substitutions. After these mutations were corrected by site directed mutagenesis, 22311333 was shown to fold properly, with T50= 60.10C. [0097] Further characterization of stable cytochrome P450 chimeras. Important measures of stability for protein applications include the ability to withstand denaturing conditions, including elevated temperatures. An enzyme's half-life of (irreversible) inactivation (ti/2) is commonly used to describe stability, and ti/2 often correlates with other stability measures. The ti/2 was measured at 57°C for 13 stable chimeras and the three parents (Table 6) . The results show that the increased stability can have a profound effect on half-life: while the most stable parent, Al, lost its ability to bind CO with a half-life of 15 min at this temperature, chimera 21312231 had a half -life of 1600 min, or more than 108 times greater. The MTP, chimera 21312333 also had a very long half-life, at 1550 min. T50 has also been shown to correlate linearly with urea concentrations required for half-maximal denaturation for variants of CYP102A1. Thus it was expected that Attorney Docket No. 1034345-000263
the stable P450 chimeras are also more tolerant to inactivation by denaturants .
Table 6. Half-lives of inactivation (tl/2) at 57 oC of three parent proteins and 13 stabilized chimeric proteins.
Figure imgf000042_0001
[0098] All 40 new chimeras were verified by full sequencing to eliminate any possibility that the enhanced thermostabilities were due to mutations, insertions or deletions. The 40 stable chimeras comprise a diverse family of sequences, differing from one another at 14 to 88 amino acid positions (49 on average) (Fig. 7) . The distance to the closest parent is as high as 100 amino acids. The 40 chimeras thus comprise a family of properly folded, highly stable cytochrome P450s that exhibit considerable sequence novelty. [0099] The activities of the stable chimeras were assessed in order to explore the relationship between activity and stability, and specifically to determine whether the increased stability came at the cost of catalytic function. The 40 thermostable chimeras and the three parents were assayed for peroxygenase activity on 2- phenoxyethanol, a substrate that is accepted by all three parent enzymes. All 40 chimeras were active on 2-phenoxyethanol (Table 3) . Furthermore, many of them were more active than the most active parent (Al). One of the most stable enzymes (T5o = 64.3°C), chimera 22313333 was 9 times more active at room temperature than the most active parent. Consistent with observations from the extensive analysis of the substrate specificities of a sampling of the P450 chimeras, changes in sequence can have significant effects on relative activity. Chimera 21313333, for example, was also highly stable, but had less than half the activity of 2231333 on this nonnatural substrate (Table 3) .
[00100] The protein expression levels of most of the thermostable chimeras were higher than those of the parent proteins. Most thermostable chimeras expressed well even without the inducing agent isopropyl-beta-D-thiogalactopyranoside (IPTG). Attorney Docket No. 1034345-000263
Thus the family of 40 chimeric P450s reported here represents a set of well-expressed, highly stable, catalytically active enzymes with novel sequences.
[00101] In a further set of experiments, seventeen proteins, including the three parent heme domains, were chosen for holoenzyme construction by fusion to a wildtype CYP102A reductase domain. For each sequence, four proteins were examined—the heme domain and its fusion to each of the three reductase domains—for a total of 68 constructs. Heme domains contain the first 463 amino acids for Al and the first 466 amino acids for A2 and A3. The reductase domains start at amino acid E464 for Rl, K467 for R2 and D467 for R3 and encode the linker region of the corresponding reductase. [00102] The chimeric sequences are reported in terms of the parent from which each of the eight sequence blocks is inherited (Table 7). Twelve of the fourteen chimeras were selected because they displayed relatively high activities on substrates in preliminary studies. Chimera 23132233 was chosen because it displayed low peroxygenase activity, while 22312333 was selected because it is more thermostable than any of the parents (T50 = 62 0C) . For the constructs studied here, the reductase identity is indicated as the ninth sequence element, with RO referring to no reductase (i.e., heme domain peroxygenase).
[00103] Table 7: Pairwise correlations of normalized activities for monooxygenases (Rl, R2, R3) and peroxygenases (RO) of fourteen chimeras and the Al and A2 parents. R2 values are reported. Bold and underlined=0.7-1.0; Underlined=0.4-0.7; Regular=0.0-0.4.
Attorney Docket No. 1034345-000263
Heme sequence R0/R1 R0/R2 R0/R3 R1/R2 R1/R3 R2/R3
11111111 0.49 0.00 0.53 0.21 0.66 0.11
22222222 0.70 0.53 0.49 0.75 0.83 0.66
11113311 0.61 0.65 0.49 0.90 0.59 0.78
12112333 0.11 0.04 0.00 0.91 0.11 0.10
21113312 0.14 0.01 0.00 0.73 0.76 0.77
21313111 0.24 0.19 0.05 0.84 0.15 0.39
21313311 0.25 0.28 0.00 0.41 0.01 0.34
21333233 0.90 0.64 0.87 0.72 0.95 0.66
22132231 0.80 0.85 0.56 0.98 0.64 0.60
22213132 0.46 0.08 0.37 0.11 0.01 0.54
22312333 0.01 0.02 0.00 0.69 0.69 0.25
22313233 0.17 0.01 0.08 0.02 0.85 0.07
23132233 0.96 0.89 0.97 0.90 0.99 0.90
32312231 0.14 0.06 0.02 0.07 0.04 0.21
32312333 0.33 0.41 0.02 0.97 0.40 0.33
32313233 0.15 0.44 0.09 0.74 0.60 0.38
[00104] To assess the functional diversity of the chimeric P450s, their activities were measured on the eleven substrates shown in Figure 14. Propranolol (PR), tolbutamide (TB) and chlorzoxazone (CH) are drugs that are metabolized by human P450s. 12-p-nitrophenoxycarboxylic acid (PN) is a long-chain fatty acid surrogate; parent Al-Rl holoenzyme and the Al heme domain (with the F87A mutation) both show high activity on this substrate. Previous work showed that Al has weak peroxygenase activity on some of the aromatic substrates. Aromatic hydroxylation products of all substrates can be detected quantitatively using the 4-amino antipyrine assay. PN hydroxylation can be monitored spectrophometrically .
[00105] Peroxygenase activities of the 16 heme domains (all except A3) were determined by assaying for product formation after a fixed reaction time in 96-well plates. Similar assays were used to determine monooxygenase activities for each of the fusion proteins. Final enzyme concentrations were fixed to 1 μM in order to reduce large errors associated with low expression and to allow us to compare chimera activities using absorbance values directly. Protein concentrations were re-assayed in 96-well format and Attorney Docket No. 1034345-000263
determined to be 0.88 μM +/- 13% (SD/average) . All samples were prepared and analyzed in triplicate, and outlier data points were eliminated. Tables 8 and 9 report the averages and standard deviations for each of the assays. More than 85% of the data for each substrate was retained, and more than 95% was retained for 6 of the 11 substrates (Table 10) .
[00106] Table 8: Average activity in absorbance units for each substrate-construct pair (maximal value for each substrate in bold/italic) .
Attorney Docket No. 1034345-000263
0.000 0. I011 0 I.013 0.011 0.178
Figure imgf000046_0001
0.032 0.033 0.302
11111111-R2 0.484 0.179 0.157 O.llβ 0.200 0.114 0.146 0.029 0.026 0.029 0.114
11111111-R3 0.0-18 0.000 0.038 0.059 0.030 0.054 0.023 0.019 0.022 0.132
22232222-R0 0.054 0.000 0.000 0.000 0.013 0.009 0.000 0.010 0.014 0.011 0.026
22222222-R1 0.042 o.ooo 0.038 o.ooo 0.027 0.031 0.020 0.021 0.016 0.020 0.064
22222222-R2 0.039 0.000 0.045 0.000 0.027 0.083 0.022 0.020 0.016 0.018 0.037
22232222-R3 0.065 o.ooo 0.040 0.000 0.048 0.031 0.055 0.028 0.024 0.024 0.079
33333333-R3 0.049 0.000 0.033 o.ooo 0.046 0.026 0.056 0.030 0.022 0.024 0.063
11113311-RO 0.463 0.000 0.046 o.ooo 0.011 0.031 0.000 0.013 0.012 0.009 0.190
11113311-R1 0.44β 0.23S 0.160 0.072 0.135 0.225 O.061 0.029 0.028 0.027 0.364
11113311-R2 0.329 0.145 0.087 0.000 0.091 0.159 0.051 0.030 0.024 0.024 0.277
11113311-R3 O.llβ 0.000 0.033 0.000 0.032 0.026 0.047 0.022 0.017 0.019 0.155
12112333-RO 0.544 0.053 0.048 0.000 0.013 0.038 0.012 0.014 0.013 0.056
12112333-Rl 0.513 0.282 0.163 0.091 0.124 0.414 0.038 0.020 0.017 0.019 0.170
12112333-R2 0.511 0.334 0.163 0.116 0.135 0.462 0.063 0.025 0.024 0.025 0.143
12112333-R3 0.129 0.044 0.039 0.000 0.043 0.058 0.030 0.025 0.019 0.022 0.053
211133-2-RO 0.522 0.135 0.078 0.000 0.017 0.034 0.017 0.017 0.013 0.069
21113312-R1 0.269 0.107 0.084 0.000 0.053 0.056 0.046 0.033 0.045 0.034 0.065
21113312-R2 0.213 0.085 0.073 0.046 0.066 0.047 0.055 0.033 0.038 0.031 0.050
21113312-R3 0.179 0.063 0.058 o.ooo 0.049 0.034 0.075 0.034 0.037 0.033 0.031
21313111-RO 0.731 0.10S 0.073 0.000 0.016 o.osβ 0.018 0.012 0.013 0.000
21313111-R1 0.617 0.313 0.173 0.167 0.OS9 0.370 0.044 0.024 0.024 0.024 0.033
21313111-R2 0.560 0.282 0.139 0.152 0.102 0.332 0.079 0.029 0.027 0.02S 0.000
21313111-R3 0.767 0.256 0.258 0.207 0.260 0.S18 0.137 0.102 0.039 0.076 0.000
213133.1-RO 0.365 0.000 0.046 0.000 0.O09 0.038 o.oβo 0.012 0.011 0.012
21313311-R1 0.343 0.082 0.109 0.061 0.089 0.202 0.017 0.019 0.015 0.019 0.000
21313311-R2 0.306 0.07« 0.092 o.ooo 0.0B6 0.149 0.050 0.030 0.029 0.029 0.000
21313311-R3 0.190 0.109 0.098 0.097 0.115 0.150 0.136 0.072 0.071 0.060 0.000
21333233-R0 0.113 0.000 0.036 0.000 0.020 0.016 0.023 0.025 0.020 0.020 0.000
21I33233-R1 0.046 0.000 0.035 0.000 0.029 0.026 0.022 0.024 0.019 0.022 0.000
21333233-R2 0.180 0.104 0.119 0.000 0.070 0.090 0.039 0.036 0.034 0.031 0.062
21333Z33-R3 0.057 0.000 0.03S 0.000 0.036 0.028 0.040 0.026 0.025 0.024 0.000
22132231-RO 0.034 0.000 0.000 0.000 0.009 0.006 0.000 0.005 0.008 0.007 0.000
22132231-R1 0.025 0.000 0.024 0.000 0.023 0.018 0.000 0.018 0.014 0.018 0.000
22132231-R2 0.045 0.000 0.035 0.000 0.026 0.033 0.000 0.018 0.016 0.020 0.000
22132231-R3 0.022 o.ooo 0.000 o.ooo 0.016 0.015 0.025 0.014 0.012 0.015 0.000
22213132-R0 0.269 0.051 0.061 0.000 0.010 0.017 0.020 0.010 0.019 0.013 0.000
22213132-R1 0.584 0.217 0.238 0.076 0.081 0.172 0.068 0.031 0.040 0.030 0.133
22213132-R2 0.377 0.289 0.253 0.169 0.153 0.206 0.152 0.122 0.130 0.126 0.000
22213132-R3 0.172 0.070 0.077 0.000 0.038 0.043 0.051 0.026 0.025 0.024 0.015
22312333-RO 0.103 0.000 0.024 0.000 0.008 0.017 0.000 O.0O9 0.00« 0.009 0.000
22312333-R1 0.080 0.000 0.044 0.000 0.058 0.132 0.082 O.015 0.015 0.018 0.000
22312333-R2 0.172 0.067 0.084 0.049 0.121 0.356 0.117 0.019 0.012 0.017 0.000
22312333-R3 0.034 0.000 0.000 0.000 0.022 0.019 0.093 0.012 0.011 0.015 0.000
22313233-R0 0.185 0.000 0.050 0.000 0.011 0.029 0.003 0.009 0.010 0.000
22313233-R1 0.064 0.000 0.036 0.000 0.033 0.044 0.023 0.021 0.018 0.021 0.000
22313233-R2 0.260 0.204 0.150 0.137 0.089 0.415 0.049 0.022 0.016 0.019 0.000
22313233-R3 0.077 0.041 0.000 0.034 0.031 0.053 0.026 0.020 0.023 0.000
23132233-R0 0.024 0.000 0.000 o.ooo 0.019 0.019 0.022 0.025 0.021 0.021 0.000
23132233-R1 0.044 o.ooo 0.043 0.000 0.051 0.037 0.035 O.042 0.039 0.036 0.000
23132233-R2 0.049 0.000 0.055 0.046 0.054 0.044 0.043 0.043 0.041 0.038 0.000
23132233-R3 0.030 0.000 0.031 0.000 0.034 0.024 0.025 0.031 0.026 0.028 0.000
32312231-RO 0.354 0.065 0.085 0.000 0.016 0.067 0.015 0.013 0.018 0.000
32312231-R1 0.067 0.053 0.055 0.000 0.051 0.156 0.063 0.021 0.016 0.021 0.139
32312231-R2 0.204 0.245 0.277 0.154 0.090 0.44B 0.063 0.019 0.016 0.020 0.048
32312231-R3 0.064 0.000 0.035 o.ooo 0.025 0.024 0.044 0.018 0.015 0.018 0.000
32312333-R0 1.101 0.33S 0.236 0.076 0.025 0.297 0.067 0.019 0.O19 0.019 0.000
32312333-R1 1.030 0.860 0.803 0.MO 0.167 0.664 0.233 0.022 0.048 0.023 0.034
32312333-R2 0.907 0.712 0.653 0.246 0.133 O.53B 0.174 0.01a 0.023 0.022 0.044
32312333-R3 0.212 0.189 0.264 0.178 0.066 0.561 0.145 0.023 0.023 0.023 0.000
32313233-R0 0.796 0.383 0.276 0.095 0.036 0.389 0.121 0.009 0.023 0.023 0.000
32313233-R1 0.249 0.471 0.476 0.280 0.163 0.742 0.361 0.044 0.048 0.039 0.018
32313233-R2 0.535 0.566 0.454 0.197 0.153 0.485 0.229 0.029 0.037 0.029 0.017
32313233-R3 0.147 0.123 0.125 o.oai 0.056 OJM 0.153 0.034 0.032 0.031 0.000
[00107] Table 9: Standard deviations/ average of absorbance for each substrateconstruct pair. Blanks indicate where the average absorbance equals zero. Attorney Docket No. 1034345-000263
0.098 0. ϊ052
111111114(1
Figure imgf000047_0001
0.364 0.054 0.123 0.106 0.076
11111111-R2 0.039 0.020 0.118 0.135 0.041 0.030 0.112 0.113 0.120 0.067 0.159
111111114(3 0.054 0.031 0.029 0.066 0.189 0.092 0.082 0.118 0.033
222222224(0 0.089 0.156 0.264 0.261 0.005 0.159 0.125
222222224(1 0.128 0.074 0.077 0.119 0.255 0.076 0.144 0.144 0.040
22222222-R2 0.071 0.054 0.113 0.0B1 0.251 0.085 0.103 0.099 0.011
222222224(3 0.053 0.111 0.084 0.070 0.054 0.155 0.123 0.086 0.096
333333334(3 0.154 0.126 0.017 0.094 0.082 0.110 0.155 0.088 0.063
11113StI-RO 0.092 0.097 0.086 0.370 0.117 0.033 0.000 0.05S
111133114(1 0.045 0.158 0.124 0.092 0.159 0.032 0.622 0.0S4 0.127 0.079 O.0O7
11113311-R2 0.04« 0.018 0.U3 0.035 0.079 0.177 0.130 0.102 0.038 0.012
111133114(3 0.103 0.093 0.033 0.065 0.110 0.110 0.176 0.022 0.102
12112333-RO 0.012 0.046 0.045 0.159 0.034 0.193 0.114 0.067 0.073
12112333-Rl 0.092 0.014 0.114 0.107 0.029 0.104 0.065 0.177 0.137 0.069 0.075
121123334(2 0.054 0.118 0.094 0.021 0.024 0.081 0.115 0.160 0.019 0.073 0.129
12112333-113 0.039 0.016 0.057 0.020 0.035 0.064 0.0*2 0.066 0.115 0.133
211133124(0 0.129 0.076 0.126 0.074 0.176 0.156 0.053 0.156 0.118
211133124(1 0.065 0.049 0.060 0.045 0.046 0.075 0.156 0.051 0.058 0.250
21113312-R2 0.024 0.190 0.114 0.150 0.064 0.182 0.183 0.182 0.083 0.051 0.379
21113312-R3 0.O94 0.147 O.0B7 0.051 0.044 0.0 -05 0.350 0.121 0.110 0.030
213131114(0 0.07β 0.Ϊ77 0.142 0.038 0.092 0.138 0.167 0.107
21313111-R1 0.116 0.046 0.019 0.0-8 0.055 0.032 0.239 0.135 0.107 0.083 O.095
213131114(2 0.012 0.0B4 0.076 0.039 0.037 0.069 0.424 0.083 0.106 0.088
21313111-R3 0.038 0.200 0.092 0.034 0.034 0.107 0.195 0.03S 0.145 0.127
21313311-RO 0.065 0.143 0.162 0.078 0.041 0.163 0.105
213133114(1 0.026 0.051 0.166 0.178 0.086 0.024 QAAS, 0.029 0.097 0.072
21313311-R2 0.137 0.141 0.169 0.018 0.049 0.020 0.183 0.034 0.049
213133114» 0.012 0.053 0.038 0.075 0.010 0.111 0.131 0.148 0.091 0.040
213332334(0 0.062 0.242 0.110 0.188 0.377 0.159 0.133 0.128
21333233411 0.095 0.049 0.038 0.192 0.189 0.085 0.074 0.120
21333233-R2 0.036 0.183 0.135 0.016 0.044 0.026 0.119 0.117 0.062 0.105
21333233413 0.043 0.044 0.044 0.182 0.067 0.043 0.082 0.041
22132231-RO 0.002 0.180 0.398 0.677 0.060 0.189
221322314(1 0.052 0.041 0.051 0.077 0.183 0.166 0.110
22132231-R2 0.063 0.067 0.019 0.092 0.063 0.143 0.073
221322314(3 o.αso 0.061 0.014 0.137 0.142 0.160 0.044
22213132-R0 0.153 0.128 0.058 0.081 0.147 0.156 0.166 0.073 0.137
222131324» 0.077 0.118 0.SO4 0.053 0.066 0.058 0.339 0.098 0.147 0.030 0.C43
22213132-R2 0.065 0.091 0.059 0.075 0.050 0.039 0.070 0.124 0.120 0.005
22213132413 0.037 0.061 0.116 0.061 0.052 0.119 0.144 0.111 0.114 0.000
22312333-RO 0.023 0.173 0.181 0.367 0.151 0.132 0.170
223123334(1 0.103 0.110 0.046 0.068 0.266 0.098 0.035 0.076
22312333412 0.060 0.191 0.108 0.050 0.047 0.059 0.042 0.160 0.091 0.016
22312333413 0.101 0.077 0.127 0.153 0.121 0.264 0.038
223132334(0 0.100 0.158 0.080 0.134 0.334 0.246 0.127
223132334(1 0.055 0.023 0.158 0.034 0.154 0.101 0.079 0.104
223132334(2 0.076 0.245 0.144 0.062 0.079 0.019 0.118 0.006 0.134 0.106
22313233-R3 0.028 0.005 0.036 0.141 0.155 0.040 0.031 0.104
23132233-R0 0.056 0.013 0.095 0.058 0.092 0.182 0.086
23132233-R1 0.050 0.109 0.045 0.050 0.060 0.012 0.116 0.078
231322334(2 0.042 0.009 0.178 0.076 0.067 0.078 0.122 0.091 0.113
23132233413 0.061 0.052 0.028 0.047 0.146 0.053 0.089 0.093
323122314(0 0.119 0.119 0.019 0.085 0.034 0.167 0.105 0.177
323122314(1 0.114 0.046 0.133 0.108 0.074 0.531 0.050 0.102 0.064 0.190
323122314(2 0.043 0.061 0.062 0.146 0.107 0.058 0.174 0.096 0.191 0.083 0.08S
32312231413 0.036 o.on 0.031 0.118 0.054 0.055 0.117 0.051
323123334(0 0.031 0.074 0.089 0.03« 0.071 0.OtS 0.056 0.137 0.077 0.125
323123334(1 0.068 0.111 0.045 0.020 0.056 0.113 0.014 0.052 0.102 0.042 0.457
323123334(2 0.051 0.107 0.035 0.019 0.049 0.097 0.150 0.173 0.023 0.068 0.139
323123334(3 0.107 0.070 0.079 0.133 0.030 0.075 0.095 0.050 0.073 0.069
323132334(0 0.030 0.149 0.049 0.120 0.031 0.140 0.050 1.863 0.074 0.067
323132334(1 0.143 0.105 0.036 0.011 0.063 0.069 0.184 0.147 0.073 0.044 0.C62
323132334(2 0.064 0.053 0.033 0.020 0.063 0.113 0.102 0.122 0.072 0.035 0.346
323132334(3 0.064 0.093 0.073 0.034 0.013 0.034 0.005 0.132 0.133 0.039
[00108] Table 10: Summary of error statistics for collected absorbance data sorted by substrates. The percent of the standard deviation divided by the average value and the percentage of data points retained for the analysis are measures of data quality. For Attorney Docket No. 1034345-000263
each substrate, 65 data points were collected. The Triplicates/Duplicates column indicates how many of those data points were used for the analysis performed here.
Figure imgf000048_0001
[00109] The data compare the chimeras with respect to their activities on a given substrate and also to compare their activity profiles and therefore their specificities. Chimeras having a similar profile form the same relative amounts of products from all substrates and are therefore likely to have similar specificities. To better visualize differences among chimeras, the highest average absorbance value for a given substrate was set to 100%, and all other absorbances for the same substrate, but different chimeras, were normalized to this. Figure 15 is a heat plot of the complete data set of normalized absorbances, while Figure 19 shows the substrate-activity profiles in the form of bar plots. [00110] Figure 16A shows the normalized substrate-activity profiles of the Al and A2 peroxygenases . Both have relatively low or no activity on any of the substrates except PN, where Al makes about an order of magnitude more product than does A2. Profiles for the reconstituted parent holoenzymes are shown in Figure 16B. Fusion of Al and Rl generated an enzyme with profile peaks on ethyl 4-phenylbutyrate (PB) and PN. Al is in fact the second-best- performing enzyme on PB. The Al peroxygenase activity on this substrate, however, is among the worst, showing that peroxygenase specificity does not necessarily predict that of the monooxygenase. Fusion of A2 to R2 slightly increased activity relative to A2, but did not alter the profile. The A3-R3 holoenzyme exhibits some Attorney Docket No. 1034345-000263
activity on the drug-like substrates (PR, TB, CH) as well as PN and PB.
[00111] Fusion of the Al and A2 heme domains to other reductase domains yields holoenzymes that are active on some substrates (Figure 16C and 16D) . The A2 fusions have relatively low activities. Al fusions with Rl and R2, on the other hand, created highly active enzymes with different specificities: the Al-Rl profile has peaks on PN and PB, while that of A1-R2 has peaks on PB, phenoxyethanol (PE) and zoxazolamine (ZX). The A1-R3 fusion is less active on nearly all substrates.
[00112] 14 chimeric heme domains generated 56 chimeric peroxygenases and monooxygenases . Nearly all the chimera fusions outperformed even the best parent holoenzyme, and chimeric peroxygenases consistently outperformed the parent peroxygenases (Figure 15 and Figure 19) . The best enzyme for each substrate is listed in Table 11. All the best enzymes are chimeras. Most of the best enzymes are also holoenzymes—only PE has a peroxygenase as the best catalyst .
[00113] Table 11: Summary of most active chimeric proteins for each substrate. Pairwise correlation matrix of the activities on all substrates. R2 values are reported. Bold and underlined=0.7- 1.0; Underlined=0.4-0.7; Regular=0.0-0.4.
Proton PE EB PA PPTT PB DP ZX PR CH TB PN
3231223I-R0 PE N.A. 0.61 0.4S 00..3377 0.18 0.35 0.15 0.01 0.05 0.02 0.01
32312231-R1 EB NA 0.92 00..8800 0.41 0.73 0.56 0.04 0.13 0.06 0.00
32312231-R1 PA NA. 00..8811 0.39 0.71 0.62 0.04 0.14 0.06 0.00
32312231-Rl PT N NAA 056 0.85 0.66 0.14 0.24 0.16 0.00
21313111-R3 PB NA. 0.49 0.49 0.36 0.37 0.33 0.08
32313233-R1 DP N.A, 0.58 0.05 0.10 0.06 0.00
32313233-R1 ZX NA 0.18 0.29 0.21 0.00
22213132-R2 PR NA. 0.91 0.95 0.00
22213132-R2 CH N A. 0.94 0.00
22213132-R2 TB NA. 0.00
11113311-R1 PN NA.
[00114] The data show that there exists a discrete set of characteristic substrate-activity profiles to which each chimera can be uniquely assigned. A k-means clustering analysis was applied Attorney Docket No. 1034345-000263
to the normalized absorbance data to better understand the functional diversity. K-means clustering, a statistical algorithm that partitions data into clusters based on data similarity, mutants exhibiting similar substrate specificities and protein fragments (4-7 residues) of similar structure and interacting nucleotide pairs with similar 3D structures. For this analysis, the normalized data were used to ensure that each of the 11 dimensions is given equal weight by the clustering algorithm. The clustering was performed over values of k (number of clusters) ranging from k=2 to k=8. The highest silhouette value was observed at k=5. [00115] The cluster composition for k=5 is depicted in Figure 17. Cluster 1, consisting of chimeras 32312333-R1/R2 and 32313233- R1/R2 (Figure 17B), is characterized by low relative activities on CH, TB, PR and PN and high relative activities on all other substrates. In fact, two of these chimeras are the best enzymes on all the remaining substrates except PB and PE. [00116] Cluster 2 is made up of 22213132-R2, 21313111-R3, 21313311-R3, which are the most active enzymes on TB, CH and PR (Figure 17C) . Cluster 2 enzymes are entirely inactive on PN and show low activity on most of the substrates that cluster 1 enzymes accept (PE, DP, PA and EB) . Relative activities on the remaining substrates (i.e. PB, ZX and PT) are moderate (although lower than cluster 1 chimeras). An exception is 21313111-R3, which is the best enzyme for PB and also fairly good on PE and DP.
[00117] Cluster 3 contains chimeras A1-R1/R2, 12112333-R1/R2, 11113311-R1/R2 and 22213132-R1 (Figure 17D) . The Al-like sequences are characterized by high relative activity on PN (on which 11113311-R1/R2 and Al-Rl are the three top-ranking enzymes), and moderate to high relative activity on PB and moderate activity on PE.
[00118] Cluster 4 contains 21313111-R1/R2, 22313233-R2, 22312333-R2, 32312231-R2, 32312333-R0, 32312333-R3, 32313233-R0, and 32313233-R3 (Figure 17E) . This cluster is characterized by having the highest relative activity on PE, in addition to moderate activities on PT, DP and ZX. The remaining chimeras appear in a fifth cluster with relatively low activity on everything except PN and PE (Figure 17F) . This cluster contains parental sequences Al- Attorney Docket No. 1034345-000263
RO, A1-R3, A2-R0, A2-R1/R2/R3 and A3-R3. Native sequences are thus found in two of the clusters. The remaining clusters (1, 2 and 4) are made up of highly active chimeras that have acquired novel profiles .
[00119] The partition created by the clustering algorithm shows that the presence and identity of the reductase can alter the activity profile and thus the specificity of a heme domain sequence. For example, the Rl and R2 fusions of 32312333 and 32313233 appear in cluster 1, whereas their RO and R3 counterparts are in cluster 4. Sequences 22213132 and 21313111 also behave differently when fused to different reductases. 22213132-R2, for example, displays pronounced peaks on substrates TB, CH and PR that are not present in the corresponding peroxygenase and R1/R3 profiles (Figure 19E) and is thus the only member with this heme domain sequence appearing in cluster 2. 21313111-R3 and 21313111- R2/R1 have nearly opposite profiles (Figure 19J) and consequently appear in different clusters. Thus the best choice of reductase depends on both the substrate and the chimera sequence. [00120] As shown in Figure 15, each of the 14 chimeric heme domains can be fused to a parental reductase to generate a functional monooxygenase . The resulting monooxygenases are generally more active under these conditions than the corresponding peroxygenases (see Figure 19) . The Rl and R2 fusions tend to outperform R3 fusions. While altering reductase identity never completely deactivates the protein, it does affect specificity in some cases. To quantify the differences between the profiles of the four different enzymes that can be made from a given chimera, the pairwise linear coefficients (R2) of the R0/R1, R0/R2, R0/R3, R1/R2, R1/R3 and R2/R3 profiles were determined for each heme domain sequence (with the exception of A3). The results are shown in Table 7. High correlations represent enzyme pairs with similar specificities. The results show that peroxygenase and monooxygenase specificities are usually different, R1/R2 fusions of a chimera are often very similar (five pairs have R2 values above 0.9), and the Rl and R2 fusions are less similar to the R3 enzymes. [00121] To understand whether a chimera's activity on one substrate predicts activity on another, the pairwise correlations Attorney Docket No. 1034345-000263
of the absorbances of all the possible substrate pairs were determined (Table 11). Correlations were used to identify substrates having similar chimera profiles. This analysis led to the identification of three substrate clusters characterized by high values of the correlation coefficients. Members of different clusters are poorly correlated. DP, PT, PA and EB all exhibit high correlations with each other (R2 = 0.71-0.92, see Figure 2OA for an example) and were grouped into the core of substrate group A. Group B consists of CH, TB and PR. The categorization of this group is clearly defined: its members show high correlations with each other (R2 above 0.9, see Figure 2OB for an example), but correlate very poorly with the other substrates (R2 = 0.01-0.37). PN does not correlate significantly with any of the other substrates tested (R2 = 0.00-0.08) and is its own substrate group C.
[00122] ZX, PB and PE show moderate correlation to members of the group A core (R2 = 0.56-0.66, 0.39-0.56 and 0.35-0.61, respectively) . These substrates are considered loosely associated with group A since they do not belong to any other group due to poor correlation with each other and the remaining substrates. [00123] There exists a correspondence between the chimera clusters and the substrate groups. Group A core substrates have cluster 1 chimeras as their top-performing enzymes, whereas substrates of group B have cluster 2 chimeras as their top- performing enzymes. The top catalysts for group C are three of the cluster 3 chimeras. Members of a substrate group thus share the same best-performing enzymes.
[00124] The folded P450s contain an average of 72 mutations from their closest parent. A large fraction of the folded P450s are catalytically active showing activity on a single substrate (PN). Eleven additional substrates were selected for characterization of 14 of the active chimeric heme domains and their fusions with each of the three parental reductase domains. Many of the chimeras were shown to be significantly more active. In fact, for every single substrate, including one widely used to assay CYP102A1 (PN), the top-performing enzyme is a chimera. Recombining mutations already accepted in natural homologs thus leads to a family of highly active enzymes that accept a broader range of substrates. Attorney Docket No. 1034345-000263
[00125] Chimeric enzymes exhibit distinct specificities and that they can be partitioned into clusters based on their specificity. One cluster contains parent Al-Rl and all chimeras with Al-like profiles. Another cluster contains low activity chimeras and includes all remaining parental sequences. The remaining clusters represent highly active chimeras that have acquired new specificities. Members of a cluster are likely to exhibit common structural, physical or chemical features that account for their similar catalytic properties. If the library is large enough, statistical techniques can be used to determine how sequence elements relate to the observed profiles. In particular, if there are sufficient numbers of chimeras in each cluster, then powerful tools such as logistic regression or machine learning can be used to predict which cluster an untested sequence belongs to. This type of analysis would enable the prediction of substrate profiles of untested chimeras based on sequence information alone. The functionally diverse enzymes generated by SCHEMA-guided recombination can therefore be used to probe the sequence and structural basis of enzyme specificity. For example, the chimeras in the library with parent Al in blocks 1, 3 and 4 are all among the best enzymes for PN. These same enzymes display low relative activity on all the remaining substrates except for PB. This suggests that having parent Al sequence at one or more of these blocks improves PN activity and specificity.
[00126] Substrates were also partitioned into groups based on the linear correlations of substrate pairs. An enzyme active on one member of a substrate group is therefore likely to be active on another member of the same group. One group consists of the drug- like substrates TB, PR and CH (Figure 14). Another consists of PT, PA, EB and DP. If these correlations hold for the larger library of chimeric enzymes, we should be able to predict with reasonable accuracy the relative activities of a chimera on all the substrates in a group by testing activity on only one. This type of analysis can be expanded to a larger collection of substrates to identify additional groups or additional members of an existing group. [00127] The observed correspondence between the three substrate groups and chimera clusters 1, 2 and 3 illustrates that each group Attorney Docket No. 1034345-000263
can be associated with a cluster made up of or containing the top- performing enzymes for the substrates in that group. Some degree of correspondence can be expected, given how the partitions were constructed. However, because intra-group correlations are not one and inter-group correlations are not zero, the correspondence is not perfect. For this reason there exist chimeras whose profiles exhibit peaks on only certain members of a group (cluster 4) and others that exhibit peaks on members of different groups (cluster 2 and 3 chimeras) . Cluster 4 chimeras have peaks on only certain members of group A and are thus responsible for the lower correlations among group A substrates. Some cluster 2 and cluster 3 chimeras exhibit peaks on PB (on the edge of group A) as well as group B and C, respectively. In fact although PB correlates mostly with group A core substrates it shares its top-performing enzymes with groups B and C and thus displays a hybrid behavior. This is why PB correlates less with group A than core substrates do and why it has higher correlations with group B and C members than any other substrate not belonging to these groups.
[00128] Because chimeras displaying high relative activity have more weight in determining the correlation coefficients, the top enzymes for one member of a substrate group will usually be among the top ones for all members of that group. The clearer the definition of the substrate groups, the more likely this is to hold. Given the many important applications of P450s in medicine and biocatalysis, and the lack of high-throughput screens for many compounds of interest, an approach to screening that is based on carefully chosen 'surrogate' substrates could significantly enhance our ability to identify useful catalysts. Clearly, any member of a well-defined substrate group can be a surrogate for other members of that group. Further analysis may also help to identify the critical physical, structural or chemical properties of substrates belonging to a known group. This will make it possible to predict which chimeras will be most active on a new, untested substrate. [00129] Swapping reductase domains consistently yields active monooxygenases and conserves key P450-reductase FMN domain interactions. Reconstitution of the chimeric CYP102A heme domains with the three parental reductases generated functional Attorney Docket No. 1034345-000263
monooxygenases in all cases. Although their specificities were often different (particularly when fused to R3) , fusion to a reductase was never detrimental to activity, and swapping the reductase never completely inactivated the enzyme (Figure 19) . Subtle changes in the structure and coupling behavior that affect total product formation may account for specificity differences. The fact that the parental reductase domains are accepted without loss of function, however, suggests that key domain-domain interactions are conserved upon reductase swapping. [00130] Although a complete crystal structure of a CYP102A holoenzyme is not available, a partial CYP102A1 structure (IBVY) includes the interface between the heme and the reductase FMN domains. Only a few direct contacts, including one hydrogen bond, one salt bridge and several water-mediated contacts, make up this Al-Rl interface. Parental sequences were aligned using ClustalW and found that the interactions depicted in the IBVY crystal structure involve amino acids that are mostly conserved in the parent proteins. Figure 18 displays the interface between the heme and reductase domains of CYP102A1 and highlights the amino acids involved in key interactions. The salt bridge is formed between reductase residue E494 and heme domain residue HlOO, both of which are conserved in all three parents. Thus this key interaction would be retained upon reductase swapping that conserves the orientation of the two domains.
[00131] The direct hydrogen bond occurs between the reductase backbone carbonyl of N573 and the side-chain hydroxyl group of heme domain residue S383. N573 is only conserved in Rl and R2, but because the interaction involves the backbone oxygen, the reductase side of the interface is not affected by changes in the side-chain identity. S383 is only conserved in parents Al and A3. However, the corresponding residue in A2, D385, may also be capable of forming the hydrogen bond. This interaction may therefore be present in all the chimeras .
[00132] There are two water-mediated hydrogen bonds between the hydrogen of the indole nitrogen of reductase residue W574 and the backbone carbonyl of S383 and 1385. W574 was earlier shown to be useful for electron transfer from the FMN to the heme and is Attorney Docket No. 1034345-000263
conserved in Rl, R2 and R3. S383 and 1385 are conserved in Al and A3 but not A2, where the corresponding residues are D385 and V387. Because the hydrogen bonds involve the backbone oxygens of these residues, these interactions may be retained upon domain substitution. Also, all possible pairwise interactions that can be formed at these positions by domain swapping already exist in at least one of the parental sequences and are thus likely not to be destabilizing. Finally, the substitutions that do occur are conservative, replacing a hydrophilic residue with another hydrophilic residue and a hydrophobic residue with another hydrophobic residue. The third water-mediated hydrogen bond between the side chains of reductase residue R498 and heme domain residue E244 (block 5) is conserved in Al-Rl, A2-R2 but not A3-R3, where the corresponding residues are G501 and V246. A3-R3 thus cannot form this interaction nor can any chimera that inherits A3 sequence at block 5 and/or is fused to R3.
[00133] The direct hydrogen bond, two of the three water- mediated hydrogen bonds and the salt bridge are all conserved in the chimera-reductase fusions. The third-water mediated hydrogen bond is conserved only in R1/R2 fusions that do not have parent A3 in block 5 (8 out of 17 sequences) . Thus the activities of the reconstituted monooxygenases are consistent with their sequences, the domain-domain interactions identified in the IBVY structure and the assumption that the overall structures and orientations are conserved upon reductase swapping. These results demonstrate the highly conservative nature of mutation by recombination of protein domains: as long as key interactions are retained, the remaining sequences can vary extensively.
[00134] Chimeria are presented herein as an 8-digit number, where each digit indicates the parent from which each of the eight blocks was inherited. The identity of the reductase is indicated by RO (for no reductase) or Rl, R2 or R3 for the CYP102A1, A2, or A3 reductases, respectively.
[00135] To construct the holoenzymes, the chimeric heme domains were fused to each of the three wildtype reductase domains after amino acid residue 463 when the last block originates from CYP102A1 and 466 for CYP102A2 and CYP102A3. The holoenzymes were constructed Attorney Docket No. 1034345-000263
by overlap extension PCR and/or ligation and cloned into the pCWori expression vector. All constructs were confirmed by sequencing. Table 12 provides exemplary sequences associated with the chimeras described herein. [00136] TABLE 12:
Figure imgf000057_0001
Attorney Docket No. 1034345-000263
Figure imgf000058_0001
[00137] Proteins were expressed in E. coli as described previously and purified by anion exchange on Toyopearl SuperQ-650M from Tosoh. After binding of the proteins, the matrix was washed with a 30 mM NaCl buffer, and proteins were eluted with 150 mM NaCl (all buffers used for purification contained 25 mM phosphate buffer pH 8.0). Proteins were rebuffered into 100 mM phosphate buffer and concentrated using 30,000 MWCO Amicon Ultra centrifugal filter devices (Millipore) . Proteins were stored at -200C in 50% glycerol. [00138] Protein concentration was measured by CO absorption at 450 ran. A protein concentration of 1 μM was chosen for the activity assays. Protein concentrations were re-assayed in 96-well format and determined to be 0.88 μM +/- 13% (SD/average) . [00139] Proteins were assayed for mono- or peroxygenase activities in 96-well plates. Heme domains were assayed for peroxygenase activity using hydrogen peroxide as the oxygen and electron source. Reductase domain fusion proteins were assayed for monooxygenase activity, using molecular oxygen and NADPH. Reactions were carried out in 100 mM EPPS buffer pH 8, 1% acetone, 1% DMSO, 1 μM protein in 120 μl volumes. Substrate concentrations depended on their solubility under the assay conditions. Final concentrations were: 2-phenoxyethanol (PE), 100 mM; ethoxybenzene (EB), 50 mM; ethyl phenoxyacetate (PA), 10 mM; 3-phenoxytoluene (PT), 10 mM; ethyl 4-phenylbutyrate (PB), 5 mM; diphenyl ether (DP), 10 mM; zoxazolamine (ZX), 5 mM; propranolol (PR), 4 mM; chlorzoxazone (CH) , 5 mM; tolbutamide (TB) , 10 mM; 12-p-nitrophenoxycarboxylic acid (PN), 0.25 mM. The reaction was initiated by the addition of NADPH or hydrogen peroxide stock solution (final concentration of 500 μM NADPH or 2 mM hydrogen peroxide) and mixed briefly. After 2 hrs at room temperature, reactions with substrates 1-10 were quenched with 120 μl of 0.1 M NaOH and 4 M urea. Thirty-six μl of 0.6% (w/v) 4-aminoantipyrine (4 -AAP) was then added. The 96-well plate reader was zeroed at 500 nm and 36 μl of 0.6% (w/v) potassium persulfate was added. After 20 min, the absorbance at 500 nm was Attorney Docket No. 1034345-000263
read [21] . Reactions on PN were monitored directly at 410 nm by the absorption of accumulated 4-nitrophenol . All experiments were performed in triplicate, and the absorption data were averaged. [00140] The background absorbance (BG) was subtracted from the raw data. BG reactions contained buffer, cofactor and substrate in the absence of protein sample and were done in triplicates. All absorbance measurements were done once on three separate samples (triplicate sampling) . Data points with a SD/average ≥ 20% that did not lie within the average ± 1.1*SD were eliminated. 1.1*SD was chosen so that for each substrate at least 85% of the points were retained. This never resulted in the elimination of more than one point from each triplicate set of measurements. All points with an average absorbance < BG were set to zero, because they are assumed to belong to inactive proteins. The absorbance matrix thus obtained for all 68 proteins on all 11 substrates is displayed in Supplemental Table 2. The SD/average matrix is displayed in Supplemental Table 3. SD/average was calculated ignoring values for inactive enzymes.
[00141] K-means clustering is a partitioning method that divides a set of observations into k mutually exclusive clusters. K-means treats each data point as an object having a location in Tridimensional space (m=ll in this analysis) [23] . It then finds a partition such that members of the same cluster are as close as possible to each other and as far as possible to members of other clusters. For this reason, a measure of the meaningfulness of a partition is given by the silhouette value
Figure imgf000059_0001
where a(i) is the average distance of point i to all other points in its cluster and b (i) is the average distance of point i to all points in the closest cluster. It is evident that -l≤s≤l and the quality of the clustering increases as s -> 1 [42] . Distances are measured by the square of the Euclidean distance.
1. DePristo, M. A., Weinreich, D. M. & Hartl, D. L. Missense meanderings in sequence space: A biophysical view of protein evolution. Nat. Rev. Genet. 6, 678-687 (2005). Attorney Docket No. 1034345-000263
2. Yue, P., Li, Z. L. & Moult, J. Loss of protein structure stability as a major causative factor in monogenic disease. J. MoI. Biol. 353, 459-473 (2005).
3. Bloom, J. D. et al. Thermodynamic prediction of protein neutrality. Proc. Nat. Acad. Sci. USA 102, 606-611 (2005) .
4. Bloom, J. D., Labthavikul, S. T., Otey, CR. & Arnold, F. H.
Protein stability promotes evolvability Proc. Nat. Acad. Sci. USA 103, 5869-5874 (2006).
5. Drummond, D. A., Bloom, J. D., Adami, C, Wilke, CO. & Arnold,
F. H. Why highly expressed proteins evolve slowly. Proc. Nat. Acad. Sci. USA 102, 14338-14343 (2005).
6. Niehaus, F., Bertoldo, C, Kahler, M. & Antranikian, G.
Extremophiles as a source of novel enzymes for industrial application. Appl . Microbiol. Biot. 51, 711-729 (1999).
7. Zeikus, J. C, Vieille, C & Savchenko, A. Thermozymes: biotechnology and structure-function relationships. Extremophiles 2, 179-183 (1998) .
8. Guengerich, F. P. Cytochrome P450 enzymes in the generation of commercial products. Wat. Rev. Drug Discov. 1, 359-366 (2002) .
9. Landwehr, M. et al. Enantioselective alpha-hydroxylation of 2- arylacetic acid derivatives and buspirone catalyzed by engineered cytochrome P450BM-3. J. Am. Chem. Soc. 128, 6058-6059 (2006) .
10. Otey, CR., Bandara, G., Lalonde, J., Takahashi, K. & Arnold,
F. H. Preparation of human metabolites of propranolol using laboratory-evolved bacterial cytochromes P450. Biotechnol . Bioeng. 93, 494-499 (2006) .
11. Urlacher, V. B. & Eiben, S. Cytochrome P450 monooxygenases : perspectives for synthetic application. Trends Biotechnol . 24, 324-330 (2006) .
12. van Vugt-Lussenburg, B. M. A. et al . Heterotropic and homotropic cooperativity by a drug-metabolising mutant of cytochrome P450BM3. Biochem. Bioph. Res. Comm. 346, 810-818 (2006).
13. Otey, CR. et al . Structure-guided recombination creates an artificial family of cytochromes P450. PLoS Biol. 4, ell2 (2006) . Attorney Docket No. 1034345-000263
14. Dietterich, T. G. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10, 1895-1923 (1998) .
15. Fox, R. et al. Optimizing the search algorithm for protein engineering by directed evolution. Protein Eng. 16, 589-597 (2003) .
16. Amin, N. et al. Construction of stabilized proteins by combinatorial consensus mutagenesis. Protein Eng. Des . SeI. 17, 787-793 (2004) .
17. Lehmann, M. et al. The consensus concept for thermostability engineering of proteins: further proof of concept. Protein Eng. 15, 403-411 (2002) .
18. Steipe, B., Schiller, B., Pluckthun, A. & Steinbacher, S.
Sequence statistics reliably predict stabilizing mutations in a protein domain. J. MoI. Biol. 240, 188-192 (1994).
19. Joern, J. M., Meinhold, P. & Arnold, F. H. Analysis of shuffled gene libraries. J. MoI. Biol. 316, 643-656 (2002).
20. Johannes, T. W., Woodyer, R. D., & Zhao, H. M. Directed evolution of a thermostable phosphite dehydrogenase for NAD(P)H regeneration. Appl. Environ. Microb. 71, 5728-5734 (2005)
21. Landwehr, M., Carbone, M., Otey, CR. , Li, Y. & Arnold, F. H.
Diversification of catalytic function in a synthetic family of chimeric cytochrome P450s. Chem . Biol. In press (2007) .
22. Somero, G.N. Proteins and temperature. Annu . Rev. Physiol . 57,
43-68 (1995) .
23. Arnold, F. H., Wintrode, P. L., Miyazaki, K. & Gershenson, A. How enzymes adapt: lessons from directed evolution. Trends Biochem. Sci. 26, 100-106 (2001).
24. Taverna, D. M. & Goldstein, R. A. Why are proteins marginally stable? Proteins 46, 105-109 (2002).
25. Bloom, J. D., Raval, A. & Wilke , CO. Thermodynamics of neutral protein evolution Genetics 175, 255-266 (2007) .
26. Serrano, L., Day, A. G. & Fersht, A. R. Step-wise mutation of barnase to binase - a procedure for engineering increased stability of proteins and an experimental-analysis of the evolution of protein stability. J. MoI. Biol. 233, 305-312 Attorney Docket No. 1034345-000263
( 1993 ) .
27. Giver, L., Gershenson, A., Freskgard, P.O. & Arnold, F. H. Directed evolution of a thermostable esterase. Proc. Nat. Acad. Sci. USA 95, 12809-12813 (1998).

Claims

What is claimed is:
1. A polypeptide comprising sequences from CYP102A1, CYP102A2 or
CYP102A3 and having the general structure from N-terminus to C- terminus :
[segment I]- [segment 2] -[segment 3] -[segment 4] -[segment 5]-
[ segment 6] -[segment 7] -[segment 8] wherein : segment 1 is amino acid residue from about 1 to about xl of SEQ ID N0:l ("1"), SEQ ID NO : 2 ("2") or SEQ ID NO : 3 ("3"); segment 2 is from about amino acid residue xl to about x2 of SEQ ID N0:l ("1"), SEQ ID NO : 2 ("2") or SEQ ID NO : 3 ("3"); segment 3 is from about amino acid residue x2 to about x3 of SEQ ID N0:l ("1"), SEQ ID NO : 2 ("2") or SEQ ID NO : 3 ("3"); segment 4 is from about amino acid residue x3 to about x4 of SEQ ID N0:l ("1"), SEQ ID NO : 2 ("2") or SEQ ID NO : 3 ("3"); segment 5 is from about amino acid residue x4 to about x5 of SEQ ID N0:l ("1"), SEQ ID NO : 2 ("2") or SEQ ID NO : 3 ("3"); segment 6 is from about amino acid residue x5 to about x6 of SEQ ID N0:l ("1"), SEQ ID NO : 2 ("2") or SEQ ID NO : 3 ("3"); segment 7 is from about amino acid residue x6 to about x7 of SEQ ID N0:l ("1"), SEQ ID NO : 2 ("2") or SEQ ID NO : 3 ("3"); and segment 8 is from about amino acid residue x7 to about x8 of SEQ ID N0:l ("1"), SEQ ID NO : 2 ("2") or SEQ ID NO : 3 ("3");
"wherein : xl is residue 62, 63, 64, 65 or 66 of SEQ ID N0:l, or residue 63, 64, 65, 66 or 67 of SEQ ID N0:2 or SEQ ID N0:3; x2 is residue 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 132 or 132 of SEQ ID N0:l, or residue 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, or 133 of SEQ ID NO : 2 or SEQ ID NO: 3; x3 is residue 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, or 177 of SEQ ID NO : 1 , or residue 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, or 178 of SEQ ID NO: 2 or SEQ ID NO : 3 ; x4 is residue 214, 215, 216, 217 or 218 of SEQ ID NO : 1 , or residue 215, 216, 217, 218 or 219 of SEQ ID NO : 2 or SEQ ID NO:3; x5 is residue 266, 267, 268, 269 or 270 of SEQ ID NO : 1 , or residue 268, 269, 270, 271 or 272 of SEQ ID NO : 2 or SEQ ID NO:3; x6 is residue 326, 327, 328, 329 or 330 of SEQ ID NO : 1 , or residue 328, 329, 330, 331 or 332 of SEQ ID NO : 2 or SEQ ID NO:3; x7 is residue 402, 403, 404, 405 or 406 of SEQ ID NO : 1 , or residue 404, 405, 405, 407 or 408 of SEQ ID NO : 2 or SEQ ID NO:3; and x8 is an amino acid residue corresponding to the C-terminus of the heme domain of CYP102A1, CYP102A2 or CYP102A3 or the C- terminus of SEQ ID NO:1, SEQ ID NO : 2 or SEQ ID NO:3; wherein the polypeptide has a CO-binding peak at 450 nm; and wherein the general structure is selected from the group consisting of :
11311333, 11312331, 11312333, 21111133 , 21111213, 21111231, 21111233, 21111311, 21111313, 21111331 , 21111332, 21112131, 21112133, 21112211, 21112213, 21112231 , 21112313, 21112323, 21113213, 21113231, 21113233, 21113311 , 21113313, 21113331, 21113332, 21211133, 21211213, 21211231 , 21211233, 21211311, 21211313, 21211331, 21211332, 21211333 , 21212131, 21212211, 21212232, 21212311, 21212313, 21212323 , 21212331, 21213213, 21213233, 21213311, 21213313, 21213331 , 21311113, 21311123, 21311131, 21311132, 21311133, 21311211 , 21311212, 21311213, 21311221, 21311232, 21311312, 21311321 , 21311322, 21311323, 21311332, 21312113, 21312131, 21312132 , 21312221, 21312232, 21312312, 21313113, 21313131, 21313132 , 21313133, 21313211, 21313212, 21313213, 21313223, 21313232 , 21313321, 21313323, 21313332, 21321333, 21322313, 21322333 , 21331331, 21332213, 21332311, 21332313, 22111233, 22111313 , 22111331, 22111333, 22112213, 22112231, 22112311, 22112313 , 22112332, 22113331, 22113333, 22211231, 22211233, 22211313 , 22211333, 22212213, 22212231, 22212233, 22212311, 22212313 , 22212331, 22212332, 22213313, 22213331, 22213333, 22311111 , 22311113, 22311131, 22311132, 22311133, 22311211, 22311213 , 22311221, 22311223, 22311232, 22311311, 22311312, 22311313 , 22311321, 22311323, 22312113, 22312131, 22312212, 22312213 , 22312321, 22312323, 22313113, 22313131, 22313133, 22313211 , 22313213, 22313223, 22313311, 22313312, 22313313, 22313321 , 22322333, 22332331, 22332333, 22333333, 23311231, 23311331 , 23311333, 23312133, 23312213, 23312231, 23312233, 23312313 , 23312331, 23312332, 23312333, 23313313, 23313331, 31311331 , 31311333, 31312313, 31312331, 31313331, 32112231 and 32311333.
2. The polypeptide of claim 1, wherein the polypeptide comprises a heme domain and the heme domain is fused to a functional reductase domain having at elast 50% identitity to the reductase domain of SEQ ID NO:1, 2, or 3.
3. A polypeptide having the general structure from N-terminus to C-terminus: [segment I]- [segment 2] -[segment 3] -[segment 4]-
[ segment 5] -[segment 6] -[segment 7] -[segment 8] wherein segment 1 comprises at least 50-100% identity to the sequence of SEQ ID NO: 4, 5, or 6; wherein segment 2 comprises at least 50-100% identity to the sequence of SEQ ID NO: 7, 8, or 9; wherein segment 3 comprises at least 50-100% identity to the sequence of SEQ ID NO: 10, 11 or 12; segment 4 comprises at least 50-100% identity to the sequence of SEQ ID NO:13, 14, or 15; segment 5 comprises at least 50-100% identity to the sequence of SEQ ID NO: 16, 17, or 18; segment 6 comprises at least 50-100% identity to the sequence of SEQ ID NO: 19, 20, or 21; segment 7 comprises at least 50-100% identity to the sequence of SEQ ID NO: 22, 23, or 24; and segment 8 comprises at least 50-100% identity to a sequence of SEQ ID NO:25, 26, or 27, and wherein the polypeptide has a CO binding peak at 450 nm.
4. The polypeptide of claim 3, wherein the polypeptide comprises a heme domain and the heme domain is fused to a functional reductase domain having at least 50% identity to the reductase domain of SEQ ID NO:1, 2, or 3.
5. A polynucleotide encoding a polypeptide of claim 1, 2, 3 or 4.
6. A vector comprising a polynucleotide of claim 5.
7. A host cell comprising the vector of claim 6.
8. A host cell comprising the polynucleotide of claim 5.
9. An enzyme extract comprising a polypeptide produced from the host cell of claim 7 or 8.
PCT/US2008/053344 2007-02-08 2008-02-07 Methods for generating novel stabilized proteins WO2008118545A2 (en)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US90022907P 2007-02-08 2007-02-08
US60/900,229 2007-02-08
US91852807P 2007-03-16 2007-03-16
US60/918,528 2007-03-16
US12/024,515 2008-02-01
US12/024,515 US20080248545A1 (en) 2007-02-02 2008-02-01 Methods for Generating Novel Stabilized Proteins

Publications (2)

Publication Number Publication Date
WO2008118545A2 true WO2008118545A2 (en) 2008-10-02
WO2008118545A3 WO2008118545A3 (en) 2009-12-30

Family

ID=39789216

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2008/053344 WO2008118545A2 (en) 2007-02-08 2008-02-07 Methods for generating novel stabilized proteins

Country Status (1)

Country Link
WO (1) WO2008118545A2 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7863030B2 (en) 2003-06-17 2011-01-04 The California Institute Of Technology Regio- and enantioselective alkane hydroxylation with modified cytochrome P450
US8026085B2 (en) 2006-08-04 2011-09-27 California Institute Of Technology Methods and systems for selective fluorination of organic molecules
US8252559B2 (en) 2006-08-04 2012-08-28 The California Institute Of Technology Methods and systems for selective fluorination of organic molecules
US8802401B2 (en) 2007-06-18 2014-08-12 The California Institute Of Technology Methods and compositions for preparation of selectively protected carbohydrates
US9322007B2 (en) 2011-07-22 2016-04-26 The California Institute Of Technology Stable fungal Cel6 enzyme variants

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050037411A1 (en) * 2003-08-11 2005-02-17 California Institute Of Technology Thermostable peroxide-driven cytochrome P450 oxygenase variants and methods of use
US7226768B2 (en) * 2001-07-20 2007-06-05 The California Institute Of Technology Cytochrome P450 oxygenases

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7226768B2 (en) * 2001-07-20 2007-06-05 The California Institute Of Technology Cytochrome P450 oxygenases
US20050037411A1 (en) * 2003-08-11 2005-02-17 California Institute Of Technology Thermostable peroxide-driven cytochrome P450 oxygenase variants and methods of use

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7863030B2 (en) 2003-06-17 2011-01-04 The California Institute Of Technology Regio- and enantioselective alkane hydroxylation with modified cytochrome P450
US8343744B2 (en) 2003-06-17 2013-01-01 The California Institute Of Technology Regio- and enantioselective alkane hydroxylation with modified cytochrome P450
US8741616B2 (en) 2003-06-17 2014-06-03 California Institute Of Technology Regio- and enantioselective alkane hydroxylation with modified cytochrome P450
US9145549B2 (en) 2003-06-17 2015-09-29 The California Institute Of Technology Regio- and enantioselective alkane hydroxylation with modified cytochrome P450
US8026085B2 (en) 2006-08-04 2011-09-27 California Institute Of Technology Methods and systems for selective fluorination of organic molecules
US8252559B2 (en) 2006-08-04 2012-08-28 The California Institute Of Technology Methods and systems for selective fluorination of organic molecules
US8802401B2 (en) 2007-06-18 2014-08-12 The California Institute Of Technology Methods and compositions for preparation of selectively protected carbohydrates
US9322007B2 (en) 2011-07-22 2016-04-26 The California Institute Of Technology Stable fungal Cel6 enzyme variants

Also Published As

Publication number Publication date
WO2008118545A3 (en) 2009-12-30

Similar Documents

Publication Publication Date Title
US20080248545A1 (en) Methods for Generating Novel Stabilized Proteins
Gumulya et al. Engineering highly functional thermostable proteins using ancestral sequence reconstruction
Otey et al. Structure-guided recombination creates an artificial family of cytochromes P450
Bloom et al. Neutral genetic drift can alter promiscuous protein functions, potentially aiding functional evolution
US20120171693A1 (en) Methods for Generating Novel Stabilized Proteins
Perperopoulou et al. Recent advances in protein engineering and biotechnological applications of glutathione transferases
Garcia et al. Reconstructing the evolutionary history of nitrogenases: Evidence for ancestral molybdenum‐cofactor utilization
JP2021131901A (en) Automated screening of enzyme variants
Park et al. Energetics-based protein profiling on a proteomic scale: identification of proteins resistant to proteolysis
US20080268517A1 (en) Stable, functional chimeric cytochrome p450 holoenzymes
WO2008118545A2 (en) Methods for generating novel stabilized proteins
Nutschel et al. Systematically scrutinizing the impact of substitution sites on thermostability and detergent tolerance for Bacillus subtilis lipase A
Chandler et al. Strategies for increasing protein stability
Rembeza et al. Experimental and computational investigation of enzyme functional annotations uncovers misannotation in the EC 1.1. 3.15 enzyme class
Matsuoka et al. Discovery of fungal denitrification inhibitors by targeting copper nitrite reductase from Fusarium oxysporum
Saab‐Rincón et al. Stabilization of the reductase domain in the catalytically self‐sufficient cytochrome P450BM3 by consensus‐guided mutagenesis
Kramm et al. Short‐chain dehydrogenases/reductases in cyanobacteria
Nakano et al. Benchmark analysis of native and artificial NAD+-dependent enzymes generated by a sequence-based design method with or without phylogenetic data
Fiorentini et al. The extreme structural plasticity in the CYP153 subfamily of P450s directs development of designer hydroxylases
Roda et al. Structural-based modeling in protein engineering. A must do
WO2008115844A2 (en) Stable, functional chimeric cytochrome p450 holoenzymes
WO2005017106A2 (en) Libraries of optimized cytochrome p450 enzymes and the optimized p450 enzymes
Verma et al. MAP2. 03D: a sequence/structure based server for protein engineering
Luo et al. Evidence that the C-terminal domain of a type B PutA protein contributes to aldehyde dehydrogenase activity and substrate channeling
Wilderman et al. Functional characterization of cytochromes P450 2B from the desert woodrat Neotoma lepida

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08780403

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08780403

Country of ref document: EP

Kind code of ref document: A2