https://www.nature.com/articles/s41587-022-01618-2 Skip to main content Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript. Advertisement Advertisement Nature Biotechnology * View all journals * Search * Log in * Explore content * About the journal * Publish with us * Subscribe * Sign up for alerts * RSS feed 1. nature 2. nature biotechnology 3. articles 4. article * Article * Published: 26 January 2023 Large language models generate functional protein sequences across diverse families * Ali Madani ORCID: orcid.org/0000-0002-3092-1295^1,2, * Ben Krause^1^ na1, * Eric R. Greene^3^ na1, * Subu Subramanian^4,5, * Benjamin P. Mohr^6, * James M. Holton ORCID: orcid.org/0000-0002-0596-0137^7,8,9, * Jose Luis Olmos Jr.^3, * Caiming Xiong^1, * Zachary Z. Sun^6, * Richard Socher^1, * James S. Fraser^3 & * ... * Nikhil Naik ORCID: orcid.org/0000-0002-9894-8865^1 Show authors Nature Biotechnology (2023)Cite this article * 46k Accesses * 12 Citations * 1058 Altmetric * Metrics details Subjects * Enzymes * Machine learning * Proteomics Abstract Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase. Access through your institution Buy or subscribe This is a preview of subscription content, access via your institution Relevant articles Open Access articles citing this article. * Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies + Jeffrey A. Ruffolo + , Lee-Shin Chu + ... Jeffrey J. Gray Nature Communications Open Access 25 April 2023 * Regression Transformer enables concurrent sequence regression and generation for molecular language modelling + Jannis Born + & Matteo Manica Nature Machine Intelligence Open Access 06 April 2023 * Welcome to the Era of ChatGPT et al. + Timm Teubner + , Christoph M. Flath + ... Oliver Hinz Business & Information Systems Engineering Open Access 13 March 2023 Access options Access through your institution Access through your institution Change institution Buy or subscribe Access Nature and 54 other Nature Portfolio journals Get Nature+, our best-value online-access subscription $29.99 per month cancel any time Learn more Subscribe to this journal Receive 12 print issues and online access $209.00 per year only $17.42 per issue Learn more Rent or buy this article Get just this article for as long as you need it $39.95 Learn more Prices may be subject to local taxes which are calculated during checkout Additional access options: * Log in * Learn about institutional subscriptions * Read our FAQs * Contact customer support Fig. 1: Artificial protein generation with conditional language modeling. [41587_2022_1618_Fig1_HTML] Fig. 2: Generated artificial antibacterial proteins are diverse and express well in our experimental system. [41587_2022_1618_Fig2_HTML] Fig. 3: Artificial protein sequences are functional while reaching as low as 31% identity to any known protein, exhibit comparable catalytic efficiencies to a highly-evolved natural protein, and demonstrate similar structures to known natural folds. [41587_2022_1618_Fig3_HTML] Fig. 4: Applicability of conditional language modeling to other protein systems. [41587_2022_1618_Fig4_HTML] Data availability All sequence databases used in this study are publicly available and include UniprotKB, UniParc, NCBI Taxonomy, Pfam, Uniref30, NCBI nr database and Interpro. Please refer to Supplementary Table 1 for more details. Sequences and activity data for natural and artificial lysozymes tested are in the Supplementary Material. Evaluation data for the CM experiments can be found in Russ et al.^6. Evaluation data for the MDH experiments can be found in Repecka et al.^52. The crystal structure datasets generated during the current study are available under PDB accession 7RGR. Source data are provided with this paper. Code availability Our code and checkpoints are publicly available on Zenodo and can be reproduced using the details provided in the Methods section on data preparation, model architecture and training protocol. Major components of our model architecture and training protocol can be reproduced using CTRL (https://github.com/salesforce/ctrl). The most updated and supported codebase can be found at https://github.com/ salesforce/progen. References 1. Koga, N. et al. Principles for designing ideal protein structures. Nature 491, 222-227 (2012). Article CAS Google Scholar 2. Lin, Y.-R. et al. Control over overall shape and size in de novo designed proteins. Proc. Natl Acad. Sci. USA 112, E5478-E5485 (2015). Article CAS Google Scholar 3. Huang, P.-S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320-327 (2016). Article CAS Google Scholar 4. Huang, P.-S. et al. De novo design of a four-fold symmetric TIM-barrel protein with atomic-level accuracy. Nat. Chem. Biol. 12, 29-34 (2016). Article CAS Google Scholar 5. Boyken, S. E. et al. De novo design of protein homo-oligomers with modular hydrogen-bond network-mediated specificity. Science 352, 680-687 (2016). Article CAS Google Scholar 6. Lapedes, A. S., Bertrand, G. G., LonChang, L. & Stormo, G. D. Correlated mutations in models of protein sequences: Phylogenetic and structural effects. Lect. Notes Monogr. Ser. 33, 236-256 (1999). Article Google Scholar 7. Russ, W. P. et al. An evolution-based model for designing chorismate mutase enzymes. Science 369, 440-445 (2020). Article CAS Google Scholar 8. Hopf, T. A. et al. The EVcouplings Python framework for coevolutionary sequence analysis. Bioinformatics 35, 1582-1584 (2019). Article CAS Google Scholar 9. Morcos, F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl Acad. Sci. USA 108, E1293-E1301 (2011). Article CAS Google Scholar 10. Ovchinnikov, S., Kamisetty, H. & Baker, D. Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. eLife 3, e02030 (2014). Article Google Scholar 11. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315-1322 (2019). Article CAS Google Scholar 12. Wu, Z. et al. Signal peptides generated by attention-based neural networks. ACS Synth. Biol. 9, 2154-2161 (2020). Article CAS Google Scholar 13. Shin, J.-E. et al. Protein design and variant prediction using autoregressive generative models. Nat. Commun. 12, 2403 (2021). Article CAS Google Scholar 14. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583-589 (2021). Article CAS Google Scholar 15. Bryant, D. H. et al. Deep diversification of an AAV capsid protein by machine learning. Nat. Biotechnol. 39, 691-696 (2021). Article CAS Google Scholar 16. Das, P. et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nat. Biomed. Eng. 5, 613-623 (2021). Article CAS Google Scholar 17. Anishchenko, I. et al. De novo protein design by deep network hallucination. Nature 600, 547-552 (2021). Article CAS Google Scholar 18. Moffat, L., Kandathil, S. M. & Jones, D. T. Design in the DARK: Learning deep generative models for De Novo Protein Design. Preprint at bioRxiv https://doi.org/10.1101/2022.01.27.478087 (2022). 19. Ferruz, N., Schmidt, S. & Hocker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022). 20. Huang, B. et al. A backbone-centred energy function of neural networks for protein design. Nature 602, 523-528 (2022). Article CAS Google Scholar 21. Leinonen, R. et al. UniProt archive. Bioinformatics 20, 3236-3237 (2004). Article CAS Google Scholar 22. Bairoch, A. et al. The Universal Protein Resource (UniProt). Nucleic Acids Res. 33, D154-D159 (2005). Article CAS Google Scholar 23. Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Res. 42, D222-D230 (2014). Article CAS Google Scholar 24. Vaswani, A. et al. Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS, 2017). 25. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT, 2019). 26. Brown, T. B. et al. Language models are few-shot learners. In 34th Conference on Neural Information Processing Systems (NeurIPS, 2020). 27. Zellers, R. et al. Defending against neural fake news. In 33rd Conference on Neural Information Processing Systems (NeurIPS, 2019). 28. Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C. & Socher, R. CTRL: a conditional transformer language model for controllable generation. Preprint at arXiv https://doi.org/ 10.48550/arXiv.1909.05858 (2019). 29. AlQuraishi, M. The future of protein science will not be supervised. Some Thoughts on a Mysterious Universe https:// moalquraishi.wordpress.com/2019/04/01/ the-future-of-protein-science-will-not-be-supervised/ (2019). 30. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021). Article CAS Google Scholar 31. Elnaggar, A. et al. ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Trans. Pattern Anal. Mach. Intell. 44 , 7112-7127 (2021). Google Scholar 32. Peters, M. E. et al. Deep contextualized word representations. In Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT, 2018). 33. Howard, J. & Ruder, S. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL, 2018). 34. Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. Improving language understanding by generative pre-training. Preprint at https://s3-us-west-2.amazonaws.com/openai-assets/ research-covers/language-unsupervised/ language_understanding_paper.pdf (2018). 35. Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389-396 (2021). Article CAS Google Scholar 36. Pfaff, C. W. Constraints on language mixing: Intrasentential code-switching and borrowing in Spanish/English. Language 55, 291-318 (1979). Article Google Scholar 37. Poplack, S. Sometimes I'll start a sentence in Spanish Y TERMINO EN ESPANOL: toward a typology of code-switching. Linguistics 18, 581-618 (1980). Article Google Scholar 38. Dathathri, S. et al. Plug and play language models: a simple approach to controlled text generation. In 8th International Conference on Learning Representations (ICLR, 2020). 39. Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405, 442-451 (1975). Article CAS Google Scholar 40. Broendum, S. S., Buckle, A. M. & McGowan, S. Catalytic diversity and cell wall binding repeats in the phage-encoded endolysins. Mol. Microbiol. 110, 879-896 (2018). Article CAS Google Scholar 41. Love, M. J., Abeysekera, G. S., Muscroft-Taylor, A. C., Billington, C. & Dobson, R. C. J. On the catalytic mechanism of bacteriophage endolysins: opportunities for engineering. Biochim. Biophys. Acta. Proteins Proteom. 1868, 140302 (2020). Article CAS Google Scholar 42. Martin, P. P. Potts Models And Related Problems In Statistical Mechanics (World Scientific, 1991). 43. Thomas, J., Ramakrishnan, N. & Bailey-Kellogg, C. Graphical models of residue coupling in protein families. IEEE/ACM Trans. Comput. Biol. Bioinform. 5, 183-197 (2008). Article CAS Google Scholar 44. Weigt, M., White, R. A., Szurmant, H., Hoch, J. A. & Hwa, T. Identification of direct residue contacts in protein-protein interaction by message passing. Proc. Natl Acad. Sci. USA 106, 67-72 (2009). Article CAS Google Scholar 45. Balakrishnan, S., Kamisetty, H., Carbonell, J. G., Lee, S.-I. & Langmead, C. J. Learning generative models for protein fold families. Proteins 79, 1061-1078 (2011). Article CAS Google Scholar 46. Stein, R. R., Marks, D. S. & Sander, C. Inferring pairwise interactions from biological data using maximum-entropy probability models. PLoS Comput. Biol. 11, e1004182 (2015). Article Google Scholar 47. Mirdita, M., Steinegger, M., Breitwieser, F., Soding, J. & Levy Karin, E. Fast and sensitive taxonomic assignment to metagenomic contigs. Binformatics 37, 3029-3031 (2021). Article CAS Google Scholar 48. Mooers, B. H. M., Tronrud, D. E. & Matthews, B. W. Evaluation at atomic resolution of the role of strain in destabilizing the temperature-sensitive T4 lysozyme mutant Arg 96 - His. Protein Sci. 18, 863-870 (2009). Article CAS Google Scholar 49. Baase, W. A., Liu, L., Tronrud, D. E. & Matthews, B. W. Lessons from the lysozyme of phage T4. Protein Sci. 19, 631-641 (2010). Article CAS Google Scholar 50. Kuroki, R., Weaver, L. H. & Matthews, B. W. A covalent enzyme-substrate intermediate with saccharide distortion in a mutant T4 lysozyme. Science 262, 2030-2033 (1993). Article CAS Google Scholar 51. Mchaourab, H. S., Oh, K. J., Fang, C. J. & Hubbell, W. L. Conformation of T4 lysozyme in solution. Hinge-bending motion and the substrate-induced conformational transition studied by site-directed spin labeling. Biochemistry 36, 307-316 (1997). Article CAS Google Scholar 52. Kim, J.-K. et al. BetaCavityWeb: a webserver for molecular voids and channels. Nucleic Acids Res. 43, W413-W418 (2015). Article CAS Google Scholar 53. Rost, B. Twilight zone of protein sequence alignments. Protein Eng. 12, 85-94 (1999). Article CAS Google Scholar 54. Pearson, W. R. An introduction to sequence similarity ('homology') searching. Curr. Protoc. Bioinforma. 3, 3.1 (2013). ChapterUnit. Google Scholar 55. Repecka, D. et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324-333 (2021). Article Google Scholar 56. Ruder, S., Peters, M. E., Swayamdipta, S. & Wolf, T. Transfer learning in natural language processing. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics (eds Jill Burstein, J., Doran, C. & Solorio T.) (Association for Computational Linguistics, 2019). 57. Huh, M., Agrawal, P. & Efros, A. A. What makes ImageNet good for transfer learning? Preprint at arXiv https://doi.org/10.48550/ arXiv.1608.08614 (2016). 58. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436-444 (2015). Article CAS Google Scholar 59. Norn, C. et al. Protein sequence design by conformational landscape optimization. Proc. Natl Acad. Sci. USA 118, e2017228118 (2021). Article CAS Google Scholar 60. Anand, N. et al. Protein sequence design with a learned potential. Nat. Commun. 13, 746 (2022). Article CAS Google Scholar 61. Federhen, S. The NCBI Taxonomy database. Nucleic Acids Res. 40, D136-D143 (2012). Article CAS Google Scholar 62. Pettit, L. D. The IUPAC stability constants database. Chem. Int. 28, 14-15 (2006). CAS Google Scholar 63. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25-29 (2000). Article CAS Google Scholar 64. Bengio, Y., Ducharme, R., Vincent, P. & Janvin, C. A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137-1155 (2003). Google Scholar 65. Madani, A. et al. ProGen: language modeling for protein generation. Preprint at arXiv https://doi.org/10.1101/ 2020.03.07.982272 (2020). 66. Vig, J. et al. BERTology meets biology: Interpreting attention in protein language models. In International Conference on Learning Representations (ICLR, 2020). 67. Goyal, K., Dyer, C. & Berg-Kirkpatrick, T. Exposing the implicit energy networks behind masked language models via metropolis-hastings. In 10th International Conference on Learning Representations (ICLR, 2022). 68. Bhattacharya, N. et al. Single layers of attention suffice to predict protein contacts. Preprint at bioRxiv https://doi.org/ 10.1101/2020.12.21.423882 (2020). 69. Ramsauer, H. et al. Hopfield Networks is All You Need. Preprint at arXiv https://doi.org/10.48550/arXiv.2008.02217 (2020). 70. Alley, E., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315-1322 (2019). Article CAS Google Scholar 71. Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst. 32, 9689-9701 (2019). Google Scholar 72. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint arXiv https://doi.org/10.48550/ arXiv.1412.6980 (2014). 73. Pascanu, R., Mikolov, T. & Bengio, Y. On the difficulty of training recurrent neural networks. In Proc. 30th International Conference on Machine Learning (eds. Dasgupta, S. & McAllester, D.) 1310-1318 (PMLR, 2013). 74. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929-1958 (2014). Google Scholar 75. Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The curious case of neural text degeneration. In 8th International Conference on Learning Representations (ICLR, 2020). 76. Goodfellow, I. J. et al. Generative adversarial networks. In 28th Conference on Neural Information Processing Systems (NIPS, 2014). 77. Koehn, P. Pharaoh: a beam search decoder for phrase-based statistical machine translation models. in Machine Translation: From Real Users to Research 115-124 (Springer, 2004). 78. Sun, Z. Z. et al. Protocols for implementing an Escherichia coli based TX-TL cell-free expression system for synthetic biology. J. Vis. Exp. 16, e50762 (2013). Google Scholar 79. Kabsch, W. XDS. Acta Crystallogr. D Biol. Crystallogr. 66, 125-132 (2010). Article CAS Google Scholar 80. McCoy, A. J. et al. Phaser crystallographic software. J. Appl. Crystallogr. 40, 658-674 (2007). Article CAS Google Scholar 81. Kovalevskiy, O., Nicholls, R. A., Long, F., Carlon, A. & Murshudov, G. N. Overview of refinement procedures within REFMAC5: utilizing data from different sources. Acta Crystallogr D Struct. Biol. 74, 215-227 (2018). Article CAS Google Scholar 82. Terwilliger, T. C. et al. Iterative model building, structure refinement and density modification with the PHENIX AutoBuild wizard. Acta Crystallogr. D Biol. Crystallogr. 64, 61-69 (2008). Article CAS Google Scholar 83. Hoh, S. W., Burnley, T. & Cowtan, K. Current approaches for automated model building into cryo-EM maps using Buccaneer with CCP-EM. Acta Crystallogr D Struct. Biol. 76, 531-541 (2020). Article CAS Google Scholar 84. Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. Features and development of Coot. Acta Crystallogr. D Biol. Crystallogr. 66, 486-501 (2010). Article CAS Google Scholar 85. Afonine, P. V. et al. Towards automated crystallographic structure refinement with phenix.refine. Acta Crystallogr. D Biol. Crystallogr. 68, 352-367 (2012). Article CAS Google Scholar 86. Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Preprint at arXiv https:// doi.org/10.48550/arXiv.1910.10683 (2019). 87. Studier, F. W. Protein production by auto-induction in high density shaking cultures. Protein Expr. Purif. 41, 207-234 (2005). Article CAS Google Scholar 88. Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679-682 (2022). Article CAS Google Scholar Download references Acknowledgements We thank B. McCann, C. Gee, E. Procko, K. Trego, L. Varshney, N. Shirish Keskar and S. Savarese for their feedback at various stages of this project. We thank A. Cook, D. Lo and V. Nemali for operational support. Thanks also to the Salesforce Research Computing Infrastructure team and the Google Cloud TPU team for their help with computing resources, in addition to Twist Bioscience for DNA synthesis support. Beamline 8.3.1 at the Advanced Light Source is operated by the University of California Office of the President, Multicampus Research Programs and Initiatives grant MR-15-328599, the National Institutes of Health (R01 GM124149 and P30 GM124169), Plexxikon Inc. and the Integrated Diffraction Analysis Technologies program of the US Department of Energy Office of Biological and Environmental Research. The Advanced Light Source (Berkeley, CA) is a national user facility operated by Lawrence Berkeley National Laboratory on behalf of the US Department of Energy under contract number DE-AC02-05CH11231, Office of Basic Energy Sciences. Icons in one figure were created using BioRender (https://biorender.com). E.R.G. is supported by NIH F32-GM144982-01. J.S.F. was supported by NIH GM123159, NIH GM145238 and a Sanghvi-Agarwal Innovation Award. Author information Author notes 1. These authors contributed equally: Ben Krause and Eric R. Greene. Authors and Affiliations 1. Salesforce Research, Palo Alto, CA, USA Ali Madani, Ben Krause, Caiming Xiong, Richard Socher & Nikhil Naik 2. Profluent Bio, San Francisco, CA, USA Ali Madani 3. Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA Eric R. Greene, Jose Luis Olmos Jr. & James S. Fraser 4. Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA, USA Subu Subramanian 5. Howard Hughes Medical Institute, University of California, Berkeley, Berkeley, CA, USA Subu Subramanian 6. Tierra Biosciences, San Leandro, CA, USA Benjamin P. Mohr & Zachary Z. Sun 7. Molecular Biophysics and Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA James M. Holton 8. Stanford Synchrotron Radiation Lightsource, SLAC National Accelerator Laboratory, Menlo Park, CA, USA James M. Holton 9. Department of Biochemistry and Biophysics, University of California, San Francisco, San Francisco, CA, USA James M. Holton Authors 1. Ali Madani View author publications You can also search for this author in PubMed Google Scholar 2. Ben Krause View author publications You can also search for this author in PubMed Google Scholar 3. Eric R. Greene View author publications You can also search for this author in PubMed Google Scholar 4. Subu Subramanian View author publications You can also search for this author in PubMed Google Scholar 5. Benjamin P. Mohr View author publications You can also search for this author in PubMed Google Scholar 6. James M. Holton View author publications You can also search for this author in PubMed Google Scholar 7. Jose Luis Olmos Jr. View author publications You can also search for this author in PubMed Google Scholar 8. Caiming Xiong View author publications You can also search for this author in PubMed Google Scholar 9. Zachary Z. Sun View author publications You can also search for this author in PubMed Google Scholar 10. Richard Socher View author publications You can also search for this author in PubMed Google Scholar 11. James S. Fraser View author publications You can also search for this author in PubMed Google Scholar 12. Nikhil Naik View author publications You can also search for this author in PubMed Google Scholar Contributions A.M. conceived and designed the study in collaboration with S.S. A.M. and B.K. designed and performed machine learning modeling, generation and scoring. B.P.M. performed the cell-free expression and activity assay and was supervised by Z.Z.S. E.R.G. performed the cell-based expression and kinetics assay and was supervised by J.S.F. J.M.H., J.L.O., J.S.F. performed the structure determination. A.M., S.S., B.K. and N.N. performed computational analysis, and were advised by C.X. R.S. provided advice on machine learning and computational methods. A.M., J.S.F. and N.N. wrote the manuscript with feedback and contributions from all authors, in particular from E.G. and B.K. N.N. supervised and managed the project. Corresponding authors Correspondence to Ali Madani or Nikhil Naik. Ethics declarations Competing interests A.M. is a co-founder of Profluent Bio. All other authors declare no competing interests. Peer review Peer review information Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work. Additional information Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Supplementary information Supplementary Information Supplementary Tables 1-7, Supplementary Figures 1-12 and Supplementary References. Reporting Summary Source data Source Data Fig. 1 Experimental data for protein sequences and activity. Source Data Fig. 2 Structure report for artificial lysozyme deposited as 7RGR. Rights and permissions Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. Reprints and Permissions About this article Verify currency and authenticity via CrossMark Cite this article Madani, A., Krause, B., Greene, E.R. et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol (2023). https://doi.org/10.1038/s41587-022-01618-2 Download citation * Received: 12 July 2022 * Accepted: 17 November 2022 * Published: 26 January 2023 * DOI: https://doi.org/10.1038/s41587-022-01618-2 Share this article Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative This article is cited by * Regression Transformer enables concurrent sequence regression and generation for molecular language modelling + Jannis Born + Matteo Manica Nature Machine Intelligence (2023) * Hallucinating functional protein sequences + David Belanger + Lucy J. Colwell Nature Biotechnology (2023) * Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies + Jeffrey A. Ruffolo + Lee-Shin Chu + Jeffrey J. Gray Nature Communications (2023) * Prepare for truly useful large language models Nature Biomedical Engineering (2023) * Welcome to the Era of ChatGPT et al. + Timm Teubner + Christoph M. Flath + Oliver Hinz Business & Information Systems Engineering (2023) Access through your institution Buy or subscribe Access through your institution Change institution Buy or subscribe Associated Content Hallucinating functional protein sequences * David Belanger * Lucy J. Colwell Nature Biotechnology News & Views 26 Jan 2023 Advertisement Advertisement Explore content * Research articles * Reviews & Analysis * News & Comment * Podcasts * Videos * Current issue * Collections * Follow us on Facebook * Follow us on Twitter * Subscribe * Sign up for alerts * RSS feed About the journal * Aims & Scope * Journal Information * Journal Metrics * About the Editors * Our publishing models * Editorial Values Statement * Editorial Policies * Content Types * Web Feeds * Posters * Contact Publish with us * Submission Guidelines * For Reviewers * Language editing services * Submit manuscript Search Search articles by subject, keyword or author [ ] Show results from [All journals] Search Advanced search Quick links * Explore articles by subject * Find a job * Guide to authors * Editorial policies Nature Biotechnology (Nat Biotechnol) ISSN 1546-1696 (online) ISSN 1087-0156 (print) nature.com sitemap About Nature Portfolio * About us * Press releases * Press office * Contact us Discover content * Journals A-Z * Articles by subject * Nano * Protocol Exchange * Nature Index Publishing policies * Nature portfolio policies * Open access Author & Researcher services * Reprints & permissions * Research data * Language editing * Scientific editing * Nature Masterclasses * Nature Research Academies * Research Solutions Libraries & institutions * Librarian service & tools * Librarian portal * Open research * Recommend to library Advertising & partnerships * Advertising * Partnerships & Services * Media kits * Branded content Career development * Nature Careers * Nature Conferences * Nature events Regional websites * Nature Africa * Nature China * Nature India * Nature Italy * Nature Japan * Nature Korea * Nature Middle East * Privacy Policy * Use of cookies * Manage cookies/Do not sell my data * Legal notice * Accessibility statement * Terms & Conditions * California Privacy Statement Springer Nature (c) 2023 Springer Nature Limited Close Nature Briefing: Translational Research Sign up for the Nature Briefing: Translational Research newsletter -- top stories in biotechnology, drug discovery and pharma. Email address [ ] [ ] Sign up [ ] I agree my information will be processed in accordance with the Nature and Springer Nature Limited Privacy Policy. Close Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research * *