https://news.mit.edu/2024/new-way-let-ai-chatbots-converse-all-day-without-crashing-0213 Skip to content | Massachusetts Institute of Technology MIT Top Menu| * Education * Research * Innovation * Admissions + Aid * Campus Life * News * Alumni * About MIT * More | Search MIT Search websites, locations, and people [ ] See More Results Suggestions or feedback? MIT News | Massachusetts Institute of Technology Subscribe to MIT News newsletter Browse Enter keywords to search for news articles: [ ] Submit Browse By Topics View All - Explore: * Machine learning * Social justice * Startups * Black holes * Classes and programs Departments View All - Explore: * Aeronautics and Astronautics * Brain and Cognitive Sciences * Architecture * Political Science * Mechanical Engineering Centers, Labs, & Programs View All - Explore: * Abdul Latif Jameel Poverty Action Lab (J-PAL) * Picower Institute for Learning and Memory * Media Lab * Lincoln Laboratory Schools * School of Architecture + Planning * School of Engineering * School of Humanities, Arts, and Social Sciences * Sloan School of Management * School of Science * MIT Schwarzman College of Computing View all news coverage of MIT in the media - Listen to audio content from MIT News - Subscribe to MIT newsletter - Close Breadcrumb 1. MIT News 2. A new way to let AI chatbots converse all day without crashing A new way to let AI chatbots converse all day without crashing Researchers developed a simple yet effective solution for a puzzling problem that can worsen the performance of large language models such as ChatGPT. Adam Zewe | MIT News Publication Date: February 13, 2024 Press Inquiries Press Contact: Abby Abazorius Email: abbya@mit.edu Phone: 617-253-2709 MIT News Office Media Download Cartoon with several online chat windows saying "Oops something went wrong," and one in the center with text bubbles showing it is continuing to perform. | Download Image Caption: Researchers developed a technique that enables an AI chatbot like ChatGPT to efficiently conduct a day-long conversation with a human collaborator without slowing down or crashing, no matter how much text the conversation involves. Credits: Image: Christine Daniloff, MIT *Terms of Use: Images for download on the MIT News office website are made available to non-commercial entities, press and the general public under a Creative Commons Attribution Non-Commercial No Derivatives license. You may not alter the images provided, other than to crop them to size. A credit line must be used when reproducing images; if one is not provided below, credit the images to "MIT." Close Cartoon with several online chat windows saying "Oops something went wrong," and one in the center with text bubbles showing it is continuing to perform. Credits: Image: Christine Daniloff, MIT Previous image Next image When a human-AI conversation involves many rounds of continuous dialogue, the powerful large language machine-learning models that drive chatbots like ChatGPT sometimes start to collapse, causing the bots' performance to rapidly deteriorate. A team of researchers from MIT and elsewhere has pinpointed a surprising cause of this problem and developed a simple solution that enables a chatbot to maintain a nonstop conversation without crashing or slowing down. Their method involves a tweak to the key-value cache (which is like a conversation memory) at the core of many large language models. In some methods, when this cache needs to hold more information than it has capacity for, the first pieces of data are bumped out. This can cause the model to fail. By ensuring that these first few data points remain in memory, the researchers' method allows a chatbot to keep chatting no matter how long the conversation goes. The method, called StreamingLLM, enables a model to remain efficient even when a conversation stretches on for more than 4 million words. When compared to another method that avoids crashing by constantly recomputing part of the past conversations, StreamingLLM performed more than 22 times faster. This could allow a chatbot to conduct long conversations throughout the workday without needing to be continually rebooted, enabling efficient AI assistants for tasks like copywriting, editing, or generating code. "Now, with this method, we can persistently deploy these large language models. By making a chatbot that we can always chat with, and that can always respond to us based on our recent conversations, we could use these chatbots in some new applications," says Guangxuan Xiao, an electrical engineering and computer science (EECS) graduate student and lead author of a paper on StreamingLLM. Xiao's co-authors include his advisor, Song Han, an associate professor in EECS, a member of the MIT-IBM Watson AI Lab, and a distinguished scientist of NVIDIA; as well as Yuandong Tian, a research scientist at Meta AI; Beidi Chen, an assistant professor at Carnegie Mellon University; and senior author Mike Lewis, a research scientist at Meta AI. The work will be presented at the International Conference on Learning Representations. A puzzling phenomenon Large language models encode data, like words in a user query, into representations called tokens. Many models employ what is known as an attention mechanism that uses these tokens to generate new text. Typically, an AI chatbot writes new text based on text it has just seen, so it stores recent tokens in memory, called a KV Cache, to use later. The attention mechanism builds a grid that includes all tokens in the cache, an "attention map" that maps out how strongly each token, or word, relates to each other token. Understanding these relationships is one feature that enables large language models to generate human-like text. But when the cache gets very large, the attention map can become even more massive, which slows down computation. Also, if encoding content requires more tokens than the cache can hold, the model's performance drops. For instance, one popular model can store 4,096 tokens, yet there are about 10,000 tokens in an academic paper. To get around these problems, researchers employ a "sliding cache" that bumps out the oldest tokens to add new tokens. However, the model's performance often plummets as soon as that first token is evicted, rapidly reducing the quality of newly generated words. In this new paper, researchers realized that if they keep the first token in the sliding cache, the model will maintain its performance even when the cache size is exceeded. But this didn't make any sense. The first word in a novel likely has nothing to do with the last word, so why would the first word be so important for the model to generate the newest word? In their new paper, the researchers also uncovered the cause of this phenomenon. Attention sinks Some models use a Softmax operation in their attention mechanism, which assigns a score to each token that represents how much it relates to each other token. The Softmax operation requires all attention scores to sum up to 1. Since most tokens aren't strongly related, their attention scores are very low. The model dumps any remaining attention score in the first token. The researchers call this first token an "attention sink." "We need an attention sink, and the model decides to use the first token as the attention sink because it is globally visible -- every other token can see it. We found that we must always keep the attention sink in the cache to maintain the model dynamics," Han says. In building StreamingLLM, the researchers discovered that having four attention sink tokens at the beginning of the sliding cache leads to optimal performance. They also found that the positional encoding of each token must stay the same, even as new tokens are added and others are bumped out. If token 5 is bumped out, token 6 must stay encoded as 6, even though it is now the fifth token in the cache. By combining these two ideas, they enabled StreamingLLM to maintain a continuous conversation while outperforming a popular method that uses recomputation. For instance, when the cache has 256 tokens, the recomputation method takes 63 milliseconds to decode a new token, while StreamingLLM takes 31 milliseconds. However, if the cache size grows to 4,096 tokens, recomputation requires 1,411 milliseconds for a new token, while StreamingLLM needs just 65 milliseconds. "The innovative approach of StreamingLLM, centered around the attention sink mechanism, ensures stable memory usage and performance, even when processing texts up to 4 million tokens in length," says Yang You, a presidential young professor of computer science at the National University of Singapore, who was not involved with this work. "This capability is not just impressive; it's transformative, enabling StreamingLLM to be applied across a wide array of AI applications. The performance and versatility of StreamingLLM mark it as a highly promising technology, poised to revolutionize how we approach AI-driven generation applications." Tianqi Chen, an assistant professor in the machine learning and computer science departments at Carnegie Mellon University who also was not involved with this research, agreed, saying "Streaming LLM enables the smooth extension of the conversation length of large language models. We have been using it to enable the deployment of Mistral models on iPhones with great success." The researchers also explored the use of attention sinks during model training by prepending several placeholder tokens in all training samples. They found that training with attention sinks allowed a model to maintain performance with only one attention sink in its cache, rather than the four that are usually required to stabilize a pretrained model's performance. But while StreamingLLM enables a model to conduct a continuous conversation, the model cannot remember words that aren't stored in the cache. In the future, the researchers plan to target this limitation by investigating methods to retrieve tokens that have been evicted or enable the model to memorize previous conversations. StreamingLLM has been incorporated into NVIDIA's large language model optimization library, TensorRT-LLM. This work is funded, in part, by the MIT-IBM Watson AI Lab, the MIT Science Hub, and the U.S. National Science Foundation. Share this news article on: * X * Facebook * LinkedIn * Reddit * Print Paper Paper: "Efficient streaming language models with attention sinks" Related Links * Project page * Guangxuan Xiao * Song Han * Department of Electrical Engineering and Computer Science * School of Engineering * MIT Schwarzman College of Computing * MIT-IBM Watson AI Lab Related Topics * Research * Computer science and technology * Artificial intelligence * Machine learning * Algorithms * Human-computer interaction * Electrical Engineering & Computer Science (eecs) * School of Engineering * MIT Schwarzman College of Computing * MIT-IBM Watson AI Lab * National Science Foundation (NSF) Related Articles A cell phone peeks out from the pocket of a person wearing jeans, a belt, and a plaid shirt. In the background and on the cell phone's screen are stylized connected nodes representing a neural network. Technique enables AI on edge devices to keep learning over time A busy city intersection where a bus is colored blue and pedestrians are colored red. AI model speeds up high-resolution computer vision Artistic collage shows a large teal hand, pointing towards us as if typing on a smart phone, with a lens flare on the tip of the finger. The background is a surreal, AI-generated blend of smart phones, keyboards, teal screens. Learning on the edge Photo of a small brown circuit board with a camera attached, next to a US 25-cent coin for scale. Tiny machine learning design alleviates a bottleneck in memory usage on internet-of-things devices Previous item Next item More MIT News Robert Langer sits at an empty table. The walls behind him are covered with framed awards and plaques Robert Langer receives Dr. Paul Janssen Award Award honors "scientists who have made a transformational contribution toward the improvement of human health." Read full story - Leela Fredlund poses in front of a large stone archway For all humankind Political science and physics major Leela Fredlund wants to ensure fairness and justice prevail in humanity's leap into space. Read full story - Photo of the partial front pages of four newspapers Local journalism is a critical "gate" to engage Americans on climate change The MIT Environmental Solutions Journalism Fellowship provides support to journalists dedicated to connecting local stories to broader climate contexts. Read full story - A firefighter sprays a chemical onto a smokey pile in a dense forest. Study measures the psychological toll of wildfires Research in Southeast Asia quantifies how much wildfire smoke hurts peoples' moods; finds the effect is greater when fires originate in other countries. Read full story - Headshots of Marc Baldo, Jacopo Buongiorno, and Hsiao-hua Burke MIT community members elected to the National Academy of Engineering for 2024 Marc Baldo, Jacopo Buongiorno, and Hsiao-hua Burke, along with 13 additional MIT alumni, are honored for significant contributions to engineering research, practice, and education. Read full story - Pat McAtamney smiles in front of a large machine inside a cement brick room Pat McAtamney: Empowering student-led engineering teams The MIT Edgerton Center technical instructor's expertise and dedication enriches the student experience. Read full story - * More news on MIT News homepage - More about MIT News at Massachusetts Institute of Technology This website is managed by the MIT News Office, part of the Institute Office of Communications. News by Schools/College: * School of Architecture and Planning * School of Engineering * School of Humanities, Arts, and Social Sciences * MIT Sloan School of Management * School of Science * MIT Schwarzman College of Computing Resources: * About the MIT News Office * MIT News Press Center * Terms of Use * Press Inquiries * Filming Guidelines * RSS Feeds Tools: * Subscribe to MIT Daily/Weekly * Subscribe to press releases * Submit campus news * Guidelines for campus news contributors * Guidelines on generative AI Massachusetts Institute of Technology MIT Top Level Links: * Education * Research * Innovation * Admissions + Aid * Campus Life * News * Alumni * About MIT * Join us in building a better world. Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA Recommended Links: * Visit * Map (opens in new window) * Events (opens in new window) * People (opens in new window) * Careers (opens in new window) * Contact * Privacy * Accessibility * + Social Media Hub + MIT on X + MIT on Facebook + MIT on YouTube + MIT on Instagram