https://arxiv.org/abs/2110.13711 close this message Thank you for supporting arXiv Thank you to everyone who donated during arXiv's Giving Week, October 25 - 31. It's not too late to give. arXiv is a nonprofit that depends on donations to fund essential operations and new initiatives. We appreciate your support of Open Science. Thank you! DONATE [secure site, no need to create account] Skip to main content Cornell University We gratefully acknowledge support from the Simons Foundation and member institutions. arXiv.org > cs > arXiv:2110.13711 [ ] Help | Advanced Search [All fields ] Search arXiv Cornell University Logo [ ] GO quick links * Login * Help Pages * About Computer Science > Machine Learning arXiv:2110.13711 (cs) [Submitted on 26 Oct 2021] Title:Hierarchical Transformers Are More Efficient Language Models Authors:Piotr Nawrot, Szymon Tworkowski, Michal Tyrolski, Lukasz Kaiser, Yuhuai Wu, Christian Szegedy, Henryk Michalewski Download PDF Abstract: Transformer models yield impressive results on many NLP and sequence modeling tasks. Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs: full paragraphs produced by GPT-3 or well-structured images produced by DALL-E. These large language models are impressive but also very inefficient and costly, which limits their applications and accessibility. We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences. To verify this claim, we first study different ways to downsample and upsample activations in Transformers so as to make them hierarchical. We use the best performing upsampling and downsampling layers to create Hourglass - a hierarchical Transformer language model. Hourglass improves upon the Transformer baseline given the same amount of computation and can yield the same results as Transformers more efficiently. In particular, Hourglass sets new state-of-the-art for Transformer models on the ImageNet32 generation task and improves language modeling efficiency on the widely studied enwik8 benchmark. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2110.13711 [cs.LG] (or arXiv:2110.13711v1 [cs.LG] for this version) Submission history From: Szymon Tworkowski [view email] [v1] Tue, 26 Oct 2021 14:00:49 UTC (1,841 KB) Full-text links: Download: * PDF * Other formats (license) Current browse context: cs.LG < prev | next > new | recent | 2110 Change to browse by: cs cs.CL References & Citations * NASA ADS * Google Scholar * Semantic Scholar a export bibtex citation Loading... Bibtex formatted citation x [loading... ] Data provided by: Bookmark BibSonomy logo Mendeley logo Reddit logo ScienceWISE logo (*) Bibliographic Tools Bibliographic and Citation Tools [ ] Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) [ ] Litmaps Toggle Litmaps (What is Litmaps?) [ ] scite.ai Toggle scite Smart Citations (What are Smart Citations?) ( ) Code & Data Code and Data Associated with this Article [ ] arXiv Links to Code Toggle arXiv Links to Code & Data (What is Links to Code & Data?) ( ) Related Papers Recommenders and Search Tools [ ] Connected Papers Toggle Connected Papers (What is Connected Papers?) [ ] Core recommender toggle CORE Recommender (What is CORE?) ( ) About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs and how to get involved. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) * About * Help * Click here to contact arXiv Contact * Click here to subscribe Subscribe * Copyright * Privacy Policy * Web Accessibility Assistance * arXiv Operational Status Get status notifications via email or slack