https://spectrum.ieee.org/in-the-17th-century-leibniz-dreamed-of-a-machine-that-could-calculate-ideas [ ] IEEE.orgIEEE Xplore Digital LibraryIEEE StandardsMore Sites Sign InJoin IEEE In the 17th Century, Leibniz Dreamed of a Machine That Could Calculate Ideas Share FOR THE TECHNOLOGY INSIDER [ ] Explore by topic AerospaceArtificial IntelligenceBiomedicalComputingConsumer ElectronicsEnergyHistory of TechnologyRoboticsSemiconductorsSensors TelecommunicationsTransportation FOR THE TECHNOLOGY INSIDER Topics AerospaceArtificial IntelligenceBiomedicalComputingConsumer ElectronicsEnergyHistory of TechnologyRoboticsSemiconductorsSensors TelecommunicationsTransportation Sections FeaturesNewsOpinionCareersDIYEngineering Resources More Special ReportsExplainersPodcastsVideosNewslettersTop Programming LanguagesRobots Guide For IEEE Members The MagazineThe Institute For IEEE Members The MagazineThe Institute IEEE Spectrum About UsContact UsReprints & PermissionsAdvertising Follow IEEE Spectrum Support IEEE Spectrum IEEE Spectrum is the flagship publication of the IEEE -- the world's largest professional organization devoted to engineering and applied sciences. Our articles, podcasts, and infographics inform our readers about developments in technology, engineering, and science. Join IEEE Subscribe About IEEEContact & SupportAccessibilityNondiscrimination PolicyTerms IEEE Privacy Policy (c) Copyright 2021 IEEE -- All rights reserved. A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. IEEE websites place cookies on your device to give you the best user experience. By using our websites, you agree to the placement of these cookies. To learn more, read our Privacy Policy. view privacy policy accept & close Enjoy more free content and benefits by creating an account Saving articles to read later requires an IEEE Spectrum account The Institute content is only available for members Downloading full PDF issues is exclusive for IEEE Members Access to Spectrum's Digital Edition is exclusive for IEEE Members Following topics is a feature exclusive for IEEE Members Adding your response to an article requires an IEEE Spectrum account Create an account to access more content and features on IEEE Spectrum, including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE. Join the world's largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum's articles, archives, PDF downloads, and other benefits. Learn more - CREATE AN ACCOUNTSIGN IN JOIN IEEESIGN IN Enjoy more free content and benefits by creating an account Create an account to access more content and features on IEEE Spectrum, including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE. CREATE AN ACCOUNTSIGN IN Artificial Intelligence Topic Type Article In the 17th Century, Leibniz Dreamed of a Machine That Could Calculate Ideas The machine would use an "alphabet of human thoughts" and rules to combine them Oscar Schwartz 04 Nov 2019 4 min read Gottfried Wilhelm Leibniz in front of pages from his dissertation entitled On the Combinatorial Art. Gottfried Wilhelm Leibniz in front of pages from his dissertation entitled On the Combinatorial Art. Illustration: Gluekit machine learning software history of technology history of natural language processing NLP natural language processing AI history This is part two of a six-part series on the history of natural language processing. In 1666, the German polymath Gottfried Wilhelm Leibniz published an enigmatic dissertation entitled On the Combinatorial Art. Only 20 years old but already an ambitious thinker, Leibniz outlined a theory for automating knowledge production via the rule-based combination of symbols. Leibniz's central argument was that all human thoughts, no matter how complex, are combinations of basic and fundamental concepts, in much the same way that sentences are combinations of words, and words combinations of letters. He believed that if he could find a way to symbolically represent these fundamental concepts and develop a method by which to combine them logically, then he would be able to generate new thoughts on demand. The idea came to Leibniz through his study of Ramon Llull, a 13th century Majorcan mystic who devoted himself to devising a system of theological reasoning that would prove the "universal truth" of Christianity to non-believers. Llull himself was inspired by Jewish Kabbalists' letter combinatorics (see part one of this series), which they used to produce generative texts that supposedly revealed prophetic wisdom. Taking the idea a step further, Llull invented what he called a volvelle, a circular paper mechanism with increasingly small concentric circles on which were written symbols representing the attributes of God. Llull believed that by spinning the volvelle in various ways, bringing the symbols into novel combinations with one another, he could reveal all the aspects of his deity. Leibniz was much impressed by Llull's paper machine, and he embarked on a project to create his own method of idea generation through symbolic combination. He wanted to use his machine not for theological debate, but for philosophical reasoning. He proposed that such a system would require three things: an "alphabet of human thoughts"; a list of logical rules for their valid combination and re-combination; and a mechanism that could carry out the logical operations on the symbols quickly and accurately--a fully mechanized update of Llull's paper volvelle. He imagined that this machine, which he called "the great instrument of reason," would be able to answer all questions and resolve all intellectual debate. "When there are disputes among persons," he wrote, "we can simply say, 'Let us calculate,' and without further ado, see who is right." "When there are disputes among persons, we can simply say, 'Let us calculate,' and without further ado, see who is right." The notion of a mechanism that produced rational thought encapsulated the spirit of Leibniz's times. Other Enlightenment thinkers, such as Rene Descartes, believed that there was a "universal truth" that could be accessed through reason alone, and that all phenomena were fully explainable if the underlying principles were understood. The same, Leibniz thought, was true of language and cognition itself. But many others saw this doctrine of pure reason as deeply flawed, and felt that it signified a new age sophistry professed from on high. One such critic was the author and satirist Jonathan Swift, who took aim at Leibniz's thought-calculating machine in his 1726 book, Gulliver's Travels. In one scene, Gulliver visits the Grand Academy of Lagado where he encounters a strange mechanism called "the engine." The machine has a large wooden frame with a grid of wires; on the wires are small wooden cubes with symbols written on each side. The students of the Grand Academy of Lagado crank handles on the side of the machine causing the wooden cubes to rotate and spin, bringing the symbols into new combinations. A scribe then writes down the output of the machine, and hands it to the presiding professor. Through this process, the professor claims, he and his students can "write books in philosophy, poetry, politics, laws, mathematics, and theology, without the least assistance from genius or study." Swift's point was that language is not a formal system that represents human thought, but a messy and ambiguous form of expression. This scene, with its pre-digital language generation, was Swift's parody of Leibniz's thought generation through symbolic combinatorics--and more broadly, an argument against the primacy of science. As with the Lagado academy's other attempts at contributing to its nation's development through research--such as trying to change human excretion back into food--Gulliver sees the engine as a pointless experiment. Swift's point was that language is not a formal system that represents human thought, as Leibniz proposed, but a messy and ambiguous form of expression that makes sense only in relation to the context in which it is used. To have a machine generate language requires more than having the right set of rules and the right machine, Swift argued--it requires the ability to understand the meaning of words, something that neither the Lagado engine nor Leibniz's "instrument of reason" could do. In the end, Leibniz never constructed his idea-generating machine. In fact, he abandoned the study of Llull's combinatorics altogether, and, later in life came to see the pursuit of mechanizing language as immature. But the idea of using mechanical devices to perform logical functions remained with him, inspiring the construction of his 'step reckoner,' a mechanical calculator built in 1673. But as today's data scientists devise ever-better algorithms for natural language processing, they're having debates that echo the ideas of Leibniz and Swift: Even if you can create a formal system to generate human-seeming language, can you give it the ability to understand what it's saying? This is the second installment of a six-part series on the history of natural language processing. Last week's post started the story with a Kabbalist mystic in medieval Spain. Come back next Monday for part three, which describes the language models that were painstakingly built by Andrey Markov and Claude Shannon. You can also check out our prior series on the untold history of AI. machine learning software history of technology history of natural language processing NLP natural language processing AI history Oscar Schwartz The Conversation (0) Article The Institute Type Topic Founder of MIT's Microsystems Technology Labs Dies at 88 12 Nov 2021 4 min read A tall and thin robot sits on a track inside of a convenience store beverage fridge Robotics News Type Topic Video Friday: FridgeBot 12 Nov 2021 3 min read Tesla supercharger stations Topic Type Transportation Analysis There's a Queue Coming for Tesla Charging, and That's OK 12 Nov 2021 6 min read Related Stories Consumer Electronics Topic Artificial Intelligence Type Guest Article Watch Out, Wedding Videographers, AI Is Coming for You Computing Topic Type Sensors News System Sniffs Out Trojans in Electromagnetic Emissions Energy Topic Artificial Intelligence Type Feature Smokey the AI Artificial Intelligence Topic Magazine Type Feature Special reports Deep Learning's Diminishing Returns The cost of improvement is becoming unsustainable Neil C. Thompson Kristjan Greenewald Keeheon Lee Gabriel F. Manso 24 Sep 2021 10 min read Vertical A robot arm being pushed down by a very big dollar icon Eddie Guy LightGreen Deep learning is now being used to translate between languages, predict how proteins fold, analyze medical scans, and play games as complex as Go, to name just a few applications of a technique that is now becoming pervasive. Success in those and other realms has brought this machine-learning technique from obscurity in the early 2000s to dominance today. Although deep learning's rise to fame is relatively recent, its origins are not. In 1958, back when mainframe computers filled rooms and ran on vacuum tubes, knowledge of the interconnections between neurons in the brain inspired Frank Rosenblatt at Cornell to design the first artificial neural network, which he presciently described as a "pattern-recognizing device." But Rosenblatt's ambitions outpaced the capabilities of his era--and he knew it. Even his inaugural paper was forced to acknowledge the voracious appetite of neural networks for computational power, bemoaning that "as the number of connections in the network increases...the burden on a conventional digital computer soon becomes excessive." This article is part of our special report on AI, "The Great AI Reckoning." Fortunately for such artificial neural networks--later rechristened "deep learning" when they included extra layers of neurons--decades of Moore's Law and other improvements in computer hardware yielded a roughly 10-million-fold increase in the number of computations that a computer could do in a second. So when researchers returned to deep learning in the late 2000s, they wielded tools equal to the challenge. These more-powerful computers made it possible to construct networks with vastly more connections and neurons and hence greater ability to model complex phenomena. Researchers used that ability to break record after record as they applied deep learning to new tasks. While deep learning's rise may have been meteoric, its future may be bumpy. Like Rosenblatt before them, today's deep-learning researchers are nearing the frontier of what their tools can achieve. To understand why this will reshape machine learning, you must first understand why deep learning has been so successful and what it costs to keep it that way. Deep learning is a modern incarnation of the long-running trend in artificial intelligence that has been moving from streamlined systems based on expert knowledge toward flexible statistical models. Early AI systems were rule based, applying logic and expert knowledge to derive results. Later systems incorporated learning to set their adjustable parameters, but these were usually few in number. Today's neural networks also learn parameter values, but those parameters are part of such flexible computer models that--if they are big enough--they become universal function approximators, meaning they can fit any type of data. This unlimited flexibility is the reason why deep learning can be applied to so many different domains. The flexibility of neural networks comes from taking the many inputs to the model and having the network combine them in myriad ways. This means the outputs won't be the result of applying simple formulas but instead immensely complicated ones. For example, when the cutting-edge image-recognition system Noisy Student converts the pixel values of an image into probabilities for what the object in that image is, it does so using a network with 480 million parameters. The training to ascertain the values of such a large number of parameters is even more remarkable because it was done with only 1.2 million labeled images--which may understandably confuse those of us who remember from high school algebra that we are supposed to have more equations than unknowns. Breaking that rule turns out to be the key. Deep-learning models are overparameterized, which is to say they have more parameters than there are data points available for training. Classically, this would lead to overfitting, where the model not only learns general trends but also the random vagaries of the data it was trained on. Deep learning avoids this trap by initializing the parameters randomly and then iteratively adjusting sets of them to better fit the data using a method called stochastic gradient descent. Surprisingly, this procedure has been proven to ensure that the learned model generalizes well. The success of flexible deep-learning models can be seen in machine translation. For decades, software has been used to translate text from one language to another. Early approaches to this problem used rules designed by grammar experts. But as more textual data became available in specific languages, statistical approaches--ones that go by such esoteric names as maximum entropy, hidden Markov models, and conditional random fields--could be applied. Initially, the approaches that worked best for each language differed based on data availability and grammatical properties. For example, rule-based approaches to translating languages such as Urdu, Arabic, and Malay outperformed statistical ones--at first. Today, all these approaches have been outpaced by deep learning, which has proven itself superior almost everywhere it's applied. So the good news is that deep learning provides enormous flexibility. The bad news is that this flexibility comes at an enormous computational cost. This unfortunate reality has two parts. A chart with an arrow going down to the right A chart showing computations, billions of floating-point operations Extrapolating the gains of recent years might suggest that by 2025 the error level in the best deep-learning systems designed for recognizing objects in the ImageNet data set should be reduced to just 5 percent [top]. But the computing resources and energy required to train such a future system would be enormous, leading to the emission of as much carbon dioxide as New York City generates in one month [bottom]. SOURCE: N.C. THOMPSON, K. GREENEWALD, K. LEE, G.F. MANSO The first part is true of all statistical models: To improve performance by a factor of k, at least k^2 more data points must be used to train the model. The second part of the computational cost comes explicitly from overparameterization. Once accounted for, this yields a total computational cost for improvement of at least k^4. That little 4 in the exponent is very expensive: A 10-fold improvement, for example, would require at least a 10,000-fold increase in computation. To make the flexibility-computation trade-off more vivid, consider a scenario where you are trying to predict whether a patient's X-ray reveals cancer. Suppose further that the true answer can be found if you measure 100 details in the X-ray (often called variables or features). The challenge is that we don't know ahead of time which variables are important, and there could be a very large pool of candidate variables to consider. The expert-system approach to this problem would be to have people who are knowledgeable in radiology and oncology specify the variables they think are important, allowing the system to examine only those. The flexible-system approach is to test as many of the variables as possible and let the system figure out on its own which are important, requiring more data and incurring much higher computational costs in the process. Models for which experts have established the relevant variables are able to learn quickly what values work best for those variables, doing so with limited amounts of computation--which is why they were so popular early on. But their ability to learn stalls if an expert hasn't correctly specified all the variables that should be included in the model. In contrast, flexible models like deep learning are less efficient, taking vastly more computation to match the performance of expert models. But, with enough computation (and data), flexible models can outperform ones for which experts have attempted to specify the relevant variables. Clearly, you can get improved performance from deep learning if you use more computing power to build bigger models and train them with more data. But how expensive will this computational burden become? Will costs become sufficiently high that they hinder progress? To answer these questions in a concrete way, we recently gathered data from more than 1,000 research papers on deep learning, spanning the areas of image classification, object detection, question answering, named-entity recognition, and machine translation. Here, we will only discuss image classification in detail, but the lessons apply broadly. Over the years, reducing image-classification errors has come with an enormous expansion in computational burden. For example, in 2012 AlexNet, the model that first showed the power of training deep-learning systems on graphics processing units (GPUs), was trained for five to six days using two GPUs. By 2018, another model, NASNet-A, had cut the error rate of AlexNet in half, but it used more than 1,000 times as much computing to achieve this. Our analysis of this phenomenon also allowed us to compare what's actually happened with theoretical expectations. Theory tells us that computing needs to scale with at least the fourth power of the improvement in performance. In practice, the actual requirements have scaled with at least the ninth power. This ninth power means that to halve the error rate, you can expect to need more than 500 times the computational resources. That's a devastatingly high price. There may be a silver lining here, however. The gap between what's happened in practice and what theory predicts might mean that there are still undiscovered algorithmic improvements that could greatly improve the efficiency of deep learning. To halve the error rate, you can expect to need more than 500 times the computational resources. As we noted, Moore's Law and other hardware advances have provided massive increases in chip performance. Does this mean that the escalation in computing requirements doesn't matter? Unfortunately, no. Of the 1,000-fold difference in the computing used by AlexNet and NASNet-A, only a six-fold improvement came from better hardware; the rest came from using more processors or running them longer, incurring higher costs. Having estimated the computational cost-performance curve for image recognition, we can use it to estimate how much computation would be needed to reach even more impressive performance benchmarks in the future. For example, achieving a 5 percent error rate would require 10 ^19 billion floating-point operations. Important work by scholars at the University of Massachusetts Amherst allows us to understand the economic cost and carbon emissions implied by this computational burden. The answers are grim: Training such a model would cost US $100 billion and would produce as much carbon emissions as New York City does in a month. And if we estimate the computational burden of a 1 percent error rate, the results are considerably worse. Is extrapolating out so many orders of magnitude a reasonable thing to do? Yes and no. Certainly, it is important to understand that the predictions aren't precise, although with such eye-watering results, they don't need to be to convey the overall message of unsustainability. Extrapolating this way would be unreasonable if we assumed that researchers would follow this trajectory all the way to such an extreme outcome. We don't. Faced with skyrocketing costs, researchers will either have to come up with more efficient ways to solve these problems, or they will abandon working on these problems and progress will languish. On the other hand, extrapolating our results is not only reasonable but also important, because it conveys the magnitude of the challenge ahead. The leading edge of this problem is already becoming apparent. When Google subsidiary DeepMind trained its system to play Go, it was estimated to have cost $35 million. When DeepMind's researchers designed a system to play the StarCraft II video game, they purposefully didn't try multiple ways of architecting an important component, because the training cost would have been too high. At OpenAI, an important machine-learning think tank, researchers recently designed and trained a much-lauded deep-learning language system called GPT-3 at the cost of more than $4 million. Even though they made a mistake when they implemented the system, they didn't fix it, explaining simply in a supplement to their scholarly publication that "due to the cost of training, it wasn't feasible to retrain the model." Even businesses outside the tech industry are now starting to shy away from the computational expense of deep learning. A large European supermarket chain recently abandoned a deep-learning-based system that markedly improved its ability to predict which products would be purchased. The company executives dropped that attempt because they judged that the cost of training and running the system would be too high. Faced with rising economic and environmental costs, the deep-learning community will need to find ways to increase performance without causing computing demands to go through the roof. If they don't, progress will stagnate. But don't despair yet: Plenty is being done to address this challenge. One strategy is to use processors designed specifically to be efficient for deep-learning calculations. This approach was widely used over the last decade, as CPUs gave way to GPUs and, in some cases, field-programmable gate arrays and application-specific ICs (including Google's Tensor Processing Unit). Fundamentally, all of these approaches sacrifice the generality of the computing platform for the efficiency of increased specialization. But such specialization faces diminishing returns. So longer-term gains will require adopting wholly different hardware frameworks--perhaps hardware that is based on analog, neuromorphic, optical, or quantum systems. Thus far, however, these wholly different hardware frameworks have yet to have much impact. We must either adapt how we do deep learning or face a future of much slower progress. Another approach to reducing the computational burden focuses on generating neural networks that, when implemented, are smaller. This tactic lowers the cost each time you use them, but it often increases the training cost (what we've described so far in this article). Which of these costs matters most depends on the situation. For a widely used model, running costs are the biggest component of the total sum invested. For other models--for example, those that frequently need to be retrained-- training costs may dominate. In either case, the total cost must be larger than just the training on its own. So if the training costs are too high, as we've shown, then the total costs will be, too. And that's the challenge with the various tactics that have been used to make implementation smaller: They don't reduce training costs enough. For example, one allows for training a large network but penalizes complexity during training. Another involves training a large network and then "prunes" away unimportant connections. Yet another finds as efficient an architecture as possible by optimizing across many models--something called neural-architecture search. While each of these techniques can offer significant benefits for implementation, the effects on training are muted--certainly not enough to address the concerns we see in our data. And in many cases they make the training costs higher. One up-and-coming technique that could reduce training costs goes by the name meta-learning. The idea is that the system learns on a variety of data and then can be applied in many areas. For example, rather than building separate systems to recognize dogs in images, cats in images, and cars in images, a single system could be trained on all of them and used multiple times. Unfortunately, recent work by Andrei Barbu of MIT has revealed how hard meta-learning can be. He and his coauthors showed that even small differences between the original data and where you want to use it can severely degrade performance. They demonstrated that current image-recognition systems depend heavily on things like whether the object is photographed at a particular angle or in a particular pose. So even the simple task of recognizing the same objects in different poses causes the accuracy of the system to be nearly halved. Benjamin Recht of the University of California, Berkeley, and others made this point even more starkly, showing that even with novel data sets purposely constructed to mimic the original training data, performance drops by more than 10 percent. If even small changes in data cause large performance drops, the data needed for a comprehensive meta-learning system might be enormous. So the great promise of meta-learning remains far from being realized. Another possible strategy to evade the computational limits of deep learning would be to move to other, perhaps as-yet-undiscovered or underappreciated types of machine learning. As we described, machine-learning systems constructed around the insight of experts can be much more computationally efficient, but their performance can't reach the same heights as deep-learning systems if those experts cannot distinguish all the contributing factors. Neuro-symbolic methods and other techniques are being developed to combine the power of expert knowledge and reasoning with the flexibility often found in neural networks. Like the situation that Rosenblatt faced at the dawn of neural networks, deep learning is today becoming constrained by the available computational tools. Faced with computational scaling that would be economically and environmentally ruinous, we must either adapt how we do deep learning or face a future of much slower progress. Clearly, adaptation is preferable. A clever breakthrough might find a way to make deep learning more efficient or computer hardware more powerful, which would allow us to continue to use these extraordinarily flexible models. If not, the pendulum will likely swing back toward relying more on experts to identify what needs to be learned. Special Report: The Great AI Reckoning [image] READ NEXT: How the U.S. Army Is Turning Robots Into Team Players Or see the full report for more articles on the future of AI. From Your Site Articles * Deep Learning at the Speed of Light - IEEE Spectrum > * Facebook AI Director Yann LeCun on His Quest to Unleash Deep ... > * The Future of Deep Learning Is Photonic - IEEE Spectrum > * How Deep Learning Works > Related Articles Around the Web * Deep learning | Nature > * Deep Learning by deeplearning.ai | Coursera > * Deep learning - Wikipedia > Keep Reading | Show less