https://www.quantamagazine.org/when-data-is-missing-scientists-guess-then-guess-again-20241002/ * Physics * Mathematics * Biology * Computer Science * Topics * Archive * Blog * Columns * Interviews * Podcasts * Puzzles * Multimedia * Videos * About Quanta An editorially independent publication supported by the Simons Foundation. Follow Quanta Newsletter Get the latest news delivered to your inbox. Email [ ] [ ] Subscribe Recent newsletters Gift Store Shop Quanta gear * Physics * Mathematics * Biology * Computer Science * Topics * Archive * Saved Articles Create a reading list by clicking the Read Later icon next to the articles you wish to save. See all saved articles * Log out ----------------------------------------------------------------- Change password * * Type search term(s) and press enter [ ] What are you looking for? Popular Searches * Mathematics * Physics * Black Holes * Evolution Home When Data Is Missing, Scientists Guess. Then Guess Again. Read Later Share Copied! * Comments * Read Later Read Later statistics When Data Is Missing, Scientists Guess. Then Guess Again. By Matt von Hippel October 2, 2024 Across the social and biological sciences, statisticians use a technique that leverages randomness to deal with the unknown. Read Later [MissingData-CrNicoRoper-Lede-scaled] Nico Roper for Quanta Magazine Introduction [matt_von_hippel1-2000-copy-1720x1720] By Matt von Hippel Contributing Writer --------------------------------------------------------------------- October 2, 2024 --------------------------------------------------------------------- View PDF/Print Mode applied math data machine learning mathematics randomness statistics All topics [Article-Design-E] Data is almost always incomplete. Patients drop out of clinical trials and survey respondents skip questions; schools fail to report scores, and governments ignore elements of their economies. When data goes missing, standard statistical tools, like taking averages, are no longer useful. "We cannot calculate with missing data, just as we can't divide by zero," said Stef van Buuren, the professor of statistical analysis of incomplete data at the University of Utrecht. Suppose you are testing a new drug to reduce blood pressure. You measure the blood pressure of your study participants every week, but a few get impatient: Their blood pressure hasn't improved much, so they stop showing up. You could leave those patients out, keeping only the data of those who completed the study, a method known as complete case analysis. That may seem intuitive, even obvious. It's also cheating. If you leave out the people who didn't complete the study, you're excluding the cases where your drug did the worst, making the treatment look better than it actually is. You've biased your results. Avoiding this bias, and doing it well, is surprisingly hard. For a long time, researchers relied on ad hoc tricks, each with their own major shortcomings. But in the 1970s, a statistician named Donald Rubin proposed a general technique, albeit one that strained the computing power of the day. His idea was essentially to make a bunch of guesses about what the missing data could be, and then to use those guesses. This method met with resistance at first, but over the past few decades, it has become the most common way to deal with missing data in everything from population studies to drug trials. Recent advances in machine learning might make it even more widespread. A Touch of Randomness Portrait of Donald Rubin In the 1970s, Donald Rubin invented and evangelized a new statistical method for dealing with missing data. Though controversial at first, today it's ubiquitous across many scientific fields. Courtesy of Donald Rubin Outside of statistics, to "impute" means to assign responsibility or blame. Statisticians instead assign data. If you forget to fill out your height on a questionnaire, for instance, they might assign you a plausible height, like the average height for your gender. That kind of guess is known as single imputation. A statistical technique that dates back to 1930, single imputation works better than just ignoring missing data. By the 1960s, it was often statisticians' method of choice. Rubin would change that. Rubin began his undergraduate studies in the early '60s as a physics major, only to switch to psychology. Then, after beginning graduate school at Harvard University, he was told he could not skip the psychology department's required math courses. Feeling that he'd already covered the material in college, he switched to computer science, completing his master's degree in 1966. Afterward, he spent a summer writing statistics programs for a sociologist, which inspired him to get a doctorate in statistics. During his doctoral studies, Rubin grew interested in the missing data problem. Though single imputation avoided the bias of complete case analysis, Rubin saw that it had its own flaw: overconfidence. No matter how accurate a guess might seem, statisticians can never be completely sure it's correct. Techniques involving single imputation often underestimate the uncertainty they introduce. Moreover, while statisticians can find ways to correct for this, Rubin realized that their methods tended to be finicky and specialized, with each situation practically requiring its own master's thesis. He wanted a method that was both accurate and general, adaptable to almost any situation. Share this article Copied! --------------------------------------------------------------------- Newsletter Get Quanta Magazine delivered to your inbox Subscribe now Recent newsletters Stef van Buuren on a beach. Stef van Buuren of the University of Utrecht has helped develop statistical techniques for analyzing incomplete data. "We cannot calculate with missing data, just as we can't divide by zero," he said. Eric de Vries In 1971, a year after completing his doctorate, Rubin started working for the Educational Testing Service in Princeton, New Jersey. When a government agency asked ETS to analyze a survey with missing data, Rubin proposed an unconventional but surprisingly simple solution: Don't just impute once. Impute multiple times. Imputing, and Imputing Again Let's go back to that blood pressure study. You're testing a new blood pressure drug, and some of the patients stop showing up at the clinic. What do you do? If you were using single imputation, you might assume that anyone who left the study kept their last measured blood pressure forever. Or you could try something more sophisticated: Find, say, another patient whose progress was similar to that of the missing case, and use their data instead. But there are likely several similar cases you could choose from -- and substituting a different value can lead to a very different result. All the different choices you might make give what statisticians call a distribution of predictions for the missing data. [Guessing_G][Guessing_G] Mark Belan/Quanta Magazine Rubin's approach, called multiple imputation, takes that distribution into account. To use it, first make several copies of your data set. For a given missing value in one copy, randomly assign a guess from your distribution. By design, you're more likely to pick one of the better guesses, but you'll also have a small chance of picking one of the less plausible guesses. This process reflects the uncertainty in each guess. Repeat these steps for the missing value in each of the other copies of the data set. Once you've filled in all the missing data, you can analyze each completed data set. You'll end up with several different predictions for the effectiveness of your drug. Then you can use a recipe known as Rubin's rules to pool your results and get an average prediction. By following these steps, you can also compute a better estimate of your final prediction's uncertainty. For drug regulators like the U.S. Food and Drug Administration, being accurate about that uncertainty is crucial: It influences whether or not a drug will get approved. The Multiplying Uses of Multiple Imputation When Rubin first introduced his technique in the early 1970s, many scientists were dubious. Why, they asked, would they want to use anything other than the best guess? Even those who wanted to try it sometimes found it hard to implement: If their study involved, say, census data, then storing several copies of it would mean managing hundreds of millions of data entries. In an era when data had to be stored on punch cards, this was almost impossible. [MissingData-CrNicoRoper-1-circle1] Rubin evangelized his method throughout the 1970s and '80s. He consulted for a number of government agencies, including the IRS, the National Institutes of Health, the Department of Labor and the Department of Defense -- agencies that had the resources to make many copies of large data sets. His work with them showed how effective multiple imputation could be. The organizations also created imputed data that others could then use in their own analyses. By the 1990s, computer memory and processing power had significantly advanced. Multiple imputation became accessible not just to government agencies, but to individual researchers. Among them was van Buuren. In 1999, he and Karin Groothuis-Oudshoorn released a computer program that made it even easier for scientists to use multiple imputation. More programs followed, and multiple imputation became more widespread. Then, in 2010, a report commissioned by the FDA strongly recommended against single imputation and older ad hoc methods. Multiple imputation became the go-to technique in medicine. Related: --------------------------------------------------------------------- 1. Michel Talagrand Wins Abel Prize for Work Wrangling Randomness 2. Wanted: More Data, the Dirtier the Better 3. A Revealer of Secrets in the Data of Life and the Universe Multiple imputation turned out to be both rigorous and versatile. While there are other methods that avoid the drawbacks of single imputation, multiple imputation is the most general: It works any time you might have otherwise tried to use single imputation. Software for multiple imputation still struggles with the largest and most complicated data sets. But new multiple-imputation software that uses machine learning has been able to impute more complicated data. This, in turn, has introduced multiple imputation to fields like engineering, where ad hoc methods have been more common. That said, some researchers still worry about the mathematical rigor of these new techniques and are more hesitant to adopt them. For now, though, it seems that Rubin's multiple imputation is here to stay. Whether scientists are testing a new drug or analyzing voting patterns, random guesses are helping them stay honest about what they know. [matt_von_hippel1-2000-copy-1720x1720] By Matt von Hippel Contributing Writer --------------------------------------------------------------------- October 2, 2024 --------------------------------------------------------------------- View PDF/Print Mode applied math data machine learning mathematics randomness statistics All topics [Article-Design-E] Share this article Copied! --------------------------------------------------------------------- Newsletter Get Quanta Magazine delivered to your inbox Subscribe now Recent newsletters The Quanta Newsletter Get highlights of the most important news delivered to your email inbox Email [ ] [ ] Subscribe Recent newsletters Also in Mathematics Mathematicians Discover New Shapes to Solve Decades-Old Geometry Problem [ExcellentPacking-crChristopherWebbYoung-Default] geometry Mathematicians Discover New Shapes to Solve Decades-Old Geometry Problem By Gregory Barber September 20, 2024 Read Later 'Groups' Underpin Modern Math. Here's How They Work. [WhatIsAGroup-crAdamNickel-Default] group theory 'Groups' Underpin Modern Math. Here's How They Work. By Leila Sloman September 6, 2024 Read Later Perplexing the Web, One Probability Puzzle at a Time [Daniel_Litt-cr] Q&A Perplexing the Web, One Probability Puzzle at a Time By Erica Klarreich August 29, 2024 Read Later Comment on this article Quanta Magazine moderates comments to facilitate an informed, substantive, civil conversation. Abusive, profane, self-promotional, misleading, incoherent or off-topic comments will be rejected. Moderators are staffed during regular business hours (New York time) and can only accept comments written in English. Show comments An illustration shows a bee, decorated with positive charges, approaching a field of flowers that wear negative charges. An electric field emerges between them. Next article The Hidden World of Electrostatic Ecology * About Quanta * Archive * Contact Us * Terms & Conditions * Privacy Policy --------------------------------------------------------------------- All Rights Reserved (c) 2024 An editorially independent publication supported by the Simons Foundation. Simons Foundation Log in to Quanta Use your social network Connect with Facebook Connect with Google or [ ] email [ ] password [*] Remember me [Log in] Forgot your password ? Don't have an account yet? Sign up Forgot your password? We'll email you instructions to reset your password [ ] email [Send] Change your password Enter your new password [ ] Password [ ] Retype new password [Send] Sign Up [ ] First Name [ ] Last Name [ ] Email [ ] Password [ ] Retype Password [Create an account] Creating an account means you accept Quanta Magazine's Terms & Conditions and Privacy Policy We care about your data, and we'd like to use cookies to give you a smooth browsing experience. Please agree and read more about our privacy policy.AGREE