(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . A multi-lab experimental assessment reveals that replicability can be improved by using empirical estimates of genotype-by-lab interaction [1] ['Iman Jaljuli', 'Department Of Statistics', 'Operations Research', 'Tel-Aviv University', 'Tel-Aviv', 'Department Of Epidemiology', 'Biostatistics', 'Memorial Sloan Kettering Cancer Center', 'New York', 'United States Of America'] Date: 2023-05 The utility of mouse and rat studies critically depends on their replicability in other laboratories. A widely advocated approach to improving replicability is through the rigorous control of predefined animal or experimental conditions, known as standardization. However, this approach limits the generalizability of the findings to only to the standardized conditions and is a potential cause rather than solution to what has been called a replicability crisis. Alternative strategies include estimating the heterogeneity of effects across laboratories, either through designs that vary testing conditions, or by direct statistical analysis of laboratory variation. We previously evaluated our statistical approach for estimating the interlaboratory replicability of a single laboratory discovery. Those results, however, were from a well-coordinated, multi-lab phenotyping study and did not extend to the more realistic setting in which laboratories are operating independently of each other. Here, we sought to test our statistical approach as a realistic prospective experiment, in mice, using 152 results from 5 independent published studies deposited in the Mouse Phenome Database (MPD). In independent replication experiments at 3 laboratories, we found that 53 of the results were replicable, so the other 99 were considered non-replicable. Of the 99 non-replicable results, 59 were statistically significant (at 0.05) in their original single-lab analysis, putting the probability that a single-lab statistical discovery was made even though it is non-replicable, at 59.6%. We then introduced the dimensionless “Genotype-by-Laboratory” (GxL) factor—the ratio between the standard deviations of the GxL interaction and the standard deviation within groups. Using the GxL factor reduced the number of single-lab statistical discoveries and alongside reduced the probability of a non-replicable result to be discovered in the single lab to 12.1%. Such reduction naturally leads to reduced power to make replicable discoveries, but this reduction was small (from 87% to 66%), indicating the small price paid for the large improvement in replicability. Tools and data needed for the above GxL adjustment are publicly available at the MPD and will become increasingly useful as the range of assays and testing conditions in this resource increases. Data Availability: Raw data files and R code for reproducible research is available online at Zenodo, URL https://doi.org/10.5281/zenodo.7672211 and GitHub, URL: https://github.com/IJaljuli/Improving-replicability-using-interaction-2021 . The replicability estimator tool is now implemented in MPD: https://phenome.jax.org/replicability . This enables users to submit new experimental results, select prior relevant studies and perform the GxL adjustment on their own data. All other data is available in the manuscript and Supporting Information files. Introduction The scientific community is concerned with issues of published results that fail to replicate in many fields including those of preclinical animal models, drug discovery, and discovering mammalian gene function [1–3]. Many reports have called out a “crisis” in replicability as an explanation for translational failures for preclinical models. Indeed, some of the first concerns regarding the complex interaction between genotype and the conducting laboratory were raised in the field of rodent behavioral phenotyping [4]. While mouse and rat models may predict the human situation, such as the case of activity-dependent neuroprotective protein (ADNP) and the potential of its fragment as a drug (reviewed in Gozes [5]), the utility of any findings critically depends on their replicability in other laboratories [6–8]. A similar concern arises regarding the interaction between the conducting laboratory and novel pharmacological treatments (e.g., Rossello and colleagues [9]) that are of vital importance for translational research into novel drug development. It should be emphasized that the impact of such animal studies goes well beyond animal behavior to clinical studies in neurology and psychiatry. These clinical studies, requiring multiple research centers, are much less homogeneous in terms of genetic and environmental backgrounds of the treatment cohorts. As such, many failures are noted in clinical studies employing therapies deemed efficacious in animal studies. As Collins and Tabak wrote when discussing these problems in preclinical animal studies “If the antecedent work is questionable and the trial is particularly important, key preclinical studies may first need to be validated independently” [10]. In response, there have been several attempts to refine experimental design and practice, in an attempt to extract a pure treatment effect. In most cases, a radical push toward standardization of laboratory conditions, genotypes, and other study conditions has been advocated. However, such attempts are misguided, as effects are often dependent on idiosyncratic conditions, and therefore, standardization produces exactly the opposite of the intended effect—rather than increase replicability; it limits generalizability to the narrow range of conditions under which the finding was obtained. This is sometimes referred to as the “the standardization fallacy” [11,12]. The problem intensifies if the usual recommendation to increase power by larger sample size is followed, for now there is high power to find even small effects particular to the study. One should instead seek to estimate the extent to which a discovery is replicable across the range of likely conditions. For this purpose, heterogenization or systematic variation of testing conditions have been advanced as a strategy; however, both approaches increase experimental costs through somewhat larger sample sizes [7,12]. Moreover, these efforts are yet to prove practical and useful [8,13]. In a previous publication [14], we proposed an alternative to standardization or heterogenization in order to assess statistically the replicability of single-lab results, before making the effort to replicate them across multiple labs. The statistical approach hinges on the “Random Lab Model” for the measured phenotype of a specific genotype in a particular [15]. In particular, we considered a result to be “replicable” if it is tested in a multi-lab experiment and was statistically significant under the assumptions of the random lab model (0.05 level is used throughout the paper). This model treats both the effect of the lab, and more importantly, the effect of the interaction of this genotype in this particular lab, as random. The random effect of the lab cancels out when comparing 2 genotypes in the same lab, but the random interaction contributions add up. Moreover, the actual interaction effect cannot be separated from the lab effect in the analysis of the single-lab results. Still, it can be separated in multi-lab experiments, and while the values are irrelevant to a new lab, their standard deviation is relevant and can be estimated. We therefore suggested to estimate the interlaboratory replicability of novel discoveries in a single-lab study in the following way: We first estimate the Genotype by Laboratory (GxL) interaction standard deviation in previous data from other labs and possibly other genotypes. We then adjust the within-groups standard deviation, which is usually used for testing confidence intervals in a single-lab analysis, by inflating it with the GxL interaction standard deviation (see Statistical methods). This “GxL adjustment” thus generates a larger yardstick, against which genotype differences are tested, and confidence intervals are reported. Consequently, this adjustment raises the benchmark for discovering a significant genotype effect, trading some statistical power for better replicability. We demonstrated that previous phenotyping results from multi-lab databases can be used to derive a GxL-adjustment term to ensure (within the usual 0.05 error) the replicability of single-lab results, for the same phenotypes and genotypes, even before making the effort of replicating the findings in additional laboratories [14]. This demonstration, however, still raises several important questions. Kafkafi and colleagues used data from a highly coordinated [14], multi-lab phenotyping program to estimate the standard deviations of the GxL interaction for each phenotype. These were then used to adjust the results of each of these same labs separately. While the success of this demonstration is encouraging, it does not cover the more realistic setting where the adjusted laboratories are operating independently from the laboratories used for generating the GxL adjustment. Here, we investigate the question of whether GxL adjustment of single-lab results from independently collected data in other labs, reduces the proportion of single-lab discoveries among the non-replicable discoveries, relative to the naïve analysis, and what loss of power does it involve. A related important question is whether GxL estimation from standardized studies can be used to successfully identify replicable results in studies that were not subject to the same standardization. Namely, will the adjustment based on the data from the International Mouse Phenotyping Consortium (IMPC) [16,17], which typically uses relatively well-coordinated, standardized protocols, predict the replicability of results obtained in more common and realistic scenarios, such as those deposited by many investigators into the Mouse Phenome Database (MPD Phenome.jax.org [18]). Unlike the IMPC, MPD archives previously conducted studies, which were not a priori meant to be part of a multi-lab project. Their methods, apparatus, endpoints, and protocols of such experiments are thus not expected to be standardized. Finally, our previous demonstration of GxL adjustment tested only genotype effects, using inbred strains and knockouts, but not pharmacological effects. It therefore remains to be tested whether the pre-estimated interaction of treatment with lab (TxL) or the interaction of the genotype and pharmacological treatment with the lab (GxTxL) can also be used to adjust single-lab treatment testing in a similar way. In order to enable such studies, we modify our previous GxL-adjustment by introducing the dimensionless GxL-factor per phenotype and subpopulation, being the ratio of the interaction standard deviation to the pooled within groups standard deviations. The intuition underlying this factor can be explained by the simplistic situation where one lab measures distance traveled in inches, while in multiple benchmarking labs (the multi-lab) it is measured in centimeters. Standard deviations are affected by the unit of measurement, so one cannot transfer the interaction standard deviation from the centimeter-based multi-lab experiment as a proxy for the interaction standard deviation in the inches-using lab. However, taking the ratio of the interaction standard deviation to the pooled measured standard deviations from the multi-lab analysis defines a scale-free factor that will be the same in the single lab. Now, taking the GxL-factor from the multi-lab and multiplying it back by the standard deviation within groups in the new lab will produce the right value (in inches). Turning to a more realistic situation where a widely used activity measure, “percent time spent at center” is measured by 2 different systems with some variation in the definition of “center,” we still expect that the GxL-factor will be quite stable across labs (also termed “environmental effect ratio” by Higgins and colleagues [19]). Thus, the use of the scale-free dimensionless GxL-factor enables us to carry the information about the interaction of a phenotype to other laboratories, other genotypes, and variations in setups and conditions. In the present study, we assessed the value of the GxL-adjustment for experimental results previously submitted to the MPD, involving genotype effects on several phenotypes, as well as fluoxetine treatment effect on various genotypes. For this purpose, we conducted an experiment measuring the above phenotypes on several genotypes across 3 labs, without strong interlaboratory standardization and coordination. The replications obtained in our own experiment enabled us to estimate the GXL parameter to identify the non-replicable discoveries from MPD. Counting how many of these were statistically significant in their original study, this proportion is an estimate of the probability that a statistical discovery is made even though it is not replicable. A convenient terminology for this probability is the “Type-I replicability error,” in analogy to the Type-I error in testing, being the probability of making a statistical discovery even if there is no effect. We could thereby show that using the GxL adjustments in the original studies would have greatly reduced the number of non-replicable discoveries, and thereby reduce this Type-I replicability error. We therefore recommend supplementing any single-lab discovery with a GxL-adjusted analysis as an assessment of whether it is predicted to be replicated across multiple labs. [END] --- [1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3002082 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/