(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . Developing better digital health measures of Parkinson’s disease using free living data and a crowdsourced data analysis challenge [1] ['Solveig K. Sieberts', 'Sage Bionetworks', 'Seattle', 'Washington', 'United States Of America', 'Henryk Borzymowski', 'Independent Researcher', 'Yuanfang Guan', 'Department Of Computational Medicine', 'Bioinformatics'] Date: 2023-04 One of the promising opportunities of digital health is its potential to lead to more holistic understandings of diseases by interacting with the daily life of patients and through the collection of large amounts of real-world data. Validating and benchmarking indicators of disease severity in the home setting is difficult, however, given the large number of confounders present in the real world and the challenges in collecting ground truth data in the home. Here we leverage two datasets collected from patients with Parkinson’s disease, which couples continuous wrist-worn accelerometer data with frequent symptom reports in the home setting, to develop digital biomarkers of symptom severity. Using these data, we performed a public benchmarking challenge in which participants were asked to build measures of severity across 3 symptoms (on/off medication, dyskinesia, and tremor). 42 teams participated and performance was improved over baseline models for each subchallenge. Additional ensemble modeling across submissions further improved performance, and the top models validated in a subset of patients whose symptoms were observed and rated by trained clinicians. Motion sensors available in consumer devices like smartphones, smartwatches and fitness trackers have enormous potential for use in tracking health and, in the case of movement disorders, understanding symptom severity. In this case, we use data collected from smartphones and smartwatches collected passively as patients go about their daily lives to measure symptom severity in Parkinson’s disease. We challenged analysts around the world to develop algorithms to interpret the sensor data from the smart-devices and scored their submissions to determine those that performed the best. 42 teams from around the world participated, and for all 3 symptoms we measured (on/off medication, dyskinesia and tremor) the models using the sensor data showed the ability to predict symptom severity. We also validated these models against symptom severity scores reported by trained doctors. Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: YG serves as scientific advisor for Eli Lilly and Company and Merck & Co.; serves as scientific advisor and receives grants from Merck KGaA. YH has grant funding through BBJ from Sanofi S.A. and UCB for unrelated projects. BBJ has grant funding from Sanofi S.A. and UCB for unrelated projects. AJ is funded by MJ Fox Foundation for data curation. All other authors report no competing interests. Funding: The BEAT-PD Challenge was funded by the Michael J. Fox Foundation (MJFF) in a grant to LO. MJFF played an advisory role in the design of the challenge. The salaries of SKS, AJ, AM2, NS, PS and LO were partially supported by funds from MJFF. The ensemble computations were partly enabled by Scientific Computing resources at the Icahn School of Medicine at Mount Sinai, and YL and GP were supported by NIH R01HG011407-01A1. YG is funded separately by the Michael J. Fox Foundation. BKBJ was supported by a grant from NIH NINDS (award #: K99NS114850). MSK is supported by a grant from NIH (award #: 5T32HG002295-18). CSV, GS, and RZ received research support funded by the NIH NINDS under award number P50NS108676. These funders had no role in study design, data collection, and analysis, the decision to publish, or preparation of the manuscript. Challenge participants were asked to build models from wearable data that were able to predict PD severity labels collected through the Hauser diaries. Given the large amount of heterogeneity in disease symptoms between PD patients [ 13 ] and large amounts of data available per study subject, we opted to design the challenge such that personalized models of disease could be used to perhaps better capture the previously observed variation in PD [ 14 ] and enable a future of patient-specific long term tracking [ 15 ]. The challenge leveraged two datasets: the Clinicians Input Study (CIS-PD) [ 9 , 10 ] and REAL-PD which is also known as the Parkinson@Home Validation Study [ 11 ], both of which employed similar approaches pairing smartwatch sensor data with patient-reported symptom severity collected frequently, at-home, over multiple days. In both studies data from smart watches (Apple Watch in CIS-PD and Motorola Watch with an Android phone REAL-PD) were collected from patients as they went through their daily lives. Patients also reported symptom severity at 30-minute increments using digital Hauser diaries over the course of multiple days of these studies [ 12 ]. The challenge leveraged 2,476 symptom reports from 16 subjects for CIS-PD and 782 symptom reports from 12 subjects for REAL-PD. Using real world data to better understand the variety and severity of disease requires both exploratory studies as well as validation in a heterogeneous environment. Prior work has demonstrated that digital measures that validate in lab conditions don’t always validate in a home environment [ 6 ]. Previously, we showed that smartphone sensor measurements from prescribed activities in the home could be used to distinguish PD from non-PD patients [ 7 ]. In the same exercise, we also showed that wearable sensors from short, prescribed activities in the clinic could be used to assess symptom severity in PD. We employed a crowd-sourcing approach to achieve these goals and benchmark the best methods [ 8 ]. Here we extend our previous work to understand if sensor data, collected passively during patients’ daily lives, could be used to assess symptom severity and medication fluctuations. To this end, we ran the Biomarker and Endpoint Assessment to Track Parkinson’s Disease (BEAT-PD) DREAM Challenge which leveraged data coupling patient-reported severity measures from EMAs, with accelerometer data from wrist-worn, consumer smartwatches. For many diseases, brief clinic visits do not adequately capture the full lived experience of patients. This is especially true for Parkinson’s disease (PD), which is characterized by motor symptoms such as tremors, slowness of movement as well as a broad set of non-motor symptoms in areas such as cognition, mood, and sleep. Of these, only a few are easily evaluated during clinicians exams or captured by patient reports. Because Parkinson’s symptoms can be highly variable [ 1 ], short, infrequent physician assessments do not capture fluctuations experienced by patients. In fact, motor fluctuations are a common side-effect of the drug treatments commonly used for PD. Additionally, symptoms and assessments that are clinically monitored don’t always overlap with the symptoms that interfere with the patient’s life [ 2 ]. This disconnect is being recognized and, for example, has been noted by the FDA who recently rejected Verily’s Virtual Motor Exam for PD as a clinical trial tool because it had “limited capacity to evaluate meaningful aspects of concepts of interest that are relevant to the patients’ ability to function in day-to-day life."[ 3 ] The development of in-home monitoring using digital health tools, ecological momentary assessments (EMAs) and wearables can offer a way to develop measures of disease that expands the lived experience by collection of real world data [ 4 , 5 ]. For tremor (SC3) ( S9 Table ) and dyskinesia (SC2) ( S8 Table ), the results were less consistent across models. For tremor, the top scoring model from this subchallenge (yuanfang.gaun) showed only one nominally significant subject (1046, p-value = 0.003), and the cross-subject meta-analysis was not significant after multiple test correction (unadjusted p-value = 0.047). However, the models by dbmi and HaProzdor showed more patients having significant correlation between predictions and labels (6 and 2 of 13, respectively) as well as greater overall significance (meta-analysis p-value = 1.97e-10 and 1.27e-04, respectively). For dyskinesia, only one model (team ROC BEAT-PD) showed nominal significance (unadjusted p-value = 0.035), and only one out of 6 subjects showed significant correlations with any of the models (pearson correlation = 0.286 and 0.290 and unadjusted p-value = 0.003 and 0.004, for ROC BEAT-PD and dbmi, respectively). The top teams were also invited to apply their models to sensor data collected during the completion of short (~30 second) specified tasks for the same study participants in the CIS-PD study. Each of these segments was assessed, in-person, for symptom severity by a clinical PD expert in order to ascertain the degree to which these models recapitulate clinician-rated severity. Four teams (dbmi, HaProzdor, ROC BEAT-PD, and yuanfang.guan) participated in this exercise and submitted predictions for 1277 segments across 16 subjects. Within-subject correlation between the predicted value and the symptom severity label was used as the measure of accuracy, rather than MSE, in order to account for the fact that patients’ perception of average severity may differ from a physician’s. That is to say, the distributions may be shifted, but we expect the patient- and physician-derived severity ratings to be correlated. For on/off medication, all four models showed significant positive correlation with the clinical ratings for some, but not all, of the subjects ( S7 Table – S9 Table ). Cross-subject meta-analysis was significant for all teams however. As seen with the challenge predictions, there is a substantial amount of heterogeneity across subjects (4–7 of 14 showing nominal p-value < 0.05) and across models. For a few subjects we see high positive correlations for some teams (dbmi and ROC BEAT-PD) and high negative correlations for others (yuanfang.guan). Interestingly, subject 1004, who showed high heterogeneity across models in the challenge predictions, shows strong positive correlations in the clinical segments. We evaluated whether subject-specific factors or patient characteristics were associated with better predictability for each model. The patient characteristics explored were age and disease severity as measured by the Movement Disorder Society Unified Parkinson’s Disease Rating Scale (MDS-UPDRS) [ 23 ] instrument parts I, II, and IV, as well as part III assessed in both the on- and off-medication state. We also explored the effect of data and metadata characteristics including number of observations (n), the variance of the labels, as well as the mean difference between the symptom reporting period and the time the report was made (reporting lag), with the hypothesis that symptom reports made well after the reporting period may be less accurate. Overall, the only significant association observed was with label variance for on/off medication ( S4 Table ). The label variance for on/off medication also showed a trend of positive correlation for all models with dyskinesia and tremor ( S5 Table and S6 Table ). Using those models statistically outperforming the Null model (from teams dbmi, HaProzdor, hecky, Problem Solver, ROC BEAT-PD, and yuanfang.guan for Subchallenges 1 (on/off) and 3 (tremor), and teams hecky, ROC BEAT-PD, and yuanfang.guan for SC2), we sought to examine whether all subjects were predictable by these models or whether heterogeneity leads to models working well for some patients but not others. To do so, we defined to be the improvement in MSE of the submitted model over the Null model, where a positive value indicates an improved prediction. Generally, we observed that the contribution to MSE improvement over the Null model is largely driven by a small number of subjects, which are well predicted by all or most of the top models ( S6 Fig – S8 Fig ). For SC1 (on/off), 7 of 22 subjects were responsible for the majority improvement in MSE. Upon examining the percent scale (Lift model /MSE Null ), we observed additional subjects who have statistically significant lift, but whose overall contribution to the improvement in performance is low. For SC2 (dyskinesia) and SC3 (tremor), 7 of 16 and 4 of 19 individuals, respectively, account for most of the lift. In rare instances, we observe individuals that are predictable by some models, but are poorly predicted by others (e.g subject 1004 S6 Fig – S8 Fig )). In this case, the teams employing individualized models (dbmi and HaProzdor) perform particularly poorly, suggesting that, in these cases, employing global models protects against overfitting. This observation is consistent across subchallenges. We then evaluated the median and RL-based ensemble models in the test data and compared the results to the teams’ individual models. The performances of the final ensemble predictors are shown in Table 1 and Fig 1 . The RL ensembles were the best performing models in every case, performing better than the median ensembles and best teams’ model for every subchallenge. However, the median ensemble performed better than the best team model in SC3 only. Still, we observe that ensemble approaches can improve prediction accuracy when applied to models submitted during the course of a predictive modeling challenge. Based on their performance in the training data cross-validations, two models were chosen to be evaluated on the test data, one from among the unsupervised methods and one from among the supervised methods. The median of the individual base predictions was the best-performing unsupervised ensemble method in the training data evaluation for two of the three subchallenges ( S5 Fig ). Among supervised ensemble methods, the Reinforcement Learning (RL)-based ensemble selection algorithms [ 20 , 22 ] were the best-performing methods for all three subchallenges in the training set evaluation ( S5 Fig ). In this case, the optimal RL-based algorithm was a L 2 -regularized linear regression function applied to all five teams’ individual predictions. It is important to note that this model selection and optimization was done entirely on the training data set as would have been available to challenge participants, and, in fact, the team generating the ensemble predictions were blinded to the test data in the same way challenge participants were. To investigate if the overall predictive performance of the challenge could be improved further, we constructed heterogeneous ensembles [ 19 ] of the solutions to the three subchallenges submitted by the five best-performing teams (ROC, dbmi, HaPrazador, yuanfang.guan and hecky). These teams were selected based on having submitted models that were significantly better than the Null model (nominal bootstrap p-val < 0.05) for at least one subchallenge. One team that met this bar chose not to join this effort and was not included. This investigation was carried out in two stages. In the first stage, several classes of heterogeneous ensemble methods [ 19 – 22 ] were tested in a nested cross-validation setup applied to the training sets of SC1-3 to determine the best ensemble method(s) for each subchallenge. The various ensemble methods showed variable performance across subchallenges when evaluated within the training data cross-validations ( S5 Fig ), though still outperforming the best individual team model in most cases. This implied that the ensembles were likely to improve the accuracy over the individual models. For the SC2 (dyskinesia) winning model from team ROC BEAT-PD, model effects were observed to be predominantly linear, so Gini importance [ 16 ] was used to explore feature importance in this case. For their models, team ROC BEAT-PD fit a single model for all subjects, incorporating patient characteristics to capture patient heterogeneity. In particular, the clinical MDS-UPDRS scores were highly ranked, suggesting that the predictions were kept grounded by the static/baseline information, then modulated by the real-time sensor data. UPDRS question 4.1 relates to dyskinesia burden, and was by far the most important predictor of dyskinesia level in the CIS-PD cohort, accounting for 45% of the model ( S1 Data ). A PCA vector based on the UPDRS Part III (motor symptom) questions was the strongest predictor of dyskinesia in the REAL-PD cohort (37% of the model). The most important sensor-based feature in dyskinesia prediction was mean acceleration (vm_mean). Additionally, “counts per minute” (cpm_mean), a feature designed to mimic conventional Actigraph reports, was relatively highly ranked (7% of the dyskinesia model). Finally, correlation coefficients between the acceleration axes (i.e. x vs. y, x vs. z, and y vs. z) were well-represented, with each pair generally accounting for 2–7% of any model. All of these sensor-derived features were more important in the REAL-PD cohort; CIS-PD predictions were largely dictated by the static features. MDS-UPDRS Parts III and IV accounted for over half of the CIS-PD dyskinesia and tremor prediction models. Team dmbi (winner of SC1 and co-winner of SC3) and Team ROC BEAT-PD (winner of SC2) both used random forest-based [ 16 ] machine learning modeling which allowed us to explore the model feature importance. Team dbmi trained a random forest model on manually extracted signal features from raw data. Separate models were trained for each patient-phenotype combination. To explore the feature importance within team dbmi’s SC1 and SC3 models, we computed SHAP values [ 17 ] which quantify the importance of features in a way that is comparable across different models. We computed SHAP values for every prediction and SHAP interaction values for a randomly selected subset of predictions (see Methods ). In general, we observed that model predictions were multi-factorial in nature. Effects of individual features were small, and main effects were generally outweighed by interaction effects ( S2 Fig ). However, there was general consistency within the top features, even across the two outcomes examined (on/off (SC1) and tremor (SC3)), with the two models sharing 11 of their top 15 features. Nine of the top ten features by SHAP value magnitude correspond to measures of signal magnitude from the accelerometer, including various data quantiles, signal mean, and the 0Hz component of the Fast Fourier Transform ( S3 Fig ). There was a strong correlation among the top features ( S4 Fig ), potentially diminishing the importance of individual features and causing stochasticity in feature scores across models. No significant differences in feature effects were observed when comparing across labels ( S3 Table ). Additionally, we observed no association with previously reported features correlating with Parkinsonian symptoms, such as spectral power in 3-7Hz for tremor or spectral entropy for dyskinesia [ 18 ]. Bootstraps (n = 1000) of submissions for (A) SC1: on/off, (B) SC2: dyskinesia, and (C) SC3: tremor. Team models (black) and their ensembles (blue) are ordered by rank. Boxes correspond to the 25 th , 50 th , and 75 th percentiles, and individual points are displayed beyond 1.5*IQR (interquartile range) from the edge of the box. For each sub-challenge, a null model (shown in red) estimated as the subject-specific mean of the training labels was used as a benchmark. Models submitted by teams Lifespark and Sydney Neurophysics were outliers, and have not been displayed in order to present greater definition among the top models. For SC2, the UT BEAT-PD TEAM bootstraps have been truncated for the same reason. The non-truncated figures are available in the Supplementary materials. Among the top 6 teams whose models performed statistically significantly better than the Null models (with bootstrap p-value < 0.05) in at least one subchallenge, all but one team used signal processing for feature extraction followed by machine learning to build their models. The remaining team, which was the co-winner in SC3 (tremor) and runner-up in SC2 (dyskinesia), input the sensor data directly into a deep learning model. Among the signal processing approaches, two teams fit individual models for each subject: Team dbmi, who won SC1 (on/off) and co-won SC3 (tremor), and Team HaProzdor, who was runner-up in SC1 (on/off). The remaining teams fit global models with subject-specific information to model both within and across subject effects. Along with the lack of consistency among top models, we surveyed all challenge participants and found no association between approaches (including data cleaning and preprocessing, feature extraction, and modeling) and predictive performance. For SC1 (on/off predictions), we received submissions from 37 teams ( Fig 1A and S1A Fig ), of which 9 submissions performed strictly better than the Null model and 6 performed significantly better at a nominal bootstrap p-value of 0.05. The best model achieved a weighted MSE of 0.878, compared to 0.967 for the Null model. For SC2 (dyskinesia), we received 38 submissions, of which 8 performed strictly better than the Null model and 3 were statistically better at a nominal bootstrap p-value of 0.05 ( Fig 1B and S1B Fig ). The best model achieved a weighted MSE of 0.405, compared to 0.437 for the Null model. For SC3 (tremor), we received submissions from 37 teams, of which 9 strictly outperformed the Null model and 6 were statistically better at nominal bootstrap p-value of 0.05 ( Fig 1C and S1C Fig ). The two top models were not distinguishable from each other with weighted MSEs of 0.4003 and 0.4006 (p-value of the one-sided Wilcoxon signed-rank test for the bootstrap scores > 0.05), and the weighted MSE of the Null model was 0.440. We developed three subchallenges using the three symptoms that were captured in both the CIS-PD and REAL-PD Hauser diaries: on/off medication (Subchallenge 1 [SC1]), dyskinesia (Subchallenge 2 [SC2]), and tremor (Subchallenge 3 [SC3]). Challenge participants were free to participate in any or all subchallenges, and one model was scored per team per subchallenge. Challenge participants were asked to predict medication status (SC1) or symptom severity (SC2 & SC3) using non-overlapping 20-minute readings from the sensors associated with the time of the symptom report, as well as baseline patient demographics and MDS-UPDRS scores assessed in both the on and off states by a clinician. Training and test partitions were split within subjects to enable subject-specific models, that is, challenge participants could choose to build either global models or personalized machine learning models ( S1 Table & S2 Table ). The same training and test splits were used across all three subchallenges. Test partition labels were withheld from challenge participants, and they were asked to predict the phenotype severity in the test partition. Weighted mean-square error (MSE) was used as the scoring metric in each subchallenge and was calculated by a weighted average of the per subject MSE where the weight was the square-root of the subject-specific number of observations in the test set. This weighting scheme was chosen in order to downweight the contributions from subjects with substantially more observations because there was a large range in the number of test observations across subjects (11–99 observations per subject). Models were compared to a baseline Null model that generated predictions according to the subject-specific mean of the training labels, which is the best prediction in the absence of any sensor data. Bootstrap p-values were computed to compare each submission to the Null model. Discussion The BEAT-PD DREAM Challenge was an open-sourced, collaborative effort to publicly benchmark the ability to use wearable data collected passively during free-living conditions to predict PD symptom severity. Utilizing a challenge framework allowed us to very quickly explore a large space of solutions and engage a community of researchers from around the world to provide solutions. The open source nature of the DREAM challenge frameworks means that all the methods of the participants have been shared and are available as a resource to the community (www.synapse.org/beatpdchallenge). The results of the challenge demonstrate that passive data from wrist-worn sensors could be used to predict PD symptom severity and motor fluctuations, with multiple models and their ensembles showing significantly improved prediction over the Null model for each symptom tracked. Many of these models showed significant validation against clinical ratings for the same patients. Of the four models which were able to be applied to the short, clinical validation data 4 models in SC1 (on/off), 1 in SC2 (dyskinesia) and 3 in SC3 (tremor) models showed significant association. This is a necessary proof-of-concept toward the development and deployment of validated instruments for passive monitoring of PD. Past efforts have primarily focused on predicting symptom severity from short, well-defined tasks [5,7]. A few efforts have attempted to passively monitor PD symptoms in daily life, chiefly tremor and gait impairments [5,24]. Consistent with previous efforts [7], prediction of dyskinesia was more difficult than prediction of tremor or medication on/off state. This was supported by the fact that only 3 models significantly outperformed the null model for SC2 (dyskinesia), and of those, only one model’s predictions significantly correlated with clinician ratings. This worse performance may be due to the difficulty in distinguishing choreic movements from certain types of voluntary movements [18]. Indeed the most important sensor-derived features from the SC2-winning model appear to capture overall motion, rather than specific types of motion. Modeling strategies that take activity into account, for example human activity recognition (HAR) may be more successful in distinguishing movement types, though it is possible that certain types of activities will always be subject to high error rates in the prediction of dyskinesia. This is consistent with previous work that has shown good ability to predict symptom severity in the context of fixed activities [7,18]. Most of the top-performing models used signal processing methods, with the exception of the co-winner of the tremor subchallenge (SC3), which used a deep learning approach. However, it is important to note that a bug discovered in the code of team ROC BEAT-PD would have rendered them the winner in SC3 (tremor) had it been discovered and fixed during the competition (updated weighted MSE = 0.3990). Still, the sole deep learning approach remained among the top models for SC3. It was also the runner-up in SC2, and one of only three models statistically outperforming the null model for prediction of dyskinesia severity (SC2), although it failed to validate in the clinically-rated segments. While deep learning approaches have performed well for predicting PD diagnosis or PD symptom severity in the past, it appears to be most successful when trained on very large data sets, but has performed comparably to signal processing methods in moderate-sized data sets [7]. In this exercise, we also noted a general similarity in prediction across individuals. However, we did observe examples where the deep learning approach performed better or worse than the signal processing approaches (S6 Fig–S8 Fig), although it is presently unclear what factors may drive those differences. In the moderately sized data set used in this Challenge, subject-specific sample size did not appear to be a mediating factor. Among the signal processing approaches, the top performing approaches utilized a similar workflow—splitting the 20-minute recordings into smaller windows, followed by feature extraction and machine learning; however, there was quite a bit of variability in how these were implemented. Some teams performed some sort of pre-processing (e.g. resampling, normalization, interpolation, removal of gravity, etc) though ROC BEAT-PD (the winner of SC2), did not. Segmentation sizes ranged from 10 to 60 seconds, with varying overlaps. Some used custom features, while several used the publicly available package tsfresh [25]. With respect to machine learning approaches, most teams used random forest [16] models, though one team from among the top performers incorporated these with multiple other models via ensemble approaches to generate their final predictions. There were also differences among the teams in their choice to build individualized versus global models. While both types of models performed similarly overall amongst the top models, there appeared to be examples of patients where individual models performed substantially better or worse than the global models (S6 Fig–S8 Fig). Given that this is a relatively limited sample size, with respect to the number of individuals, it seems likely that global modeling approaches would benefit greatly from an increase in numbers. Still, given the highly individualistic nature of Parkinson’s symptom manifestation, it is unclear whether these types of models will ever be successful in independent individuals without some degree of within-individual training. We also found that combining information across models in the form of ensemble modeling improved prediction accuracy over the best performing model for all three subchallenges. The RL-based ensemble algorithms [20,22] produced the most accurate predictors for all the subchallenges (Table 1). These near-exhaustive and systematic algorithms are designed to select a parsimonious and effective ensemble (subset) from a large set of base predictors. However, since these algorithms were only applied to five base predictors in this study, the best ensemble was found to be an L 2 -regularized linear regression function applied to the full set of base predictors. It is also interesting to note that during the course of this analysis, team ROC BEAT-PD discovered a bug in their code, which decreased their weighted MSE to 0.8879 and 0.3990 (from 0.8897 and 0.4012) for SC1 and SC3, though slightly decreasing their MSE in SC2 to 0.4056 from 0.4053. Despite these modest changes, applying the same algorithms to the improved models resulted in little change in the performance of the RL-based ensembles (weighted MSE of 0.8686, 0.4043 and 0.3938 for SC1, SC2 and SC3, respectively, in contrast to 0.8687, 0.4048 and 0.3937 for the RL ensembles of the original submissions) and no change to the median-based ensembles. This demonstrated the robustness of the ensembles. In future Challenges, we aim to apply these ensemble algorithms to larger sets of submissions, and expect to develop even more accurate and parsimonious ensembles. While the results of this challenge showed promise for the vision of passive- low-burden, at-home monitoring of PD symptoms, the current results are not yet practically useful. Although the symptoms we analyzed are well established motor outcomes, the severity scores available were patient-reported. Patient-reported data can be subject to perception and recall bias, however, researchers have previously observed that patient reporting accuracy is high even in the presence of depression or cognitive difficulties [26]. Additionally, we observed good correspondence between patient and expert severity ratings in the in-clinic (CIS-PD) and at-home clinician visits (REAL-PD). We also found no association between model accuracy and reporting lag in the models developed in the course of this challenge. In our previous challenge we observed that large amounts of data allows for more sophisticated methods [7] to be used. Even though we had large amounts of longitudinal data, it was derived from a small number of subjects (16 to 22 depending on subchallenge). Utilizing a larger number of subjects in future efforts could improve the performance of global models by capturing more of the inter-individual variability expected in PD. Future studies, such as remote longitudinal studies, have the potential to collect data from thousands of patients [1,27]. If studies like these can be paired with the corresponding outcome variables it might be possible to build better models of disease. Large sample sizes become particularly important as we move away from the basic motor symptoms that are typically measured in the clinic and address additional symptoms that affect patients in their lived experience but are consistently experienced across PD patients. [END] --- [1] Url: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000208 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/