• Machine-Learning Prediction of Comorbid Substance Use Disorders in ADHD Youth Using Swedish Registry Data

      Zhang-James, Yanli; Chen, Qi; Kuja-Halkola, Ralf; Lichtenstein, Paul; Larsson, Henrik; Faraone, Stephen V. (Cold Spring Harbor Laboratory, 2019-06-06)
      Background: Children with attention-deficit/hyperactivity disorder (ADHD) have a high risk for substance use disorders (SUDs). Early identification of at-risk youth would help allocate scarce resources for prevention programs. Methods: Psychiatric and somatic diagnoses, family history of these disorders, measures of socioeconomic distress, and information about birth complications were obtained from the national registers in Sweden for 19,787 children with ADHD born between 1989 and 1993. We trained (a) a cross-sectional random forest (RF) model using data available by age 17 to predict SUD diagnosis between ages 18 and 19; and (b) a longitudinal recurrent neural network (RNN) model with the Long Short-Term Memory (LSTM) architecture to predict new diagnoses at each age. Results: The area under the receiver operating characteristic curve (AUC) was 0.73(95%CI 0.70–0.76) for the random forest model (RF). Removing prior diagnosis from the predictors, the RF model was still able to achieve significant AUCs when predicting all SUD diagnoses (0.69, 95%CI 0.66–0.72) or new diagnoses (0.67, 95%CI: 0.64, 0.71) during age 18–19. For the model predicting new diagnoses, model calibration was good with a low Brier score of 0.086. Longitudinal LSTM model was able to predict later SUD risks at as early as 2 years age, 10 years before the earliest diagnosis. The average AUC from longitudinal models predicting new diagnoses 1, 2, 5 and 10 years in the future was 0.63. Conclusions: Population registry data can be used to predict at-risk comorbid SUDs in individuals with ADHD. Such predictions can be made many years prior to age of the onset, and their SUD risks can be monitored using longitudinal models over years during child development. Nevertheless, more work is needed to create prediction models based on electronic health records or linked population registers that are sufficiently accurate for use in the clinic.