Predicting Longitudinal Outcomes of Alzheimer’s Disease via a Tensor-Based Joint Classification and Regression Model

Alzheimer’s disease (AD) is a serious neurodegenerative condition that affects millions of people across the world. Recently machine learning models have been used to predict the progression of AD, although they frequently do not take advantage of the longitudinal and structural components associated with multi-modal medical data. To address this, we present a new algorithm that uses the multi-block alternating direction method of multipliers to optimize a novel objective that combines multi-modal longitudinal clinical data of various modalities to simultaneously predict the cognitive scores and diagnoses of the participants in the Alzheimer’s Disease Neuroimaging Initiative cohort. Our new model is designed to leverage the structure associated with clinical data that is not incorporated into standard machine learning optimization algorithms. This new approach shows state-of-the-art predictive performance and validates a collection of brain and genetic biomarkers that have been recorded previously in AD literature.


Introduction
Alzheimer's disease (AD) is a neurodegenerative disorder that has serious mental and financial consequences for those affected and their families. AD is characterized by progressive declines of memory and cognitive capabilities. According to the Alzheimer's Association 1 5.7 million people in the United States are currently suffering from AD-related dementia. In 2018 alone, the total financial cost associated with health care, long-term care, and hospice services for patients suffering from dementia was estimated to be $277 billion. It is forecasted that by 2050, the number of people suffering from AD will surpass 13.8 million. Furthermore, the Alzheimer's Association emphasizes that early detection and diagnosis of individuals with AD could save up to $7.9 trillion in associated medical costs. With the projected increase in individual hardship and financial burden caused by AD, it is essential that the scientific community develop computational methods for early diagnosis and treatment of AD.
A central research component, designed to assist in early identification of dementia, has focused on discovering characteristic biomarkers that are closely associated with the development of AD. This branch of research has been driven by the successful development and deployment of a variety of non-invasive clinical observations such as positron emission tomography (PET), magnetic resonance imaging (MRI) scans, and genetic analysis through the identification of single nucleotide polymorphisms (SNPs). By way of public-private partnerships, such as the Alzheimer's Disease Neuroimaging Initiative (ADNI), 39 clinical data from each of theses modalities have been made publicly available to the scientific community. Through the effective analyses of these AD data sources, we are able to build models that have the potential to help clinical researchers narrow down the array of phenotypic and genetic measures that are predictive of cognitive decline. Given the complexity and size of these clinical datasets, there has been a concerted effort to design new machine learning methods to assist in the discovery of AD-related biomarkers.
In recent years, various computational methods 18,27,40,41 have been proposed to identify biomarkers associated with AD. Although these methods have shown good predictive performance, they only incorporate clinical data that is collected at a single time-point. Since these approaches rely on a single point in time, they are unable to identify longitudinal patterns found across patient data. Recent works 3,15,33,34 explored using longitudinal data to predict an AD diagnosis, which validated that specific regions of the brain (derived from neuroimaging modalities) are the most useful for diagnosing AD over time.
With the above recognitions, in this work we aim to develop a principled approach to incorporate longitudinal data from multiple data sources that the ADNI provides. Through extensive empirical studies, our new approach has shown great promise in predicting cognitive scores, diagnoses and identifying AD-relevant genetic and phenotypic biomarkers. Specifically, we present the following:

-
A principled strategy for incorporating tensor data (e.g. longitudinal) collected from multiple data sources, which leads to a new objective that is able to combine multimodal longitudinal clinical data of various modalities to simultaneously predict the cognitive scores and diagnoses of the participants in the ADNI cohort.
-An effective optimization algorithm, using the multi-block alternating direction method of multipliers, to optimize the proposed objective.
-A collection of phenotypic biomarkers, some of which have been shown by previous research to be predictive of cognitive decline, identified by our model.

Methods
In this manuscript, we write tensors as cursive uppercase letters . Given a matrix M, its i-th row and j-th column are denoted as m i and m j respectively. We define the Frobenius norm of the m × n matrix A as The input imaging features are represented by the tensor: = X 1 , X 2 , …, X T ∈ ℝ n × d × T Each X t represents the input observations for n patients with d features at a given time t.
Each X t can be further broken down into K modalities: X t j j = 1 K . The output diagnoses and cognitive scores are represented by another tensor: d × c e are the learned coefficient matrices for the respective regression and classification tasks. The input output , and learned coefficient tensors are illustrated in Fig. 1.

The Longitudinal Joint Learning Model
A key idea behind our approach is to perform the regression and classification tasks at the same time. Joint regression and classification can help discover more robust patterns than those discovered when classification and regression are performed separately. 3,31,35 In order to link the regression and classification tasks, following the large body of previous works 3,31,35 we introduce the following regularized joint learning model: where ℒ r and ℒ c are the prescribed loss functions associated with the regression and classification tasks respectively. Here the regularization function ℛ( ) is applied to the matrix unfolded from tensor , i.e., we construct V ∈ ℛ d × cT by taking the (W t , P t ) matrix pairs at each time-point and joining them along their columns. 33,34 This joint regularization scheme in Eq. (1) is designed to identify features in that are predictive of both clinical diagnoses and cognitive scores. This approach reasonably assumes that there exists a relationship between the classification and regression tasks. For example, if a patient does poorly on a given cognitive test then they are more likely to be diagnosed with AD.
Regularizing the joint coefficient matrices ( , ) allows us to discover biomarkers that are strongly associated with the two related tasks. We design the regularization function ℛ( ) as following.
First, in order to associate the longitudinal imaging and genetic markers to predict cognitive scores and diagnoses over time, we apply the widely used l 2,1 -norm 20,32 to the unfolded Second, as we combine K different modalities (MRI, SNP, FreeSurfer, etc.) together, it is critical for our model to differentiate the impact that each modality has on the joint model. In order to capture the impact of each modality, we leverage the group l 1 -norm (G 1 -norm): where V j is a matrix constructed of the rows in V that corresponds to the j-th modality in .
Finally, we know that as AD develops, many cognitive measures are related to one another within the same modality. In order to account for this inter-modal relationship, we leverage the trace norm regularization 21,24,33,34,38 Bringing together these three regularizations, we present our new objective as following: where the first term is the multivariate regression loss at each longitudinal time-point; and the second term represents the loss of c C × T one-vs-all multi-class support-vector machine (SVM) penalized via the hinge-loss, where y ikt ∈ − 1, 1 is the class label associated with ith patient at time t, and b kt is the bias associated with the (k × t)-th SVM. The notation (·) + is defined as (a) + = max(0, a).

The Solution Algorithm Using the Multi-Block ADMM
While the objective of our new method in Eq. (2) is clearly and reasonably motivated, all its terms are dependent on . Thus, it is difficult to optimize this objective in general. To solve the proposed objective, we derive an efficient iterative algorithm using the multi-block extension 8 of the alternating direction method of multipliers (ADMM). 2 The ADMM aims to decouple a larger and more difficult problem into a series of smaller sub-problems that are easier to solve. 2 An extension to ADMM, known as multi-block ADMM, 8 is designed to extend the ADMM framework to optimize functions of the following form: Equation.
(3) can be solved by minimizing the following unconstrained objective: 2,8 where y is a Lagrangian multiplier and μ > 0 is a constant. The objective in Eq. (4) can be solved by the following iterative procedure that updates each x k (primal) and the Lagrangian variable y (dual): where ρ > 1 is a constant. The process described above in Eq. (5) is repeated until the algorithm converges. In order to decouple the terms containing in Eq. (2), we introduce four new variables and a set of corresponding equality constraints as following: subject to e ikt = y ikt − (x it p kt + b kt ), F = V, G = V, and H = V .
Brand et al. Page 5 Since each y ikt in the second term must be equal to either −1 or 1, we can use the following to move from Eq. (2) to Eq. (6): 1 − x it p kt + b kt y ikt = y ikt y ikt − x it p kt + b kt y ikt = y ikt y ikt − x it p kt + b kt . 22 Then we can solve Eq. (6) by minizing the following ojective: where λ ikt , Σ, Θ, and Ω are the Lagrangian multipliers. The updates for each of the primal variables can be calculated by taking the derivative of Eq. (7), with respect to each of the primal variables, setting the resulting equation equal to zero, and solving for the associated primal variable. Due to space considerations, we will provide the detailed mathematical derivation for each variable in an extended journal version of this paper. The derived parameter updates are provided in Algorithm 1.

Data.
We downloaded MRI scans, SNP genotypes, and demographic information for 821 ADNI-1 participants. We performed FreeSurfer automated parcellation on the MRI data by following Risacher et al. 25 and extracted mean modulated gray matter measures for 90 target regions of interest. We followed the SNP quality control steps discussed in Shen et al. 26 We also

Settings.
The performance and standard deviation results reported in Table 1 and Table 2 are calculated from ten five-fold cross validation experiments applied to and ; in-between each cross validation experiment and are randomly shuffled. Each method reported in the following experiments were tuned via a reasonable hyper parameter search to guarantee a fair comparison. The optimal tuning parameters are chosen by the model that provides the best regression or classification performance during a single five-fold cross validation experiment. In choosing the parameters for our new method, we fine tuned the γ parameters, described in Eq. (7), by applying powers of 10 between 10 −5 and 10 5 and choosing the best model based on the average multitask performance. Following the search, we achieve the best performance at γ 1 = .00001, γ 2 = .01, γ 3 = 100, μ = .001 and ρ = 1.2, which we use in all our experiments.

Performance
Regression.-We compare our algorithm against multivariate linear regression (Linear), l 2regularized linear regression (Ridge), l 1 -regularized linear regression (Lasso), 29 and multi-layer perception regression (MLP). 7 In Table 1, our new method shows superior regression performance when compared to the aforementioned methods. This is likely because our method incorporates information provided by the longitudinal regularizations across both tasks.
Classification.-We report the iterated five-fold cross validation results on the classification task of our method compared to a variety of popular machine learning algorithms for classification in Table 2. We compare our method against logistic regression (Logistic), random forest classifier (RandomForest), support vector machine using a sigmoid-kernel (SVM), k-nearest neighbors classifier (KNN), logistic regression with elastic net regularization (ElasticNet), 4 and a linear support vector machine (LinearSVM). 13 Both ElasticNet and LienarSVM have been used in the past to classify patients with AD vs. HC.
From Table 2 we can see that our algorithm shows significant improvement when predicting AD and HC diagnoses. This improvement does not appear to extend to MCI diagnoses, where logistic regression improves upon our model. This disparity is likely because the c c one-vs-all multi-class SVMs constructed in are not normalized against one another. Nonetheless outperforms the detection of HC and AD in ADNI participants when compared to the methods in Table 2.

Empirical Convergence
It is well known that the multi-block ADMM approach described in Algorithm 1 does not necessarily converge. 5 So, in order to determine convergence properties of the proposed algorithm, we perform the following empirical analyses. First, we want to determine whether the initialization of the model has a significant effect on the convergence of Algorithm 1. Second, we want to determine whether our multi-block optimization scheme actually matches the constraints incorporated by the augmented Lagrangian after a reasonable number of iterations.
To analyze the first issue we apply our algorithm to the same dataset three times and plot the objective on the left-hand-side of Fig. 2. This plot shows that, even with random initialization, our algorithm converges to a similar solution after only one-hundred iterations.
To analyze the second issue, we plot the difference between the introduced variables (e ik , F, G, H) designed to decouple the original objective in Eq. (2). As can be seen on the righthand-side of Fig. 2, once the objective has converged the difference between the decoupled variables and the variables that they replaced are within 10 −3 after one-hundred iterations of the proposed method. The convergence of the overall objective across differently initialized runs, and the eventual gap decrease, provide empirical evidence for the convergence of the proposed multi-block ADMM algorithm.

Biomarker Identification
In addition to predictive performance, our method is easily interpreted and can assist in the identification of AD-related biomarkers. Fig. 3 we plot the magnitudes, derived from , of coefficients associated with the FreeSurfer features contained in . We can clearly see that the biomarkers discovered across all four time-points are all longitudinally consistent. Visually, the brain heat-map images from Baseline to Month 24 look almost identical; this illustrates the power of the l 2,1 -norm regularization that provides our algorithm with the ability to identify longitudinally consistent biomarkers. This consistency is especially important from the clinical perspective. We find that the biomarkers identified by our method are strongly supported by previous research. For example, Mu et al. 19 provide a review documenting how the hippocampus is affected by the early stages of AD; this part of the brain is one of the top-5 regions discovered by our model in Fig. 3. Van Hoesen et al. 30 provide strong evidence that a severely damaged entorhinal cortex (Broadmann's area 28) is observed in patients suffering from AD; the thickness of the entorhinal cortex is also identified by our method. Furthermore, Poulin et al. 23 analyzed the impact of amygdala atrophy and determined that it was highly predictive of AD severity during the early clinical stages of AD; this finding is also supported by the FreeSurfer brain regions identified by our model. Table 3 we rank the top-30 SNPs discovered by our algorithm. As we expect, the highest impact SNP discovered by our algorithm is rs429358; this SNP, known frequently as the APOE-ε4 allele, has been found 12 to be highly predictive of early-onset AD. The authors' note that approximately one third of the SNPs identified by our new method have previously been linked to AD; this further validates the utility of our approach in discovering well-known, as well as possibly-novel, AD biomarkers.

Conclusion
In this work we present a multi-block alternating direction method of multipliers approach to optimize the proposed new model that incorporates the l 2,1 -norm, group l 1 -norm and tracenorm regularizations to discover important features contained in the ADNI dataset. This work illustrates a principled approach to combine multi-modal data using clinical time series data. The presented optimization algorithm is able to identify clinically relevant biomarkers and shows state-of-the-art predictive performance when jointly predicting the cognitive scores and diagnoses of ADNI participants. Visualization of the input (X), coefficient (V) and output (Y) tensors. In each time-point of X the K modalities (MRI, SNP, etc.) are explicitly defined to facilitate the calculation of the group l 1 -norm. The goal of the proposed method is to learn a joint model V that can effectively map X to the cognitive scores and diagnoses encoded in Y. Left: The proposed objective in Eq. 7 plotted over one-hundred iterations of Algorithm 1. In each run the primal and dual variables are randomly re-initialized. Right: The difference between the introduced variables designed to decouple the terms in Eq. (2).  Multi-class F 1 scores and their standard deviations, of the iterated five-fold cross validation experiments, for predicting the cognitive status of ADNI participants averaged over each time-point.  The top-30 SNPs identified by our algorithm.