The reliability of a deep learning model in clinical out-of-distribution MRI data: a multicohort study

Deep learning (DL) methods have in recent years yielded impressive results in medical imaging, with the potential to function as clinical aid to radiologists. However, DL models in medical imaging are often trained on public research cohorts with images acquired with a single scanner or with strict protocol harmonization, which is not representative of a clinical setting. The aim of this study was to investigate how well a DL model performs in unseen clinical data sets---collected with different scanners, protocols and disease populations---and whether more heterogeneous training data improves generalization. In total, 3117 MRI scans of brains from multiple dementia research cohorts and memory clinics, that had been visually rated by a neuroradiologist according to Scheltens' scale of medial temporal atrophy (MTA), were included in this study. By training multiple versions of a convolutional neural network on different subsets of this data to predict MTA ratings, we assessed the impact of including images from a wider distribution during training had on performance in external memory clinic data. Our results showed that our model generalized well to data sets acquired with similar protocols as the training data, but substantially worse in clinical cohorts with visibly different tissue contrasts in the images. This implies that future DL studies investigating performance in out-of-distribution (OOD) MRI data need to assess multiple external cohorts for reliable results. Further, by including data from a wider range of scanners and protocols the performance improved in OOD data, which suggests that more heterogeneous training data makes the model generalize better. To conclude, this is the most comprehensive study to date investigating the domain shift in deep learning on MRI data, and we advocate rigorous evaluation of DL models on clinical data prior to being certified for deployment.


Introduction
The use of deep learning (DL) models in neuroimaging has increased rapidly in the last few years, often showing superior diagnostic abilities compared to traditional imaging softwares (see [1,2] for reviews). This makes DL models promising to use as diagnostic aid to clinicians. However, for a software to function in a clinical setting it should work on images acquired from different scanners, protocol parameters, and of varying image quality-a scenario reflective of most clinical settings today. Fig. 1 shows illustrative examples of the variability in images from some different centers included in this study. Training a DL model on magnetic resonance imaging (MRI) scans requires a large dataset to obtain good performance. However, (labeled) clinical data is generally difficult (and expensive) to acquire due to strict privacy regulations on medical data. Most researchers are therefore constrained to rely on publicly available neuroimaging datasets, which are typically research cohorts that differ from a clinical setting in several ways: 1) Images are acquired from the same scanner and protocol, or protocols have been harmonized across machines. This is done to reduce image variability and confounding effects, which are problematic also for traditional neuroimaging softwares such as FSL, FreeSurfer and SPM [3]. 2) Research cohorts often have strict inclusion and exclusion criteria for the individuals enrolled in order to study a particular effect of interest. For instance, to study the progression of patients suffering from Alzheimer's Disease (AD) it may be necessary to exclude comorbidities, such as cerebrovascular pathology or history traumatic brain injury, in order to reduce heterogeneity not relevant to the research question. This is the case of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohortthe most extensive public neuroimaging data set in AD and used for training and evaluation in multiple DL studies on AD [1]. However, since comorbidities are frequent alongside AD the ADNI cohort is hardly reflective of the heterogeneous AD profiles of patients in the clinics [4,5]. Thus, training a DL model on data from a research cohort may perform worse in a clinical setting due to difficulties generalizing to new scanners/protocols (point 1) and/or more heterogeneous population (point 2). Investigating the performance in out-of-distribution data (OOD data, i.e. images acquired with different scanners/protocols than the ones included in the training set) is an important step in order to investigate clinical applicability of DL models and understanding the challenges that can arise when deploying.
Some previous studies have investigated the clinical applicability of machine learning models, or domain-shift (training a model on data from one domain and applying it in data from another). A recent paper by De Fauw et al. (2018) trained and applied a deep learning model on a clinical dataset of 3D optical coherence tomography scans, which managed to predict referral decisions with similar performance as experts [6]. However, when applied to images from a new scanning device the performance was poor. Since they used a two-stage model architecture, where the first part segmented the image into different tissue types (making subsequent analysis scanner independent), it was sufficient to retrain only the segmentation network with a (much smaller) data set from the new device. Klöppel and colleagues (2015) investigated the performance of a trained SVM-classifier to diagnose dementia in a clinical data set of a more heterogeneous population [7]. Their models were also fed tissue-segmentation maps, preprocessed using SPM, and found a drop in performance compared to the "clean" training set, as well as lower performance than previous studies had reported (typically cross-validation performance). Zech et al. (2018) explicitly investigated how a convolutional neural network (CNN) trained for pneumonia screening on chest X-rays generalized to new cohorts. They found significantly lower performance in OOD cohorts. Further, they demonstrated that a CNN could accurately classify which hospital an image was acquired at and thus potentially leverage this information to adjust the prediction method due to different disease prevalences in the cohorts [8]. Some recent studies have investigated MRI segmentation performance across centers and again found drops in performance [9,10,11]. These analyses were made on a small number of images, as segmented data is typically expensive and time-consuming to label. In contrast to segmented data, visual ratings of atrophy, which still serve as the main tools to quantify neurodegeneration in memory clinics, offer a faster method to annotate brain images that make it feasible to label large datasets (>1000 images) from multiple clinics. Our group recently proposed AVRA (Automatic Visual Ratings of Atrophy), a DL model based on convolutional neural networks (CNN) [12]. AVRA inputs an unprocessed T 1 -weighted MRI image and predicts the ratings of Scheltens' Medial Temporal Atrophy (MTA) scale, commonly used clinically to diagnose dementia [13] (see Fig. 1 for examples of the MTA scale).
The aim of this study is to systematically investigate the performance of a CNN based model (AVRA) in OOD data from clinical neuroimaging cohorts. We study the impact more heterogeneous training data has on generalization to OOD data by training and evaluating AVRA on images from different combinations of cohorts. Two of these cohorts are research oriented: similar to each other in terms of disease population (AD) and protocol harmonization. The other two datasets consist of clinical data from multiple European sites including individuals of different and mixed types of dementia, not just AD. Additionally, we assess the inter-and intra-scanner variability of AVRA in a systematic test-retest set. To our knowledge this is the largest and most comprehensive study on the generalization of DL models in neuroimaging and MRI data.

MRI data and protocols
The 3117 images analyzed in this study came from five different cohorts described in Table  1, where we also list the reasons for including these datasets in the current study. Full lists of scanners and scanning protocols are provided as Supplementary Data. TheHiveDB was used for data management in this study [14].
Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron Table 1: An overview of how the cohorts used for training and/or evaluation differ from each other, and the purpose of including them in the present study. The E-DLB cohort (denoted as E-DLB all , referring to all images in the cohort) was stratified into different subsets in order to isolate specific features of interest. N train /Ntest refers to the number of labeled images used during training/evaluation, where some cohorts were split into training and test set. Abbreviations: Deep Learning (DL); Out-of-distribution (OOD) data; Alzheimer's disease (AD); Healthy controls (HC); Frontotemporal lobe dementia (FTLD); Dementia with Lewy Bodies (DLB); Parkinson's disease with dementia (PDD).

Cohort
Scanners/Protocols Disease population Purpose of inclusion Multiple scanners and sites, but strictly harmonized with phantom. Both 1.5T and 3T.
AD spectrum and HC.
Common cohort to train and evaluate DL models in, which we hypothesize should not generalize well.

N=122
Harmonized, designed to be compatible with ADNI.
AD patients only.
Assess AVRA in an external research cohort similar to ADNI.
Unharmonized, part of clinical routine from a single memory clinic.
Mainly AD spectrum and HC, with 37 FTLD patients.
Large clinical cohort with similar disease population as ADNI and AddNeuroMed.

N=645
Retrospective unharmonized data of varying quality from 12 European sites as part of their clinical routine.
Mainly DLB spectrum, but also HC, AD and PDD.
To assess performance of AVRA in a large, realistic clinical cohort.

Same as E-DLB all
Only individuals with AD pathology from E-DLB all .
To isolate effects of scanners/protocols not seen during training from disease population.

N={266,97}
Same as E-DLB all Only individuals with DLB or PDD pathology from E-DLB all , respectively.
To assess the impact scanners/protocols and disease populations not seen during training have on AVRA performance.

Same as E-DLB all
Randomly selected images with a probability of 25% (or 50%) from all centers in E-DLB all .
To assess effect of including training data from test set distribution has on AVRA performance.
Both centers have used a single scanner (3T) and protocol.
Stratifying images into three groups: from center C1, from C2, and all images not in C1, C2 from E-DLB all .
"External validation sets": how would AVRA perform if deployed in two external memory clinics?
Young (38 ± 13 years old) MS patients and healthy controls.
Systematic evaluation of the impact scanner variability has on AVRA predictions.
emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD). In brief, the ADNI dataset is a large, public dataset that has been helped advance the field of AD and neuroimaging. However, the strictly harmonized protocols and strict exclusion criteria make ADNI unrepresentative of a clinical setting. Some subjects were scanned multiple times (within a month) in both a 1.5T and a 3T scanner in which case one of the images was selected at random during training for the current study. AddNeuroMed is an imaging cohort collected in six sites across Europe with the aim of developing and validating novel biomarkers for AD [15]. The MRI images were acquired with protocols designed to be compatible with data from ADNI, and the two cohorts have been successfully combined in previous studies [16,17,18]. AddNeuroMed was an interesting cohort to assess AVRA's reliability in due to having consistent scanning parameters and acquisition methods similar to ADNI. Thus, this dataset represented a research cohort where we expected our DL model to show good performance in when trained on ADNI data. A subset of the images (122) of patients diagnosed with AD had been visually rated for MTA. Exclusion criteria for both these studies included no histories of head trauma, neurological or psychiatric disorders (apart from AD), organ failure, or drug/alcohol abuse. The MemClin data set was used for training also in our previous study detailing AVRA [12]. MemClin consists of images of AD or frontotemporal lobe dementia collected from the memory clinic at Karolinska Hospital in Huddinge, Sweden. This data set better resembled a clinical setting with varying scanning parameters and field strengths, while the disease population was not completely representative of patients in a memory clinic. The only exclusion criteria was history of traumatic brain injury. Images and ratings have previously been analyzed in [19,20].
The fourth cohort in this study consists of clinical MRI images from the European consortium for Dementia with Lewy Bodies (referred to as E-DLB from here on) previously described in [21,22]. Patients with referrals to memory, movement disorders, psychiatric, geriatric or neurology clinics that had undergone an MRI were selected from 12 sites in Europe. These individuals were diagnosed with Dementia with Lewy Bodies (DLB), AD, Parkinson's Disease with Dementia (PDD), mild cognitive impairment (MCI, due to AD or DLB), or were normal elderly controls (NC). The images were acquired as part of the clinical routine, and consequently without protocol harmonization, and can thus be considered to reflect a clinical setting well. Exclusion criteria for the E-DLB cohort were having received a recent diagnosis of major somatic illness, history of psychotic or bipolar disorders, acute delirium, or terminal illness.
We also investigated AVRA's rating consistency on unprocessed MRI images (i.e. no lesion filling) of three healthy and nine individuals with Multiple Sclerosis (MS, mean disease duration 7.3± 5.2 years) that were scanned twice with repositioning in three different Siemens scanners (i.e. six scans in total) in a single day. Six of the patients had relapsing-remitting MS, two secondary progressive MS, and one primary progressive MS. This data set was collected for a previous study [3], and we will refer to this small set as the test-retest dataset. These individuals were not rated for MTA by a radiologist.

Radiologist ratings
An experienced neuroradiologist (Lena Cavallin, L.C.) visually rated 3117 T 1 -weighted brain images (blind to age and sex) according to the established MTA rating scale. These ratings have been used in previous studies on AD [23] and E-DLB [22] by our group, and the distribution of ratings are shown in Table 2. These rating scales provide a quantitative measure of atrophy in specific regions, and while they are often used for dementia diagnosis the rating scales themselves are independent of diagnosis, age and sex. L.C. has previously demonstrated excellent inter-and intra-rater agreements in research studies [12].

Model description
Our group recently proposed a method we call AVRA (Automatic Visual Ratings of Atrophy) that provides computed scores of three visual rating scales commonly used clinically: Scheltens' MTA scale (see Fig. 1), Pasquier's frontal subscale of global cortical atrophy (GCA-F), and Koedam's scale of posterior atrophy (PA) [12]. AVRA showed substantial rating-agreement to an expert neuroradiologist in all three scales on a hold-out test set (N =464) that was drawn from the same distribution as the training data (N =1886) from two AD cohorts. Since the measures are independent of diagnosis, sex and age, a DL tool such as AVRA (trained end-to-end and does its own feature-extraction from the entire brain volume) should work equally well on different disease populations.
For this experiment we focused only on the MTA scale and used the same network architecture and hyperparameters as previously described in [12], but with different combinations of cohorts in the training set. Briefly, AVRA is a Recurrent Convolutional Neural Network (R-CNN) that inputs an unprocessed MRI volume, which is then processed slice-by-slice by the model. A Table 2: Distribution of MTA ratings from a neuroradiologist in the different cohorts, together with sex (female percentage) and age demographics. The lines in bold refers to the statistics of the whole cohort, whereas the rows not in boldface text are the subsets used for during training. N is the total number of rated images, and since both left and right hemispheres were rated there were 2N ratings. MTA distribution shows the percentage of each radiologist rating per (sub-)cohort. A small linespace are added between some E-DLB subsets to illustrate the grouping of the subsets where no overlap between training and test sets occur. residual attention network [24] is used to extract features from each slice, and these are forwarded to a Long-Short Term Memory (LSTM) network [25]. The LSTM modules remember relevant information provided from each slice and use it to predict the atrophy score the radiologist would give. This prediction is continuous, but when studying the inter-rater agreement with the radiologist, expressed in kappa statistics or accuracy, we round AVRA's output to the nearest integer.
A trained version of AVRA is publicly available targeted towards neuroimaging researchers at https://github.com/gsmartensson/avra_public.

Training procedure
To systematically investigate the performance in new data distributions we trained versions of AVRA on data where we kept the number of subjects fixed to the maximum size of the ADNI training data (N = 1568), since more training data generally leads to better performance and could bias the results. ADNI was the largest dataset with ratings available to us, and needed to be part of all training sets in order for the number of images to be large enough for training. When combining data from an additional cohort, we replaced a subject in ADNI with one from the new cohort that had the same ratings from the radiologist. This way, both the size and the distribution of the training data were kept constant. Each training set was divided into five cross-validation sets (to replicate the procedure in [12]) and these five trained models were used as an ensemble classifier.
Each of the cohorts have different characteristics, as outlined in Table 1. Since the E-DLB cohort was highly diverse in terms of scanners and disease population, we stratified it into different partitions (some with overlap, but no training/test set pairs shared any images) in order to isolate specific features. To investigate the performance drop due to OOD test data, we randomly assigned each subject into E-DLB train 25% , E-DLB train 50% and E-DLB test 50% , where the numbers refer to the percentage of subject from the whole cohort and with no overlap between train and test . This setup aims to simulate realistic ways of introducing a DL model into a new clinic: 1) as is (i.e. no additional labeled data from the new clinic), 2) retraining, or finetuning, the existing model with some additional labeled data from the same clinics (E-DLB 25% ), 3) same as 2) but with twice as much additional data (E-DLB train 50% ). To assess the impact of disease population we sampled individuals on the AD spectrum (E-DLB AD ), DLB spectrum (E-DLB DLB ), or with PDD (E-DLB PDD ) into three subsets. Since the main bulk of training images comes from ADNI-an AD cohort-it is of interest to see if the models overfit to AD atrophy patterns and are influenced by neighboring regions in the medial temporal lobe not part of the MTA scale.
To study if AVRA's generalizability improved when widening the training data distribution we also computed the performance on data from two clinics that we refer to as E-DLB C1 and E-DLB C2 . A single 3T scanner and protocol was used at each site for scanning, yet with visibly different image intensities (see image examples in Fig. 1). We view these centers as "external validation sets" to estimate the performance we may expect if implementing AVRA in a new memory clinic (although single-scanner usage and study populations may not perfectly represent a memory clinic sample). We included data from all other centers to our training set (E-DLB train C1,C2 ) to study if more heterogeneous training data improves generalization to new protocols.

Evaluation metrics
We assess the performance of AVRA using Cohen's linearly weighted kappa κ w , which is the most common metric to assess inter-and intra-rater agreement for visual ratings in the literature. It ranges from [-1,1] where κ w ∈ [0.2,0.4) is generally considered fair, κ w ∈ [0.4,0.6) moderate, κ w ∈ [0.6,0.8) substantial and κ w ∈ [0.8,1] almost perfect [26]. As opposed to accuracy, κ w takes the rating distributions of the two sets into account, which is particularly useful when the number of ratings in each class are imbalanced. As comparison, AVRA achieved inter-rater agreements of κ w = 0.72 -0.74 (left and right MTA, respectively) to an expert radiologist on a test set from the same data distribution as the training data in [12], similar to reported inter-rater agreements between two radiologists. Since using κ w required rounding AVRA's continuous predictions to the nearest integer, mean squared error (MSE) was also reported. Accuracies are included as Supplementary Data.

Results
The rating agreements between AVRA and the neuroradiologist are summarized in Table  3. When only training on the research cohort ADNI we saw a general drop in performance in clinical cohorts compared to the test set of ADNI-particularly in the E-DLB C1 set. Adding data from the similar cohort AddNeuroMed helped little in improving generalization, whereas the inclusion of clinical MemClin had a positive impact on performance. The overall impression was that including data from clinical cohorts in the training set improved the rating agreements and accuracies in the clinical test sets, although not consistently. Surprisingly, the rating agreement was greater in the sub-cohorts E-DLB DLB and E-DLB PDD than in E-DLB AD when only training on images from AD cohorts.
In Fig. 2 we focus on the centers E-DLB C1 and E-DLB C2 , where AVRA's performance metrics were particularly low (C1) or close to within-distribution test set performances (C2) when trained on research data. We compared the predictions made by the ensemble models trained only on ADNI (x-axis) to when trained on data from ADNI and clinical images from the MemClin and E-DLB train C1,C2 cohorts. Thus, no images from these centers had been part in either of the training sets, but the latter included clinical images acquired from a wider range of protocols. We observed systematic differences in the predictions between the two models, most notably in the C1 cohort. Note the intensity differences in tissue types between images from ADNI, C1 and C2 in Fig. 1.
AVRA's MTA ratings on the test-retest cohort are plotted in Fig. 3 for the models trained on the least and most heterogeneous data. We observed small intra-subject rating variability for most subjects, within the same model. It was mainly the predictions of the two images acquired with the Siemens Trio 3T that stood out. While the direction of the rating prediction differences were not consistent across subjects, it suggests that AVRA may systematically rate images acquired from some protocols/scanners differently. Comparing the two versions we see that the model trained only on ADNI systematically rates images lower than when trained also on clinical data-same as in Fig. 2. Further, it should be noted that these participants were younger than in any of the training cohorts and-for the patients suffering from MS-from a different disease population.

Discussion
In this study we systematically showed that the performance of a CNN trained on MRI images from homogeneous research cohorts generally drops when applied to clinical data. In one center-where image intensity was visibly different to images from the training data-the performance of AVRA was lower due to a systematic underestimation. However, by including images acquired from a wider range of scanners and protocols in the training set we observed an increase in robustness/reliability of the DL model in unseen OOD data-without a substantial damage to the within-distribution test set performance. This is the first study on a large MRI neuroimaging data set labeled by the same expert neuroradiologist (thus no inter-observability bias) and with fixed training set sizes and label distribution. These results add to the evidence that rigorous testing of DL applications in medical imaging needs to be performed on external data before being used in clinics.
From our results in Table 3 we note several interesting findings. First, the level of agreement is lower in the clinical cohorts MemClin and E-DLB all when only trained on research cohorts (ADNI with or without AddNeuroMed). This suggests that we can expect a degradation of a CNN model when applied to MRI images acquired with protocols not seen during training, which is problematic for scalable deployment in clinics. Similar findings have previously been reported on segmentation tasks on cross-institutional MRI data [10,11] and chest x-ray data [27,28]. While inter-rater agreement levels of κ w > 0.6 might be considered acceptable in many clinical situations for visual ratings (reported κ w levels between radiologists are typically between 0.6 and 0.8 in previous studies [12]) we see that the agreement in E-DLB C1 is substantially lower when only trained on data from harmonized research cohorts. Further, the performance drop observed in E-DLB C1 but not in E-DLB C2 implies that evaluating DL models in data from a single external center is not sufficient to assess the degree of generalization. In order to deploy a clinical DL model we believe that it is necessary to report the epistemic uncertainty of a prediction, i.e. the model's uncertainty due to not having been previously exposed to a similar image during training. This would signal that more "E-DLB C1 -like" data needs to included in the training set for the DL model to show good performance in C1 (sometimes referred to as active learning). Developing scalable methods to estimate DL model uncertainty-or being able to detect OOD data-is an active research field but was not explored in the current study and dataset. Second, including images of larger variability from clinical cohorts improved performance even when keeping the training set size and label distribution fixed. Including data from MemClin in the training set had a positive impact on the E-DLB sets and vice versa. This implies that by training a supervised DL model on data from a wide range of scanners, protocols, field strengths and diagnoses/labels it is possible to achieve acceptable performance on new unseen data. The systematic prediction differences for E-DLB C1 in Fig. 2 illustrates this point well, where training data from other memory clinics had a large impact on predictions.
Third, we investigated the performance of AVRA in DLB and PDD populations when trained on images of subjects on the AD spectrum (from healthy controls, to patients with mild cognitive impairment and AD). Unexpectedly, the agreement was higher in both the DLB and PDD populations than in the AD population from the E-DLB cohort. These results could potentially be explained by the differences in rating distributions between the disease populations. PDD and DLB individuals generally had lower MTA ratings than the AD patients, and from Fig. 2 we see that the model trained only on ADNI tends to rate too low-particularly for higher MTA values. Thus, this systematic error could affect the AVRA performance in the AD population more. However, the relatively high agreements of E-DLB DLB and E-DLB PDD show potential that AVRA has the ability to generalize across disease populations. This finding is likely attributed to the strength of the clinical visual rating scales-which are disease-unspecific by designand demonstrate the power of incorporating domain knowledge when building DL models. A previous study on applying machine learning models (SVM) on unseen clinical data reported and discussed difficulties in determining if subjects suffered from mixed pathologies (e.g. both AD and FTD) or a misdiagnosis [7]. A model trained to discriminate between e.g. AD patients from healthy controls-both generally defined by strict inclusion and exclusion criteria in research cohorts-does just that. Applying an "AD model" like this in a more heterogeneous cohort with controls, AD and DLB subjects, would thus most probably misdiagnose DLB as AD due to resembling patterns of atrophy [22].
The test-retest results (Fig. 3) show impressive consistency for each DL model in most predictions. The ratings from the version trained on multiple data sets seems to yield higher variability for many subjects compared to when only trained on ADNI. Given that this model showed better generalization in the analyses summarized in Table 3, this is a bit counterintuitive. It should be noted however that these differences are small considering being trained on integer ratings with some degree of intra-rater variability. The explanation for this inter-scanner variability could partially be due to a minor overfit to scanner and protocol. This is however to prefer to the ADNI-model where the ratings seems to be systematically too low. Within-scanner and within-field strength variability was practically non-existent, and it is only the images of the 3T scanner that notably deviates for some patients. This means that we expect AVRA to be useful for longitudinal studies, where the data is typically collected in a harmonized way. Guo et al. (2019) analyzed the same dataset using different (non-machine learning) neuroimaging softwares and reported smaller within-than between-scanner variability [3]. A previous study investigating the impact choice of scanner and field strength have on the performance of an SVMclassfier found the largest performance drop when training on 1.5T data and testing on 3T data and vice versa, while generalizing well to new scanners within the same field strength [29]. The analyses in [29] were done in the ADNI cohort, with protocols harmonized using a phantom to reduce scanner and site variability. For computer scientists it would solve many practical issues if protocols were harmonized across clinics, and that these protocols were used as default. However, this seems unlikely given the enormous effort of implementing it, the development of new (improved) sequences, and disrupting habits and workflows of clinicians. Further, the real gain of machine learning applications would be on CT images-as it is cheaper and more commonly available-where image quality variation is even greater. Thus, scanner/protocol generalization remains an important issue that needs solving prior to deploying DL models as clinical aid. Since labeled data in medicine is often difficult or expensive to acquire semi-supervised approaches may play a big role in medical machine learning applications as it allows the inclusion of unlabeled images in the training data. This has been shown to improve generalization on medical OOD data [9,10,30].
The current study has some limitations that we leave as future studies. Foremost, we trained and evaluated a single network architecture and we cannot say to what degree the results are representative of DL models in general. By using the same hyperparameters as in [12] (tuned to optimize performance on a within-distribution cross-validation set) nothing prevented AVRA from overfitting to the training protocol. Further, while the kappa metric is the most common way to quantify reliability of visual ratings, it can be noisy since we need to round the prediction to nearest integer. The MSE metric does not require rounding but is on the other hand sensitive to outliers. Since AVRA takes unprocessed MRI images as input-just as a radiologist would-we did not explore the impact of preprocessing or intensity normalization could have on generalization.

Conclusion
In this study we assessed how well a supervised deep learning model (AVRA), trained on unprocessed MRI brain images to predict Scheltens' MTA score, generalizes to external clinical data. More specifically, we trained multiple versions of AVRA on data from different combinations of research and clinical cohorts, while keeping training set size and label distribution fixed. We found that AVRA trained on homogeneous data from a research cohort generalized well to cohorts with similar protocols, but worse when applied to clinical data. On images from one specific memory clinic the performance dropped to an unacceptably low level. Including more heterogeneous data from a wider range of scanner and protocols during training improved the performance also in out-of-distribution data. Furthermore, when applying AVRA on images of patients suffering from other neurological disorders than AD we did not observe a noticeable decrease in performance. From these findings we advocate that DL models need to be rigorously tested in OOD data before being deployed in clinics. This is, to our knowledge, the largest and most comprehensive study to date on the effect of domain-shift in MRI images and deep learning models.
As supplementary data we provide accuracy results complementary to MSE and Cohen's κw in Table  A