BASE: Brain Age Standardized Evaluation

Brain age, most commonly inferred from T1-weighted magnetic resonance images (T1w MRI), is a robust biomarker of brain health and related diseases. Superior accuracy in brain age prediction, often falling within a 2-3 year range, is achieved predominantly through deep neural networks. However, comparing study results is difficult due to differences in datasets, evaluation methodologies and metrics. Addressing this, we introduce Brain Age Standardized Evaluation (BASE), which includes (i) a standardized T1w MRI dataset including multi-site, new unseen site, test-retest and longitudinal data, and an associated (ii) evaluation protocol, including repeated model training and upon based comprehensive set of performance metrics measuring accuracy, robustness, reproducibility and consistency aspects of brain age predictions, and (iii) statistical evaluation framework based on linear mixed-effects models for rigorous performance assessment and cross-comparison. To showcase BASE, we comprehensively evaluate four deep learning based brain age models, appraising their performance in scenarios that utilize multi-site, test-retest, unseen site, and longitudinal T1w brain MRI datasets. Ensuring full reproducibility and application in future studies, we have made all associated data information and code publicly accessible at https://github.com/AralRalud/BASE.git.


Introduction
Brain age is an estimate of biological age derived from brain magnetic resonance images (MRIs), and it has emerged as a significant biomarker of neurological health and aging.Assessing brain age involves training a machine learning model for age prediction using input T1-weighted (T1w) MRIs of a healthy population, followed by the application of the model outside the training dataset to detect potential brain age discrepancies in diverse health conditions.For instance, increased brain age with respect to healthy controls has been demonstrated in patients with neurological diseases such as Alzheimer's dementia (Franke and Gaser, 2012), multiple sclerosis (Høgestøl et al., 2019;Cole et al., 2020), schizophrenia (Schnack et al., 2016;Koutsouleris et al., 2014), and other diseases like type 2 diabetes (Franke et al., 2013), human immunodeficiency virus (HIV) (Petersen et al., 2021;Cole et al., 2017c), and in obese (Ronan et al., 2016) and vitamin D deficient subjects (Terock et al., 2022).
The use of deep learning (DL) models for brain age prediction has seen a surge in recent years (Baecker et al., 2021b;Tanveer et al., 2023).However, differences in evaluation protocols, such as the use of ✩ This study was supported by the Slovenian Research Agency (Core Research Grant No. P2-0232 and Research Grants Nos. J2-2500 and J2-3059).The APC was funded by the Slovenian Research Agency.
E-mail address: ziga.spiclin@fe.uni-lj.si(Ž.Špiclin). 1 Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu).As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report.A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.
varying performance metrics, different validation datasets, age spans, subject counts, T1w preprocessing pipelines, and post-processing agebias corrections, make comparisons across studies challenging, if not impossible.Although the evaluation of models on new site data is somewhat common, their evaluation on longitudinal datasets to assess the ability to capture the linear trend associated with aging, is rather rare.Even in studies that performed such evaluations (Dartora et al., 2022;Dunås et al., 2021;Beheshti et al., 2021), the consistency of predictions was either assessed visually or based on cross-sectional metrics, which seems inadequate.Furthermore, the reproducibility of predictions across models trained with different weight initializations (Jonsson et al., 2019;Levakov et al., 2020) or those using test-retest settings (Franke and Gaser, 2012;Cole et al., 2017b;Feng et al., 2020) has not been systematically evaluated.
To bridge these gaps, we propose the Brain Age Standardized Evaluation (BASE), which aims to establish a standardized approach to evaluate brain age prediction models, integrating best practices and overcoming the limitations of existing methodologies.https://doi.org/10.1016/j.neuroimage.2023.120469Received 27 July 2023; Received in revised form 31 October 2023; Accepted 20 November 2023 This paper is organized as follows: a review of related work is provided in Section 2; Section 3 describes the BASE datasets, performance metrics, and evaluation protocols, and a statistical framework used for assessing brain age models; the models and their evaluation using BASE are detailed in Sections 4 and 5, respectively; finally, the discussion and conclusion are presented in Sections 6 and 7, respectively.

Related work and contribution
Recent research efforts in brain age prediction have focused on introducing novel DL architectures (He et al., 2022b,a;Bellantuono et al., 2021), diversifying training strategies, including cascade learning (Cheng et al., 2021) and model ensembling over modalities (Kuo et al., 2021;Peng et al., 2021;Dunås et al., 2021;Jonsson et al., 2019), modifying the input T1w image into a two-channel representation encoding contrast and morphometry information (He et al., 2021), simplifying preprocessing by utilizing only image registration to common space (Dartora et al., 2022), and optimizing sampling strategies, to achieve an evenly sampled training set over the entire age span (Feng et al., 2020).A general deficiency of these research studies is the lack of a common, standardized evaluation approach.
Present methodologies for evaluating brain age models predominantly concentrate on contrasting the performances of traditional machine learning models (Beheshti et al., 2022;Baecker et al., 2021a;Han et al., 2022;Xiong et al., 2023).In these studies, models are typically trained and tested on the same collection of MRIs.Such evaluations may fall short in fully capturing various confounding elements such as subject and scanner variability, thus sidelining several crucial aspects of model performance.Although the recent comprehensive research by More et al. (2023) delves into these aspects, it primarily focuses on traditional machine learning models.It thereby overlooks certain aspects intrinsic to deep learning models, such as the reproducibility of predictions of multiple models trained with different weight initializations and the effect of potential alterations in preprocessing between training and test datasets.
The accuracy of brain age models is conventionally assessed through the Mean Absolute Error (MAE) computed across all test subjects, signifying the discrepancy between biological and predicted age.However, MAE can present a misleading picture, particularly when the test data comprises age ranges that are overrepresented in the training data, leading to more precise predictions (for instance, when there is a high proportion of young subjects as in the OpenBHB dataset (Dufumier et al., 2021)).As such, the MAE is not sensitive to the possible increase (or decrease) of absolute errors in specific age subintervals.Some studies attempt to circumvent this issue by reporting the MAE by age interval (He et al., 2022b;Levakov et al., 2020;Amoroso et al., 2019).There is a clear need for a robustness metric to differentiate between close-fitting (Cheng et al., 2021) models, which demonstrate consistent precision across all ages, and loose-fitting (He et al., 2021) models, which exhibit variable accuracy, especially in underrepresented age intervals throughout the entire age span.
Methodological studies reporting improvements in brain age prediction accuracy on healthy subjects often lack rigorous statistical evaluation.Conversely, studies on diseased populations typically involve statistical evaluation, employing t-test and/or ANOVA with post hoc pairwise comparisons (Franke and Gaser, 2012).Noteworthy practices include the use of Linear Mixed-effects Models (LMEM) on subjects with Alzheimer's disease, mild cognitive impairment, schizophrenia or depression (Bashyam et al., 2020), and multiple sclerosis (Høgestøl et al., 2019;Cole et al., 2020), using the brain age gap as an independent variable.Such a rigorous statistical framework and its parametrization, is yet to be established for evaluation of brain age on healthy subject datasets.
Validation of brain age prediction models for clinical applications should involve assessing their performance on new (unseen) site T1w subject scans, not used during model training (Feng et al., 2020;Jonsson et al., 2019;Dufumier et al., 2022;Franke and Gaser, 2012;He et al., 2021He et al., , 2022b,a;,a;Bellantuono et al., 2021;Han et al., 2022;Dartora et al., 2022;Cai et al., 2023;Bashyam et al., 2020).When models are applied to an unseen dataset, a deterioration in performance metrics is generally observed (Feng et al., 2020;Dufumier et al., 2022;Jonsson et al., 2019;Han et al., 2022;Dartora et al., 2022;Cai et al., 2023;Bashyam et al., 2020), but which is often compensated for through a linear bias linear correction.However, a recent study advises against such age-bias correction, since bias corrected metrics can indicate high accuracy, even for models showing poor initial performance (de Lange et al., 2022;Butler et al., 2021).Nevertheless, when the offset appears to be systematic across the entire age span (Franke and Gaser, 2012), applying an offset adjustment may be appropriate.
The consistency of age predictions is vital for longitudinal intrasubject evaluations, especially when tracking disease progression or deviations from the normative aging trajectory.While there has been significant progress in providing extensive public datasets and benchmarking platforms, which incorporate multi-site train and test datasets, as well as new site data (for instance, the OpenBHB (Dufumier et al., 2022)), research on longitudinal datasets involving healthy subjects remains underrepresented.Current studies usually resort to visual methods to evaluate longitudinal consistency through charting longitudinal predictions on linear graphs (Dunås et al., 2021;Dartora et al., 2022).Quantitative longitudinal performance evaluation metrics were used in the study by Dunås et al. (2021), where linear lines between time points were computed to analyze the longitudinal predicted trajectories.While the analysis of slope and intercept allows monitoring the rate of change over time, it does not capture the information on the magnitude of the error of the predicted difference, that would be analogous to MAE.This observation underscores the necessity for specialized metrics designed to evaluate the consistency of brain age predictions on longitudinal data.
Finally, the reproducibility of any biomarker holds vital importance for practical application and can be assessed using test-retest data.However, brain age studies have thus far used either (i) a limited number of test-retest subjects with a large number of scans per subject (Feng et al., 2020) or (ii) a large number of test-retest subjects, each with few scans (Cole et al., 2017b;Franke and Gaser, 2012).The best observed practice for assessing test-retest agreement is to report the intraclass correlation coefficient (ICC).Another aspect is the reproducibility of brain age predictions across DL model realizations, considering the initial random weight selection, where ICC can also be utilized.However, such evaluations have rarely been performed in the studies involving DL models (Jonsson et al., 2019;Levakov et al., 2020).
The contribution of this paper is BASE, which comprises (i) a standardized T1w MRI dataset including multi-site, new unseen site, test-retest, and longitudinal datasets, along with (ii) an evaluation protocol.The evaluation protocol includes a comprehensive set of established and novel performance metrics to measure the accuracy, robustness, reproducibility, and consistency aspects of brain age predictions, complemented by a statistical evaluation framework based on LMEMs.This protocol is crafted for compatibility not just with our proposed T1w MRI dataset, but can also be adapted for use with alternative datasets relevant to brain age prediction.We demonstrate the use of BASE in a comprehensive evaluation of four DL brain age models, with reproducible results using our public implementation at https://github.com/AralRalud/BASE.git.

BASE protocol
The BASE protocol, depicted in Fig. 1    The model evaluation phase involves four tasks: (1) comparison of the performance of DL models and/or the comparative evaluation of the impact of model training strategies, (2) performance evaluation on seen/unseen dataset, (3) reproducibility and (4) consistency evaluation on respective test-retest and longitudinal datasets.The principal results of BASE, sourced from Sections 5.1-5.4,are depicted in the form of a radar plot in Fig. 2.
The building blocks of BASE comprise the data, performance metrics, and statistical analysis framework, which are detailed in the following subsections.

Datasets
In developing BASE, we established four distinct datasets.The primary dataset encompasses multi-site T1w MRIs, allocated for purposes of training, validation, and testing.The remaining three datasets are dedicated exclusively to testing, each serving a specific function: one for new unseen site T1w MRIs, another for test-retest T1w MRIs, and the last for longitudinal T1w MRIs.Across all datasets, the included subjects are healthy adults, ranging in age from 18 to 95 years old.
The multi-site dataset (cf.Table 1) comprises seven publicly available datasets and included a total of 4428 T1w MRIs of healthy subjects.Many of these datasets sourced their images from several hospitals or sites, employing a variety of MRI scanners, such as GE, Siemens, and Philips, with 1.5T and 3T field strengths.OASIS 2 and CamCAN datasets were the only datasets in which scans were acquired on a single scanner.The incorporation of these datasets from multiple sources, sites, and vendors inherently leads to variations in the acquisition pipelines.
All MRIs underwent a visual quality check.Images that did not pass the visual quality check (e.g.due to motion artifacts) were excluded (  = 408), while subjects under the age of 18 or with nondisclosed ages were discarded (  = 481).For subjects with multiple T1w scans, we retained the chronologically first non-discarded image.Ultimately, 2504 T1w MRIs were accepted and split into training ( = 2012), validation ( = 245) and test ( = 247) datasets.The distribution of subjects' ages per dataset, as well as within the train/validation/test subsets, is provided in the Supplementary Materials (Appendix A.1).For reproducibility purposes, we have made the subject IDs for each split available in the online project repository. 2he unseen site and longitudinal dataset were sourced from a subset of UK Biobank (UKB) dataset (Miller et al., 2016).We identified 1493 subjects who met the inclusion criteria, which included having two MR scans and no long-standing illnesses.In addition, subjects were required to self-report an overall health rating of excellent or good at both scans.The unseen site dataset comprises 1493 baseline scans, while the longitudinal dataset includes 2986 T1w MRI scans from both baseline and follow-up sessions.The average time between scans was 2.25 ± 0.12 years.Finally, for the test-retest datasets, we used the OASIS-1 dataset (Marcus et al., 2007), which comprises 316 healthy adults, each with two T1w MRIs acquired within a couple of hours (i.e., test-retest scans).

Performance metrics
The established metric for assessing accuracy in model predictions is the mean absolute error (MAE): , where   denotes true age and   } predicted age of th subject.Additionally, we report the mean error (ME): to detect instances of age under-or over-estimation, with the assumption that ME is normally distributed around zero mean.
Model is considered robust if the MAE within any age subinterval is consistent with the overall MAE.To assess this, we propose the maximal MAE (mMAE), which is calculated by taking the maximum MAE over age intervals , where   is the number of samples from the interval [  ,  +1 ) and ∑    = .To ensure a balanced number of subjects in each interval on the test set, we utilized intervals [18,25], (25,35], … , (75, 85], (85,100]. A model exhibits high reproducibility if the prediction error is similar on test-retest scans and/or across models of the same architecture trained with different weight initializations.We compute the average standard deviation (SD) of the scan predictions: where  denotes the number of subjects,  the number of scans per subject, and  the number of repeated model trainings, each with different weight initialization.Further,  ′()   denotes the prediction of th model for th scan of subject  and ȳ′  its mean prediction over all  trained models.
We further computed the mean difference () of per subject testretest predictions as and the corresponding average SD of differences () as where denotes the mean subject difference of  trained models, and  ′() 1 represents the difference in prediction between th subject's first and second scan for th model.
The degree of agreement between the predictions of models trained with different weight initializations is further quantified using intraclass correlation coefficient (ICC) (Finn, 1970).
Differences in brain age predictions on successive scans of healthy subjects should be consistent with the time elapsed between scan acquisitions.To evaluate this, we compute the age difference between the baseline () and follow-up ( ) scans for th subject,   =    −    , and corresponding difference of model predictions, Subsequently, we define the Mean Difference Error (MdE), Mean Absolute Difference Error (MAdE) and Maximal Mean Absolute Difference Error (mMAdE) as: , using the same age interval boundaries  1 ,  2 , … ,   as defined for the robustness assessment.
In case the longitudinal and test-retest data contain  > 2 images per subject, the formulation of (), (), ,  and  can be generalized by averaging across all (

Statistical analysis
Linear mixed-effects models (LMEMs) were employed to characterize the relationship between the error and the absolute error (AE) as dependent variables, with the model architecture serving as a fixed effect and the subject ID as a random effect.This configuration ensures that all responses from a specific subject are adjusted by a unique additive value corresponding to that subject.By treating the subject ID as a random effect, we effectively accommodated the dependent nature of the data, which stems from generating multiple brain age predictions for the same individual.
For all models, we report the estimated regression coefficients along with their 95% confidence intervals (CIs).To explain the variability in the response variable due to the fixed effect, we performed an Analysis of Variance (ANOVA) on the fitted model and pairwise comparisons between the levels of the fixed factor using the Estimated Marginal Means (EMM) method, with a Tukey adjustment for multiple comparisons.
To statistically evaluate longitudinal consistency, testing whether the average slope between age estimates on baseline and follow-up T1w scans differs from 1 (null hypothesis), we ran the t-test.
In all statistical tests, the significance threshold was set at  = 0.05, unless noted otherwise.

T1-weighted image preprocessing
Each input T1w image was converted to the Nifti format.The raw T1w image underwent adaptive non-local means denoising3 (Manjón et al., 2010).Next, we performed a 12 degree-of-freedom affine registration using NiftyReg4 (Modat et al., 2014) to map the denoised T1w image into the 7th generation Montreal Neurological Institute (MNI) atlas space (version 2009c) (Fonov et al., 2009).To improve registration accuracy, intensity inhomogeneity correction (w/o mask) was applied to the denoised image using the N4 algorithm5 (Tustison et al., 2010), prior to running the registration.The intensity-inhomogeneitycorrected, denoised T1w image was used during registration only.With the obtained affine mapping, the denoised T1w image was resampled to the MNI space using sinc interpolation, such that all preprocessed T1w images had a size of 193 × 229 × 193 and isotropic 1 mm spacing.
Finally, a two-step grayscale correction was applied: (1) intensity windowing, which involves the computation of the lower and upper thresholds based on the grayscale histogram, smoothed with a Gaussian filter.The lower threshold is set based on the histogram's lowest intensity mode location plus twice the value of the mode's full width at half maximum (FWHM).Note that the particular mode corresponds to the grayscale values of the background and non-tissue regions of the T1w MRI image.To compute the upper threshold, the grayscale values beyond the 99th percentile are first set to the value of the lower threshold.Inflection points in the intensity distribution from the 50th to the 95th percentiles are then identified by computing the second derivative.The upper threshold is defined as the value of the percentile at a selected inflection point, plus three times the Median Absolute Deviation of the pixel intensities that are above the lower threshold.The second step, (2) involves intensity inhomogeneity correction, utilizing the N4 algorithm with the MNI152 atlas mask dilated by 3 voxels.In all the resulting preprocessed T1w MRI images we removed the noninformative empty space around the head by cropping to a size of 157 × 189 × 170.

UK biobank T1w preprocessing
We utilized the UKB dataset, employing both raw T1w MRI images and preprocessed images obtained through the protocol outlined in Smith et al. (2022).We generated two versions of preprocessed T1w MRI images: (1) The raw T1w defaced images in subject image space were preprocessed as described in previous section, and (2) the UKB preprocessed T1w images were spatially mapped from the 6th to 7th generation MNI atlas (version 2009) (Fonov et al., 2009), using 3rd order interpolation and a linear transformation matrix between the two atlases pre-computed by FSL's FLIRT (Jenkinson and Smith, 2001;Jenkinson et al., 2002) to ensure that the preprocessed images were registered to the same atlas as used in our preprocessing.

Prediction models
To showcase BASE, we reimplemented four CNN-based brain age models based on the descriptions in the literature.The architectures of the four models are depicted in Fig. 3.
Model 1 (Cole et al., 2017b) was among the first 3D regression CNNs applied for brain age prediction, and was trained and tested on the preprocessed T1w MRIs.Model 2 (Huang et al., 2017) is a multichannel 2D regression CNN, trained and tested on 15 equidistantly sampled axial slices of the preprocessed T1w as input channels.Model 3 (Ueda et al., 2019) is similar to Model 1, but applied on downsampled 3D T1w.The use of multi-channel 2D or downsampled 3D models may reduce computational complexity with little impact on prediction performance, as motivated by a recent review paper on DL brain age regression (Tanveer et al., 2023); a hypothesis that we aim to verify.
Finally, Model 4 (Peng et al., 2021) is a fully convolutional classification model, outputting probability over non-overlapping 2-year age intervals, that reported one of the best results for brain age prediction among DL models.It was trained and tested on the preprocessed T1w images, using a weighted sum over the class probabilities to predict the age.
All models were implemented in PyTorch 1.4.0 for Python 3.6.8.The details on model training and hyperparameter tuning are presented in the Supplementary materials A.1.

Offset correction
Predicting age on a dataset involving domain shift (i.e., unseen scanner and/or T1w preprocessing) usually incurs a drop in accuracy, observed as a systematic offset versus the true age.We applied the offset correction for the value of ME, calculated as: The offset correction was applied to all four models, computed on a per-model basis.
Several recent studies have cautioned against the use of linear agebias correction (de Lange et al., 2022;Butler et al., 2021).These methods involve regressing the age out of the brain age gap and forcing an alignment between predicted and true age, even for poorly fitting models.In the worst-case scenario, a poorly fitting model would predict the median age for all subjects.Unlike fitting a linear regression line, offset correction does not force this alignment and does not correct the model's inability to capture a linear trend, nor does it reduce dispersion of predictions.

Experiments and results
Our experiments showcase an objective, quantitative, and comparative evaluation of the four DL-based brain age models using BASE in four tasks, each with a corresponding set of data, performance metrics, and statistical analyses, as outlined in the following subsections.

Impact of model architecture
The performance of four DL model architectures, described in Section 4.2, were evaluated.We trained a total of 20 models on the multi-site test set, i.e.,  = 5 random weight initializations for each of the four models.
The final predictions were obtained by averaging the  = 5 predictions across models with different weight initialization.The socalled mean ensembling strategy has been shown to generally improve models' accuracy (Jonsson et al., 2019;Levakov et al., 2020;Peng et al., 2021;Couvy-Duchesne et al., 2020).
We evaluated the accuracy and robustness of age predictions for the models trained on the multi-site dataset, obtained by the mean ensembling strategy on the multi-site test dataset.We fit a LMEM with AE as the dependent variable, subject ID as a random effect and model architecture as a fixed effect.
Results in Table 2 show that the best accuracy was achieved by the mean ensemble for Model 1, with a MAE of 2.96 years and ME close to zero.Furthermore, the performance of Model 1 versus the other models resulted in comparatively small SDs of ME and MAE.According to the MAE values, as well as their SDs, Models 1, 3, and 4 performed better than Model 2, due to the former inputting 3D T1w MRI, while the latter the inputted subsampled 2D axial slices.The most robust model, according to mMAE, was the Model 1.Among the models inputting 3D T1w MRIs, Model 4 performed the worst in terms of accuracy and robustness.Furthermore, we observed that  2 and  have little or no discriminating power to differentiate model performances.An overview of the models' performances on the multi-site dataset, including the MAE, mMAE, and the absolute value of ME, is visually presented in the top-right blue area of the radar plot in Fig. 2.
In evaluating the significance of the observed differences, the LMEM analysis and ANOVA test ( (3, 738) = 7.709,  < 0.001) showed that model architecture had a significant effect on the AE.The exact regression coefficients, their 95% CI, and ANOVA F-values are reported in Supplementary Table 8.The results of LMEM post-hoc pairwise analysis are presented in Fig. 4. The AEs of Model 2 were statistically significantly different from those of Models 1, 3, and 4. The EMMs did not significantly differ for the other model pairs.

Performance on unseen site dataset
The four models employing the mean ensembling strategy were applied to the unseen site dataset, which utilized two distinct T1w preprocessing procedures: one identical to the preprocessing of the training  dataset (seen) and the other different (unseen) (cf.Section 4.1.1).We predicted the age using all 20 previously trained models (cf.Section 5.1) on 1493 T1w baseline scans from the UKB dataset.The predictions are showcased as scatter plots in Fig. 5. Performance evaluation in Table 3 shows that, while all models captured the linear trend of aging, a systematic offset parallel to the identity line can be observed.All models underestimated age across the whole age interval, which was especially evident for predictions on data with unseen T1w preprocessing.The MAE of Models 1 and 4, using the same preprocessing, was equal to 3.73 and 3.65 years, and increased by less than a year when applied to T1w scans with the unseen preprocessing.This increase was much larger for the Models 2 and 3, with the MAE increasing from 4.32 and 3.93 years to almost 10 and 8 years.The difference was even more pronounced when observing mMAE, which increased to over 10 years.
Compared to the results on the multi-site test dataset ( Table 2), the MAE of regression Models 1, 2, and 3 increased by about 0.75 years; however, the increase was smallest for classification Model 4, at 0.4 years.
Offset correction improved both accuracy and robustness metrics (cf.Table 3, top vs. bottom).Compared to results on the multi-site test dataset, the increase in MAE due to unseen site was 0.34 years, with an additional 0.45 years due to unseen T1w preprocessing.The offsetcorrected metrics for the new site with both the same and new T1w preprocessing are visually summarized in the bottom-right yellow area of the radar plot in Fig. 2. Since ME equals to 0 and is the same for all models, it was not included in the plot.
Statistical evaluation involved fitting two LMEMs on the offsetcorrected mean ensemble predictions: the first for predictions on the unseen dataset, either with the same or unseen T1w preprocessing.The LMEMs were fit with AE as the dependent variable, subject ID as the random effect, and model architecture as the fixed effect.
The results of the ANOVA showed that model architecture was significant for both the same ( (3, 4476) = 55.7, < 0.001) and unseen T1w preprocessing ( (3, 4476) = 53.9, < 0.001).The post-hoc pairwise differences between the EMMs of LMEM fit on data with the same T1w preprocessing showed a statistically significant difference between Model 2 and all other models ( < 0.001) and between Model 1 and 3 ( = 0.041).However, the post-hoc pairwise analysis on unseen T1w preprocessing data showed a statistically significant difference between all pairs ( < 0.001;  = 0.009 between Model 3 and 4), except between Models 1 and 3 ( = 0.658).Coefficient estimates and their 95% CI are reported in the Supplementary Table 9.
The LMEM and ANOVA analyses were also tested with the sex variable and its interaction with other variables as fixed effects.ANOVA indicated no significant differences in MAE with respect to sex ( (1, 245) = 0.004,  = 0.952), nor did it show statistical significance between the interaction of sex and model architecture ( (3, 735) = 0.004,  = 0.203) (results not shown).These findings assert that the accuracy of age predictions remains stable across sex groups.

Test-retest reproducibility
Using brain age as a biomarker necessitates consistent age predictions on MRIs taken within a short time span, having low intra-model variance, despite potential accuracy bias.To verify this, we applied all 20 models resulting from the experiment described in Section 5.1 to obtain age predictions on the test-retest dataset.We then computed reproducibility metrics and conducted statistical analyses using LMEM and ANOVA.
The reproducibility results are summarized in Table 4 for five trained models ( = 5) and two scans ( = 2) per subject.The average difference between the first and second scan, (), ranged from −0.03 for Model 1 to −0.10 years for Model 2. The average standard deviation of scan predictions, ( ′  ), was lowest for Model 2 at 1.97 years, followed by Model 1 at 2.02 years.Fig. 6 displays the age prediction difference,  ()   , between the two scans for each subject.Each of the five points represents models with  = 5 different weight initializations.The difference in age predictions remained consistent within subjects, with values close to 0. For some subjects, the age prediction difference reached up to four years.Notably, for Model 4, there was minimal within-subject variation, indicating that the large difference in age prediction was consistent for all five models with  = 5 different weight initializations.As a result, the average standard deviation of differences, (), was lowest for Model 4 (cf.Table 4), at 0.63 years.
The agreement in the predicted difference among the 5 models was computed using ICC, with Model 2 achieving the highest level of agreement with an ICC of 0.59 (cf.Table 4).However, the results showed moderate to poor reliability for all four models.Yet, the ICC for each individual T1w scan was much excellent for all models, ranging from 0.95 for Model 3 to 0.98 for Models 1, 2 and 4. We infer that the differences stem from the quality of input T1w scans, especially for the lower input resolution of Model 3, and that the models generally exhibit good reproducibility.The values of all metrics in Table 4 are visually summarized in the bottom-left red area of the radar plot in Fig. 2.
The observations above are supported by statistical analyses.Specifically, we fitted a LMEM with prediction difference  ()    as the dependent variable, subject ID as a random effect, and the model architecture as a fixed effect.Pairwise marginal means show that none of the paired differences are statistically significant ( > 0.05).The ANOVA test did not identify the model architecture as statistically significant ( (3, 3531) = 2.097,  = 0.098).The exact regression coefficients, their 95% CI, and ANOVA F-values are reported in Supplementary Table 10.

Longitudinal consistency
All 20 models trained on the multi-site dataset (Section 5.1), were applied on the longitudinal dataset (which had the same T1w preprocessing as the multi-site dataset).Subsequently, mean ensemble predictions were computed, and consistency metrics were evaluated.Results in Table 5 show that, generally, the MdE (i.e.ME between the actual and predicted age difference) was negative, and all models on underestimated the age difference between visits, with values ranging from 0.52 to 0.9 years.Model 4 achieved best longitudinal accuracy and robustness (lowest MAdE and mMAdE values, respectively), despite exhibiting the largest bias, as indicated by the highest MdE.The MAdE error corresponded to 50%-90% of the average age span between scans.Values of the three metrics from Table 5 are visually summarized in the top-left green area of the radar plot in Fig. 2. Fig. 7 shows the age trajectories based on the chronological versus the predicted brain age between the baseline and follow-up visit for approximately 60 randomly chosen subjects and their corresponding T1w scans from the UKB test set.We expect to observe the slopes close to or equal to the identity line (dashed diagonal line in Fig. 7).For Model 2, the subject-specific age differences follow a rather randomized pattern, while for Models 1 and 4, the majority of the subject-specific lines seem parallel to the identity line.
For statistical confirmation, the th subject slope was evaluated as , where ( 1 ,  ′ 1 ) denote the baseline age and its estimate and ( 2 ,  ′ 2 ) denote the follow-up age and its estimate.The null hypothesis, that the average slope  across subject was different from 1, was rejected for all models.The average slope with  = 0.96 was closest to 1 for Model 4.
Finally, we fit a LMEM model with MAdE as the dependent variable, model architecture as a fixed factor and subject ID as a random factor.The estimated coefficients significantly differ between architectures.Additionally, ANOVA was statistically significant ( (3, 4476) = 109.26, < 0.001).Post-hoc pairwise analysis of EMMs showed statistical significance between all pairs ( < 0.001) except Model 1 and Model 4 ( = 0.248).Exact coefficients are presented in the Supplementary Table 11.

Discussion
We proposed the Brain Age Standardized Evaluation or BASE protocol and showcased a comprehensive, objective, quantitative, and

Table 6
The overall performance rank and individual task rankings for each model.Within each task, as shown in Fig. 2, models were first ranked based on each metric.Subsequently, an average rank was computed for each task by aggregating these metric ranks.For the final overall ranking, the average ranks from all tasks were consolidated, resulting in an overall model ranking.

Multi-site
Unseen reproducible validation and comparison of four DL-based brain age prediction models.The principal results of using BASE are visually summarized in Fig. 2. The proposed datasets and evaluation protocol in BASE represent a framework that ensures reproducibility across different studies, as it considers and tackles the confounding factors impacting the variability of results.Namely, the use of heterogeneous, multi-site and multisource datasets induces variability of results caused by MRI scannerspecific and biological (subject) variability, while the use of multiple T1w preprocessing pipelines induces variability of results caused by the use of specific tools and implementations.To account for model (epistemic) uncertainty, we adopted repeated model training by using five different seeds for random model weight initialization and incorporated this in a statistical framework based on LMEMs.
We introduced the BASE evaluation in conjunction with four datasets, each corresponding to a specific aspect.When provided with a suitable dataset, BASE can be applied to various other datasets, including those from other modalities such as functional and diffusion tensor MRI and positron emission tomography.However, the results from this study, as well as any other, are directly comparable only when applied to the same datasets, which are subjected to identical preprocessing procedures.Alterations in dataset attributes or variations in preprocessing can have a significant impact on model outcomes.Although model rankings based on accuracy largely remained the same when changing the preprocessing, there can be variations in MAE values, which may hinder comparisons across studies (Dular et al., 2023).
We developed a detailed set of performance metrics tailored to evaluate the accuracy, robustness, reproducibility, and consistency of brain age models.Based on research objectives, specific components of the BASE evaluation can be favored.For example, given its best ranking in longitudinal consistency (cf.Table 6) and its comparable reproducibility, Model 4 emerges as the prime choice for patient monitoring.Considering its accuracy and robustness across both known and unseen sites, Model 1 is best-suited for population studies, out of the models compared.

Accuracy and robustness
In addition to MAE, which is the main metric used in brain age estimation, we propose the inclusion of ME as a complementary measure.ME allows for the assessment of the offset across the entire age interval, which is particularly insightful when models are applied on unseen site dataset (Section 5.2).Furthermore, we recommend reporting standard deviations of MAE and ME, so as to evaluate the model precision.While many studies report the MAE along with its standard deviation, it is essential to clarify that this standard deviation is typically computed over the MAE values obtained from repeated model training with different weight initialization or cross-validation folds, rather than across all subjects.The former provides insights into model reproducibility, while the latter offers information on prediction dispersion.In this paper, we argue for and recommend reporting the latter as it provides valuable information on prediction variability.
We introduced the robustness metric mMAE, where a large discrepancy between MAE and mMAE can serve as an indicator that MAE is biased due to differences in age structure or age span between the training and test datasets.For instance, Han et al. (2022) reported and overall MAE of 3.72 years, whereas low MAE of 2.86 and 2.97 years was obtained on two large pediatric datasets (age < 22) with over 10000 subjects, but a large MAE of 5.35 years on 252 adults up to the age of 60.
While RMSE, , and  2 are commonly reported in brain age studies, we did not include them in BASE.Note that the values of RMSE are represented by the standard deviation of ME, and thus redundant.Furthermore, it may take up to 4 decimals of  in order to detect the difference in model performance (He et al., 2022b), whereas the proposed metrics are more sensitive to differences in performance.

Performance on unseen site dataset
Brain age models are generally applied on new (unseen) cohorts, where the anticipated goal is to estimate the brain age gap between healthy individuals and those with specific condition; consequently, our model needs to provide accurate age assessment for healthy controls.
The observed drop in performance on unseen site datasets, i.e. about 0.7 years increase in MAE, aligns with existing literature in brain age studies.For instance, Feng et al. (2020) found a minor increase in MAE of 0.15 years, Jonsson et al. (2019) a larger increase of about 3 and 5 years on two unseen datasets, and Dartora et al. (2022) an increase of 0.92 and 3.04 years on two unseen datasets for a model trained on minimally preprocessed T1w images.As our results show, the differences in the T1w preprocessing contribute to a substantial drop in performance, such as the increase of MAE above 1 year.
Ranking of models according to MAE may be relevant for best model selection, if the MAE increases are consistent.He et al. (2021) evaluated the performance of three distinct models on three unseen datasets and observed an overall increase in MAE of approximately 0.7 and up to one year.Despite changes in MAE, the rank order of models' accuracy among the three datasets remained consistent.
Our findings are mirrored, as similar performance ranks were observed on the unseen, as well as on the seen data, and even on unseen data with different T1w preprocessing.The increase of MAE was systematic across the entire age span, but varied depending on the model and dataset.This observation is apparent from Fig. 5, which shows Models 1 and 4 as less susceptible to changes in the dataset and T1w preprocessing.Furthermore, all models tend to perform better on datasets that bear resemblance to the T1w preprocessing of the training set (Dular et al., 2023).

Offset correction
As a result of regression dilution, researchers often observed a systematic over-and under-estimation of brain age on lower and upper end of the dataset age span.To alleviate this phenomenon, many researches apply post-hoc correction of the predictions in form of (linear) bias correction (de Lange et al., 2019;Peng et al., 2021;Cole et al., 2017a;Smith et al., 2019;Cheng et al., 2021;Dunås et al., 2021), fitting a regression line on training or validation dataset.However, recent studies (de Lange et al., 2022;Butler et al., 2021) caution against the use of such corrections, since it can inflate performance metrics.

L. Dular et al.
Upon visual inspection of Fig. 5, the increase in MAE seems systematic across the whole age span and specific to the model and dataset.This systematic offset was also reported by Franke and Gaser (2012), who proposed that the increase in MAE is dataset-specific, resulting in a consistent offset in subject's age predictions across multiple scanning time points.
We propose correcting this offset when testing on new unseen site dataset, however, the offset-uncorrected model predictions should always be inspected and reported, in order to evaluate the validity of predicted brain age estimates.
The offset correction does not compromise the reproducibility and consistency metrics, while the accuracy and robustness metrics are improved.It is important to mention, that a model with poor predictive power and a significant bias towards the mean, will still yield poor performance, even after the offset correction.Unlike fitting a linear regression model, our approach does not result in overly optimistic performance.

Reproducibility
We demonstrate that the most accurate Model 1 is not necessarily the most reproducible, as can be clearly observed from Fig. 2. Specifically, Model 4 achieved the smallest average standard deviation of age prediction, as well as one of the highest values of ICC.Surprisingly, despite its poor accuracy, Model 2 exhibited the lowest average variability in age predictions for models trained with different weight initialization.Reproducibility metrics are invariant to offset by design, as the aim is to focus on a model's ability to reproduce the same prediction.Models with low variance but potentially high bias will still perform well.Thus, these metrics should be viewed as complementary to accuracy metrics, rather than a replacement for them.
The reported ICC values above 0.94, are comparable to 0.9 reported by Franke and Gaser (2012).Despite a high ICC of up to 0.99, the standard deviation of age predictions for a single MRI was at best 1.97 years, which is comparable to the 1.88 years reported by He et al. (2021).The sensitivity of model training to becoming trapped into local optima might present a significant challenge to using brain age as an individualized clinical biomarker.Employing model ensembling appears to be a promising strategy to mitigate the effects of random model weight initialization.

Consistency
The evaluation of consistency encompasses the use of baseline and follow-up T1w MRIs, assessing the accuracy and robustness of predicted age differences using the MdE, MAdE, and mMAdE metrics, analogous to ME, MAE, and mMAE metrics.Despite achieving accurate and reproducible results, we observe that all tested models often fall short when predicting age across longitudinal data.We found the mean values of slopes are statistically different from the ideal value of 1, with even the best-performing models exhibiting an average age difference error of 1.2 years, which is about half of the actual average time difference of 2.25 years.
There is a clear need to design models specifically tailored to address consistency.Incorporating longitudinal data might offer a solution, as it could enable us to model individual aging trajectories (Levakov et al., 2020).Dartora et al. (2022) used multiple images per subject in the training dataset and their visual results appear more desirable compared to results of this study.However, an objective and quantitative evaluation using the proposed consistency metrics is needed before drawing conclusions.
Given that longitudinal data are scarce, DL-based data augmentation could be leveraged.For instance, Fu et al. (2023) developed a methodology for generating missing data in longitudinal cohorts with anatomically plausible images.This approach could prove beneficial in enhancing the dataset for better model performance.

Study reproducibility: Data, code and BASE protocol
The standardized dataset comprises multi-site train, validation, and test T1w scans from 2504 healthy subjects.Additionally, there are two test sets: one with previously unseen site longitudinal T1w MRIs (  = 1493,   = 2986) and another with test-retest T1w MRIs from 316 subjects, ranging from 18 to 94 years of age.All T1w MRI scans used in this study were sourced from public datasets. 6Every scan underwent a rigorous visual quality assessment to exclude low-quality scans or those with unsuccessful T1w preprocessing.
To ensure the reproducibility of our study, we have disclosed at the public GitHub repository7 the subject ID lists, dataset split, the implementations and dependencies of the T1w preprocessing routines, brain age regression models, scripts to re-run the experiments and carry out the performance evaluations and statistical analyses.With the use of BASE implementation other researchers may evaluate novel models and techniques in a standardized manner.
Although a large public dataset for brain age dubbed OpenBHB (Dufumier et al., 2021(Dufumier et al., , 2022)), has recently become available, it falls short in certain critical aspects of brain age performance assessment.Specifically, the OpenBHB dataset lacks longitudinal and test-retest datasets, but which are essential for the assessment of consistency and reproducibility as per the BASE protocol.Hence, there was a need to introduce a new dataset.Moreover, the OpenBHB has biased age structure since 40% of its MRI scans are from subjects aged between 20 and 25 years, skewing the mean age to 25 years, as compared to 52 in our dataset.This huge age bias can lead to an artificially low MAE value, as the brain age predictions are generally more accurate at lower age, thereby presenting an overoptimistic and biased evaluation of the model's performance across the 18-95 years age span.

Statistical framework
Point estimates of performance metrics like the MAE, which are usually reported in brain age literature, need to be statistically evaluated to enable one to draw generalizing conclusions.For this purpose we used the LMEMs, as they enable to account for repeated measures on a subject level by including the subject ID as a random effect.Our results show that despite the observed difference in MAE point estimates the difference may not be statistically significant.For instance, when comparing the performances of Models 1 and 4 (cf.Table 2), the seemingly relevant difference in MAE values of about 0.3 years was not statistically significant (cf.Fig. 4).

Study limitation
In this study, we have concentrated our efforts on a select group of four CNN-based models, each showcasing significant variations in terms of input dimensionality, image resolution, and output representation.While this selection enables a clear and focused introduction of BASE, providing insights into its operation across different models and application scenarios, we acknowledge that it does not cover the exhaustive array of available model architectures, including various branches of convolutional networks and emerging transformer architectures.While a broader comparison could potentially yield a more comprehensive understanding of the BASE approach, our intention was to introduce BASE with clarity and precision, demonstrating its applicability.We encourage future work in this area to apply BASE, either partially for their specific application, or in whole, across a broader spectrum of models.

Conclusion
In this study we proposed and demonstrated the application of the Brain Age Standard Evaluation or BASE.The BASE comprises the dataset, performance metrics and an evaluation protocol.Using BASE we evaluated four state-of-the-art deep regression brain age models in aspects such as accuracy and robustness on multi-site and unseen site and differently preprocessed T1w MRIs, reproducibility on testretest and consistency on longitudinal T1w scans.Our study is fully reproducible as the dataset information and code are made publicly available at https://github.com/AralRalud/BASE.git.

Data availability
Data used in the study was obtained from public data sources.Some require online registration to gain access to the MRI scans, acquiring the UK Biobank dataset necessitates a fee payment.

Declaration of Generative AI and AI-assisted technologies in the writing process
During the preparation of this work the authors used ChatGPT-4 in order to improve the readability of this paper.After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.1 with batch size 4, 8, 16 and 24 and Model 4 with batch size 4 and 8. Hyperparameter values were selected based on their associated model performance, which was evaluated using the median value of MAE on the validation set across the last 10 epochs.The chosen hyperparameters are presented in Supplementary Table 7.

A.1.4. Data augmentation
In all our experiments models were trained using the following data augmentation: (1) random shifting along all major axes with probability of 0.3 for an integer sampled from [−, ], where  = 3 for Model 3 and  = 5 for Models 1,2, and 4; (2) random padding with probability of 0.3 for an integer from range [0, ], where  = 2 for Model 3 and  = 5 for Models 1,2, and 4; (3) flipping over central sagittal plane with probability of 0.5.Note that the difference in size of parameters  and  for Model 3 in comparison to the other three models is due to difference in input sizes between models as the result of downsampling.
The image size for 2D Model 2 and Model 3, trained on downsampled images was adapted during augmentation.Namely, with Model 2 the 15 axial slices (predefined in atlas space) were sampled to obtain input image size of 157 × 189 × 15, while with Model 3 the input images were downsampled using sinc resampling and cropped to size 95 × 79 × 78.

A.1.5. Weighted training
Weighted training is a strategy of assigning higher sampling probabilities to subjects in underrepresented age categories, such that the expected number of samples from each age category becomes equal.Since Model 4 is defined as a classification model, it is susceptible to higher error in underrepresented age classes.The use of weighted training improved the predictions of Model 4 on the multi-site validation set for age > 80 years, but not for the other three models, which was confirmed by LME model (results not shown).
Specifically, we applied weighted random sampler with replacement during training, assigning each subject a weight of ∕  , where   denotes the number of samples in category .Subjects were split into age categories [18, 20), [20, 25), [25, 30), . . . , [85, 90), [90, 100) as previously proposed by Feng et al. (2020).The number of sampled subjects was set to  to keep the number of samples per train epoch the same as in the experiments without weighted training.

A.2. Detailed results of statistical analyses
In the following Tables 8-11 we show detailed results of the ANOVA test and LMEM as performed in respective Sections 5.1-5.4.The levels L. Dular et al.

Multi
, outlines tasks in the model training and tuning, and model evaluation phases.The former involves model training, hyperparameter tuning, repeated model training with different weight initializations and prediction ensembling.

L
.Dular et al.

Fig. 2 .
Fig. 2. The principal results of BASE are visualized in the form of a radar plot.Values closer to the plot's center indicate better performance, therefore a tighter envelope indicates a better overall performance for a particular model.

Fig. 3 .
Fig. 3. Architecture of the four reimplemented CNN models for the task of brain age prediction.

L
.Dular et al.

Fig. 5 .
Fig. 5. Mean ensemble based age predictions on the unseen dataset with the same and unseen T1w preprocessing (upper) and bottom rows, respectively).The corrected offset is marked with the red line.

Fig. 7 .
Fig. 7. Age trajectories between the baseline and the follow-up T1w scans based on the true and predicted age (for 60 randomly chosen subjects, one colored line per subject).The -values reject the hypothesis, that the average slope equals to 1 (cf.text for details).

Fig. 8 .
Fig. 8. Density of age distribution per each and combined multi-site dataset, depicted for train, test and validation set splits.

Table 1
Dataset information including age statistics, such as span, mean age (  ), and associated standard deviation (  ) in years, is provided per dataset for the included T1w subject scans in train, test, and validation datasets (top), and as well as the new unseen site, test-retest and longitudinal datasets (bottom).
h OASIS-1: https://www.oasis-brains.org/.i Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu).The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD.The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD).

Table 2
Evaluation of brain age prediction for four DL models on the multi-site test set.Best metric results with respect to model architecture (in rows) are marked in bold.All numbers are in years.

Table 3
Accuracy and robustness metrics for the previously unseen UKB dataset.The best metric result with respect to the model architecture (in rows) are marked in bold.All numbers are in years.

Table 4
Reproducibility metrics for the mean ensemble and intraclass correlation (ICC) for the models trained with  = 5 different weight initializations.All values are in years; the best values are highlighted in bold.
Fig.6.Predicted age difference (vertical axis) between two scans for a subset of subjects (horizontal axis).Each point represents one of the five models, trained with different weight initializations.Subjects were arranged in ascending order of age (from left to right), with every tenth individual selected for plotting.For each model, the average predicted age difference (computed over all subjects, not just the plotted ones) is marked in green, and the 95% CI is indicated by red lines.

Table 5
Consistency metrics for the mean ensemble age predictions on the longitudinal dataset.The MAdE intervals are set based on the age at baseline.All numbers are in years, best are marked in bold.

Table 7
Proposed hyperparameter values in original literature and the values implemented herein.Only the hyperparameters marked with * were reevaluated.The resulting model accuracy is reported as MAE in years.Test MAE of implemented models is presented as the median, minimal and maximal values of the last 10 epochs of model training. a

Table 9
-site data Results of LMEM and ANOVA of ensemble model performance on new site dataset (Section 5.2) with same and different preprocessing than the one used in model training, with absolute difference error as response variable, model architecture as fixed factor and subject ID as random factor: | ′ − | =  + (1|).

Table 10
Results of LMEM and ANOVA evaluating the four models on test-retest dataset (Section 5.3).The predicted difference serves as the response variable, with model architecture as the fixed factor and subject ID as the random factor:  =  + (1|).

Table 11
Results of LMEM and ANOVA on same site longitudinal dataset and new site longitudinal dataset (Section 5.4), with absolute difference error as response variable, model architecture as fixed factor and subject ID as random factor:  =  + (1|).