Reliability and sensitivity of two whole-brain segmentation approaches included in FreeSurfer – ASEG and SAMSEG

Accurate and reliable whole-brain segmentation is critical to longitudinal neuroimaging studies. We undertake a comparative analysis of two subcortical segmentation methods, Automatic Segmentation (ASEG) and Sequence Adaptive Multimodal Segmentation (SAMSEG), recently provided in the open-source neuroimaging package FreeSurfer 7.1, with regard to reliability, bias, sensitivity to detect longitudinal change, and diagnostic sensitivity to Alzheimer’s disease. First, we assess intra- and inter-scanner reliability for eight bilateral subcortical structures: amygdala, caudate, hippocampus, lateral ventricles, nucleus accumbens, pallidum, putamen and thalamus. For intra-scanner analysis we use a large sample of participants (n = 1629) distributed across the lifespan (age range = 4–93 years) and acquired on a 1.5T Siemens Avanto (n = 774) and a 3T Siemens Skyra (n = 855) scanners. For inter-scanner analysis we use a sample of 24 participants scanned on the day with three models of Siemens scanners: 1.5T Avanto, 3T Skyra and 3T Prisma. Second, we test how each method detects volumetric age change using longitudinal follow up scans (n = 491 for Avanto and n = 245 for Skyra; interscan interval = 1–10 years). Finally, we test sensitivity to clinically relevant change. We compare annual rate of hippocampal atrophy in cognitively normal older adults (n = 20), patients with mild cognitive impairment (n = 20) and Alzheimer’s disease (n = 20). We find that both ASEG and SAMSEG are reliable and lead to the detection of within-person longitudinal change, although with notable differences between age-trajectories for most structures, including hippocampus and amygdala. In summary, SAMSEG yields significantly lower differences between repeated measures for intra- and inter-scanner analysis without compromising sensitivity to changes and demonstrating ability to detect clinically relevant longitudinal changes.


Introduction
Automated techniques for whole-brain segmentation have become extremely useful in the study of a range of brain diseases and conditions, such as Alzheimer's disease (AD) (Chételat, 2018), and also normal changes such as in development (Ostby et al., 2009) and aging (Wonderlick et al., 2009). Automated techniques enable processing of large numbers of magnetic resonance imaging (MRI) scans with limited operator investments, enabling detailed segmentations of brains from large-scale brain imaging initiatives. One of the most extensively used whole-brain segmentation approaches is Automatic Segmentation (ASEG) (Fischl et al., 2002), distributed as part of FreeSurfer (http://freesurfer.net/) . FreeSurfer ASEG is a core tool in large-scale neuroimaging projects such as the UK Biobank (≈ 40.000 scans to date) (Alfaro-Almagro et al., 2018), ABCD (≈ 10.000 scans to date) (Hagler et al., 2019), ADNI (> 20.000 scans) (Jack et al., 2008), ENIGMA (> 50.000 scans) (Thompson et al., 2020), and Lifebrain (≈ 10.000 scans) (Walhovd et al., 2018). Although the accuracy of automated segmentation techniques such as ASEG is generally high and enables detection of longitudinal changes (Mulder et al., 2014;Worker et al., 2018), reports have suggested that segmentation accuracy may vary as a function of variables such as age (Wenger et al., 2014) and brain size (Herten et al., 2019;Schoemaker et al., 2016). Hence, continued efforts are undertaken to improve accuracy and reduce bias in the segmentations.
Similar to many other current whole-brain segmentation techniques, ASEG is based on supervised models of T1-weighted images. As signal intensities alone are not sufficient to distinguish between different neuroanatomical structures from a T1-weighted MRI, an atlas containing probabilistic information about the location of structures is used to determine the relationship between intensities and neuroanatomical labels in particular regions of the brain. The probabilistic atlas is generated from a set of manually labeled training images. The segmentation problem is then solved in a Bayesian framework in which local shape, position and appearance all contribute to the probability of a given label. Recently, an alternative approach was suggested -Sequence Adaptive Multimodal Segmentation (SAMSEG)which uses generative parametric models (Puonti et al., 2013(Puonti et al., , 2016. Unlike ASEG, SAMSEG uses a mesh-based computational atlas combined with a Gaussian appearance model to achieve independence of specific image contrast by grouping together voxels with similar intensities (Van Leemput, 2009). SAMSEG is less computationally demanding than other iterative segmentation methods since no preprocessing is needed and only a single, efficient non-linear registration of the atlas to the target image is required. Moreover, bias field estimation and correction are done simultaneous with segmentation and non-linear registration. Nevertheless, SAMSEG resulted in accuracy comparable to ASEG and three other state-of-the-art methods in segmenting T1-weighted MRIs (Puonti et al., 2016). Since SAMSEG does not rely on the specific intensity profiles of a separate training data set, it yields consistent segmentations across scanner platforms and pulse sequences. SAMSEG is included as part of the recent FreeSurfer 7.1 release (released May 11th, 2020), which enables its general use in the neuroimaging community. Therefore, a thorough analysis is necessary to direct the choice between these two utilities provided in the same widely used package of FreeSurfer.
In the present study we undertake a thorough comparative analysis of SAMSEG and ASEG in terms of reliability, bias, sensitivity to longitudinal change, and clinical sensitivity. Longitudinal SAMSEG is used in the present study, which was not available at the time of the Puonti et al. (2016) study. First, we assess intra-and inter-scanner reliability. Second, since higher reliability could come at the cost of lower sensitivity to biologically meaningful change, we test how ASEG and SAMSEG are able to detect neuroanatomic volumetric change in longitudinal follow up scans. Finally, we test how sensitive each method is to clinically relevant change by comparing the annual rate of hippocampal atrophy in a group of cognitively normal older adults (CN), patients with mild cognitive impairment (MCI) and patients with AD.

Materials and methods
2.1. Datasets 2.1.1. Lifespan scan-rescan dataset-We use scan-rescan dataset selected from several ongoing projects at the Center for Lifespan Changes in Brain and Cognition (LCBC), University of Oslo, approved by the Regional Committees for Medical and Health Research Ethics South of Norway. Participants were cognitively healthy, and all participants or their guardian provided informed consent (for details, see e.g. (Walhovd et al., 2016)). Images were acquired using two models of Siemens MRI scanners (Siemens Medical Solutions, Erlangen, Germany) -1.5T Avanto and 3T Skyra, at Rikshospitalet, Oslo University Hospital. A total of 890 participants (1643 sessions) and 887 participants (1739 sessions) were included in the initial within-session scan-rescan datasets for Avanto and Skyra scanners respectively. All images were visually inspected for motion artefacts, and sessions that had two images of no visual appearance of motion were included in further analysis. Fig. 1 illustrates examples of exclusion criterion. After discarding images with insufficient quality, the samples were reduced to 774 participants (427 females; 1362 sessions; age range = 4-93 years) for Avanto and 855 participants (563 females; 1646 sessions; age range = 14-84 years) for Skyra. Fig 2 summarizes age distributions of each scanner dataset. All data was acquired using Magnetization Prepared Rapid Gradient Echo (MPRAGE) sequence with parameters summarized in Table 1. The parameters for the scan-rescan datasets differed between the scanners but were identical for each session on the same scanner, except for the Skyra dataset where one image was acquired using parallel acquisition factor GRAPPA=1 and rescanned with GRAPPA=2. To acquire data with optimal comparability within each scanner, participants remained in the same position between scan and rescan acquisitions.

2.1.2.
Inter-scanner dataset-For inter-scanner dataset, we use a sample of 24 participants (19 females, age range between 20 and 36 years) scanned with three models of Siemens MRI scanners (Siemens Medical Solutions, Erlangen, Germany) on the same day -1.5T Avanto, 3T Skyra and 3T Prisma, at Rikshospitalet, Oslo University Hospital. Table 1 summarizes MRI T1w pulse sequence parameters of each scanner.

Lifespan longitudinal datasets-For
longitudinal LCBC datasets, we select participants from the scan-rescan dataset who also have a follow-up visit: 491 participants of the Avanto scanner and 245 participants of the Skyra scanner. Each participant has two visits with the follow-up ranging from 1 to 10 years for the Avanto dataset and 1 to 5 years for the Skyra dataset.

Clinical sensitivity dataset-
In addition to the longitudinal LCBC datasets, we also include scans from the Alzheimer's disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial MRI, positron emission tomography, other biological makers, and clinical and neuropsychological assessment can be combined to measure the progression of MCI and early AD. For up-to-date information, see www.adni-info.org. For our study, we randomly select three groups of participants with similar age distributions: CN, MCI and AD. Each group consist of 20 participants. The selected sample of ADNI data has been acquired at different sites using a Siemens Avanto 1.5T MRI scanner and MPRAGE sequence: TR = 2400 ms, TE = 3.54 ms, TI = 1000 ms, flip angle = 8°, voxel size = 1.25 × 1.25 × 1.2 mm 3 , 192 × 192 acquisition matrix, 160 slices, 180 Hz pixel bandwidth, GRAPPA = 1, 8 channel matrix coil. Each participant has two visits with a follow-up ranging from 6 months to 2 years for each group.

MRI processing
Due to the non-linearity of the magnetic fields from the imaging gradient coils, we first preprocess images to reduce geometrical variability of the same participants' brains between sessions. This is achieved by obtaining scanner-specific spherical harmonics expansions that represent the gradient coils (Jovicich et al., 2006).
We use two fully automated subcortical segmentation methods FreeSurfer v7.1 ASEG and SAMSEG to process MRI data and measure volumes of eight bilateral brain structures of interest: amygdala, caudate, hippocampus, lateral ventricles, nucleus accumbens, pallidum, putamen and thalamus. Briefly, the FreeSurfer ASEG pipeline includes Talairach transformation, intensity correction, the removal of nonbrain tissues and volumetric brain segmentation based upon the existence of an atlas containing information on the location of structures, whereas SAMSEG utilizes a mesh-based atlas and a Bayesian modeling framework to obtain volumetric segmentations without the need for skull-stripping. Moreover, SAMSEG does the bias field estimation and correction simultaneous with segmentation and non-linear registration which is not the case for ASEG where each step is performed separately. Both methods are fully automated and model-based that use a pre-built probabilistic atlas prior from 39 to 20 subjects, respectively. The 20 subjects used for SAMSEG are a subset of the 39 used for ASEG.
To extract reliable volume estimates, we process all datasets with the longitudinal stream in FreeSurfer ASEG and SAMSEG. For FreeSurfer ASEG, an unbiased within-subject template space and image (Reuter and Fischl, 2011) is created using robust, inverse consistent registration (Reuter et al., 2010). Several processing steps, such as skull stripping, Talairach transforms, atlas registration, and spherical surface maps and parcellations are then initialized with common information from the within-subject template, significantly increasing reliability and statistical power (Reuter et al., 2012). Longitudinal SAMSEG is based on a generative model of longitudinal data . In the forward model, a subject-specific atlas is obtained by generating a random warp from the usual population atlas, and subsequently each time point is again randomly warped from this subject-specific atlas. Bayesian inference is used to obtain the most likely segmentations, with the intermediate subject-specific atlas playing the role of latent variable in the model, whose function is to ensure that various time points have atlas warps that are similar between themselves, without having to define a priori what these warps should be similar to.

Statistical analysis
2.4.1. Scan-rescan reliability-We use multiple statistical approaches to describe and evaluate the magnitude of intra-and inter-scanner variability between repeated measurements. We calculate the absolute symmetrized percent difference (ASPD) as follows: where L 1 and L 2 are the segmented labels of the same structure but of different images and V(L) is the volume of the label. ASPD value of 0 indicates a perfect replicability, with increasing values indicating less reliable repeated measurements. We use Generalized additive models (GAM) (Wood, 2017) to characterize volume estimation variability trends of subcortical structures across the lifespan. GAMs are generalized linear models in which the predictors depend linearly or non-linearly on some smooth non-linear functions (Hastie and Tibshirani, 1990). The smooth functions are estimated from the data and enable a flexible smooth curve fitting across the lifespan. In addition to ASPD, we also calculate Dice scores (Dice, 1945), intraclass correlation coefficients (ICC) (McGraw and Wong, 1996;Koo and Li, 2016) and Bland-Altman plots (Bland and Altman, 1986). For ICC we use a 2-way mixed-effects model, single measurement and absolute agreement ICC form.

2.4.2.
Sensitivity to longitudinal change-First, to assess whether the estimated lifespan trajectories of the subcortical volumes differ depending on segmentation method, we use General Additive Mixed Models (GAMM) (Wood, 2017). In contrast to GAMs which treat each observation as independent, GAMMs take longitudinal information into account by explicitly modeling the correlation between repeated measurements of the same subject, yielding a model which captures cross-sectional and longitudinal information. Second, to assess longitudinal changes, we use the annualized percentage change (APC) values between the baseline and the follow-up visits for all participants with two scans separated by one or more years. We compare APC values for each segmentation method with paired samples t-tests. We divide the sample into development (< 20 years), adulthood (between 20 and 60 years) and aging (> 60 years) and compare APCs across age groups using t-tests and Cohen's D. Cohen's D is an effect size used to indicate the standardized difference between two means. Third, to address the clinical sensitivity of each segmentation method, we compute APC for the hippocampus for ADNI subjects, and assess differences in APC between groups (NC vs. MCI vs. AD) using Cohen's D. Finally, we use Receiver Operating Characteristic (ROC) curves and Area Under the Curve (AUC) to address the classification sensitivity based on the APC values of the longitudinal hippocampus estimates in different groups.
All statistical analyses described above is done using R statistical software package v3.6.3 (R Core Team, 2020) and its related packages: mgcv (Wood, 2017), ggplot2 (Wickham, 2016), ggpubr (Kassambara, 2020), cowplot (Wilke, 2019), irr (Gamer et al., 2019), effsize (Torchiano, 2020) and dplyr (Wickham et al., 2020). Fig. 4 show volume estimation differences between repeated intra-scanner acquisitions across the lifespan for the Avanto and Skyra datasets respectively. Although most of the subcortical structures indicate relatively flat lifespan trends, small deviations are observed in the Avanto dataset for the young children group (age < 10 years) when using ASEG. This is not present in the Skyra dataset as it does not include this age group. SAMSEG volumetric estimates are significantly lower (paired samples t-test, p < 0.05) for both datasets and all structures across the lifespan, see appendix (Table A.1) for summary statistics which also indicate lower standard deviations for SAMSEG. Fig. 6 indicate spatial overlap similarity in terms of dice scores for the Avanto and Skyra datasets respectively. Most of the structures show inverted u-shape trajectories except the lateral ventricles which demonstrate almost linearly increasing reliability with aging. ASEG yields significantly higher spatial agreement for putamen whereas the rest of the spatial overlaps are significantly better for SAMSEG (paired samples t-test, p < 0.01).

Fig. 5 and
The largest improvements are demonstrated for amygdala, pallidum and nucleus accumbens.
In general, all Dice scores are high for both the segmentation methods indicating a good spatial agreement between segmented volumes.
We also compute ICC to assess the agreement between the repeated measurements for each scanner dataset and segmentation method. Although we find the reliability of the repeated measurements very high (ICC > 0.95) for both methods, SAMSEG results in significantly higher (p < 0.01) ICC values for all subcortical structures. Bland-Altman plots of both methods do not indicate bias towards the estimated structure size, see appendix (Fig. A.1 and Fig. A.2). However, despite consistent volumetric estimations regardless of the structure size, the limits of agreement (average difference ± 1.96 standard deviation of the difference) are in favor of SAMSEG.

Inter-scanner differences
In Fig. 7, we present inter-scanner differences for three comparisons: Avanto vs. Prisma, Avanto vs. Skyra, and Prisma vs. Skyra. It is evident that the performance of both segmentation methods depends on the particular choice of comparison. Nevertheless, most of the estimated differences are in favor of SAMSEG, especially for amygdala, lateral ventricles and pallidum. A table of numerical results (means and standard deviations) is provided in the appendix (see Table A.2). Similar to scan-rescan reliability, spatial overlaps are also significantly better for SAMSEG except putamen, which has significantly better scores for ASEG, see Fig. 8.

Longitudinal changes
SAMSEG's higher intra-and inter-scanner reliability could be a result of a lower sensitivity to detect relevant changes in brain volumes. We, therefore, test the sensitivity of ASEG and SAMSEG to detect changes over time using longitudinal scans and previously documented effects. First, we run GAMMs to test whether ASEG and SAMSEG yields distinct estimated lifespan trajectories for the volume of each structure when both cross-sectional and longitudinal information is taken into account. For this, we use a part of the LCBC scan-rescan dataset where two observations separated by at least one year are available for each participant. Each volume's trajectory is modelled as a function of age, which varies within each participant with more than one test occasion. The resulting curves thus take into account both observed within-participant change and between participant differences in age. Fig. 9 shows the estimated lifespan trajectories for each method for the longitudinal Avanto dataset. Although there are similarities in estimated age-trajectories between segmentation methods, there are also marked differences. Specifically, ASEG estimates more prominent age-effects for the hippocampus, amygdala and thalamus structures, with apparent volumetric reductions starting at a much earlier age compared to the SAMSEG results. We observe similar results for the Skyra dataset as well.
Next, we analyze change as indexed by the APC between time-points. We divide the sample into 3 age groups: development, adulthood and aging as described in Section 2.4.2. segmentation methods for each age group are all significant (paired samples t-tests, p < 0.01) as well indicating that SAMSEG tends to estimate smaller longitudinal changes than ASEG. Fig. 10 illustrates Cohen's D effect sizes based on the APC values between development and adulthood, and between adulthood and aging groups for hippocampus. SAMSEG results in larger numeric effect sizes between development and adulthood whereas ASEG tends to estimate larger effect sizes for adulthood vs. aging group. However, none these differences are significant between segmentation methods.

Clinical sensitivity
The results of the longitudinal changes indicate that SAMSEG yields lower APC values than ASEG. However, there is no ground truth whether smaller or larger changes are more accurate. We, therefore, address the clinical sensitivity using a subsample of ADNI data. For the purpose of this analysis, we only consider a hippocampus since it is the most sensitive structure for detecting AD.
In Fig. 11, we present longitudinal left hippocampus volume changes for both segmentation methods. The observed differences are similar between the methods but SAMSEG yields notably larger changes in AD group. In addition, SAMSEG tends to estimate larger volumes compared to ASEG but this is consistent between the groups. Table 3 summarizes the same group comparisons in terms of mean and standard deviation of APC values. SAMSEG estimates significantly lower APC values for CN and MCI groups but larger for AD group as compared to ASEG. Nevertheless, SAMSEG leads to the detection of significant differences in atrophy rates between all clinical groups except for the left hippocampus MCI vs. AD comparison whereas the only significant difference for ASEG is seen for the right hippocampus CN vs. MCI contrast. Generally, ASEG demonstrates larger APC variability within each group which in turn hampers the detection of significant differences between the groups when sample sizes are small. Fig. 12 shows Cohen's D effect sizes and their 95% confidence intervals between the group comparisons. The effects are generally larger for SAMSEG than ASEG, but none are significantly different between the segmentation methods.

Discussion
The scan-rescan reliability indicates reliable volume estimation across the lifespan, scanners and segmentation methods. Slight deviations are observed for younger participants, presumably due to subtle head motion artifacts. It has previously been shown that younger age groups typically evidence increased motion artifacts, which can hinder the identification of the tissue boundaries (Blumenthal et al., 2002). Importantly, subtle motion artifacts can lead to systematic biases in automatic measurement of structural brain properties (Yendiki et al., 2014). Although different parallel imaging factors (GRAPPA) are used for the Skyra scan-rescan dataset (GRAPPA = 2 vs. GRAPPA = 1), it does not indicate sensitivity to lower signal-to-noise ratio and is comparable to the Avanto dataset. Similar effects of parallel imaging acceleration are shown by (Wonderlick et al., 2009).
The observed average volumetric differences across the lifespan for ASEG are similar to previous reports (Jovicich et al., 2009;Morey et al., 2010). Nevertheless, SAMSEG leads to significantly higher intra-scanner volume estimation reliability for all subcortical structures and higher spatial overlap except putamen, which has significantly higher spatial overlap for ASEG. This is likely a result of SAMSEG's probabilistic atlas, which currently does not include claustrum structure. Claustrum's thin shape and proximity to putamen structure makes it difficult to reliably segment at common image resolutions, for example, isotropic 1 mm 3 voxels. The probabilistic atlas used in ASEG does not include claustrum either, but it has its internal mechanism of removing it from the putamen segmentation. Fig. 14 shows an example segmentation of putamen using ASEG and SAMSEG, which outlines the inclusion of claustrum for SAMSEG.
Inter-scanner differences also support the findings of the intra-scanner reliability. Although inter-scanner differences depend on the particular comparison, SAMSEG in almost all the cases was able to estimate smaller ASPD values than ASEG. In addition, SAMSEG shows much lower variability of volumetric measures indicating improved reliability and sensitivity to detect meaningful changes. This is especially important when having small sample sizes as seen in the clinical sensitivity analysis.
Higher intra-scanner and inter-scanner reliability could come at the cost of less sensitivity to detect meaningful biological change, i.e. that SAMSEG over-regularizes. However, the present analyses of within-person longitudinal change suggest that SAMSEG does not achieve improved reliability by sacrificing sensitive to change. Longitudinal changes in hippocampal volume are detected by both methods, and the APC values are comparable. In the absence of the ground truth longitudinal changes, the present findings suggest that both methods are sensitive to changes in hippocampal volume over time.
We also mapped the lifespan trajectory of each of the structures of interest using GAMMs, taking both cross-sectional and longitudinal information into account. The segmentation differences between ASEG and SAMSEG have substantial effect on lifespan trajectories. In general, developmental trajectories are similar regardless of segmentation method, replicating previous findings (Ostby et al., 2009), although effect sizes for the hippocampus are larger for SAMSEG than ASEG when comparing development to adulthood. For adulthood and aging, however, marked differences are seen for most structures. For the hippocampus and amygdala, the ASEG results replicated earlier studies showing slight volumetric decline from young adulthood (Fjell et al., 2013), with acceleration of volume loss from the sixties, especially marked for the hippocampus. This is not observed for SAMSEG, where very little volume loss is seen before the accelerated decline in aging. For thalamus and pallidum, there are large offset effects, where the estimated volumes for the young children are much higher for ASEG, followed by a steady decline after development ends, extending throughout the rest of the lifespan. This pattern, which is in agreement with previous literature (Fjell et al., 2013), is not seen with SAMSEG. For these structures, as well as nucleus accumbens, SAMSEG yields modest decline across adulthood, with only some acceleration of volume loss in the oldest for thalamus. Interestingly, while the previously reported U-shaped trajectory for caudate (Fjell et al., 2013) is seen with ASEG, this is less evident with SAMSEG, which shows a more linear volume decline also in higher age. The implications of these findings await further explorations, but the present results show that the two segmentation methods have substantial effects on the estimated lifespan trajectories of most subcortical structures.
The longitudinal changes analyzed in the clinical setting suggest that SAMSEG tends to be more sensitive to differences in hippocampal atrophy between CN, MCI and AD groups. This is especially important for detecting the early accelerated hippocampal atrophy which is known to be one of the most sensitive biomarkers of Alzheimer's disease (Teipel et al., 2013). Expected group differences are more consistently observed for SAMSEG than ASEG. This is likely the result of larger variability between change estimates for ASEG which in turn reduces the power to detect significant differences between the groups. Therefore, based on the current study there is evidence that ASEG might require more samples per group in order to observe the expected group differences, whereas SAMSEG already shows greater sensitivity to detect relevant changes with the relatively modest number of 20 patients in each group that we use for assessment. This is well reflected in the Cohen's D effect sizes and ROC-AUC curves, which indicate the improved classifications based on SAMSEG's segmentations.
We have analyzed intra-scanner reliability of participants that were not repositioned before acquiring a repeated scan. This scenario is unlikely in the clinical setting where participants are usually taken out of the scanner before acquiring another repeated scan. This, in turn, might lead to an increased measurement variability and less reliable volumetric estimates compared to what was observed in the present work. We also acknowledge that a visual rating procedure is not the most appropriate approach of pre-selecting images for the intra-scanner analysis and the study is not informative with a view to the robustness of either segmentation method in the presence of common artefacts. Finally, we performed a comprehensive evaluation of longitudinal changes and sensitivity for the hippocampus structure. The remaining subcortical structures should be addressed in addition as it is not evident that similar longitudinal trends would be present.

Conclusions
Both whole-brain segmentation methods demonstrate high scan-rescan reliability. Although SAMSEG yields significantly lower differences between repeated measures for intraand inter-scanner analysis, it does not compromise sensitivity to detect changes and demonstrates ability to detect clinically relevant longitudinal changes. Therefore, the method has a potential to be widely used in neuroimaging research. The present findings will also direct many researchers who have the choice between these two utilities, leading to a downstream impact in clinical studies and laying the foundation for further studies that can build on this.

Acknowledgement
The present research was funded by a grant from Helse-Sør Øst (Grant No. 2018009)  A summary of the mean and standard deviation (in parenthesis) ASPD values for both segmentation methods and datasets for the intra-scanner reliability analysis. Abbreviations of structure names: PT (putamen), PA (pallidum), CA (caudate), TH (thalamus), HP (hippocampus), AM (amygdala), LV (lateral ventricles) and AA (accumbens area). Names are prefixed with L-and R-to indicate left and right hemispheres respectively. Bold numbers indicate the smallest mean and standard deviation values between the segmentation methods. Bland-Altman plots for the Avanto dataset and ASEG segmentation method. Limits of agreement (average difference ± 1.96 standard deviation of the difference) are shown by the red lines. Bland-Altman plots for the Avanto dataset and SAMSEG segmentation method. Limits of agreement (average difference ± 1.96 standard deviation of the difference) are shown by the red lines.   Box plots of Dice scores for inter-scanner comparisons. X-axis abbreviations: Avanto vs. Prisma (A vs. P), Avanto vs. Skyra (A vs. S) and Prisma vs. Skyra (P vs. S). All differences between segmentation methods are significant (p < 0.05).  Lifespan trajectories for the Avanto dataset. The trajectories are estimated by GAMM and represent a combination of cross-sectional and longitudinal information.  Cohen's D effect sizes (dots) and their 95% confidence intervals (vertical bars) for development vs. adulthood, and adulthood vs. aging groups for the Avanto dataset.     A summary of the mean and standard deviation hippocampus APC values for the age groups, segmentation methods and datasets. Standard deviations are indicated in the parenthesis next to the mean APC values and a sample size of each group is indicated in the parenthesis next to the group name.