Degenerative Adversarial NeuroImage Nets for Brain Scan Simulations: Application in Ageing and Dementia

Accurate and realistic simulation of high-dimensional medical images has become an important research area relevant to many AI-enabled healthcare applications. However, current state-of-the-art approaches lack the ability to produce satisfactory high-resolution and accurate subject-specific images. In this work, we present a deep learning framework, namely 4D-Degenerative Adversarial NeuroImage Net (4D-DANI-Net), to generate high-resolution, longitudinal MRI scans that mimic subject-specific neurodegeneration in ageing and dementia. 4D-DANI-Net is a modular framework based on adversarial training and a set of novel spatiotemporal, biologically-informed constraints. To ensure efficient training and overcome memory limitations affecting such high-dimensional problems, we rely on three key technological advances: i) a new 3D training consistency mechanism called Profile Weight Functions (PWFs), ii) a 3D super-resolution module and iii) a transfer learning strategy to fine-tune the system for a given individual. To evaluate our approach, we trained the framework on 9852 T1-weighted MRI scans from 876 participants in the Alzheimer's Disease Neuroimaging Initiative dataset and held out a separate test set of 1283 MRI scans from 170 participants for quantitative and qualitative assessment of the personalised time series of synthetic images. We performed three evaluations: i) image quality assessment; ii) quantifying the accuracy of regional brain volumes over and above benchmark models; and iii) quantifying visual perception of the synthetic images by medical experts. Overall, both quantitative and qualitative results show that 4D-DANI-Net produces realistic, low-artefact, personalised time series of synthetic T1 MRI that outperforms benchmark models.


Introduction
The increasing availability of big data in healthcare and medicine has produced a boom in AI-enabled healthcare tools, particularly in medical image analysis. However, in various contexts of this research area, there is a lack of ground truth data, which presents challenges for trust and reliability of the related AI-based tools. Therefore, medical image simulation able to generate accurate and realistic data for model validation, can be a vital ingredient in the development of these new technologies. Such simulators are also important for data augmentation when training AI data-hungry models in situations where insufficient samples are available, e.g., in rarer diseases, or to recover missing images in longitudinal studies and predict future disease courses (virtual placebo). Here we introduce a novel computationally efficient 4D brain image simulation approach and demonstrate its capabilities in a neuroimaging application.
Neurodegenerative diseases are a major challenge of 21st-century medicine, with the increasing incidence of these age-related diseases expected to continue to rise as the global population ages. This has inspired an explosion in medical data-sharing initiatives including from healthcare records (e.g., Alzheimer's Disease Data Initiative (ADDI)), and large observational research studies such as the Alzheimer's Disease Neuroimaging Initiative (ADNI). Neuroimaging, such as Magnetic Resonance Imaging (MRI), is able to probe neurodegenerative diseases noninvasively and has provided well-established biomarkers for tracking disease progression in the clinic [1]. The data-sharing revolution has inspired the development of a suite of data-driven computational modelling methods for understanding and predicting disease progression [2,3], with imaging playing a key role. Despite these efforts, suitable medical image simulators are relatively few and lagging behind, because simulating realistic and accurate neuroimaging data presents multiple challenges both biologically and computationally, many of which we address in this work.
Here, we introduce 4D-DANI-Net: a computationally-efficient framework for synthesizing a realistic, accurate, and personalized time series of high-resolution brain images for an individual conditioned on disease stage (clinical diagnosis) and age.
Our contributions can be summarized as follows: i) we designed a new pipeline that enables the simulation of 4D MRI in both ageing and disease; ii) we proposed a sequence of memory-efficient techniques designed to improve training stability, reduce image artefacts, and increase individualization; and iii) we proposed a new validation protocol based on volumetric comparison to assess the accuracy of such a system.
We demonstrate our framework in the context of Alzheimer's disease and our experiments extensively analyze the capabilities of 4D-DANI-Net through quantitative and qualitative assessment, after training on a large dataset consisting of 9652 T1-weighted MRI from the ADNI and validate on a separate test set of 1216 MRI (also from the ADNI).
The paper is structured as follows: in Section 2, we describe relevant previous work; in Section 3, we summarize our new framework; in Section 4, we describe the data set and our training protocol. Experimental results are presented in Section 5, and we conclude in Section 6.

Background
Computational disease progression modelling is a discipline that studies biophysical mechanisms and observable patterns of pathology spread and symptoms in chronic diseases. Such models are motivated by one or more applications including predicting the future course and providing insight for disease staging, which could help to achieve early diagnosis and personalized care. For a review of data-driven disease progression models, see [2]. Briefly, the input to many disease progression models [4,5,6,7,8,9,10] is unstructured data such as scalar biomarkers, including those extracted from MRI for assessing neurodegeneration. Spatiotemporal models, e.g., [11,12], attempt to incorporate structural information from the MRI themselves. All these models aim to produce quantitative templates of disease progression that promise utility for, e.g., recruiting the right patients at the right time into clinical trials. An MRI simulator has a key role to play in validating such models for these important applications. Other potential clinical applications include enhancing AI interpretability by providing counterfactual visual examples that help humans identify errors in classifications made by AI systems to understand how marginal decisions come about [13,14,15]. Lastly, such simulators can be used to augment medical imaging datasets by creating new realistic samples required to train data-intensive AI algorithms when data collection is infeasible or too expensive [16,17,18].
Current MRI simulators can be divided into two categories: i) biomechanical/physics-based models which describe the brain deformations in mechanical terms such as strain, displacement and stress. These models consider geometry, boundary conditions, loading, and material properties in their definition [19,20]; ii) data-driven/learning-based models capable of understanding and predicting disease progression. These approaches often use machine learning, including deep learning techniques to distil information from big data [21]. Among these, a type of neural network that is particularly useful for generative modelling and simulation is the Generative Adversarial Network (GAN) [22], which can generate new samples that plausibly come from an existing distribution of real data. To do this, GANs are trained using two neural network models: a generator that learns to generate new plausible samples, and a discriminator that learns to differentiate generated examples from real examples. However, due to the high spatial dimensionality (many voxels per scan) and temporal sparsity of MRI data (few time-points per individual), training such type of networks is challenging and computationally expensive.
In particular, current MRI simulators suffer three key limitations that severely limit their utility: i) lack of individualization; ii) poor image resolution; iii) limited to 2D images.
Lack of individualisation precludes accurate modelling of individual trajectories because all the simulated MRI scans have the same, group-level deformation pattern. Approaches with this limitation usually create a spatiotemporal model that learns only one monotonic behaviour across all subjects [23,24,25,26] or a few morphological templates associated to specific sub-groups [27,28,29,30]. An early attempt to overcome these restrictions exploited the power of deep generative models to propose a framework based on GANs which uses image arithmetic to combine atrophy patterns and manipulate MRI directly [31]. However, this approach was restricted to linear (short-term) disease progression and was still based on learning group-level morphological changes that lose subject individuality over time.
While solutions lacking individualisation do not completely fit the purpose of disease progression modelling, [32] have shown that sharing synthetic images reproducing group-level statistics is an alternative solution when it is not possible to share patient data due to privacy or data protection issues.

Age index
Trained DANI-Net models The second and third limitations (poor image resolution and limited to 2D images) are mainly due to the computational cost required by a simulator. In fact, implementing effective methods for 3D, high-resolution brain images, often requires increased computational time due to memory issues [33].
One approach that suffers from these limitations is proposed in [20] which combines a biophysical model and a deformation field obtained by non-rigid registration of two real images. This approach is constrained by memory restrictions that result in a trade-off between image resolution/dimensionality (e.g., 3D vs 2D), computation time and, ultimately precludes the utility of such an approach from scaling up to large, high-resolution datasets. Beyond the prohibitive computational cost, [20] also relies on an atrophy lookup table rather than learning atrophy patterns from the data.
Reducing the dimensionality from 3D to 2D MRI can ameliorate some of the computational limitations. For example, the simulator in [34] proposed a predictive regression model for only 2D images. Instead of directly predicting images, this model predicts a vector momentum sequence [35] associated with a baseline image where a Long Term-Short Memory (LSTM) network is used to encode the time-varying changes in the vector-momentum sequence, and a Convolutional Neural Network (CNN) is used to encode the baseline image of the vector momenta. [36,37] instead proposed a GAN-based adversarial training, which aimed to learn an age-based progression model for 2D slices of brain MRI scans. Our own preliminary work introduced a GAN-based framework, still only for 2D MRI [38], which was inspired by a face-ageing model [39]. L reg

Regional loss
Tot. Loss Here, we introduce a framework to address all these limitations. We decompose the 4D problem (3D plus time) into learning multiple separate (2D plus time) models based on the slice-wise framework presented in [38]. These separate models are unified using a new 3D training consistency strategy called Profile Weight Functions (PWFs) that preserves spatiotemporal continuity between 2D models. This memory-efficient strategy allows us to overcome limitation iii) -restricted to 2D images, whereas a 3D super-resolution block is used to overcome limitation ii) -poor image resolution. Lastly, we use a transfer learning strategy to obtain model individualisation to overcome limitation i)lack of individualization.

Methods
4D-DANI-Net is a deep learning framework for synthesising high-resolution, longitudinal, subject-specific MRI scans. The core of the framework is a progression model based on adversarial training which includes biologically-informed spatiotemporal constraints to model neurodegeneration in ageing and dementia. Formally, 4D-DANI-Net generates the MRI sequence Y p,i with i ∈ {1...A} representing the simulated series of A time points for the subject p, initialised from a single input MRI X p,θ acquired at age θ ∈ R + .
Our framework consists of three main blocks depicted in Fig. 2: i) pre-processing; ii) progression model; and iii) 3D super-resolution. Pre-processing removes irrelevant variations in the data. Progression modelling is performed slice-wise (2D plus time) with 3D training consistency, as a set of DANI-Net models DN n , where n ∈ {1...T } represent different slice positions. Finally, our super-resolution block is a function that maps the resulting set of T lower-resolution image slices I p,i,n ∈ R 2 for subject p and time point i obtained from each DN n , to the high-resolution MRI Y p,i ∈ R 3 . Below, we describe each block in detail.

Pre-processing
We use four pre-processing steps to prepare each input MRI X p,θ for model training. This produces a set of n normalized slices X p,θ,n from each MRI. Samples with pre-processing failures were excluded from our experiments.
The four steps are i) linear co-registration to 1mm isotropic MNI template using FLIRT-FSL [40]; ii) skull-stripping using BET-FSL [41]; iii) extraction of the n ∈ {1...T } axial slices from X p,θ ; and iv) performing slice-wise intensity standardisation (zero mean, unit standard deviation). In combination, these steps reduce irrelevant variations in the data.
Such variations can be caused by, e.g., scanner peculiarities and image orientation, which are irrelevant to the biological processes of interest.

Progression Model
For each axial slice in MNI space, we fit an independent 2D plus time progression model DN n (based on the original DANI-Net [38]). Each DANI-Net model consists of three different sub-blocks (see Fig. 3): a Conditional Deep Autoencoder (CDA) (coloured in pink); a set of adversarial networks (yellow); and a set of biological constraints (grey). We also introduce a novel PWFs strategy for unifying slice models into a 3D progression model during training (blue).

Conditional Deep Autoencoder (CDA)
This block aims to learn a mapping between an initial manifold (representing brain MRI) and a lower-dimensional space, which we refer to as the latent space. This latent space is conditioned on other factors associated with the subject (i.e., current diagnosis, age) to allow manipulation of the image prediction on the original manifold according to these metadata.
More specifically, this block is composed of two deep neural networks: an encoder E that embeds X p,θ,n in a latent space Z, and a generator G that projects samples in the latent space, back to the original manifold. The latent vector z is conditioned on two variables: d ∈ N + -a numerical representation [0 − 3] of diagnosis (cognitively normal CN, subjective memory concern SMC, early/late mild cognitive impairment E/LMCI, Alzheimer's disease AD); and a ∈ N + -an age index binned into A groups. This age binning allows learning of morphological changes between age groups and prevents the CDA from memorizing (in the latent space) the age θ as an individual representation for each sample and thereby overfitting to age.
The CDA is trained using a reconstruction loss L rec p,n that minimizes the difference between the input X p,θ,n at age θ and the output sequence This difference is weighted using a fuzzy Gaussian membership function µ i [m i , σ i ] centred on the average age m i of each age bin, with width σ i ∝ √ δ i proportional to the maximum age difference δ i inside each bin. This preserves similarity between the input and the generated sequence, weighting nearer ages more heavily. Formally, L rec p,n is described as follows:

Adversarial Training
GANs are a class of adversarial deep neural networks that have been successfully used to generate high-quality images across a wide range of tasks.
We introduce a new adversarial training technique for the 4D-DANI-Net. In our case, the generator network G (the decoder of our CDA) learns how to create synthetic realistic brain images. Simultaneously, we use two discriminators, D z and D b , trained adversarially with the encoder E and the decoder G of our CDA.
More specifically, G is trained to fool D b , i.e., to generate brain MRI with a similar distribution to the initial true distribution. Simultaneously D b is trained to discriminate between empirical and synthetic brain MRI (generated by G).
To train D b we use the following loss function: where E is the expectation, D b estimates the probability that a slice contains a realistic brain and E(X p,θ,n ) is the latent vector obtained from X p,θ,n .
The second discriminator D z is trained adversarially with the encoder E, to produce z with a uniform prior U and smooth temporal progression. To train D z we use the following loss function: where z * is a vector sampled from U and D z estimates the probability that a vector comes from U.

Biological Constraints
To capture the patterns of image intensity changes that accompany disease progression across time, 4D-DANI-Net uses two separate loss functions at different spatial scales: voxel-level L vox and region-level L reg . These losses impose biological constraints that mimic neurodegeneration by ensuring monotonically decreasing intensity (brain tissue density [42]) that is consistent with normal ageing and/or dementia.
For the synthetic output G p,a,n with a equal to the bin index for age θ, the voxel-level loss function L vox p,n penalizes non-monotonic progression by imposing that all the voxels in G p,i,n with i < a have equal or higher intensity, and that all the voxels in G p,j,n with j > a, have equal or lower intensity (recall that intensity is normalized in the first block of Fig. 2).

L vox
p,n is defined as follows: L vox p,n models progression at the voxel level, but is incapable to model intensity changes that can occur at the global level (i.e., due to tissue deformation).
Therefore, we introduce a region-level loss function L reg p,n that models slice-wise regional neurodegeneration through a set of pre-trained logistic regressors (LRs). Each regressor LR n,q is trained to predict intensity progression in fixed, overlapping region masks q. We describe how to generate these specific regions in Section 3.5.
For slice n, the regressor takes three input features: age at baseline, age at follow-up, and diagnosis. We restrict each LR to train monotonically decreasing data by removing time-points where regional intensity increases (representing outliers). We also weigh the errors made by each LR n,q with the corresponding region size s n,q , to induce consistent intensity within large regions. The contribution of s n,q helps to make this loss resistant to the noise in the MRI.
Formally, L reg p,n is defined as follows: where R n is the number of regions; r n,q are the region masks; LR n,q (o, a, d) is the corresponding intensity change predicted from the logistic regressor for age a, conditioned on diagnosis d, starting from the baseline age o; = 0.1 avoids numerical errors; is the matrix Hadamard product (element-wise multiplication); and * is the sums over brain voxels.

Total Loss
Each single-slice DANI-Net model DN n is computed on the slice position n and is trained to optimize all the losses (L reg p,n , L vox p,n , L D b n , L D z n , L rec p,n ) at the same time. We illustrate this in the black block of Fig. 3. The total loss is the weighted sum where are the cross entropies obtained respectively by the discriminators D z and D b , for the slice position n over all subjects p.
The weights allow for framework customization, such as: -increasing w reg increases the contribution of disease progression (the LRs); -increasing w vox regularizes voxel intensity changes for flat regions, but may increase rigidity of brain structures; -increasing w b increases model generalization at the cost to decrease favours qualitatively realistic brain images; -increasing w z reduces temporal smoothing to allow rapid progression, which can introduce temporal discontinuity;

Local minimum
Training manifold with PWF Input Output Artefacts Figure 4: Left: Training separated slice-based models using a complicated framework with multiple adversarial losses (i.e. 4D-DANI-Net) and each loss having a constant relevance along the entire training process, can lead to possible local minima. Right: Proposed PWFs strategy; the relevance of each loss during training is specified by profile functions with parameters learned via grid search. In this case, the training follows a specific path in the manifold and avoids local minima.
-increasing w rec increases similarity across age, which diminishes progression learned by the LRs.
Some loss functions optimize concurrent tasks, so finding the optimal configuration for these weights is nontrivial. Our strategy to accomplish this is via PWFs, that we describe in the next section.

Profiling Weight Functions (PWFs) for 3D Training Consistency
In this section, we introduce PWFs that propose a way to dynamically weigh our five losses and unifying the training of the 2D slice-wise models [38] in a computationally efficient manner.
Due to the complexity and non-convexity of our total loss function, training each DN n might be unstable. This is particularly problematic since convergence failures in a slice will generate spatial inconsistency artefacts in the synthetic 3D MRI. This is compounded by the adversarial components of DANI-Net (D z and D b ), as GANs are known to be prone to training instability [43,44].
The left block of Fig. 4 shows a hypothetical example that would create problems with classical adversarial training, as the competitor networks may reach a local minimum of the training manifold.
To overcome this type of instability, the PWFs will guide training. It is inspired by a multistage learning strategy where humans solve a complex visual problem, i.e., optimizing simpler sub-tasks first. Explicitly, PWFs guide the system to focus on fewer loss functions at a time, i.e., providing greater regularization. This is achieved by dynamically weighting each component loss during every training epoch t.
To do so, we use the following mean-reverting exponential function: with parameters (b loss , v and u) optimized by a random search strategy on a grid, and measuring training convergence using the L tot n on a validation set. The right side of Fig. 4 depicts how PWFs help to avoid local minima and, in our case, ensure that different models avoid the spatial mismatch that can cause image artefacts.
The final step for maintaining 3D consistency between consecutive slices is to smooth slice-wise models using a Gaussian-weighted (σ = 1.5) average, that includes the ±2 nearest-neighbour slices.
The workflow of the proposed progression model block is described schematically in Fig. 2.

3D Super-resolution
To recover lost anatomical detail due to the Gaussian smoothing (described in the preceding section), we include a 3D super-resolution block at the end of our pipeline (see Fig. 2). This is based on a modified 3D densely-connected super-resolution network [45] that uses pairs of low-resolution (LR) and high-resolution (HR) MRI for training a deep super-resolution neural network.
We train this super-resolution block separately from the rest of our framework. To do so, we use as HR images the X p,θ,n available in the training set, and as LR counterparts, the output obtained from the same input X p,θ,n , at the same age θ computed from our framework when the super-resolution block is disabled.
Once these pairs of LR/HR are created, we use them to train the super-resolution network. We then proceed by attaching this trained SR-network onto the backbone of our trained system. This allows us to super-resolve the stacked output I p,i,n (generated by each DN n ) and to produce the required high-resolution MRI Y p,i .

Post-processing Using Transfer Learning
At the inference stage, we use transfer learning to personalise and fine-tune the model on each test input brain. The aim of this procedure is to synthesize image evolution that reflects individual characteristics (here represented by brain morphology) and align each test MRI-slice to the model by age and diagnosis.
For each DN n and for each test subject p, we perform fine-tuning with an additional 50 training iterations on each single test input image, where only the MRI from each test subject's first visit is used. Here, only the parameters in the super-resolved block are frozen whilst all the other network parameters are fine-tuned to the specific morphology of the test individual's brain. The number of iterations required in the transfer learning step was determined empirically by visual inspection on the obtained images.

Additional Information: Region Extraction Based on Atlas
The region masks used to train 4D-DANI-Net are pre-defined by the brain atlas proposed in [46] and imposed on each MRI using linear registration. The regions are extracted from the axial slices of this atlas. Additionally, to increase subject-specific variability, each region is augmented by applying morphological operators of erosion and dilatation. After this process, for each slice position, an average of 88 regions are generated, and a total of 9113 regions are extracted on the entire MRI (average size of a region is 517 voxels). As explained in Section 3.2.3, these regions represent each r n,q and are used to train the logistic regressors LR n,q required to embed the regional ratio of intensity changes in the system.

Dataset and Training Details
Data used in the preparation of this article were obtained from the ADNI database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of Mild Cognitive Impairment (MCI) and early AD.
In our experiments, we selected 12386 pre-processed T1-weighted MRI scans from N = 1216 participants in the ADNI dataset. The scans were obtained using different scanners and at multiple sites. Participants were aged between 63 and 87 years old, and 28% were CN, 4% have been diagnosed with subjective memory concern, 54% with mild cognitive impairment and 14% with AD. Each participant has on average 4.7 MRI spanning 3 years. We divided our dataset in train-set (MRI: 9852; participants: 876), validation set (MRI: 1251; participants: 170), and test-set (MRI: 1283; participants: 170). In the test-set, we make sure that participants have at least one follow-up visit two years after baseline, to allow sufficient time for observable neurodegeneration to occur or to be excluded.
The first step of our training procedure uses the random grid search on the validation set to find the optimal parameters for our PWFs. The full PWFs and related parameters obtained from this procedure are reported below.
where t is training epoch, = 0.99 determines how fast the profile functions converge, v = 10 controls initial and final conditions of the weight functions, b reg , b vox , b b , b z , b rec are the proposed parameters describing the "shape" of the slope of each function and d reg , d vox , d b , d z , d rec describe the "direction" of the slope (ascending or descending). The range of values used in the grid search is chosen so that the product between each weight and the average value of the corresponding loss is normalized in the range [0-1]. The best values obtained during the grid search, together with the average values of each loss (which come from one training run of our system with the PWF block disabled) are reported in Table 1. Step The exponent values obtained for each PWF describe the optimization strategy used to train our system. Respectively, a negative/positive exponent tells the optimizer to focus on the corresponding loss earlier/later in the training process.
Our interpretation of the obtained values suggests that the training should first focus on L rec p,n (reconstruction loss) to learn a simple progression model based on conditional morphological deformations. As w rec decays, the other weights increase towards their asymptotes, with our parameter search results meaning that w vox and w reg dominate over w z and w b . In practical terms, this means that our PWFs favour optimization of losses associated with biological considerations (voxel-wise and region-wise neurodegeneration) over and above temporal smoothing (w z ) and image realism (w b ). The converse (favouring spatiotemporal similarity over progression modelling) can result in overfitting to individual morphology, and poor generalization performance of the model when applied to new individuals. This is because, generated images might be quite different from the training samples and, in this case, the discriminator D b would recognize them as unrealistic although they could potentially be real. This behaviour can drive the generator G to avoid creating these brain structures although they are reasonable, which amounts to overfitting of individual morphology.
Although traditional initialization techniques can be used within our pipeline (i.e. Xavier), we found better performance when all the models start from a common initialization (red cloud in Fig. 4). In practice, this common initialization consists of a model pre-trained with only 100 iterations on central axial slices of the training images.
Once the PWFs are defined, we proceed by training our T = 95 DN n models, each associated with one of the different slices n. The number of time points A for the age is fixed to 10. Output MRI having an intermediate age value within these fixed points are obtained by a weighted linear interpolation of the two closest MRI.
The architectures of each network E, G, D b , D Z are based on the implementation proposed in [39]. The size of the latent space Z is fixed to 200. Each DN n is trained using the same PWFs and the same training configuration that is based on the stochastic gradient descent solver, ADAM (α = 0.0002, β1 = 0.5). We stop the training procedure after 300 epochs where each iteration uses a random mini-batch with 100 slices having the size of 128×128 pixels.

Experiments and Results
In our experiments, we first compare the proposed solution against state-of-the-art approaches using real follow-up as a ground truth. More specifically, we perform a qualitative assessment (Section 5.1) based on the evaluation of   image realism and artefacts, complemented with quantitative analyses (Section 5.2) that measure the ability to generate MRI having accurate volumetric biomarkers. We then perform an ablation study (Section 5.3) involving different configurations of 4D-DANI-Net to assess the contributions of each component block. Qualitative and quantitative assessments for the ablation study are presented respectively in Section 5.3.1 and Section 5.3.2. We also evaluate the visual quality of our synthetic images via an evaluation survey (Section 5.4) given to expert image readers, i.e., radiologists and neurologists. Finally, we present the computation time required for training and running our simulator in Section 5.5, and an experiment on model generalization using a new cohort in Section 5.6. Coloured boxes show: spatial discontinuity artefacts (yellow boxes) generated by unstable training; missing anatomical detail (red boxes) when super-resolution is not included; artefacts caused by super-resolution in the presence of spatial discontinuity artefacts (green boxes); and inaccurate morphology (blue boxes) in the ventricles when individualization is omitted from the model.

Qualitative Comparison Study
Here we compare our framework to the two state-of-the-art solutions available for MRI synthesis: i) the baseline DANI-Net obtained by independent training (and stacking) of 2D slice models [38]; and ii) the biomechanical approach proposed in [20], which required down-sampling of the MRI resolution (by a factor of 2) for computationally feasible training times, followed by re-scaling to the original resolution using bilinear interpolation. Figure 5 shows that our approach provides the best results: fewer artefacts and superior resolution (less smoothing). Notably, images generated by [20] show excessive smoothing, whereas images generated by [38] contain notable artefacts.

Quantitative Comparison Study
Here we quantify the ability of the proposed 4D-DANI-Net to synthesize MRI that produce accurate regional volumes in the brain, as percentages of total brain volume (a standard approach to controlling for person-to-person variability in head size). Accuracy is presented as the mean and standard deviation in absolute error between synthetic and real images, across all 170 test cases. This is achieved by applying brain segmentation in both simulated and synthetic MRI scans and computing volumes using the FSL library [47] for regions of interest relevant to ageing and Alzheimer's disease: left hippocampus, right hippocampus, peripheral grey matter, ventricular cerebrospinal fluid (CSF), total grey Table 3: Quantitative ablation study: Mean absolute error (± standard deviation) in predicted regional volumes of the brain, expressed as a percentage of total brain volume.   matter, and total white matter. Error for test subject p in brain region x is formulated as: where Y * p is the real follow-up (ground truth), Y p is the simulated MRI, F SL(Y p , x) is the estimated regional volume on Y p for the region x obtained by the FSL library, and F SL(Y p , tb) is the corresponding total brain volume. Table 2 contains the results of quantitative comparison of our full model against other methods: DANI-Net [38], and a few other regression-based methods that have been used as benchmarks for predicting biomarker trajectories [48]. Specifically, we consider a naive support vector regressor (SVR), a linear mixed-effects (LME) model, and two optimized regressor models, SVR * and LME * , where 20% of outliers were removed. Note that the regression-based models are trained directly on extracted brain volumes, with gender and diagnosis as covariates. These regressor approaches are incapable of generating simulated images. For the LME model, we group the training set in four different groups based on diagnosis while the age and gender are considered both as random and fixed effects. For the SVR model, we used the RBF kernel with the hyper-parameters C=10 and coef0=0; and age, gender and diagnosis as predictive features.
Apart from tweaking the baseline DANI-Net [38] so that we could stack the different slices together and obtain the simulation on 3D MRI, we are unable to perform fair comparisons (same image resolutions) against other simulators (i.e. [20]) due to the limitations presented in the introduction. Table 2 shows that the worst-performing method is the original DANI-Net [38], which is not surprising because it was not designed for 3D MRI.
The best performing method varies with brain region size. For large regions, 4D-DANI-Net (proposed approach) has the highest accuracy by a considerable margin: average reduction in error is −33.2% against SVR * and −33.0% against LME * . For small regions, the SVR * and LME * slightly outperform 4D-DANI-Net. From this, we surmise that simple models are adequate for small regions, but are less capable to capture the complexity of neurodegeneration in larger regions.
In summary, from the comparison study, we can see that 4D-DANI-Net produces state-of-the-art performance for modelling neurodegeneration in ageing and Alzheimer's disease progression.

Ablation Study
In this section, we analyse the contribution of each component of our framework.
The configurations of 4D-DANI-Net considered in our ablation studies involve the basic model (denoted by L * ) obtained by independent training (then stacking together) of MRI slices, plus combinations of the 3D training consistency strategy (denoted by TC and obtained when PWFs are used), the super-resolution block (denoted by SR), and the transfer learning block (denoted by TL). See Section 3 for details of each.

Qualitative Ablation Study
Our qualitative ablation study compared artefacts in synthetic images obtained by different configurations of 4D-DANI-Net for three representative test cases. Figure 6 shows that the full configuration L * _TC_SR_TL produces visually superior synthetic MRI, i.e., fewer artefacts in comparison to synthetic MRI obtained by other configurations. In the approaches lacking 3D consistency constraints (L * _TL), the independent training of 2D slice-wise models leads to notable artefacts appearing in sagittal and coronal axes when networks do not converge (yellow boxes in Fig. 6). As intended, such issues are almost eliminated through the use of our 3D training consistency strategy TC (L * _TC_TL and L * _TC_SR_TL configurations). When TC is used without SR, anatomical details are often not visible (red boxes in Fig. 6) and the images appear overly smooth. Conversely, when SR is used without TC, the super-resolution of artefacts introduces false structures (green boxes in Fig. 6). Disabling the transfer learning procedure TL (configuration L * _TC_SR) produces inaccurate morphology, i.e., excessive ventricles expansion, caused by lack of individualization (blue boxes in Fig. 6).
For completeness, Fig. 1 shows an example of an entire simulation obtained using the full configuration of 4D-DANI-Net. Expected neurodegeneration is apparent in the sequence, including ventricular expansion, hippocampus contraction, and cortical thinning. Table 3 contains the results of our quantitative ablation study, which shows that the full model (L * _TC_SR_TL) produces the lowest absolute error in brain volume. Our 3D training consistency strategy TC reduces errors considerably: when TC is added to L * _TL, errors are reduced by an average (mean) of 7.6%; when TC is added to the L * _SR_TL configuration, errors are reduced by an average of 3.5%. Our super-resolution strategy SR improves accuracy significantly. In fact, SR was the largest contributor to accuracy by a considerable margin -reducing errors by an average of 53.9% when used with TL, and by 58.8% when used with TL and TC. However, this last result also shows that super-resolution alone is not sufficient to maximize accuracy.

Quantitative Ablation Study
It is noteworthy that the absolute errors in gm and wm are identical, but they are in fact opposite in sign (not shown). This indicates that the source of volumetric errors is concentrated around the grey matter/white matter boundary, which is probably due to the well-known phenomenon of partial volume effects [49].
By looking at the results of the baseline [38] in Table 2, we note that any configuration of 4D-DANI-Net outperforms [38]. Even the simplest configuration L * _TL reduces errors by an average of 2.8%. This is as expected since the baseline DANI-Net [38] is similar to the simplest configuration of 4D-DANI-Net (L * _TL) except that the latter optimizes some of the loss functions, therefore providing better accuracy. Table 4 summarizes the percentage of improvements in term of accuracy (error reduction) obtained when a specific component of our framework is included or excluded from the full configuration. Super-resolution provided the largest contribution (58.8%) followed by the transfer learning (42.5%) and the proposed training consistency (3.6%).
Finally, Table 5 shows the percentage of improvements due to each term of our combined loss (L tot ). Temporal smoothing (L D z ) provided the largest contribution (+43.5%), closely followed by the adversarial loss related to brain realism (L D b , +40.7%), then reconstruction error used to train the deep autoencoder (L rec , +30.2%) and disease progression modelling losses (L reg and L vox , +21.9%).
Our ablation studies cumulatively show that each component block and each loss of 4D-DANI-Net improves the performance in synthesizing a long-time sequence of personalised, high-resolution medical images with no discernible artefacts.

Radiological Assessment of Visual Perception and Disease Stage
Finally, expert image readers evaluated simulated images against real images in terms of perceived visual artefacts as well as diagnostic accuracy.
To do so, we performed a survey where we recruited 21 participants (4 neurologists, 4 neuro-radiologists, 10 neuroimaging experts and 3 medical imaging researchers with an average of 9 years experience), and we asked them to evaluate 22 randomly selected cases extracted from the test set. From these cases, 3 out of 22 subjects have progressed in a different diagnosis during the follow-up scan whereas the remaining 19 subjects have maintained the initial diagnosis.
We set up an online web application (see Fig. 7) that shows, for each of the 22 cases, a T1-weighted brain MRI of a patient. Below this MRI scan, two more images labelled A and B shown in random order to avoid any selection bias. One of these 2 images is the real follow up MRI of the same initial subject; the other is the synthetic image generated starting from the initial MRI and obtained for the same follow-up interval. Each participant is asked to identify the simulated image in each of these 22 cases.
Additionally, during the survey, the participants were asked to identify and classify possible visual differences selected from 2 severity scores (minor and major) and 3 different categories (noise/texture, structural differences and unrealistic artefacts) and to assign a clinical diagnosis to both A and B in order to verify that there is no clinical inconsistency between real and synthetic images. The age and diagnosis at the baseline scan and the age at the follow-up scan were displayed to help the participants to assign the correct diagnosis. Finally, for each case and each different task in the survey, the participants were asked to provide a confidence score selected from the following list:    Figure 9: Results on the participants' visual perception obtained during our survey. The considered artefacts are divided into 2 different severity scores (minor and major) and 3 different categories (noise/texture, structural differences and unrealistic artefacts). The results for the simulated images are in orange, whereas the results for the real images are in blue.
The results of this survey are presented in Fig. 8 where each task is represented with a different colour: i) bars in grey depict the results related to the discrimination task (real vs simulated images), ii) bars in orange depict the results related to the diagnosis using simulated images and iii) bars in blue depict the results related to the diagnosis using real images.
On the first left graph of Fig. 8, we can see that participants can discriminate the real and simulated images with an accuracy of 59.3%. We specifically noticed that neurologists achieved the highest accuracy (76.4% ± 2.6%), while neuro-radiologists obtained 68.0% ± 7.1%, neuroimaging experts 54.6% ± 18.6% and finally, medical imaging researchers 45.4%± 6.4%.
These results show that even the most highly trained participants have some difficulty discriminating between synthetic and real images (best score was 86.4%), and all the participants together achieved just 59.3% of accuracy that is slightly worse than the ideal case of random choice when the 2 classes are indistinguishable.  Figure 10: Two cases of model generalization issues causing blurred simulated images and occurring when the test subject is not well represented in the training set.
In Fig. 8 we can also see that the diagnosis using synthetic images is almost identical to the real follow-up (57.9% vs 56.8%) supporting the idea that our system is able to capture key aspects of disease progression.
In terms of the confidence scores related to the discrimination task (second left graph in Fig. 8), the majority of experts have select a low or a medium confidence score, confirming once again that the images cannot be easily discriminated.
For the confidence scores related to the assignment of the diagnosis (last two graphs in Fig. 8), these are distributed equally between the none and medium confidence scores and we did not find differences between the results on simulated image and the real ones.
In Fig. 9 we report visual perception results from the survey of experts. These results show that the majority of the artefacts on the simulated images are minor noise/texture artefacts (31.6%) and minor morphological structural differences (29.3%). Only 2.6% were minor unrealistic artefacts, 2.6% major texture artefacts, 1.9% major structural differences, and 0% major unrealistic artefacts. From the results in this figure, we can also see that the simulated images have a slightly higher occurrence of artefacts with respect to the real images. In particular, in the last column, we can see that 30.5% of real images have at least one artefact against 49.6% for the synthetic images.
In conclusion, the highlights from our survey are as follows: • Simulated MRI scans contain minimal noise/texture artefacts and minor structural differences, approaching the levels of artefacts contained in real MRI scans.
• Simulated MRI scans are diagnostically indistinguishable from real MRI scans.
• Simulated images and real MRI scans are not easy to discriminate (average performance is 59.3%). However, experienced neurologists and neuro-radiologists were able to achieve reasonably high performance on this task (average 76.4%).

Training and Inference Time
Having a cluster of GPUs with the total number of GPUs being similar to the number of slices in the MRI, allowed us to train each slice-based models in parallel. In our case, we have 50 NVIDIA GTX TITAN-X and 95 slices of MRI, and the total training time was approximately 3 days. The inference was much faster, in fact, on the same cluster, the computation time required to simulate the disease progression for a single MRI (including the transfer learning step) was in the order of a few minutes.

Model Generalization
The design of our system allows simulating MRI scans from a new dataset without the need to retrain the system. However, for decent generalization, the current implementation of our framework requires the use of the same image modality (i.e., T1w-MRI) with a similar image resolution (i.e. 1mm isotropic). According to our framework design, the normalization step makes our model quite robust to change in scanner type or changes on the pre-processing pipeline  [50] where we selected all the subjects diagnosed as CN or AD, aged in the range between 60-85 and having at least one follow-up 3 years after the baseline scan. After excluding all MRI scans where template registration and brain segmentation failed, we were able to evaluate 166 new subjects from this cohort. The results on this new dataset are presented in Table 6.
Our results show that the model can generalize well to new data, with only slightly reduced performance on average. Generally speaking, AI model generalization is known to be a challenge, particularly when the training set is not fully representative of the target distribution. Indeed, reduced performance is expected here due to cohort differences, e.g., the distributions of age and follow-up duration, which differed between the ADNI and OASIS-3 datasets. We found that performance declined the most for large regions of the brain such as the cortical surface, which is well known to vary considerably between individuals, suggesting that personalization is the most challenging aspect of model generalization here. To demonstrate this, we show in Fig. 10 two cases where the model did not generalize well. A moderate drop in image quality is evident in large regions of the obtained images. In particular, in both cases of Fig. 10, the images are blurred due to low numbers of individuals in the training set aged in the range between 60-62. Despite the performance reduction in such outlier cases, our model performed quite well in most cases.
Lastly, our data-driven framework offers further potential to build generative models of other medical imaging modalities, e.g., tau PET in Alzheimer's disease. This would require some methodological work such as modifying the biological constraints, used here to model neurodegeneration, to instead generate tau PET signal (for example).

Conclusion and Future Work
The aim of our system is to produce a "digital twin" of the brain that can inform disease understanding and clinical decisions by predicting future evolution. Key clinical applications include: i) support earlier diagnosis by predicting future brain appearance of a specific subject; and ii) a personalised, virtual placebo for clinical trials. More specifically, for the later application, our experiments calculated the accuracy of future predictions in untreated individuals, which produces a practical baseline accuracy (with confidence intervals) for our system to be used as a virtual placebo. Any treatment result would have to exceed the confidence interval of the virtual placebo to prove to be effective. The concept of virtual placebos that we are proposing here can potentially revolutionise future clinical trials since this would avoid recruiting actual real placebo and cutting down the total cost of the clinical trial. In conclusion, our long-term vision for 4D-DANI-Net is to have a fully automatic system that can assess experimental treatments at the individual level, as well as suggesting the right dose to minimise side effects [51].
In summary, in this paper, we presented a deep learning framework for brain image simulation in neurodegeneration, called 4D-DANI-Net, and demonstrated it in one of the biggest challenges of 21st-century healthcare: ageing and Alzheimer's disease. In particular, our work addresses a key gap in AI-enabled healthcare: generation of realistic and accurate synthetic medical images for model validation.
Current state-of-the-art MRI simulators suffer three key limitations -i) lack of individualization, ii) poor image resolution and iii) limited to 2D images -that have precluded full 4D simulation of realistic and accurate high-resolution medical images, until now.
We addressed these limitations by introducing three memory-efficient components in our system. Firstly, the proposed profile weight functions control system instability and although the parameters obtained in this work are ad hoc for this specific task, we believe that our PWF strategy can be a valid solution in many complex systems that suffer instability issues caused by optimizing simultaneously multiple adversarial networks. Therefore, such engineering novelty is important to ensure the stability of deep learning architectures with multiple networks, which are particularly complex in medical imaging, and therefore inherently unstable. Secondly, the 3D super-resolution block is used to overcome low image resolution limitations. Thirdly, a new transfer learning strategy allowed us to personalise synthetic images for each individual.
We used quantitative and qualitative experiments to demonstrate the importance of each component of our pipeline and also compared our full framework against baseline models.
We see multiple exciting avenues for future work. Firstly, our framework can handle more advanced models of neurodegenerative disease progression and ageing, e.g., by conditioning on other factors such as demographics, lifestyle, and phenotype/genotype information for personalised medicine. This idea may be extended to investigate and test hypotheses of neurodegenerative disease mechanisms in a uniquely deep manner, which may help in the unsuccessful global efforts to develop effective treatments to date. Finally, and most importantly, our modular system can generalise beyond MRI and brain diseases to other medical imaging modalities, diseases, and organs of the body.