Experiment title
: Multisource multiscale AI fusion framework for observationally constrained global climate datasets
Summary
: This project will leverage recent advances in artificial intelligence architectures to develop a multisource, multiscale fusion framework for building observationally constrained global climate datasets. Using incoming solar radiation at the surface as a first application, the framework will combine reanalysis backbones, satellite observations, and in-situ measurements into a single, consistent global product. The framework will be designed to be transferable to other climate and weather variables, making it broadly useful to the climate and weather research community.
Scientific motivation:
Many climate and weather applications require datasets that are simultaneously spatially complete, temporally consistent, and observationally accurate, yet no single data source satisfies all of these requirements. Numerical models and reanalysis products provide global, spatially complete coverage and long-term continuity, but their representation of surface fluxes is affected by systematic biases and structural model errors. Satellite-based products generally capture more realistic spatial patterns and variability, but they suffer from temporal gaps, shorter records, and retrieval uncertainties. In-situ observations, such as ground-based radiation measurements, provide the highest accuracy and are essential for defining observational truth, but they are expensive to maintain and spatially sparse, limiting their direct use for global analyses. Developing a fusion framework that can learn from the accuracy of in-situ observations while retaining the spatial completeness and temporal continuity of model and reanalysis data would therefore provide substantial scientific and practical benefits. Recent AI architectures now make it possible to combine these heterogeneous data sources in a physically consistent way, enabling the construction of global datasets that are constrained by observations while remaining spatially complete.
Experiment Name
: AI-enabled multisource, multiscale fusion for observationally constrained global surface solar radiation datasets.
People
: Sanaa Hobeichi, Hoaran Li, and Husnain Asif. Additional collaborators may be invited to join as specific expertise is required at different stages of the project.
Technical support requirements:
The project will require technical support from ACCESS-NRI to enable and maintain a suitable GPU-enabled software environment. In particular, the existing analysis3_edge-25.07 environment will need to be updated to support additional AI and geoscience packages required for model development, training, inference, and evaluation.
Model: The project will explore modern AI fusion architectures designed for multisource climate data that natively handle missing inputs and differing spatial scales, while capturing both spatial and temporal dependencies. Candidate approaches include multi-encoder convolutional and encoder-decoder (U-Net style) networks for multiscale spatial feature extraction, augmented where appropriate with temporal components to represent short-term dependencies in daily data (for example using ConvLSTM layers). The architectures will support explicit masking of unavailable data and may incorporate attention or gating mechanisms to adaptively balance information from reanalysis, satellite, and in-situ sources. Where appropriate, the framework will also include mechanisms for estimating uncertainty in the final product.
Configuration: AI model configuration will be informed by iterative experimentation. Final model choices will be determined based on experimental evaluation of accuracy, robustness, and physical consistency.
Initial conditions:
Run plan: The project will be structured in three phases:
-
Q1: Data processing and preparation, including harmonisation, regridding to a common spatial and temporal framework, and exploration of alternative preprocessing and masking strategies.
-
Q2: Model development and experimentation, testing different AI fusion architectures and training strategies.
-
Q3: Model evaluation, validation, and synthesis of results, and finalisation of the production dataset.
Simulation details:
This project involves AI model training and inference.
Input data (daily): ERA5, MERRA-2, CERES (where available), in-situ BSRN constraints.
Output: bias-corrected surface incoming solar radiation with uncertainty field.
Total KSUs required
:
Approximately 310 KSUs, with an estimated 100 KSUs in Q1 for data processing and initial experiments, and 200 KSUs in Q2 for model training and finetuning, and an additional 10 KSUs in Q3 anticipated.
Total storage required
: Up to 20 TB during Q1 and Q2 to support intermediate processed datasets, model checkpoints, and experimental outputs. Storage requirements are expected to reduce to approximately 5 TB in Q3, and may be lower, once intermediate training data are removed and only final products and essential diagnostics are retained. All project data and storage allocations on nm47 will be fully removed by Q4, following transfer of the final product to its long-term host project.
Storage lifetime
: A substantial fraction of the storage allocation will be used to hold intermediate, processed datasets generated specifically for training and running the AI fusion model. These intermediate products are required only for the duration of the project. At project completion (end of Q3), this processed training data will no longer be retained. All preprocessing and model code required to reproduce the dataset from raw inputs (currently available on NCI and/or freely accessible online for research use) will be retained to support reproducibility.
Long term data plan
: Only the final data product will be preserved beyond the lifetime of the project. This final dataset will be transferred to and managed under projects owned by the Centre of Excellence for the Weather of the 21st Century, and will be included in the NCI data collection. The dataset will be made accessible through multiple access pathways, and a dedicated entry on Research Data Australia will be created to facilitate discovery and access by the broader research community (see Datasets in ‘Related articles’ for examples). This approach ensures long-term stewardship, accessibility, and reuse of the data.
Outputs:
- Dataset: Daily, observationally constrained global incoming surface solar radiation dataset on a 0.25° grid for 1980-2025.
- Code and model artefacts: The preprocessing, training, and inference scripts, together with the trained AI model weights used to generate the dataset, will be made available via a dedicated project GitHub repository (and associated release assets where appropriate). This will support reproducibility and reuse of the framework for other variables and applications.
Restarts:
Related articles:
These articles describe statistically rigorous data-fusion frameworks that integrate satellite, reanalysis, and in-situ observations to produce observationally constrained estimates of water and energy fluxes, providing the conceptual basis for the present effort to develop an AI-driven climate data fusion model.
Hobeichi S; Abramowitz G; Evans J, 2020, ‘Conserving land-atmosphere synthesis suite (CLASS)’, Journal of Climate, 33, pp. 1821 - 1844, http://dx.doi.org/10.1175/JCLI-D-19-0036.1
- Dataset: Hobeichi, S**.** et al. ( 2019). Conserving Land-Atmosphere Synthesis Suite (CLASS) v1.1. * NCI National Research Data Collection.
Hobeichi S; Abramowitz G; Evans J; Beck HE, 2019, ‘Linear Optimal Runoff Aggregate (LORA): A global gridded synthesis runoff product’, Hydrology and Earth System Sciences, 23, pp. 851 - 870, http://dx.doi.org/10.5194/hess-23-851-2019
- Dataset: Hobeichi, S. et al. (2018). Linear Optimal Runoff Aggregate (LORA) v1.0 . * NCI National Research Data Collection.
Hobeichi S; Abramowitz G; Evans J; Ukkola A, 2018, ‘Derived Optimal Linear Combination Evapotranspiration (DOLCE): A global gridded synthesis et estimate’, Hydrology and Earth System Sciences, 22, pp. 1317 - 1336, http://dx.doi.org/10.5194/hess-22-1317-2018
- Dataset: Hobeichi, S. et al. (2021). Derived Optimal Linear Combination Evapotranspiration - DOLCE v3.0. * NCI National Research Data Collection.
Analysis:
Conclusion:
This project will deliver a daily, observationally constrained global incoming surface solar radiation dataset for 1980-2025 that is spatially complete, physically consistent, and anchored to high-quality observations. The resulting dataset will provide a valuable benchmark for the evaluation of historical climate simulations, reanalyses, and land–atmosphere model experiments, and is particularly useful for the renewable energy sector, where accurate and temporally consistent representations of surface solar availability are critical for resource assessment, long-term variability analysis, and the evaluation of solar energy models..
Additionally, the project will develop a flexible AI-based fusion framework that can be transferred to other climate and weather variables, enabling the creation of observationally constrained global or regional datasets that retain the spatial completeness of model-based products while benefiting from the accuracy of observational data.
Together, these outcomes provide a significant contribution to both climate data development and the broader community’s capacity to integrate heterogeneous Earth system observations.