RobSense: A Robust Multi-modal Foundation Model for Remote Sensing with Static, Temporal, and Incomplete Data Adaptability

CVPR 2025
Minh Kha Do Kang Han Phu Lai Khoa T. Phan Wei Xiang
La Trobe University

Overview

We propose RobSense, a robust multi-modal foundation model designed for multi-spectral and Synthetic Aperture Radar (SAR) data. RobSense:

  • Supports diverse input types, ranging from static to temporal data, and from uni-modal to multi-modal formats
  • Handles incomplete data, including missing spectral bands and irregularities in temporal sequences

How is Robsense pre-trained?

Pre-training dataset

Robsense is pre-trained on Satlas, a large-scale global dataset comprising approximately 12 million images from Sentinel-1 and Sentinel-2 satellites. Figure 1 illustrates its worldwide geographic coverage.

Interpolate start reference image.
Geographic coverage of Robsense's pre-trained dataset, Satlas. Satlas spans all continents except Antarctica (This image is adapted from Satlas publication)

Pre-training workflow

Interpolate start reference image.
Illustration of RobSense pre-training, which consists of two training phases: Temporal Multi-modal Learning for training MS/SAR encoder, multi-modal encoder and multi-modal decoder, and Latent Reconstruction Learning for training Latent MS/SAR reconstructor. TSD stands for Time-specific Distribution.

Fine-tunning

Interpolate start reference image.
By combining different modules, the model can accept either static (sta) or sequential (seq) inputs. It supports both complete (solid line) and incomplete (dashed line) inputs.

Qualitative Results

Segmentation results Segmentation results

Quantitative Results

Segmentation results
Segmentation results (mIoU ↑) on Satlas dataset
Classification results
Classification results (mAP ↑) on BigEarthNet dataset
Comparison of foundation models fine-tuned on datasets with diverse input types—from uni-modal to temporal (T-) multi-modal data—across varying missing rates. Rand. indicates a random missing rate applied to each sequence

Related Work

Favyen Bastani et. al. Satlaspretrain: A large-scale dataset for remote sensing image understanding. ICCV 2023

Anthony Fuller et. al. CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders. NeurIPS 2023

Mubashir Noman et. al. Rethinking transformers pre-training for multi- spectral satellite imagery. CVPR 2024

Danfeng Hong et. al. Spectralgpt: Spectral remote sensing foundation model. IEEE TPAMI 2024

BibTeX


@article{robsense2025,
  title={RobSense: A Robust Multi-modal Foundation Model for Remote Sensing with Static, Temporal, and Incomplete Data Adaptability},
  author={Minh Kha Do and Kang Han and Phu Lai and Khoa T. Phan and Wei Xiang},
  journal={2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2025},
}