Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery

CVPR 2026
Minh Kha Do1 Wei Xiang1 Kang Han1 Di Wu1 Khoa Phan1 Yi-Ping Phoebe Chen1 Gaowen Liu2 Ramana Rao Kompella2
1. La Trobe University
2. Cisco Research
radar_chart
patch_wise

Overview

We propose SatTxt, a spectrum-aware VLFM for satellite imagery that leverages spectral priors while operating exclusively on RGB inputs at inference:

  • Representation Distillation (SRD), which transfers multi-spectral priors into an RGB-based representation space, enabling spectrum-aware reasoning without multispectral inputs during inference
  • Grounded Alignment with Instruction-Augmented LLMs (SGI-LLM), an alignment stage that bridges spectrally distilled visual representations into the space of instructionaugmented LLM embeddings via lightweight projectors, thereby producing spectrally grounded and semantically expressive cross-modal representation

How is SATtxt pre-trained?

Pre-training dataset

SATtxt is pre-trained on SL4EO-S12 v1.1 with captions obtained from LLaMA3-SSL4EO-S12-v1.1-captions, a large-scale global dataset comprising approximately 1 million images from Sentinel-2 satellite. Figure 1 illustrates its worldwide geographic coverage.

Interpolate start reference image.
Geographic coverage of SATtxt's pre-trained dataset, SL4EO-S12 v1.1. (training (green) and validation (magenta) samples) (This image is adapted from SL4EO-S12 v1.1 publication)

Pre-training workflow

Pre-training workflow.
Illustration of SATtxt pre-training, which consists of two training phases: Two-stage pre-training pipeline for SATtxt. Stage 1 (SRD) - dashed lines: a vision projector is trained to reconstruct multi- spectral representations from an RGB encoder by distilling a frozen MS teacher, transferring spectral knowledge so MS inputs are unnec- essary in Stage 2 and at inference. Stage 2 (SGI-LLM) - solid lines: with vision and text encoders frozen, distilled vision features are aligned with LLM-based text embeddings using instruction-augmented prompts, enhancing cross-modal representations while preserving pretrained capabilities.

Related Work

Clive Tinashe Marimo et. al. Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation. ECML PKDD 2025

Johannes Jakubik et. al. TerraMind: Large-Scale Generative Multimodality for Earth Observation. ICCV 2025

Danfeng Hong et. al. Spectralgpt: Spectral remote sensing foundation model. IEEE TPAMI 2024

BibTeX


@article{sattxt2026,
  title={Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery},
  author={Minh Kha Do and Wei Xiang and Kang Han and Di Wu and Khoa Phan and Yi-Ping Phoebe Chen and Gaowen Liu and Ramana Rao Kompella},
  journal={2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026},
}