Akashah Shabbir , Mohammed Zumri , Mohammed Bennamoun , Fahad Shahbaz Khan , Salman Khan
Mohamed bin Zayed University of Artificial Intelligence, The University of Western Australia, Linköping University, Australian National University
- 📦 Code, Model checkpoints and dataset will be released soon.
- Jan-24-2025: Technical Report of GeoPixel paper is released arxiv link. 🔥🔥
GeoPixel is a first large multimodal model explicitly designed for high-resolution remote sensing (RS) image comprehension and pixel-level grounding. The model processes natural language user queries with RS imagery to generate detailed outputs, incorporating interleaved masks that adapt dynamically to the spatial resolution and complexity of the input.
- We present GeoPixel, a pixel grounding Large Multimodal Model optimized for high-resolution remote sensing image comprehension. It features adaptive image partitioning into local and global regions, enabling efficient processing of resolutions up to 4K in any aspect ratio.
- A rich annotated dataset GeoPixelD, is created that supports Remote Sensing Grounded Conversation Generation RS-GCG. This dataset combines scene-level context and object-level details through a scalable annotation pipeline that uses advanced visual prompting designed for RS imagery.
- A detailed evaluation benchmark is provided, containing 5,427 validated referring expression-mask pairs and 61,384 annotated objects. The dataset, with detailed descriptions averaging 647 characters, establishes a standard for testing the fine-grained understanding and generation capabilities of remote sensing models.
GeoPixel is fundamentally composed of five key blocks: (1) Adaptive Image Divider (2) Vision Encoder (3) Large Language Model (4) Grounding Vision Encoder (5) Pixel Decoder. These modules are seamlessly integrated to facilitate high-resolution visual perception, fine-grained semantic interpretation, and precise pixel-level grounding of Remote Sensing (RS) imagery.
We propose a semi-automatic annotation pipeline for creating a remote sensing grounded conversation generation (RS-GCG) dataset. It employs a multi-level hierarchical strategy that includes holistic scene descriptions, individual instance annotations, and group-level semantic representations, enabling a comprehensive understanding of spatial relationships and object-level details. Advanced techniques, such as Set-of-Mark (SOM) prompting combined with spatial and categorical priors, are utilized to enhance the accuracy and granularity of object-specific annotations.
GeoPixel processes user queries to produce comprehensive descriptive outputs while simultaneously grounding identified objects through interleaved, pixel-level masks, demonstrating its advanced understanding and precise interpretation of high resolution remote sensing imagery.
Performance Comparison of various models on the Remote Sensing Grounded Conversation Generation (RS-GCG) task. LISA† and PixelLM† refer to pretrained models finetuned on GeoPixelD training data. GLaMM represents zero-shot performance, while GLaMM-FT denotes the pretrained model finetuned on GeoPixelD. GeoPixel demonstrates superior performance across all metrics.
GeoPixel demonstrates a robust capability to interpret referring expressions of varying complexity and lengths to accurately generate precise segmentation masks.
Performance Comparison of GeoPixel in Referring Expression Segmentation on RRSIS-D dataset: The segmentation accuracy based on referring expressions is expressed through the Precision at IoU threshold of 0.5 ([email protected]), Overall Intersection-over-Union (oIoU) and Mean Intersection-over-Union (mIoU).
@article{shabbir2025geopixel,
title={GeoPixel : Pixel Grounding Large Multimodal Models in Remote Sensing},
author={Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad S. Khan, Salman Khan},
journal={ArXiv},
year={2025},
url={https://arxiv.org/abs/2501.13925}
}