Active Geo-Localization (AGL) within a goal-reaching reinforcement learning (RL) context.
(a) AGL focuses on localizing a target (goal), within a predefined search area (environment) presented in the bird’s eye view, by navigating the agent towards it. At a given time, the agent observes a state, i.e., a patch representing a limited observation of the environment, and selects an action, i.e., a decision that modifies the agent position and the observed state.
(b) The location of the goal is unknown during infrerence but its content can be described in various modalities:
In this work, we introduce GeoExplorer, an AGL agent that:
The learning process can be divided into three stages sequentially: feature representation, Action-State Dynamics Modeling (DM), and Curiosity-Driven Exploration (CE).
(a) Feature Representation. The environment (st) and goal (sgoal) are encoded with different but aligned encoders, according to their modalities (e.g., aerial images (Igoal), ground-level images (Ggoal), or text (Tgoal)).
(b) Action-State Dynamics Modeling. A causal Transformer is trained to jointly capture action-state dynamics, guided by supervision from generated action-state trajectories for environment modeling.
(c) Curiosity-Driven Exploration. Based on state prediction from (b), a curiosity-driven intrinsic reward (rin) is used to encourage the agent to explore the environment by measuring the t differences between prediction and observations.
Our proposed SwissView dataset is constructed from Swisstopo’s SWISSIMAGE 10cm imagery, with two distinct components:
We evaluate GeoExplorer in four settings:
Average success rate of GeoExplorer and the baseline over start-goal distance between 4 to 8 (C=4 to C=8) on the validation (Masa dataset) and cross-domain generalization (x-BD and Swiss-view100 datasets).
Average success rate of GeoExplorer and the baseline over start-goal distance between 4 to 8 (C=4 to C=8) on the cross-modal generalization (MM-GAG dataset). Green, Blue and Yellow denote aerial image, ground-level image and text as the goal, respectively.
Average success rete of GeoExplorer and the baseline when C=4, C=5, and C=6 on the SwissViewMonuments dataset: (a) Aerial view as the goal; (b) Ground-level image as the goal.
To provide insights of intrinsic reard and its impact on exploration, we design the following analysis and visualization:
Generated path visualization on the SwissViewMonuments dataset. Given a pair of {start (◦), goal (△)} per search area, models generate four trials with stochastic policy, randomly shown in four different colors. Compared with the baseline, the paths generated by GeoExplorer are more robust (adapted to various envrionment), diverse (different paths for the same {start, goal} pairs) and content-aware (related to state observations).
Statistics of the path end and path visited on the Masa dataset. (a) Statistics of the path end. We count the end location of the 895 paths in the Masa dataset test set for ground truth (goal location), GOMAA-Geo and GeoExplorer when C = 4 and C = 8. (b) Statistics of the path visited. We count all the visited patches of 895 paths in the Masa dataset test set for GOMAA-Geo and GeoExplorer when C = 4 and C = 8.
Intrinsic reward visualization with images from the SwissViewMonuments dataset. For each sample, from left to right: the search area, path visualization and intrinsic reward per patch. The patch with the highest intrinsic reward is highlighted with an orange rectangle in the search area. Patches with higher intrinsic reward turn out to be more “interesting”, i.e., the semantic content of these patches can hardly be predicted 546 from the surrounding patches.
@misc{mi2025geoexplorer,
title = {GeoExplorer: Active Geo-localization with Curiosity-Driven Exploration},
author = {Mi, Li and Béchaz, Manon and Chen, Zeming and Bosselut, Antoine and Tuia, Devis},
url = {https://limirs.github.io/GeoExplorer/},
year = {2025}
}