AirNav:A Large-Scale Real-World UAV Vision-and-Language Navigation Dataset
with Natural and Diverse Instructions

Hengxing Cai^1*, Yijie Rao^2*, Ligang Huang³, Zanyang Zhong¹, Jinhan Dong⁴,
Jingjun Tan¹, Wenhao Lu⁵, Renxin Zhong^1†
(* indicates equal contribution, † corresponding author)

¹School of Intelligent Systems Engineering, Sun Yat-Sen University, ²Beihang University, ³Peking University,
⁴Beijing University of Posts and Telecommunications, ⁵National University of Defense Technology

Paper Code

Overview

We propose AirNav, a large-scale UAV VLN benchmark constructed from real urban aerial data, rather than synthetic environments, with natural and diverse instructions. Additionally, we introduce the AirVLN-R1, which combines Supervised Fine-Tuning and Reinforcement Fine-Tuning to enhance performance and generalization. The feasibility of the model is preliminarily evaluated through real-world tests.

Dataset Construction

Overview of the AirNav Benchmark Construction Pipeline

The AirNav pipeline is designed to bridge the gap between synthetic data and real-world complexity. It consists of three primary stages:
Data Collection & Scene Reconstruction: High-resolution aerial images are captured across diverse urban areas (residential, industrial, etc.) using professional drones. These are processed using Structure-from-Motion (SfM) to create dense 3D reconstructions.
Graph Generation: The system samples nodes within the 3D space to build a navigable connectivity graph, ensuring the paths are realistic for UAV flight.
Instruction Generation: Using a combination of template-based methods and Large Language Models (LLMs), the pipeline generates diverse, natural language instructions that describe landmarks and spatial relationships for each path.

Dataset Statistics

Dataset Statistics: (a) summarizes the statistics of the number of scenes and trajectories for each set, (b) illustrates the distributions for the length of collected trajectories. (c) illustrates the distributions for the description length corresponding to the trajectories, (d) shows the distance distribution of eval splits from the starting point to the goal, (e) shows the episode length of both the shortest path and human demonstration trajectories, and (f) shows action histograms for the shortest path and human demonstration.

AirVLN R1 Architecture

AirVLN-R1 is a multi-modal agent designed to process visual observations and textual instructions simultaneously:
Hybrid Training Strategy: It uniquely combines Supervised Fine-Tuning (SFT) to establish a baseline of instruction-following behavior and Reinforcement Fine-Tuning (RFT) to optimize the agent's decision-making through trial and error.
Perception & Action: The architecture utilizes a visual encoder (often CLIP-based) to extract features from the UAV’s camera feed and a transformer-based policy network to predict the next best movement (action) in the 3D environment.
Reasoning: By leveraging the reasoning capabilities of large models, it can understand complex spatial commands like "Fly past the red building and stop near the circular fountain."

Real World UAV VLN Deployment

To validate the model's practical utility, the researchers conducted physical flight tests:
Cross-Domain Testing: The model was deployed on actual UAV hardware to navigate both outdoor urban environments (similar to the training data) and unseen indoor scenes (testing generalization).
System Integration: The deployment involves an onboard processing unit that handles real-time visual processing, local mapping, and obstacle avoidance while executing the VLN agent's high-level commands.
Feasibility Results: The tests demonstrated that the model trained on AirNav's real-world aerial data generalizes significantly better to physical hardware than models trained purely on synthetic, "perfect" simulations.

Aerial Navigation Results

◀

▶

BibTeX


      @misc{cai2026airnav,
        title={AirNav:A Large-Scale Real-World UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions}, 
        author={Hengxing Cai and Yijie Rao and Ligang Huang and Zanyang Zhong and Jinhan Dong and Jingjun Tan and Wenhao Lu and Renxin Zhong},
        year={2026},
        eprint={2601.03707},
        archivePrefix={arXiv}
      }

AirNav:A Large-Scale Real-World UAV Vision-and-Language Navigation Datasetwith Natural and Diverse Instructions