AirNav:A Large-Scale Real-World UAV Vision-and-Language Navigation Dataset
with Natural and Diverse Instructions

Hengxing Cai1*, Yijie Rao2*, Ligang Huang3, Zanyang Zhong1, Jinhan Dong4,
Jingjun Tan1, Wenhao Lu5, Renxin Zhong1†
(* indicates equal contribution, † corresponding author)
1School of Intelligent Systems Engineering, Sun Yat-Sen University, 2Beihang University, 3Peking University,
4Beijing University of Posts and Telecommunications, 5National University of Defense Technology

Overview

We propose AirNav, a large-scale UAV VLN benchmark constructed from real urban aerial data, rather than synthetic environments, with natural and diverse instructions. Additionally, we introduce the AirVLN-R1, which combines Supervised Fine-Tuning and Reinforcement Fine-Tuning to enhance performance and generalization. The feasibility of the model is preliminarily evaluated through real-world tests.


Dataset Construction

Overview of the AirNav Benchmark Construction Pipeline

The AirNav pipeline is designed to bridge the gap between synthetic data and real-world complexity. It consists of three primary stages:
Data Collection & Scene Reconstruction: High-resolution aerial images are captured across diverse urban areas (residential, industrial, etc.) using professional drones. These are processed using Structure-from-Motion (SfM) to create dense 3D reconstructions.
Graph Generation: The system samples nodes within the 3D space to build a navigable connectivity graph, ensuring the paths are realistic for UAV flight.
Instruction Generation: Using a combination of template-based methods and Large Language Models (LLMs), the pipeline generates diverse, natural language instructions that describe landmarks and spatial relationships for each path.


Dataset Statistics

statistics of our proposed dataset

Dataset Statistics: (a) summarizes the statistics of the number of scenes and trajectories for each set, (b) illustrates the distributions for the length of collected trajectories. (c) illustrates the distributions for the description length corresponding to the trajectories, (d) shows the distance distribution of eval splits from the starting point to the goal, (e) shows the episode length of both the shortest path and human demonstration trajectories, and (f) shows action histograms for the shortest path and human demonstration.


AirVLN R1 Architecture

statistics of our proposed dataset

AirVLN-R1 is a multi-modal agent designed to process visual observations and textual instructions simultaneously:
Hybrid Training Strategy: It uniquely combines Supervised Fine-Tuning (SFT) to establish a baseline of instruction-following behavior and Reinforcement Fine-Tuning (RFT) to optimize the agent's decision-making through trial and error.
Perception & Action: The architecture utilizes a visual encoder (often CLIP-based) to extract features from the UAV’s camera feed and a transformer-based policy network to predict the next best movement (action) in the 3D environment.
Reasoning: By leveraging the reasoning capabilities of large models, it can understand complex spatial commands like "Fly past the red building and stop near the circular fountain."


Real World UAV VLN Deployment

statistics of our proposed dataset

To validate the model's practical utility, the researchers conducted physical flight tests:
Cross-Domain Testing: The model was deployed on actual UAV hardware to navigate both outdoor urban environments (similar to the training data) and unseen indoor scenes (testing generalization).
System Integration: The deployment involves an onboard processing unit that handles real-time visual processing, local mapping, and obstacle avoidance while executing the VLN agent's high-level commands.
Feasibility Results: The tests demonstrated that the model trained on AirNav's real-world aerial data generalizes significantly better to physical hardware than models trained purely on synthetic, "perfect" simulations.


Aerial Navigation Results

BibTeX


      @misc{cai2026airnav,
        title={AirNav:A Large-Scale Real-World UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions}, 
        author={Hengxing Cai and Yijie Rao and Ligang Huang and Zanyang Zhong and Jinhan Dong and Jingjun Tan and Wenhao Lu and Renxin Zhong},
        year={2026},
        eprint={2601.03707},
        archivePrefix={arXiv}
      }