We propose AirNav, a large-scale UAV VLN benchmark constructed from real urban aerial data, rather than synthetic environments, with natural and diverse instructions. Additionally, we introduce the AirVLN-R1, which combines Supervised Fine-Tuning and Reinforcement Fine-Tuning to enhance performance and generalization. The feasibility of the model is preliminarily evaluated through real-world tests.
The AirNav pipeline is designed to bridge the gap between synthetic data and real-world complexity. It consists of three primary stages:
Data Collection & Scene Reconstruction: High-resolution aerial images are captured across diverse urban areas (residential, industrial, etc.) using professional drones. These are processed using Structure-from-Motion (SfM) to create dense 3D reconstructions.
Graph Generation: The system samples nodes within the 3D space to build a navigable connectivity graph, ensuring the paths are realistic for UAV flight.
Instruction Generation: Using a combination of template-based methods and Large Language Models (LLMs), the pipeline generates diverse, natural language instructions that describe landmarks and spatial relationships for each path.
Dataset Statistics: (a) summarizes the statistics of the number of scenes and trajectories for each set, (b) illustrates the distributions for the length of collected trajectories. (c) illustrates the distributions for the description length corresponding to the trajectories, (d) shows the distance distribution of eval splits from the starting point to the goal, (e) shows the episode length of both the shortest path and human demonstration trajectories, and (f) shows action histograms for the shortest path and human demonstration.
AirVLN-R1 is a multi-modal agent designed to process visual observations and textual instructions simultaneously:
Hybrid Training Strategy: It uniquely combines Supervised Fine-Tuning (SFT) to establish a baseline of instruction-following behavior and Reinforcement Fine-Tuning (RFT) to optimize the agent's decision-making through trial and error.
Perception & Action: The architecture utilizes a visual encoder (often CLIP-based) to extract features from the UAV’s camera feed and a transformer-based policy network to predict the next best movement (action) in the 3D environment.
Reasoning: By leveraging the reasoning capabilities of large models, it can understand complex spatial commands like "Fly past the red building and stop near the circular fountain."
To validate the model's practical utility, the researchers conducted physical flight tests:
Cross-Domain Testing: The model was deployed on actual UAV hardware to navigate both outdoor urban environments (similar to the training data) and unseen indoor scenes (testing generalization).
System Integration: The deployment involves an onboard processing unit that handles real-time visual processing, local mapping, and obstacle avoidance while executing the VLN agent's high-level commands.
Feasibility Results: The tests demonstrated that the model trained on AirNav's real-world aerial data generalizes significantly better to physical hardware than models trained purely on synthetic, "perfect" simulations.
@misc{cai2026airnav,
title={AirNav:A Large-Scale Real-World UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions},
author={Hengxing Cai and Yijie Rao and Ligang Huang and Zanyang Zhong and Jinhan Dong and Jingjun Tan and Wenhao Lu and Renxin Zhong},
year={2026},
eprint={2601.03707},
archivePrefix={arXiv}
}