ClaraVid: A Holistic Scene Reconstruction Benchmark From Aerial Perspective With Delentropy-Based Complexity Profiling

MY ALT TEXT

Overview of ClaraVid Dataset: (a) Multi-viewpoint UAV acquisition: High-resolution aerial imagery is captured from multiple altitudes, ensuring diverse perspectives. (b) High-fidelity, diverse environments: complex urban, suburban, and natural landscapes. (c) Multimodal ground truth: Pixel-level and scene-level multimodal data for holistic scene reconstruction and semantic mapping.

Abstract

The development of aerial holistic scene understanding algorithms is hindered by the scarcity of comprehensive datasets that enable both semantic and geometric reconstruction. While synthetic datasets offer an alternative, existing options exhibit task-specific limitations, unrealistic scene compositions, and rendering artifacts that compromise real-world applicability. We introduce ClaraVid, a synthetic aerial dataset specifically designed to overcome these limitations. Comprising 16,917 high-resolution images captured at 4032Ă—3024 from multiple viewpoints across diverse landscapes, ClaraVid provides dense depth maps, panoptic segmentation, sparse point clouds, and dynamic object masks, while mitigating common rendering artifacts. To further advance neural reconstruction, we introduce the Delentropic Scene Profile (DSP), a novel complexity metric derived from differential entropy analysis, designed to quantitatively assess scene difficulty and inform reconstruction tasks. Utilizing DSP, we systematically benchmark neural reconstruction methods, uncovering a consistent, measurable correlation between scene complexity and reconstruction accuracy. Empirical results indicate that higher delentropy strongly correlates with increased reconstruction errors, validating DSP as a reliable complexity prior. ClaraVid will be publicly released to support UAV research.

Delentropic Scene Profile

Delentropy

Delentropy[1] is a measure of image complexity that captures how much structural detail is present, based on the distribution of image gradients. While it works well for single images, we extend it to entire scenes by introducing the Delentropic Scene Profile (DSP). This provides a quantitative summary of how complex a scene is, based on multiple images taken from different viewpoints. Formally, for a given scene \( S \) composed of images \( \{I_k\}_{k=1}^N \), we compute the delentropy \( H_{\text{del},k} \) for each image, ignoring regions that don't contribute to reconstruction (like moving objects). The resulting values form a set \( \{ H_{\text{del},k} \} \), whose distribution describes the scene's overall complexity.

To model this distribution, we fit a truncated Beta distribution:

\[ \text{DSP}_S \sim \text{Beta}(H_{\text{del}} \mid \alpha, \beta, a, b) = \frac{(H_{\text{del}} - a)^{\alpha - 1} (b - H_{\text{del}})^{\beta - 1}}{(b - a)^{\alpha + \beta - 1} B(\alpha, \beta)} \]

Here, \( \alpha \) and \( \beta \) shape the curve, while \( a \) and \( b \) define the observed range of delentropy values in the scene. These parameters are estimated using maximum likelihood. The DSP provides a compact, interpretable fingerprint of scene complexity, which we show correlates strongly with reconstruction difficulty.

ClaraVid Overview

ClaraVid is a synthetic dataset built for semantic and geometric reconstruction from UAV perspectives. It contains \(16{,}917\) high-resolution multimodal frames collected across 5 diverse environments—urban, urban high, rural, highway, and natural—through 8 coordinated UAV missions. Each mission features 3 distinct viewpoints and altitude levels \(45\text{–}75\,\text{m}\), simulating multi-UAV operations under varied terrain. The dataset spans \(1.8\,\text{km}^2\), with an average scene coverage of \(0.22\,\text{km}^2\). It includes visual measurements at \(4032\times3024\) resolution for RGB images, metric depth maps, panoptic segmentation masks across 18 classes, supporting tasks such as 3D scene reconstruction, segmentation, and depth estimation. Additionally, a key feature is the inclusion of dynamic object masks that isolate moving entities, enabling motion-aware processing—critical for tasks requiring temporal coherence.

An overview of the dataset is presented below through color, semantic, and instance-level scene pointclouds:

Below can be observed the delentropic scene profile of each scenario:

claravid