ClaraVid: A Holistic Scene Reconstruction Benchmark From Aerial Perspective With Delentropy-Based Complexity Profiling

Radu Beche, Sergiu Nedevschi
Technical University of Cluj-Napoca
Accepted ICCV 2025 🚀
ClaraVid Overview

Overview of ClaraVid Dataset: (a) Multi-viewpoint UAV acquisition: High-resolution aerial imagery is captured from multiple altitudes, ensuring diverse perspectives. (b) High-fidelity, diverse environments: complex urban, suburban, and natural landscapes. (c) Multimodal ground truth: Pixel-level and scene-level multimodal data for holistic scene reconstruction and semantic mapping.

Abstract

The development of aerial holistic scene understanding algorithms is hindered by the scarcity of comprehensive datasets that enable both semantic and geometric reconstruction. While synthetic datasets offer an alternative, existing options exhibit task-specific limitations, unrealistic scene compositions, and rendering artifacts that compromise real-world applicability. We introduce ClaraVid, a synthetic aerial dataset specifically designed to overcome these limitations. Comprising 16,917 images captured at 4032×3024 from multiple viewpoints across diverse landscapes, ClaraVid provides dense depth maps, panoptic segmentation, sparse point clouds, and dynamic object masks, while mitigating common rendering artifacts. To further advance neural reconstruction, we introduce the Delentropic Scene Profile (DSP), a novel complexity metric derived from differential entropy analysis, designed to quantitatively assess scene difficulty and inform reconstruction tasks. Utilizing DSP, we systematically benchmark neural reconstruction methods, uncovering a consistent, measurable correlation between scene complexity and reconstruction accuracy. Empirical results indicate that higher delentropy strongly correlates with increased reconstruction errors, validating DSP as a reliable complexity prior.

ClaraVid Overview

ClaraVid is a synthetic dataset built for semantic and geometric neural reconstruction from low altitude UAV/aerial imagery. It contains 16,917 multimodal frames collected across 8 UAV missions over diverse environments: urban, urban high, rural, highway, and nature. Each mission features 3 viewpoints and altitude levels, simulating multi-UAV operations. The dataset spans 1.8km^2, with an average mission coverage of 0.22km^2. It includes visual measurements at 4032x3024 resolution for RGB images, metric depth maps, panoptic(semantic and instance) segmentation and dynamic object masks. Additionally in contains scene level pointclouds and camera calibrations, intrinsic and extrinsic. It supports tasks such as scene reconstruction, segmentation, and depth estimation.

An overview of the dataset is presented below through color, semantic, and instance-level scene pointclouds:

Below is presented the delentropic scene profile of each scenario, a complexity descriptor computed only from 2D emasurements(images):

claravid

Delentropic Scene Profile

Delentropy

Delentropy[1] is a measure of image complexity that captures how much structural detail is present, based on the distribution of image gradients. While it works well for single images, we extend it to entire scenes by introducing the Delentropic Scene Profile (DSP). This provides a quantitative summary of how complex a scene is, based on multiple images taken from different viewpoints. Formally, for a given scene \( S \) composed of images \( \{I_k\}_{k=1}^N \), we compute the delentropy \( H_{\text{del},k} \) for each image, ignoring regions that don't contribute to reconstruction (like moving objects). The resulting values form a set \( \{ H_{\text{del},k} \} \), whose distribution describes the scene's overall complexity.

To model this distribution, we fit a truncated Beta distribution:

\[ \text{DSP}_S \sim \text{Beta}(H_{\text{del}} \mid \alpha, \beta, a, b) = \frac{(H_{\text{del}} - a)^{\alpha - 1} (b - H_{\text{del}})^{\beta - 1}}{(b - a)^{\alpha + \beta - 1} B(\alpha, \beta)} \]

Here, \( \alpha \) and \( \beta \) shape the curve, while \( a \) and \( b \) define the observed range of delentropy values in the scene. These parameters are estimated using maximum likelihood. The DSP provides a compact, interpretable fingerprint of scene complexity, which we show correlates strongly with reconstruction difficulty.

Results

Delentropy (blue) shows a strong correlation with the novel view reconstruction error produced by NeRF or Gaussian Splatting, and demonstrates significantly tighter predictive bounds than the baseline methods (red and green).

Delentropy Correlation Results

Benchmark Results:

Delentropy Correlation Results

Cite us

@misc{beche2025claravid,
  title={ClaraVid: A Holistic Scene Reconstruction Benchmark From Aerial Perspective With Delentropy-Based Complexity Profiling},
  author={Beche, Radu and Nedevschi, Sergiu},
  journal={arXiv preprint arXiv:2503.17856},
  year={2025}
}