Cross-device Collaborative Dataset for Localization

1Lamarr Institute/Uni Bonn 2ETH Zurich 3Microsoft

CroCoDL: the first dataset to contain sensor recordings from real-world robots, phones, and mixed-reality headsets, covering a total of 10 challenging locations to benchmark cross-device and human-robot visual registra- tion.

Abstract

Accurate localization plays a pivotal role in the autonomy of systems operating in unfamiliar environments, particularly when interaction with humans is expected. High-accuracy visual localization systems encompass various components, such as feature extractors, matchers, and pose estimation methods. This complexity translates to the necessity of robust evaluation settings and pipelines. However, existing datasets and benchmarks primarily focus on single-agent scenarios, overlooking the critical issue of cross-device localization. Different agents with different sensors will show their own specific strengths and weaknesses, and the data they have available varies substantially.

This work addresses this gap by enhancing an existing augmented reality visual localization benchmark with data from legged robots, and evaluating human-robot, cross-device mapping and localization. Our contributions extend beyond device diversity and include high environment variability, spanning ten distinct locations ranging from disaster sites to art exhibitions. Each scene in our dataset features recordings from robot agents, hand-held and head-mounted devices, and high-accuracy ground truth LiDAR scanners, resulting in a comprehensive multi-agent dataset and benchmark. This work represents a significant advancement in the field of visual localization benchmarking, with key insights into the performance of cross-device localization methods across diverse settings.

Renderings vs. real CroCoDL data


Qualitative Results. Good alignment between rendering and real image validates correct camera pose with respect to the NavVis-based ground truth scan from which we render here.


Locations

New locations of the CroCoDL dataset. Each location has high-quality meshes, obtained from LiDAR, which are registered with numerous phone, AR headset, and robotic sequences.

Commonly used datasets for visual localization and SLAM


Dataset Motion Env. Locations Changes Sensors GT pose accuracy Seqs.
KITTI πŸš— ⬛ 1 πŸƒπŸŒ¦οΈ RGB, LiDAR, IMU <10cm (RTKGPS) 22
TUM RGBD βœ‹πŸ›Ό ⬜ 2 πŸƒ RGB-D, IMU 1mm (mocap) 80
Malaga πŸš— ⬛ 1 πŸƒπŸŒ¦οΈ RGB, IMU (GPS) 15
EUROC πŸ›Έ ⬜ 2 - RGB, IMU 1mm (mocap) 11
NCLT πŸ›Ό β¬œβ¬› 1 πŸƒπŸͺ‘πŸŒ’πŸŒ¦οΈ RGB, LiDAR, IMU, GNSS <10cm (GPS + IMU + LiDAR) 27
PennCOSYVIO βœ‹ β¬œβ¬› 1 πŸƒπŸŒ¦οΈ RGB, IMU 15cm (visual tags) 4
TUM VIO βœ‹ β¬œβ¬› 4 - RGB, IMU 1mm (mocap ends) 28
UZH-FPV πŸ›Έ β¬œβ¬› 2 - RGB, event camera, IMU ~1cm (total station + VI-BA) 28
ETH3D SLAM βœ‹ ⬜ 1 - RGB, depth, IMU 1mm (mocap) 96
Newer College βœ‹ β¬œβ¬› 1 - RGB, LiDAR, IMU 3cm (LiDAR ICP) 3
OpenLoris-Scene πŸ›Ό ⬜ 5 πŸƒπŸͺ‘ RGB-D, IMU, wheel odom. <10cm (2D LiDAR) 22
TartanAir syn. β¬œβ¬› 30 - RGB perfect (synthetic) 30
UMA VI βœ‹πŸš— β¬œβ¬› 2 - RGB, IMU (visual tags) 32
UrbanLoco πŸš— ⬛ 12 πŸƒπŸͺ‘ RGB, LiDAR, IMU, GNSS, SPAN-CPT 12
Naver Labs πŸ›Ό ⬜ 5 πŸƒ πŸͺ‘ RGB, LiDAR, IMU <10cm (LiDAR SLAM
and SfM)
10
HILTI SLAM βœ‹ β¬œβ¬› 8 - RGB, LiDAR, IMU <5mm (total station) 12
Graco πŸ›ΌπŸ›Έ ⬛ 1 πŸƒ 🌦️ RGB, LiDAR, GPS, IMU β‰ˆ1cm (GNSS) 14
FusionPortable 2 ∈ βœ‹πŸ¦ΏπŸ›ΌπŸš™ β¬œβ¬› 9 - RGB, event cameras, LiDAR, IMU, GPS β‰ˆ1cm (GNSS RTK) 41
LaMAR βœ‹πŸ₯½ β¬œβ¬› 3 πŸƒπŸͺ‘πŸŒ’πŸ—οΈπŸŒ¦οΈ RGB, LiDAR, depth, IMU, WiFi/BT <10cm (LiDAR + PGO + PGO-BA) 500
CroCoDL βœ‹πŸ₯½πŸ¦Ώ[πŸ›Έ] β¬œβ¬› 10 πŸƒπŸͺ‘πŸŒ’πŸ—οΈπŸŒ¦οΈ RGB, LiDAR, depth, IMU, WiFi/BT ~10cm (LiDAR + PGO + PGO-BA) 500 +800

Legend

Environment:⬜ inside, ⬛ outside;

Changes: πŸƒ Structural changes due to moving people, πŸͺ‘ long-term changes due to displaced furniture, 🌦️ weather, πŸŒ’ day-night, πŸ—οΈ construction work;

Trajectory motion from sensors mounted on: πŸ›Ό ground vehicle, 🦿 legged robot, πŸ›Έ drone, πŸš™ car, βœ‹ hand-held, πŸ₯½ head-mounted, β€˜syn.’ synthetic.

(noted with *: at most 2 devices are recorded in the same location; [πŸ›Έ]: not aligned, due to safety / permission reasons - we could only capture drone footage in 8/10 locations)


BibTeX

@inproceedings{blum2025crocodl,
  author    = {Blum, Hermann and Mercurio, Alessandro and O'Reilly, Josua and Engelbracht, Tim and Dusmanu, Mihai and Pollefeys, Marc and Bauer, Zuria},
  title     = {CroCoDL: Cross-device Collaborative Dataset for Localization},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
}

CroCoDL Team