Accurate localization plays a pivotal role in the autonomy of systems operating in unfamiliar environments, particularly when interaction with humans is expected. High-accuracy visual localization systems encompass various components, such as feature extractors, matchers, and pose estimation methods. This complexity translates to the necessity of robust evaluation settings and pipelines. However, existing datasets and benchmarks primarily focus on single-agent scenarios, overlooking the critical issue of cross-device localization. Different agents with different sensors will show their own specific strengths and weaknesses, and the data they have available varies substantially.
This work addresses this gap by enhancing an existing augmented reality visual localization benchmark with data from legged robots, and evaluating human-robot, cross-device mapping and localization. Our contributions extend beyond device diversity and include high environment variability, spanning ten distinct locations ranging from disaster sites to art exhibitions. Each scene in our dataset features recordings from robot agents, hand-held and head-mounted devices, and high-accuracy ground truth LiDAR scanners, resulting in a comprehensive multi-agent dataset and benchmark. This work represents a significant advancement in the field of visual localization benchmarking, with key insights into the performance of cross-device localization methods across diverse settings.
Dataset | Motion | Env. | Locations | Changes | Sensors | GT pose accuracy | Seqs. |
---|---|---|---|---|---|---|---|
KITTI | π | β¬ | 1 | ππ¦οΈ | RGB, LiDAR, IMU | <10cm (RTKGPS) | 22 |
TUM RGBD | βπΌ | β¬ | 2 | π | RGB-D, IMU | 1mm (mocap) | 80 |
Malaga | π | β¬ | 1 | ππ¦οΈ | RGB, IMU | (GPS) | 15 |
EUROC | πΈ | β¬ | 2 | - | RGB, IMU | 1mm (mocap) | 11 |
NCLT | πΌ | β¬β¬ | 1 | ππͺππ¦οΈ | RGB, LiDAR, IMU, GNSS | <10cm (GPS + IMU + LiDAR) | 27 |
PennCOSYVIO | β | β¬β¬ | 1 | ππ¦οΈ | RGB, IMU | 15cm (visual tags) | 4 |
TUM VIO | β | β¬β¬ | 4 | - | RGB, IMU | 1mm (mocap ends) | 28 |
UZH-FPV | πΈ | β¬β¬ | 2 | - | RGB, event camera, IMU | ~1cm (total station + VI-BA) | 28 |
ETH3D SLAM | β | β¬ | 1 | - | RGB, depth, IMU | 1mm (mocap) | 96 |
Newer College | β | β¬β¬ | 1 | - | RGB, LiDAR, IMU | 3cm (LiDAR ICP) | 3 |
OpenLoris-Scene | πΌ | β¬ | 5 | ππͺ | RGB-D, IMU, wheel odom. | <10cm (2D LiDAR) | 22 |
TartanAir | syn. | β¬β¬ | 30 | - | RGB | perfect (synthetic) | 30 |
UMA VI | βπ | β¬β¬ | 2 | - | RGB, IMU | (visual tags) | 32 |
UrbanLoco | π | β¬ | 12 | ππͺ | RGB, LiDAR, IMU, GNSS, SPAN-CPT | 12 | |
Naver Labs | πΌ | β¬ | 5 | π πͺ | RGB, LiDAR, IMU | <10cm (LiDAR SLAM and SfM) |
10 |
HILTI SLAM | β | β¬β¬ | 8 | - | RGB, LiDAR, IMU | <5mm (total station) | 12 |
Graco | πΌπΈ | β¬ | 1 | π π¦οΈ | RGB, LiDAR, GPS, IMU | β1cm (GNSS) | 14 |
FusionPortable | 2 ∈ βπ¦ΏπΌπ | β¬β¬ | 9 | - | RGB, event cameras, LiDAR, IMU, GPS | β1cm (GNSS RTK) | 41 |
LaMAR | βπ₯½ | β¬β¬ | 3 | ππͺπποΈπ¦οΈ | RGB, LiDAR, depth, IMU, WiFi/BT | <10cm (LiDAR + PGO + PGO-BA) | 500 |
CroCoDL | βπ₯½π¦Ώ[πΈ] | β¬β¬ | 10 | ππͺπποΈπ¦οΈ | RGB, LiDAR, depth, IMU, WiFi/BT | ~10cm (LiDAR + PGO + PGO-BA) | 500 +800 |
Legend
Environment:β¬ inside, β¬ outside;
Changes: π Structural changes due to moving people, πͺ long-term changes due to displaced furniture, π¦οΈ weather, π day-night, ποΈ construction work;
Trajectory motion from sensors mounted on: πΌ ground vehicle, π¦Ώ legged robot, πΈ drone, π car, β hand-held, π₯½ head-mounted, βsyn.β synthetic.
(noted with *: at most 2 devices are recorded in the same location; [πΈ]: not aligned, due to safety / permission reasons - we could only capture drone footage in 8/10 locations)
@inproceedings{blum2025crocodl,
author = {Blum, Hermann and Mercurio, Alessandro and O'Reilly, Josua and Engelbracht, Tim and Dusmanu, Mihai and Pollefeys, Marc and Bauer, Zuria},
title = {CroCoDL: Cross-device Collaborative Dataset for Localization},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2025},
}