This paper describes a multi-modal data association method for global localization using object-based maps and camera images. In global localization, or relocalization, using object-based maps, existing methods typically resort to matching all possible combinations of detected objects and landmarks with the same object category, followed by inlier extraction using RANSAC or brute-force search. This approach becomes infeasible as the number of landmarks increases due to the exponential growth of correspondence candidates. In this paper, we propose labeling landmarks with natural language descriptions and extracting correspondences based on conceptual similarity with image observations using a Vision Language Model (VLM). By leveraging detailed text information, our approach efficiently extracts correspondences compared to methods using only object categories. Through experiments, we demonstrate that the proposed method enables more accurate global localization with fewer iterations compared to baseline methods, exhibiting its efficiency.
We propose to match visual object observation with text-labeled map landmarks using multi-modal similarity calculated by CLIP. This allows for correspondence matching based on fine-grained semantic information given as arbitrary language descriptions of the landmarks. The observations and landmarks are matched by k nearest neighbor in the feature space.
Conventional RANSAC is based on completely random sampling which is not efficient as the correspondence candidates generated in the previous step inherently contain significant amounts of outliers. To mitigate the problem, we propose to use PROSAC, which samples the most promising correspondences first, leading to better efficiency. The similarity between a visual observation and the text assigned to a landmark is used as a score of the correspondence candidate.
@inproceedings{Matsuzaki2024ICRA,
author = {Matsuzaki, Shigemichi and Sugino, Takuma and Tanaka, Kazuhito and Sha, Zijun and Nakaoka, Shintaro and Yoshizawa, Shintaro and Shintani, Kazuhiro},
booktitle = {IEEE International Conference on Robotics and Automation},
doi = {10.1109/ICRA57147.2024.10611393},
month = {may},
pages = {13673--13679},
publisher = {IEEE},
title = {{CLIP-Loc: Multi-modal Landmark Association for Global Localization in Object-based Maps}},
url = {http://arxiv.org/abs/2402.06092},
year = {2024}
}
The project page was solely developed for and published as part of the publication, titled ``CLIP-Loc: Multi-modal Landmark Association for Global Localization in Object-based Maps'' for its visualization. We do not ensure the future maintenance and monitoring of this page.
Contents might be updated or deleted without notice regarding the original manuscript update and policy change.
This webpage template was adapted from DiffusionNOCS -- we thank Takuya Ikeda for additional support and making their source available.