CLIP-Loc: Multi-modal Landmark Association for Global Localization in Object-based Maps

IEEE International Conference on Robotics and Automation (ICRA), 2024

Shigemichi Matsuzaki, Takuma Sugino, Kazuhito Tanaka, Zijun Sha, Shintaro Nakaoka, Shintaro Yoshizawa, and Kazuhiro Shintani

Frontier Research Center,
Toyota Motor Corporation

feed Paper code Code

Abstract

This paper describes a multi-modal data association method for global localization using object-based maps and camera images. In global localization, or relocalization, using object-based maps, existing methods typically resort to matching all possible combinations of detected objects and landmarks with the same object category, followed by inlier extraction using RANSAC or brute-force search. This approach becomes infeasible as the number of landmarks increases due to the exponential growth of correspondence candidates. In this paper, we propose labeling landmarks with natural language descriptions and extracting correspondences based on conceptual similarity with image observations using a Vision Language Model (VLM). By leveraging detailed text information, our approach efficiently extracts correspondences compared to methods using only object categories. Through experiments, we demonstrate that the proposed method enables more accurate global localization with fewer iterations compared to baseline methods, exhibiting its efficiency.

Method

1. Correspondence generation using CLIP

We propose to match visual object observation with text-labeled map landmarks using multi-modal similarity calculated by CLIP. This allows for correspondence matching based on fine-grained semantic information given as arbitrary language descriptions of the landmarks. The observations and landmarks are matched by k nearest neighbor in the feature space.

2. Inlier extraction based on PROSAC

Conventional RANSAC is based on completely random sampling which is not efficient as the correspondence candidates generated in the previous step inherently contain significant amounts of outliers. To mitigate the problem, we propose to use PROSAC, which samples the most promising correspondences first, leading to better efficiency. The similarity between a visual observation and the text assigned to a landmark is used as a score of the correspondence candidate.

Results

More results such as ablations are in the paper.

Our Related Projects

CLIP-Clique: Graph-based Correspondence Matching Augmented by Vision Language Models for Object-based Global Localization [code]

Extended work of CLIP-Loc. We proposed a hybrid object descriptor that combines the advantages of both VLM (CLIP), and general object detectors via semantic graphs. To robustify inlier extraction, we employed a graph-theoretic algorithm.

Citation

@inproceedings{Matsuzaki2024ICRA,
    author = {Matsuzaki, Shigemichi and Sugino, Takuma and Tanaka, Kazuhito and Sha, Zijun and Nakaoka, Shintaro and Yoshizawa, Shintaro and Shintani, Kazuhiro},
    booktitle = {IEEE International Conference on Robotics and Automation},
    doi = {10.1109/ICRA57147.2024.10611393},
    month = {may},
    pages = {13673--13679},
    publisher = {IEEE},
    title = {{CLIP-Loc: Multi-modal Landmark Association for Global Localization in Object-based Maps}},
    url = {http://arxiv.org/abs/2402.06092},
    year = {2024}
}

Notification

The project page was solely developed for and published as part of the publication, titled ``CLIP-Loc: Multi-modal Landmark Association for Global Localization in Object-based Maps'' for its visualization. We do not ensure the future maintenance and monitoring of this page.

Contents might be updated or deleted without notice regarding the original manuscript update and policy change.

This webpage template was adapted from DiffusionNOCS -- we thank Takuya Ikeda for additional support and making their source available.