Do Visual-Language Grid MapsCapture Latent Semantics?

Matti Pekkanen, Tsvetomila Mihaylova, Francesco Verdoja, and Ville Kyrki

Aalto University

Video

Abstract

Visual-language models (VLMs) have recently been introduced in robotic mapping by using the latent representations, i.e., embeddings, of the VLMs to represent the natural language semantics in the map. The main benefit is moving beyond a small set of human-created labels toward open-vocabulary scene understanding. While there is anecdotal evidence that maps built this way support downstream tasks, such as navigation, rigorous analysis of the quality of the maps using these embeddings is missing. In this paper, we propose a way to analyze the quality of maps created using VLMs by evaluating two critical properties: queryability and distinctness. We demonstrate the proposed method by evaluating the maps created by two state-of-the-art methods, VLMaps and OpenScene, using two encoders, LSeg and OpenSeg, using real-world data from the Matterport3D data set.

Queryability

Method

  • Captures the ability to retrieve information from the embeddings.
  • The hypothesis is that from a map with high queryability, it is possible to find accurate matches to a wide range of queries.
  • We compare with semantic segmentation.
Queryability method graphic
Fig. 1: Queryability

Results

Method Encoder Segmentation Querying
F1 IoU F1 IoU
OpenScene LSeg 0.645 0.480 0.174 0.095
OpenScene OpenSeg 0.603 0.435 0.111 0.059
VLMaps LSeg 0.541 0.375 0.425 0.271
VLMaps OpenSeg 0.495 0.333 0.367 0.226
Table 1: Queryability results

Intra-map distinctness

Method

  • Captures the similarity of embeddings with the same semantic label within a single map.
  • The hypothesis is that embeddings sharing semantic meaning are clustered together in the latent space, allowing the distinguishing of different concepts.
Intra-map distinctness method graphic
Fig. 2: Intra-map distinctness

Results

Intra-map distinctness results
Fig. 4: Intra-map distinctness results

Inter-map distinctness

Method

  • Captures the similarity of embeddings with the same label across different maps.
  • The hypothesis is that embeddings within the same label are closer to each other across maps than to those with different labels, therefore retaining their distinctness across maps.
Inter-map distinctness method graphic
Fig. 3: Inter-map distinctness

Results

Inter-map distinctness results of VLMaps
Fig. 5: Inter-map distinctness results of VLMaps

Resolution

  • We evaluate the effect of the grid map resolution by performing the semantic segmentation experiment on a range of cell sizes.
The effect of resolution on queryability
Fig. 6: The effect of resolution on queryability

Conclusion

  • 3D features used by OpenScene increase queryability. OpenSeg has the best results in the semantic segmentation task.
  • Unlike image features, 3D features are not scale invariant. This means that a 3D network has to be trained for each resolution, making image-based methods more versatile.
  • The choice of encoder matters. LSeg performed better than OpenSeg with both methods in all of our experiments.
  • Open-vocabulary queryability is still an open problem. Currently, fully supervised semantic segmentation methods outperform all latent-semantic maps. The open-vocabulary querying method, proposed in VLMaps, cannot sufficiently partition the environment, yielding poor results.
  • While we were limited in our quantitative evaluation of closed vocabulary, all the proposed metrics can be extended to open vocabulary.

Citation

If you find this work useful, please consider citing:

@inproceedings{pekkanen_2025_latent_semantics,
    title={Do Visual-Language Grid Maps Capture Latent Semantics?},
    author={Pekkanen, Matti and Mihaylova, Tsvetomila and Verdoja, Francesco and Kyrki, Ville},
    booktitle={2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    publisher={IEEE},
    year={2025},
    month={October},
    address={Hangzhou, China},
    volume={},
    number={},
    pages={4059--4066},
    keywords={Measurement;Vocabulary;Three-dimensional displays;Image resolution;Accuracy;Navigation;Semantic
    segmentation;Semantics;Pipelines;Intelligent robots},
    doi={10.1109/IROS60139.2025.11247447}}
}

Acknowledgements

Codebase

The codebase is based on the original works of VLMaps and OpenScene.

Support

This work was supported by Business Finland (decision 9249/31/2021), the Research Council of Finland (decision 354909), Wallenberg AI, Autonomous Systems and Software Program, WASP and Saab AB. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPUs used for this research.