QuASH: Using Natural-Language Heuristics to Query Visual-Language Robotic Maps

Matti Pekkanen, Francesco Verdoja, and Ville Kyrki

Aalto University

Video

The problem

  • VLM embeddings are increasingly used as semantic descriptors in maps.
  • The key challenge is querying the embeddings.
  • State-of-the-art methods based on thresholding distances or using a single complementary query are insufficient for visual classification.

The solution

  • Training a classifier on text-description embeddings yields a good classifier in the visual domain, because textual descriptions and visual features are embedded in the same space.
  • QuASH uses a heuristic of natural-language synonyms and antonyms associated with the query to train a classifier in the embedding space.

Method

To find the target set related to a query q in the visual space:

  1. Sample semantically similar words to q to form a query set, and sample non-matching queries in language space.
  2. Embed the matching and non-matching query sets with the text encoder.
  3. Train an off-the-shelf classifier in the shared embedding space.

The decision boundary of the classifier yields the estimated region of embeddings that match the query.

QuASH method overview
QuASH trains a classifier from query-related language embeddings and applies the decision boundary in the visual/map embedding space.

Results: Image querying

SheepRefrigeratorPersonSkateboard Original Original sheep image Original refrigerator image Original person image Original skateboard image Ours QuASH sheep result QuASH refrigerator result QuASH person result QuASH skateboard result Baseline Baseline sheep result Baseline refrigerator result Baseline person result Baseline skateboard result
Qualitative image-querying examples: original image, QuASH, and baseline.
DatasetMethodF1PR
COCObaseline0.6800.5360.929
Our SVM0.7860.6960.901
Our C-SVM0.7710.6720.905
PC 459baseline0.6540.5400.831
Our SVM0.6670.6930.643
Our C-SVM0.7280.6800.783
Matterportbaseline0.5440.4480.691
Our SVM0.6210.6260.616
Our C-SVM0.5790.4830.724

Results: Map querying

ChairShelvingWallTable Ours Chair result, ours Shelving result, ours Wall result, ours Table result, ours Baseline Chair result, baseline Shelving result, baseline Wall result, baseline Table result, baseline
Map-querying examples comparing QuASH and the baseline.
EncoderMethodF1PR
LSegbaseline0.5610.5360.589
our ground truth0.5740.5670.583
our LLM0.3960.2890.656
OpenSegbaseline0.1360.0730.926
our ground truth0.4450.6220.369
our LLM0.2420.1540.586

Conclusion

  • QuASH has a higher F1-score in all image experiments.
  • QuASH has the highest F1-score in map experiments with ground-truth synonyms.
  • QuASH has a higher F1-score in map experiments when using OpenSeg with any heuristic.
  • Leveraging semantic knowledge to heuristically sample from language space improves visual classification when the query-distribution approximation is good.

Citation

If you find this work useful, please consider citing:

@inproceedings{pekkanen_2026_quash,
    title = {{QuASH: Using Natural-Language Heuristics to Query Visual-Language Robotic Maps}},
    author = {Pekkanen, Matti and and Verdoja, Francesco and Kyrki, Ville},
    booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
    year = {2026},
    month = {May},
    address = {Vienna, Austria},
    publisher = {IEEE},
    pages = {XX--YY},
    doi = {XXX}
}

Acknowledgements

Part of the codebase is based on the original works of VLMaps and OpenScene.