Our paper "InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images" has been accepted at WACV 202313.10.2022
In this paper by K. Kobs, M. Steininger, and A. Hotho, we use language to guide an image embedding process such that the resulting embedding space is focused on a desired similarity notion.
In this paper, accepted to the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2023, we aim to create custom image embedding spaces without using any training images, but only by using text prompts. To this end, we propose a new Deep Metric Learning (DML) task for this exact purpose and also introduce and evaluate a new method to solve this task. Based on the popular multimodal model CLIP, we can specify text prompts that describe the desired similarity notion that should be reflected in the embedding space. The variance captured in these vectors is then captured using a dimensionality reduction technique that is then applied to the image embedding space.
The paper will be presented at WACV 2023, taking place from 3.–7. January 2023 in Waikoloa, Hawaii.
Common Deep Metric Learning (DML) datasets specify only one notion of similarity, e.g., two images in the Cars196 dataset are deemed similar if they show the same car model. We argue that depending on the application, users of image retrieval systems have different and changing similarity notions that should be incorporated as easily as possible. Therefore, we present Language-Guided Zero-Shot Deep Metric Learning (LanZ-DML) as a new DML setting in which users control the properties that should be important for image representations without training data by only using natural language. To this end, we propose InDiReCT (Image representations using Dimensionality Reduction on CLIP embedded Texts), a model for LanZ-DML on images that exclusively uses a few text prompts for training. InDiReCT utilizes CLIP as a fixed feature extractor for images and texts and transfers the variation in text prompt embeddings to the image embedding space. Extensive experiments on five datasets and overall thirteen similarity notions show that, despite not seeing any images during training, InDiReCT performs better than strong baselines and approaches the performance of fully-supervised models. An analysis reveals that InDiReCT learns to focus on regions of the image that correlate with the desired similarity notion, which makes it a fast to train and easy to use method to create custom embedding spaces only using natural language.
Konstantin Kobs, Michael Steininger, Andreas Hotho