Natural Language Processing

    Seminar Vision & Language


    The fields of Natural Language Processing and Computer Vision have both greatly advanced in recent years due to improvements in hardware and the huge amounts of data available on the internet. At the intersection of the two modalities text and image, we have the multimodal vision+language field, which has exploded in research interest in the last years. In vision+language deep learning, we find a wide range of problems from text-driven image generation or manipulation, automatic image captioning, image search, or reasoning and Q&A on images with text.

    The topics of the seminar will cover the many research topics and challenges about the vision+language field, namely how we can model the task for deep learning methods, what tasks and datasets people have created, and how we can evaluate our models to measure how well they work. In this seminar, we review, explore and debate these challenges based on recent research at the intersection of Machine Learning, Computer Vision, and Natural Language Processing.

    Each participant will be assigned one topic, including but not limited to those listed below. 

    Each participant is expected to prepare a written review report covering the state-of-the-art on the particular topic and a corresponding oral presentation. 

    At the same time, each participant is expected to interact, read and comment on the reports provided and presented by the other participants.

    Each participant will get skills in critical analysis, scientific discourse, and preparation, writing, and presentation on a research topic. Moreover, the participants will get acquainted with state-of-the-art vision+language research.


    • Get skills on:
      - critical analysis
      - scientific discourse
      - preparation (literature review), report writing (latex), and presentation (power point) on a vision+language topic.


    • Basic concepts of mathematical analysis and linear algebra.
    • Basic knowledge of machine learning and deep learning is helpful.
    • The course language is English.


    Report (max 5-6 pages, without including references) and Review (max 1 page, without including references) templates.

    For presentation slides, the students are free to select their own templates.

    Dates and Locations

    The seminar (kickoff meeting and presentations) takes place at our office at John Skilton Str. 8A (we are located at the 3rd floor of the Sensalight office there) or in a booked seminar room, depending on the course size.
    (You best reach the office from the Skyline Hill Center bus stop. There is also a new path through the fence going from the Oswald-Külpe-Weg bus stop directly to our office which is not yet visible in Google Maps.)

    Week 3 Kickoff Meeting : (Time & place TBA)

    Until the next week: topic assignment finalized
    09.06.2023: Deadline for Report Draft
    30.06.2023: Deadline for Reviews
    Juli: Presentations in Blocks (depends on course size, details TBA)
    21.07.2023: Deadline for Final Report, Slides, and Reviews
    (Weeks are relative to the semester start week on 17.04.2023)

    Topics in Summer Semester 2023

    1. Real-world problem: Visual QA problems posed by blind people,
    2. Probing vision+language models to understand what they know
    3. Beyond English - multilingual models
    4. Ethical problems in multimodal deep learning
    5. Image-text retrieval
    6. Stable Diffusion & co - Text-Conditional Image Generation
    7. Text-based image editing
    8. Making Large Language Models multimodal


    Gregor Geigle (email:
    Prof. Dr. Radu Timofte (email:
    Prof. Dr. Goran Glavaš  (email: