Data Science Chair


    You can write any of our staff members about open topics for practica, bachelor and master theses.

    In the case of excellent performance there is also the chance to submit the thesis as an article to a computer science conference and to be co-author on a scientific publication early in your studies!

    Open Topics:

    Deep Music Composition

    Recent transformer-based generative language models have achieved near human-like performance in generating natural text. In this work, the applicability of these models to deep music generation is investigated. The scope of this work includes the evaluation of different representation methods for symbolic music and their impact on quality. While the topic generally focuses on symbolic music (MIDI), musical genre, instruments, and musical complexity can be varied.

    Supervisor: Daniel Schlör

    Deep Metric Learning with Varying Similarity Definitions

    Deep Metric Learning (DML) trains a neural network that represents input data (e.g. images or texts) as an n-dimensional vector such that similar input items are close together in embedding space, while dissimilar items are farther apart. For this, usually items are labeled as “similar” or “dissimilar”. For example, given car images, two images of cars are similar if the shown car model is the same, regardless of the color, orientation, or background. The neural network then should learn to identify the common visual features between images of the same class and represent both car images as similar embedding vectors, while images of different car models should lead to different vectors.

    DML can be used in many applications. One example is item retrieval, where we want to get items based on content features, e.g. Google Image Reverse Search. However, DML can be used in few-shot learning, face recognition, or person re-identification.

    Now, different users usually define similarity differently. For example, one user might find that two car images are similar if both cars have the same color, while another user might deem two images with cars from the same manufacturer more similar. Changing the similarity definition while keeping the computational overhead manageable has not been done yet in current DML research.

    In this thesis or project, you are going to develop and evaluate methods to dynamically set the similarity definition for DML models. This includes implementing neural network models and loss functions in PyTorch as well as training and testing models on different datasets and metrics.

    Supervisor: Konstantin Kobs, Albin Zehe

    Improved Identification of Brain Activity to Predict Human Intelligence by using Machine Learning

    Human intelligence is the best predictor of important life outcomes like educational and occupational success and even health and longevity. Cognitive neuroscience has recently shown that functional and structural brain data (as assessed with an MRI scanner) can predict individual intelligence scores. However, many preposing steps are involved and associated with many degrees of freedom so that the resulting “cleaned” brain signal is only a coarse approximation of the “true” underlying brain activity. This thesis aims to develop a new machine learning-based MRI artefact correction method. We will start with substituting single preprocessing steps and investigate how much of the common methodology can be replaced by more data-driven methods utilizing machine learning - without reducing prediction accuracy of phenotypic measures like intelligence, personality and age.

    Betreuer: Andreas Hotho, Kirsten Hilger

    Data-efficient learning model on the intelligent bridge dataset

    As the state-of-the-art machine learning methods in many fields rely on larger datasets, storing datasets and training models on them becomes significantly more expensive. In the case of sensor-based monitoring of infrastructures, big amounts of data are collected from the installed sensor networks. The goal of the master thesis is to propose a training set synthesis model for data-efficient learning, that learns to condense the 6 TB large intelligent bridge dataset into a small set of informative synthetic samples for training deep neural networks. The task can be formulated as a gradient matching problem between the gradients of deep neural network weights that are trained on the original data on the one hand and the synthetic data on the other hand. The model should further be investigated in neural architecture search in order to show the advantages and disadvantages of its usage in limited memory and computations.

    Mainly used methods for data-efficient learning like continual learning and active learning have two main shortcomings: (1) they rely on heuristics that does not guarantee any optimal solution for the downstream task and (2) on the presence of representative samples, which is neither guaranteed. The Dataset Distillation (DD) method goes beyond these limitations by modelling the network parameters as a function of the synthetic training data. Based on DD the Dataset Condensation (DC) method has been developed with the focus on learning to synthesize informative samples that are optimized to train neural networks for downstream tasks. DC proved to learn a small set of “condensed” synthetic samples such that a deep neural network trained on them obtains not only similar performance but also a close solution to a network trained on the large training data in the network parameter space. Nevertheless, most effort was made on image datasets. The thesis should thus transfer the knowledge of DC and comparable methods to the field of intelligent bridge time-series dataset.

    Supervisor:  Melanie Schaller