Deutsch Intern
    Data Science Chair


    You can write any of our staff members about open topics for practica, bachelor and master theses.

    In the case of excellent performance there is also the chance to submit the thesis as an article to a computer science conference and to be co-author on a scientific publication early in your studies!

    Open Topics:

    Deep Music Composition

    Recent transformer-based generative language models have achieved near human-like performance in generating natural text. In this work, the applicability of these models to deep music generation is investigated. The scope of this work includes the evaluation of different representation methods for symbolic music and their impact on quality. While the topic generally focuses on symbolic music (MIDI), musical genre, instruments, and musical complexity can be varied.

    Supervisor: Daniel Schlör

    From Ancient to modern: Tackling the big scary world of database migration

    Greetings prospective students, do you have a passion for database updates? Well then, have I got a project for you! We need to modernize our ancient MySQL database powering the Website Bibsonomy, which contains over 200 GB of data, by updating it to something not that ancient, like MariaDB or Postgres DB. Your task will be to change the queries from the current backend to the new databases, optimise them to make them more efficient and thundering fast. This will be a challenging project, but don't worry, we will be right here to provide guidance and support. So, gear up and embark on this exciting adventure with the promise of a delicious cake upon completion. Remember, the cake is a lie only if you don't put in the effort.

    Supervisor: Tobias Koopmann

    The last (unequal) batch – to drop or not?

    When batching training data, many frameworks utilize a parameter that controls the size of the last batch. In cases where the batch size does not smoothly divide the total number of samples, this parameter effects if the batch should be dropped altogether (e.g., because the network expects equally shaped batches) or can still be used. However, it is not examined if, in cases where a last batch with a different size is technically possible, an element-reduced last batch affects performance.

    In this practice-oriented project, you will study this question. The first step is a search for related work. Second, you will create machine learning experiments, where you will compare the test scores of a “normally” trained network with that trained under the “last-batch-not-dropped” regime (i.e., where the last batch’s size is unequal). Suitable datasets for this project are CIFAR and MNIST plus one larger dataset (ImageNet or similar).

    Supervisor: Pascal Janetzky

    Research assistant

    Are you interested in doing machine learning research, and get paid for it? Then contact me (janetzky@informatik.uni-wuerzburg.de), and we’ll find something interesting to do.

    Generally, there’s a variety of tasks that are of interest to me and my research:

    • Implementing neural networks in TensorFlow
    • Getting code to run on our compute cluster
    • Working with audio data and neural networks

    If you have good research ideas yourself, then lets discuss them together.

    Supervisor: Pascal Janetzky 

    Deep Metric Learning with Varying Similarity Definitions

    Deep Metric Learning (DML) trains a neural network that represents input data (e.g. images or texts) as an n-dimensional vector such that similar input items are close together in embedding space, while dissimilar items are farther apart. For this, usually items are labeled as “similar” or “dissimilar”. For example, given car images, two images of cars are similar if the shown car model is the same, regardless of the color, orientation, or background. The neural network then should learn to identify the common visual features between images of the same class and represent both car images as similar embedding vectors, while images of different car models should lead to different vectors.

    DML can be used in many applications. One example is item retrieval, where we want to get items based on content features, e.g. Google Image Reverse Search. However, DML can be used in few-shot learning, face recognition, or person re-identification.

    Now, different users usually define similarity differently. For example, one user might find that two car images are similar if both cars have the same color, while another user might deem two images with cars from the same manufacturer more similar. Changing the similarity definition while keeping the computational overhead manageable has not been done yet in current DML research.

    In this thesis or project, you are going to develop and evaluate methods to dynamically set the similarity definition for DML models. This includes implementing neural network models and loss functions in PyTorch as well as training and testing models on different datasets and metrics.

    Supervisor: Albin ZeheKonstantin Kobs

    Improved Identification of Brain Activity to Predict Human Intelligence by using Machine Learning

    Human intelligence is the best predictor of important life outcomes like educational and occupational success and even health and longevity. Cognitive neuroscience has recently shown that functional and structural brain data (as assessed with an MRI scanner) can predict individual intelligence scores. However, many preposing steps are involved and associated with many degrees of freedom so that the resulting “cleaned” brain signal is only a coarse approximation of the “true” underlying brain activity. This thesis aims to develop a new machine learning-based MRI artefact correction method. We will start with substituting single preprocessing steps and investigate how much of the common methodology can be replaced by more data-driven methods utilizing machine learning - without reducing prediction accuracy of phenotypic measures like intelligence, personality and age.

    Betreuer: Andreas Hotho, Kirsten Hilger

    Data-efficient learning model on the intelligent bridge dataset

    As the state-of-the-art machine learning methods in many fields rely on larger datasets, storing datasets and training models on them becomes significantly more expensive. In the case of sensor-based monitoring of infrastructures, big amounts of data are collected from the installed sensor networks. The goal of the master thesis is to propose a training set synthesis model for data-efficient learning, that learns to condense the 6 TB large intelligent bridge dataset into a small set of informative synthetic samples for training deep neural networks. The task can be formulated as a gradient matching problem between the gradients of deep neural network weights that are trained on the original data on the one hand and the synthetic data on the other hand. The model should further be investigated in neural architecture search in order to show the advantages and disadvantages of its usage in limited memory and computations.

    Mainly used methods for data-efficient learning like continual learning and active learning have two main shortcomings: (1) they rely on heuristics that does not guarantee any optimal solution for the downstream task and (2) on the presence of representative samples, which is neither guaranteed. The Dataset Distillation (DD) method goes beyond these limitations by modelling the network parameters as a function of the synthetic training data. Based on DD the Dataset Condensation (DC) method has been developed with the focus on learning to synthesize informative samples that are optimized to train neural networks for downstream tasks. DC proved to learn a small set of “condensed” synthetic samples such that a deep neural network trained on them obtains not only similar performance but also a close solution to a network trained on the large training data in the network parameter space. Nevertheless, most effort was made on image datasets. The thesis should thus transfer the knowledge of DC and comparable methods to the field of intelligent bridge time-series dataset.

    Supervisor:  Melanie Schaller