HydrAS is a DFG funded project starting 2022 and running for three years.
Increased availability of large-scale digital trace data on human behavior requires the development of suitable algorithmic approaches in the fields of computer and data science. Such data often comes in the form of sequences, e.g. as sequences of visited websites or locations in cities. To analyze this kind of data and extract knowledge in large scale, the applicants and others presented a novel computational approach that enables the comparison of hypotheses (derived from intuition, previous studies, or social theories) with respect to their plausibility regarding observed sequences in a Bayesian approach.
In this project, we will develop fundamentally new data analysis methods in that direction that overcome current shortcomings. In that regard, we will (1) systemize and simplify the process of hypothesis elicitation by integrating (semi-)automatic procedures for deriving interpretable base hypotheses from background knowledge and combining base hypotheses with each other. Additionally, we aim to (2) develop methods that partition data sequences in such a way that each part of the data can be succinctly described in terms of background information on the features, and the transition behavior in each partition can be explained by given hypotheses in order to account for heterogeneity in the data. Finally, we (3) extend the general framework of hypothesis-based analysis of sequential data, which currently focuses on simple first-order Markov Chain models to more complex models such as Hidden Markov chain models, continuous time Markov chain models or neural networks for sequential data. This would allow to formalize more complex and more fine-grained hypotheses, to pick models that are most suitable for a specific scenario, and integrate additional information (e.g., time information) in an easily understandable way.
In contrast to many recently proposed methods in the field of data science and machine learning, our research will not focus on methods that yield the maximum predictive power. Instead, we concentrate on finding potential explanations of the data generation process that can be understood by human domain experts through incorporating their hypotheses directly into the analysis process. In that regard, it will provide unique opportunities to integrate hypothesis-driven data analysis on one hand with advanced machine learning techniques on the other hand to support the understanding of the underlying processes generating the observed sequences. While this project focuses on developing new data science methods for analyzing human behavior, we expect these to be easily transferable to other application areas featuring sequential data.
We host a repository with related publication and code here.