Intern
    Data Science Chair

    SOOFI: Sovereign Open Source Foundational Models for European Intelligence

    We, the Data Science Chair and the Chair of Computer Philology and Modern German Literary History (CLS) at CAIDAS, are part of the new large-scale project  „Sovereign Open Source Foundational Models for European Intelligence“ (SOOFI), collaborating with ten partner institutions across Germany (Find the official press release here).  Within 10 months, the project aims to develop the foundations for a sovereign, transparent, and multilingual European AI base model, with a particular focus on strengthening German language capabilities and promoting European AI sovereignty.

    Based on our previous experience, having trained LLäMmlein decoder and ModernGBERT encoder models (find our project page here) from scratch, our work contributes to the entire model development pipeline - from data creation and preparation to large-scale pretraining, domain-specific fine-tuning, and reasoning.

    The Data Science Chair contributes to three key work packages, with a dedicated focus on systematically incorporating German-language resources and capabilities into the model:

    • Pretraining of GenAI models: This involves multi-stage data filtering and deduplication, a fine-grained data tracking system, and an interactive visualization platform for dataset statistics. Iterative evaluations optimize data quality, bias, and training stability, while curated multilingual benchmarks enable continuous, data-driven refinement of the pretraining pipeline.
    • Specialization of large language models (LLMs): The goal of this workpackage is to adapt the general-purpose LLM to specific domains such as German legal language, aiming to support public administration workflows (e.g., translation, explanation, summarization),  as well as applications in medicine. Therefore, we  curate domain-specific datasets and design and train LoRA based adapters, to achieve strong domain performance in low-resource fields with limited training data. Continuous benchmarking will ensure robust, real-world–aligned model performance.
    • Reasoning: To ultimately build a strong reasoning model, we are developing a modular training pipeline that can incorporate and support a wide range of state-of-the-art reinforcement learning algorithms and training recipes.  Automated evaluation tools on multilingual - and especially German - datasets will provide continuous feedback on both general capabilities and the domain-specific skills we aim to explore.

    The Chair of Computer Philology and Modern German Literary History focuses on post-training data creation: Annotation schemas are designed to capture content depth, consistency, methodological diversity, and cultural adaptation to German contexts. Samples are manually annotated and quality-checked, including complex reasoning tasks such as chain-of-thought and multihop problems. Diverse variants are generated, translations are validated for logical coherence, and inter-annotator agreement is assessed with adjudication where needed. The resulting high-quality datasets, enriched with metadata, are made available under an open license to support continuous model evaluation and improvement.

    The University of Würzburg has been granted 772,907 € in funding.

    The project is funded by the Federal Ministry for Economic Affairs and Energy. (Image: BMWE)