Intern
    Data Science Chair

    Our paper "On the Role of Embeddings in Diffusion-based Generation of scRNA-seq Data" has been accepted at the ECML PKDD 2025

    21.07.2025

    We're excited to share that our paper "On the Role of Embeddings in Diffusion-based Generation of scRNA-seq Data", has been accepted at the workshop New Frontiers in Mining Complex Patterns at ECML PKDD 2025

    Our work investigate how different embeddings influence the training of a diffusion model and the resulting generation of scRNA-seq data and how meaningful common clustering metrics are for this task.

    Abstract:

    Single-cell RNA sequencing (scRNA-seq) enables fine-grained insight into the heterogeneity of tissues and cellular responses. However, the limited availability of high-quality datasets and the inherent noise in scRNA-seq measurements hinder downstream analyses. Generative models, particularly diffusion models, offer a promising approach to synthesizing realistic scRNA-seq data. This work builds upon the scDiffusion framework and investigates the influence of various embedding strategies on diffusion-based generation. We study three trainable approaches: an autoencoder as in scDiffusion, an scANVI model, and an scTAG model. Further, we investigate a feature selection approach using highly variable genes (HVGs). For the guided diffusion, we use a four-layer MLP as well as an scANVI-based classifier to explore conditional generation. We show that the scANVI model produces the top-performing embedding, and the diffusion model trained on this embedding yields the most realistic data, with a class distribution closely resembling real data. We also show that the used embedding metrics are not sufficient for deciding which embedding is best suited for training another model. These findings highlight the importance of representation choice when training a diffusion model to generate new scRNA-seq data.

    Zurück