Development of a Federated Graph-based Representation Learning Framework to Aid Integration and Interpretation of Proteomics Data
Supervision
Alberto Santos
DTU, 1st Supervisor
Julio Saez Rodriguez
EMBL-EBI, 2nd Supervisor
Objectives
In a federated learning paradigm, data remains in its original location and format while enabling collaborative analysis as if it were part of a unified dataset. This approach is critical for generating actionable insights, such as novel precision therapies or clinical biomarkers, which often require access to heterogeneous data across multiple cohorts or institutions. The primary objective of this project is to design a federated knowledge graph framework that facilitates the querying, consolidation, analysis, and interpretation of distributed proteomics-focused clinical knowledge graphs. To achieve this, we will employ machine learning techniques to develop local graph representation models, which will be aggregated globally to enhance their predictive power and translational relevance, all while maintaining strict data privacy standards.
Methodology
To address these objectives, we will first design efficient data structures and algorithms tailored for querying, aggregating, and analyzing diverse proteomics-focused clinical knowledge graphs. These tools will ensure seamless interoperability across heterogeneous data sources. Next, we will implement federated learning strategies, including decentralized reinforcement learning, to enable collaborative model training without exposing sensitive data. This approach will balance the need for privacy with the ability to leverage distributed datasets effectively. Additionally, we will develop robust, scalable, and maintainable software components as part of a comprehensive knowledge graph analytical framework. These components will support end-to-end workflows, from data integration to model deployment, ensuring the framework’s adaptability to evolving research needs and its potential for real-world application in precision medicine.
Expected Results
The project will develop a federated knowledge graph framework that enables the integration of proteomics data across decentralized, heterogeneous data sources. By using federated learning, the framework will improve data analysis, interpretation and translatability, allowing collaborative model training without exposing sensitive data that would otherwise be inaccessible
Planned Secondments
Host: EMBL-EBI (J. Saez), Duration: 2 Months; When: Year 1, Goal: Implementation of graph representation methods for proteomics data.
2) Host: TUM (M. Wilhelm), Duration: 1 Month; When: Year2, Goal: Implementation of drug treatment prioritization methods.
3) Host: TAU (J. Hamari), Duration: 1 Month; When: Year 3, Goal: Development of high-content data visualization systems.
Required skills
The candidate should hold a Master’s degree in Computer Science, Data Science, Bioinformatics, Computational Biology, or a related field, with proven ability to apply machine learning to biological and biomedical problems. Experience in knowledge graph development, integration of heterogeneous omics data, and ontology-based modeling is highly valued.
They should be familiar with Graph Neural Networks (GNNs), graph embeddings, and MS-based proteomics or other omics data, as well as data preprocessing pipelines (e.g., Nextflow). Strong Python programming skills, experience with version control, cloud/HPC environments, and Docker are required.
The role demands effective collaboration in multidisciplinary teams and clear communication across fields. Experience with federated machine learning is an asset.
Enrolment in doctoral programs
Technical University of Denmark (DTU).
Project specific requirements
If you are interested in this project, make sure you also apply to the DTU call. PhD students can only be admitted at the host institute, if they passed DTU selection process.
Reference publications
Santos, A., Colaço, A.R., Nielsen, A.B. et al. A knowledge graph to interpret clinical proteomics data. Nat Biotechnol 40, 692–702 (2022).