STSM Reports - ML4 Microbiome

Successful Applications

Applicants & Short Reports

Mr Oliver Aasmets

Mr Paul-Stefan Popescu

Dr Mikhail Kolev

Prof. Malik Yousef

Prof. Malik Y. Yousef is a data scientist, with a focus on bioinformatics with applications to various biomedical/biological problems. He has published more than 60 peer‐reviewed articles in top journals and proceedings with over 3700 citations and an H-index of 23 and i10-index of 33 (based on Google scholar). His international experience includes 3 years as a postdoc at The Wistar Institute, Cancer Center, USA [Prof Louise Showe Cancer Biology lab] and one year at the University of Pennsylvania [UPENN-Bioinformatics Center]. Currently, he is an Associate Professor at the Zefat Academic College in Israel and Pranab K. Sen Distinguished Visiting Professor at the University of North Carolina at Chapel Hill in the Department of Biostatistics at the Gillings School of Global Public Health.

STSM Title: MicroBiomeNet: Machine Learning Analysis of Metagenomics Datasets

Host: Department of Computer Engineering, Abdullah Gül University, Kayseri, Turkey, Kayseri, Turkey

Dates: 12/07/2021 to 25/07/2021

Summary

The aim of this STSM was to develop a new tool called MicroBiomeNet that performs Machine Learning Analysis of Metagenomics Datasets based on Grouping and Ranking. The objective of MicroBiomeNet is to detect significant microbiome groups that may be able to serve as a biomarker for diseases. To this aim, MicroBiomeNet will incorporate biological domain knowledge and group the features of the microbiome data based on hierarchical taxonomy classification. Each group will contain features that belong to the specific taxonomy. Then the next step is to score and rank those groups in terms of the ability to classify the two-class data. MicroBiomeNett is a tool that identifies significant features that are linked with specific biological microbiome feature categories, whereas traditional approaches search for significant features that are able to distinguish between the two classes.

The data we used consists of 290 samples and 1455 microbial species. 135 of the samples are T2D patients and 155 from healthy subjects. The hierarchical taxonomy classification is: Domain, Kingdom, Phylum, Class, Order, Family, Genus and Species.

For the grouping microbiome species, we have considered three categories: 1) Order; 2) Family and 3)Genus.

Based on this type of grouping we identified: 177 groups for Family, 447 for genus and 84 for order

MicroBiomeNet tool

The following is the pseudo code of the MicroBiomeNet tool:
Input: D is a two class microbiome data
Creates groups (D)
G= create groups based on taxonomy(Family, Genus or Order)
[G = {groupsi = [fi1,fi2,..,fik ] }, i=1,…,nt
return G
Rank groups(G)
For each groups t in G
R = {rank (t)}
[R is the collection of the groups and its ranks]

The MicroBiomeNet consists of two main components. The first one is the “Create groups” that creates groups of features based on biological knowledge and each group creates its associated sub-data. The “Rank groups” use the sub-datasets for scoring and ranking. In this stage we have used Random forest with 5-flods cross-validation to give a score to each sub-data. The score is the accuracy of separating the two-class based on features belonging to a specific group. These significant groups should be considered for further analysis in order to deepen our understanding of the role of the microbiome in a specific disease.

We actually think that also the performance of the tool will be higher than other standard approaches. We will compare the performance with other traditional feature selection methods.

Dr Gianvito Pio

Gianvito Pio, PhD, Early Career Investigator (ECI), is an Assistant Professor at the Department of Computer Science, University of Bari, Italy. He published 36 papers, including 18 in journals with high impact factors, such as Machine Learning, Data Mining and Knowledge Discovery, Bioinformatics, BMC Bioinformatics, Information Sciences, and IEEE Transaction on Knowledge and Data Engineering. He is on the editorial board of Springer Medical & Biological Engineering & Computing, served as a reviewer for several international journals and participated in the scientific committee of several international conferences. His research interests include Machine Learning, Big Data Analytics, Bioinformatics and Blockchain.

STSM Title: Standardizing the pipeline for the analysis of high dimensional noisy Microbiome data

Host: Prof. Sašo Džeroski, Jozef Stefan Institute, Ljubljana (Slovenia)

Dates: 01/03/2020 to 31/03/2020

Summary

Microbiome studies make use of different types of data that may show different characteristics. However, they usually have one common characteristic, i.e., they are high-dimensional and noisy. Identifying a standard process of analysis of such kind of data is fundamental since it would enable the design of tools that could allow performing clinical analysis without owning specific machine learning expertise. In this context, the purpose of this STSM was to study existing works for the analysis of microbiome data to identify commonalities in the followed workflows. A specific ontology tailored for defining and reasoning on Data Mining tasks (OntoDM) has been considered during such an analysis, aiming at defining a set of guidelines and possibly a limited set of pipelines according to the classification provided by the ontology. Particular attention has been put on possible strategies for handling the high dimensionality of data and the presence of noise, considering the introduction, in the pipelines, of specific pre-processing approaches for feature reduction/extraction as well as the adoption of specific learning methods that are inherently able to work with high-dimensional noisy data.
The obtained results confirmed that we can handle microbiome data similarly as we handle high-dimensional noisy data in other application domains without specific steps (except for the CLR normalization). However, it is noteworthy that these conclusions apply to the specific task considered during the pilot study, namely multi-target regression in the specific case of the prediction of Type II diabetes.

Prof Vladimir Trajkovik

Vladimir Trajkovik was born in Skopje, R.N. Macedonia in 1971. He received a PhD degree in 2003 from Ss. Cyril and Methodius University in Skopje. His current position is a professor at the Faculty of Computer Science and Engineering. He has published 4 books as author or editor, more than 60 papers in respectable journals, and more than 160 conference papers. He has more than 1400 citations, with an h-index of 21. His research interests include: Information Systems Analyses and Design, Distributed Systems, ICT based Collaboration Systems and Mobile services with a special focus on Connected Health.

STSM Title: Evaluation of state-of-the-art research on the application of machine learning in human microbiome studies

Host: Prof. Tatjana Loncar Turukalo, University of Novi Sad, Serbia

Dates: 01/03/2020 – 07/03/2020

Summary

Thanks to sequencing technology development, microbiome studies with a large number of samples allow more sophisticated modelling using machine learning approaches to study relationships between microbiome and various health-related traits. In this STSM, we analysed the application of machine learning in analyses of the human microbiome from the perspective of possible application of machine learning paradigms.

We were interested in the preprocessing methodology pipeline from 16s rRNA data to OTU, the origin of human-based data sources (e.g. skin, gut, saliva), their sample size, the main purpose for using machine learning techniques from a microbiological point of view, the way feature reduction is made, description of machine learning methods, results and obtained performances, as well as their combination with statistical methods.

Microbiome profiling is typically conducted using 16S rRNA amplicon sequencing or shotgun sequencing. The 16S rRNA gene sequences are clustered into groups, namely operational taxonomic unit (OTU) using workflows such as QIIME or mothur. Once OTU clusters are determined, taxonomic information is assigned for the representative sequences of each OTU. Although analytical methods have recently been introduced using representative sequences of 16S rRNA, the OTU cluster-based variables are most frequently used as input features in microbiomes analysis.

Supervised learning methods are typically used to build a model to predict reported categorical outcomes, such as disease affection status. The most popular supervised learning methods used on OTU tables are support vector machine, Naïve Bayes, random forest, and k nearest neighbour methods.

Paper published thanks to the contribution of the work made during this STSM:

Tonkovic, Petar, Slobodan Kalajdziski, Eftim Zdravevski, Petre Lameski, Roberto Corizzo, Ivan Miguel Pires, Nuno M. Garcia, Tatjana Loncar-Turukalo, and Vladimir Trajkovik. “Literature on Applied Machine Learning in Metagenomic Classification: A Scoping Review.” Biology 9, no. 12 (2020): 453. https://doi.org/10.3390/biology9120453
Marcos-Zambrano, Laura Judith, Kanita Karaduzovic-Hadziabdic, Tatjana Loncar Turukalo, Piotr Przymus, Vladimir Trajkovik, Oliver Aasmets, Magali Berland et al. “Applications of machine learning in human microbiome studies: a review on feature selection, biomarker identification, disease prediction and treatment.” Frontiers in Microbiology 12 (2021): 313. https://doi.org/10.3389/fmicb.2021.634511

Miodrag Cekic

Bio & Affiliation

Professional software developer and architect working with the cutting-edge technologies coming from and around the Microsoft .NET ecosystem. Machine learning and Data Science researcher utilising different tools and technologies for data visualisation, exploration, features engineering and modelling. Currently in the final stage as a PhD candidate in the field of Computer science and engineering, area of Bioinformatics, Ss. Cyril and Methodius University – Faculty of Computer Science and Engineering in Skopje, Macedonia. His research interest aims to understand the role of the microbiome in cancer diagnostics and therapeutics by creating and utilising ML models, and working with big sets of microbiome-related data.

STSM Title: Creating and utilizing ML and Deep Learning models to understand the role of the microbiome in relation to cancer therapeutics and diagnostics

Host: Prof. Ugur Osman Sezerman, Acibadem University – Biostatistics and Medical Informatics Department, Kayışdağı Caddesi 32, 34752 Ataşehir, Istanbul, Turkey

Dates: 01/02/2021 – 07/03/2021

Publication: The STSM work was officially published in the MDPI Applied Sciences Journal, in the section of Applied Biosciences and Bioengineering (https://doi.org/10.3390/app12094094).

Brief Summary
Cancer is one of the leading causes of death worldwide. Colorectal cancer belongs to the group of the most malignant tumours for which their burden can be only reduced through early detection and appropriate treatment. Increasing evidence indicates that the intestine microbiota is related and can impact colorectal carcinogenesis. The study in this STSM proposes a multidisciplinary approach of a two-phase methodology for modeling and interpreting the key biomarkers that can play a significant role in understanding the drug-resistant mechanism and cancer carcinogenesis for patients diagnosed with colorectal cancer. The proposed methodology was evaluated using a publicly accessible dataset, which may serve clinicians as a complementary analysis tool in colorectal cancer diagnostic and therapeutics. The STSM work contributes to predictive modelling in healthcare and personalized medicine.

General Description
Recent studies have highlighted that gut microbiota can alter colorectal cancer susceptibility and progression due to its impact on colorectal carcinogenesis. Additionally, it can influence the metabolic pathways and modulate anticancer drug efficacy. This STSM work represents a comprehensive technical approach in modeling and interpreting the drug-resistance mechanisms and cancer carcinogenesis from clinical data for patients diagnosed with colorectal cancer. To accomplish our aim, we developed a methodology based on evaluating high-performance machine learning models where a Python-based random forest classifier provides the best performance metrics, with an overall accuracy of more than 90%. Our approach identified and interpreted the most significant genera in the cases of resistant groups and cancer progression and susceptibility. Thus far, many studies point out the importance of present genera in the microbiome and intend to treat it separately. The symbiotic bacterial analysis generated different sets of joint feature combinations, providing a combined overview of the model’s predictiveness and uncovering additional data correlations where different genera joint impacts support the therapy-resistant effect. This STSM work points out the different perspectives of a treatment since our aggregate analysis gives precise results for the genera that are often found together in a resistant group of patients, meaning that resistance is not due to the presence of one pathogenic genus in the patient microbiome, but rather several bacterial genera that live in symbiosis.

The findings concur with other related publications, indicating that the study within this STSM further establishes a novel methodology for a more effective and scientific approach to understanding the colorectal cancer therapy resistance mechanisms and carcinogenesis. In general, it points out the different perspectives of a treatment since our aggregate analysis gives precise results for the genera that are often found together in a resistant group of patients, meaning that resistance is not due to the presence of one pathogenic genus in the patient microbiome but several bacterial genera that live in symbiosis. This approach can be used as a complementary analysis tool in colorectal cancer diagnostic and therapeutic and for unseen microbiome data that can help oncologists decide the treatment and post-treatment strategy in terms of immunotherapy and drug resistance understandings.

Andrea Mihajlovic - winner of the Aleksandar "Saša" Popović Award For Best Student Paper - 2022

Bio & Affiliation

Andrea Mihajlovic is a Mathematician and Data Science researcher at BioSense Institute, Novi Sad, Serbia. Currently, a second-year PhD student in Informatics, Faculty of Sciences, Novi Sad, Serbia. Focused on applying Machine Learning and Deep Learning techniques in microbiome studies and exploring different preprocessing pipelines for analyzing amplicon and shotgun sequence data.

Thanks to the research work carried out during this ML4Microbiome STSM, Andrea won from the Faculty of Sciences, University of Novi Sad, Serbia, the Aleksandar “Saša” Popović Award For Best Student Paper – 2022.

Andrea Mihajlovic – First place winner

Presentations, papers and photos from the workshop held on 25.11.2022 are available here.

————————————————————————————————————————-

STSM Host Institute: the University of Bari, Department of Computer Science, Knowledge Discovery and Data Engineering research group

STSM title: Multi-view semi-supervised prediction on incomplete microbiome data

Dates: 05/06/2022 to 03/07/2022

Summary

The general problem when dealing with microbiome data is which sequencing approach (e.g., 16s amplicon or shotgun) or preprocessing approach (e.g., pipeline and parameter settings) to use to get the input data to learn predictive models through machine learning (ML) approaches. This STSM aimed to extend an existing multi-view learning approach to work in the semi-supervised multiclass setting with incomplete views and evaluate combinations of different types of microbiome views, i.e. sequencing and preprocessing approaches. The model established during the STSM improved prediction compared to two other baseline models, which considered using only one view and feature concatenation with filling missing values (due to incomplete views). It was examined under different experimental settings with benchmark dataset – 16s and shotgun microbiome profiles from participants with and without autism spectrum disorder (ASD). Further testing showed improved results for the multiclass case and other types of views (different preprocessing techniques).

General description

Recent studies showed promising results obtained by multi-view learning approaches, the direction of ML which considers distinct feature sets related to the same observations as views and aims to adapt standard ML methods to consider such multiple perspectives properly. Additionally, microbiome data comes from various sources, and data incompleteness is often inevitable. Data can be incomplete, and the target variable in ML tasks may also not always be available. When the value of the target variables for some observations is missing, this scenario is called a semi-supervised learning setting.

Our main task was a disease prediction (classification); hence the targets are class labels. The benchmark dataset was an Autism Spectrum Disorder (ASD) dataset with two microbiome views, 16s rRNA and shotgun OTU tables, and two classes (ASD and healthy)

The chosen algorithm for extension, i.e. rBoostSH relies on a partial information game algorithm called multiarm bandits, in which a player can explore or exploit the views, and the reward is the only information provided to the player about the winning view.

First, it was modified to work with incomplete views since the original implementation strictly required the presence of all the views for all the instances, both in the training and prediction phases and in multiclass case, since binary class labels are used in computations. The new algorithm is named irBoostSH. Second, irBoosthSH was extended to work in the semi-supervised setting. This was achieved through a clustering-based approach which provides initial pseudo-labels for unlabelled instances and a confidence score, which is used as initialization instance weights for rBoostSH.

The proposed model, i.e. irBoostSH, performed better than baselines, using only one view and feature concatenation with filling missing values. F1 score increased by 7% on average. We also varied two base classifiers: Decision tree and Random Forest. Results were similar in these two cases after enough iterations.

Combining our model with preprocessing for simulated semi-supervised experiments didn’t improve relative to disregarding unlabelled samples. Results showed relatively high F1 scores and AUCROC measures after removing half of the training data, more than 80%, but after removing 80% of training data, they drastically dropped. However, this made a good starting point for further research in a semi-supervised direction.

Ainhoa Garcia-Serrano

Ainhoa Garcia-Serrano is a Bioinformatician researcher at IDIBELL (Spain) currently doing her PhD in microbiome and cancer. Her main interests are focused on metagenomics and metatranscriptomics using 16S, shotgun, and RNA-Seq data. She is also interested in applying ML techniques in microbial feature selection.

STSM title: Capturing different microbiome insights using NGS technologies and ML techniques

Host institution: Karolinska Institutet (Sweden)

Dates: 15/08/2022 to 22/10/2022

Summary

The human microbiome has become an important subject of debate in the last years when accounting for health and disease. Recent next-generation sequencing technologies such as 16S and shotgun have allowed significant growth in the number of microbiome studies for human diseases. However, other NGS, such as RNA-Seq or whole-genome sequencing, have also been proven helpful in the study and characterization of microbiomes. Combining and integrating different technologies applied to the same samples will contribute new insights into human microbiome studies. There is a clear need to explore and compare different methodologies to optimize microbiome data analyses depending on the research questions. In this sense, different data type exploration, as well as Machine Learning (ML) algorithms, can provide new insights to explore and optimize human microbiome data analyses. This STSM studied the microbiome profiles of RNA-Seq and 16S data from the same public colorectal cancer dataset (GSE165255). ML models for RF and SVM have also been developed to classify tumour and healthy tissue in both datatypes. The obtained results confirmed the feasibility of those analyses by setting a pipeline to compare both types of data. This work has shown how microbial profiles vary depending on the type of data used. RNA-Seq can add an extra layer of information not captured by 16S, giving us information on active bacteria. Also, RNA-Seq allows us to characterize other microorganisms, such as viruses or fungi. Regarding ML, we can observe how RF and SVM models present different results in RNA-Seq and 16S data but with RF as the method that performed better in both cases. However, those methods need to be applied and optimized in larger datasets of better quality to validate the results.

Eliana Ibrahimi

Eliana Ibrahimi is a multidisciplinary scientist working as an Assistant Professor at the Department of Biology, University of Tirana, Albania. Her expertise encompasses biological sciences and statistics. She works on statistical and machine learning modelling of biological and health data. Her current research focuses on applying statistical and machine-learning models to analyze microbiome and transcriptomics data.

STSM title: Generalized linear models with LASSO and sparse group LASSO to analyze the association of microbiome with colorectal cancer

Host institution: NOVA MATH, NOVA University of Lisbon, Lisbon, Portugal.

Dates: 12/09/2022 to 04/10/2022

Summary
During this STSM, Eliana Ibrahimi and Marta B. Lopes worked on colorectal cancer 16S data to optimize generalized linear models (GLM) with LASSO regularization and their parameters for a two-class (healthy/cancer) and multiclass (healthy/adenoma/cancer) classification tasks. A cross-validation procedure was used for all models to select an optimal value for the shrinkage parameter, λ. The analysis is performed separately for each of the three studies included in the dataset (Zeller et al., 2014; Backer et al., 2016; Zackular et al., 2014) and using the complete data from all the studies. Several transformation approaches are applied to the data, such as normalization and centred log ratio. The preliminary results showed that the accuracy of the multinomial models when working with
separate datasets was higher than the accuracy achieved from the merged dataset. The collaboration is ongoing to improve the accuracy of the already applied models and add adaptive LASSO as an extra approach to the analysis.

References

Baxter, N.T., Ruffin, M.T., Rogers, M.A.M. et al. (2016) Microbiota-based model improves the sensitivity of fecal immunochemical test for detecting colonic lesions. Genome Med 8, 37. https://doi.org/10.1186/s13073-016-0290-3

Zackular, J. P., Rogers, M. A., Ruffin, M. T., 4th, & Schloss, P. D. (2014). The human gut microbiome as a screening tool for colorectal cancer. Cancer prevention research (Philadelphia, Pa.), 7(11), 1112–1121. https://doi.org/10.1158/1940-6207.CAPR-14-0129

Zeller, G., Tap, J., Voigt, A. Y., et al. (2014). Potential of fecal microbiota for early-stage detection of colorectal cancer. Molecular systems biology, 10(11), 766. https://doi.org/10.15252/msb.20145645

Blanca Lacruz Pleguezuelos

Blanca Lacruz Pleguezuelos is a PhD Student at the Computational Biology Group at the IMDEA Food Institute, Madrid (ES). Her research is focused on the application of machine learning methods to gut microbiome data, with the aim of understanding host-microbiome interactions and how they affect different diet-related diseases. On November 3rd 2022 Blanca enrolled in the Molecular Biosciences PhD Program at the Autonomous University of Madrid (UAM) under the supervision of Enrique Carrillo de Santa Pau, PhD; and Laura Judith Marcos Zambrano, PhD. Blanca’s tasks include DNA extraction from stool samples, bioinformatics analyses of gut microbiome data, database development and the application of machine learning and systems biology techniques to develop precision nutrition strategies. She collaborates with the GENYAL Platform for Clinical Trials in Nutrition and Health at the IMDEA Food Institute and with the Biometrics and Data Pattern Analytics (BiDA) Lab at UAM.
Regarding international experience, Blanca recently participated in the BioHackathon Europe organized by Elixir Europe (November 2022, France), where developed a proof of concept study for the annotation of nutritional terms through different ontologies. She is also responsible for maintaining the FooDrugs database within the FNS-Cloud project (link), funded by Horizon 2020 (H2020; ID: 863059). She also participated in the 2021 edition of the ML4Microbiome Training School organised by the COST Action CA18131. This experience helped Blanca to acquire machine-learning skills relevant to the analysis of gut microbiome data, which she applied to publicly available data for her Master’s thesis, also under the supervision of Enrique Carrillo and Laura Marcos (10.1101/2022.11.17.516892). Blanca also completed a semester-long stay at Boston University (Boston, MA, USA) during her Bachelor’s Degree, where she undertook classes in BU’s College of Arts and Sciences and BU School of Medicine.
Other relevant experience includes her participation in the “BrainCode Games” hackathon organized by the Spanish Society of Neuroscience (SENC) together with the UAM within the DeepCode project by La Caixa. This hackathon involved the application of artificial intelligence techniques to neuroscience data, and the work carried out by my team earned us the Honorary Award.

STSM title: Machine learning techniques for the characterization and analysis of metagenomic and metabolic microbiome networks

Host institution: Vera Pancaldi, Cancer Research Center of Toulouse, Spain

Dates: 24/04/2023 to 28/07/2023

Summary

More comprehensive analyses are needed in gut microbiota (GM) research to characterize how changes in microbial composition are related to function, as well as the ecological interactions between microbes and between them and their host. In this STSM, we have used network science methods to build microbial networks, predicted functional characteristics of gut microbes based on their metabolic capacity using unsupervised machine learning (ML), and worked on data integration through multilayer networks and supervised ML. We have used data from a month-long dietary weight loss intervention where lifestyle and medical data, as well as blood biochemistry and GM samples, were collected. First, we built co-occurrence networks for patients that responded or not to the intervention and analyzed them as a preliminary feature engineering step. We then built a metabolic network using the information available at the Virtual Metabolic Human resource. This allowed us to define a matrix of produced and consumed metabolites for each microbe, where we defined Hamming distances and performed unsupervised ML via hierarchical clustering. As a result, we defined six functional clusters, which showed separation between them according to Multiple Correspondence Analysis. We used the microbial features for a RF model, which we will improve by implementing feature engineering from the obtained functional clusters. The final part of this STSM was dedicated to the construction of multilayer networks, integrating the different data types collected during the nutritional intervention and using the predictive features from the previous ML approaches. In the future, the application of unsupervised ML approaches that explore the different layers will allow us to find node communities where the different data types are considered.
This STSM is aligned with the work from Working Groups WG1 and WG3 of the COST Action CA18131, contributing to Tasks T1.1, T1.2, T.1.3 and T3.1, T3.2 and T3.3 by reviewing and applying new ML methods to GM data, working on their performance and interpretability through feature engineering, and integrating different data types in our analyses.

Contact Us

Contact Newsletter Sign Up

Successful Applications

Applicants & Short Reports

Bio & Affiliation

Contact Us

Address

Legal