14th The International Symposium on Health Informatics and Bioinformatics
The International Symposium on Health Informatics and Bioinformatics (HIBIT) is in its fourteenth year. It aims to bring together academics, researchers, and practitioners from medical, biological, and information technology sectors to create a synergy. It is one of the few conferences emphasizing such a synergy. It provides a forum for discussion, exploration, and development of theoretical and practical aspects of health informatics and bioinformatics. Also, it gives researchers a chance to follow current research in their field by constructing networks.
University of Pittsburgh, USA
Carnegie Mellon University, USA
Case Western Reserve University, USA
Institut Pasteur, France
Hacettepe University, Turkey & EMBL-EBI, United Kingdom
Humboldt University of Berlin, Germany
EPFL, Switzerland
Genome Institute of Singapore, A*STAR & National University of Singapore, Singapore
Roche, Switzerland
EMBL-EBI, United Kingdom
University College London, United Kingdom
09:00 - 09:15
Opening Remarks
09:15 - 09:45
Martin Kircher
CADD & CADD-SV – scoring deleteriousness of all genomic variants
Approaches for the identification of disease causal mutations are widely applied in research and clinical settings, but interpretation and ranking of the resulting variants remains challenging. In 2014, we published a metric that objectively weights and integrates collections of annotations. Combined Annotation Dependent Depletion (CADD, https://cadd.gs.washington.edu) integrates annotations by contrasting variants that survived purifying selection along the human lineage with simulated mutations to score short sequence variants (SNVs, InDels, multi-allelic substitutions). CADD was well adopted by the community and minor adjustments and fixes were released since, including the native support of both GRCh37 and GRCh38 assemblies. Recently, we assessed existing deep neural network (DNN) models for splice effects with the Multiplexed Functional Assay of Splicing using Sort-seq dataset (MFASS, Cheung et al. Mol Cell. 2019). We selected two DNN models based only on genomic sequence, MMSplice and SpliceAI, which showed the best performance for integration into CADD. The DNN scores boosted CADD's predictions for splice effects and we noted that while the DNN scores have superior performance on splice variants, they do not compete in general variant effect prediction, as they do not account for nonsense and missense effects of the same variants. This suggests that variant prioritization will improve with more domain-specific information and underlines the importance of identifying additional such features, e.g. for regulatory sequences. With rapid advances in the identification of structural variants (SVs), we decided to apply the general concept of CADD to score them (CADD-SV). While methods utilizing individual mechanistic principles like the deletion of coding sequence or 3D architecture disruptions were available, a comprehensive tool that uses the broad spectrum of available SV annotations was missing. As a proof-of-principle, we show that CADD-SV scores are predictive of pathogenicity and population frequency and that CADD-SV's ability to prioritize pathogenic variants exceeds that of existing methods (0.981 AUROC compared to SVScore 0.967, AnnotSV 0.818). Our results highlight advantages of CADD, like profiting from a large training data set covering diverse and rare feature annotations without major ascertainment effects from historic and on-going variant collections.
09:45 - 10:15
Gioele La Manno
Molecular architecture of the developing mouse brain
The mammalian brain develops through a complex interplay of spatial cues generated by diffusible morphogens, cell–cell interactions and intrinsic genetic programs that result in probably more than a thousand distinct cell types. A complete understanding of this process requires a systematic characterization of cell states over the entire spatiotemporal range of brain development. The ability of single-cell RNA sequencing and spatial transcriptomics to reveal the molecular heterogeneity of complex tissues has therefore been particularly powerful in the nervous system. Previous studies have explored development in specific brain regions, the whole adult brain9 and even entire embryos10. Here we report a comprehensive single-cell transcriptomic atlas of the embryonic mouse brain between gastrulation and birth. We identified almost eight hundred cellular states that describe a developmental program for the functional elements of the brain and its enclosing membranes, including the early neuroepithelium, region-specific secondary organizers, and both neurogenic and gliogenic progenitors. We also used in situ mRNA sequencing to map the spatial expression patterns of key developmental genes. Integrating the in situ data with our single-cell clusters revealed the precise spatial organization of neural progenitors during the patterning of the nervous system.
10:15 - 10:30
Break
10:30 - 11:30
AnGenoV: A toolbox for analysis of genomic variants
Sevcan Doğramacı, Mert Doğan, Mehmet Gürsel Arslan, Süleyman Emre Çelik, Tuğba Nur Korkmaz and Arda Söylev
Evolution of genomes has been conducing to several variations because of mutations or other variation sources, causing generation of differentiations within a population. Since the genome is the source of hereditary and genetic information, these variants induce certain consequences on organisms, leading to genetic diseases, phenotypic abnormalities and more. Therefore, analyzing genomic variants is of crucial importance. General approach to analyse genomic variations include variant calling and variant annotation, where the first one involves the discovery of variants and the latter associates them with genes and functional impacts. Finally, manual curation of the variants is usually performed by utilizing visualization tools. There are several algorithms that use high-throughput sequencing technology to perform each of these tasks. However, one has to deal with each of them separately and the final conclusion is drawn by combining the results. Here we introduce AnGenoV (ANalysis of GENOmic Variants), which combines variant calling, variant annotation and visualization in one single application, thus enabling researchers to spend less time and effort with much higher accuracy. It has a modular structure and serves in a user-friendly graphical interface for easy usage, even for people with minor computational background.
Using AnGenoV, one can detect single nucleotide polymorphisms (SNPs), small insertions and deletions (Indels) and large structural variations (SVs) using short Illumina reads, as well as third generation long reads (PacBio or Oxford Nanopore). AnGenoV is configured with default state-of-the-art tools, however different algorithms can be added to its pipeline. Associating variations with several possible consequences is another critical point of AnGenoV, which is accessible from the variant annotation module. It incorporates a number of popular variant annotation databases including dbSNP, dbVar and Ensembl Variant Effect Predictor (VEP). On the other hand, we embedded Integrated Genomics Viewer (IGV) into AnGenoV so that selected variants can be visualized by tracking the reads that are mapped to the reference genome.
Another important aspect of AnGenoV that increases the accuracy of variant calling is the ability to run several variation discovery tools altogether, merge the results and output the most reliable set of variations. These final variations can be visualized in a user-friendly manner and desired variations can be searched without any UNIX command line knowledge (based on chromosome number, loci, variation type, etc). Then, the variants can be annotated by querying from multiple annotation databases. Finally, the variations can be visualized using IGV by just selecting the variation of interest among the list of possible candidates.
AnGenoV is implemented using JavaScript and Python and is freely available on https://github.com/SevcanDogramaci/AnGenoV.
Assessment of the CASP14 Assembly Predictions
Burcu Ozden, Andriy Kryshtafovych and Ezgi Karaca
In CASP14, 39 research groups submitted more than 2,500 3D models on 22 protein complexes. In general, the community performed well in predicting the fold of the assemblies (for 80% of the targets), though it faced significant challenges in reproducing the native contacts. This is especially the case for the complexes without whole-assembly templates. The leading predictor, BAKER-experimental, used a methodology combining classical techniques (template-based modeling protein docking) with deep learning-based contact predictions and a fold-and-dock approach. The Venclovas team achieved the runner-up position with template-based modeling and docking. By analyzing the target interfaces, we showed that the complexes with depleted charged contacts or dominating hydrophobic interactions were the most challenging ones to predict. We also demonstrated that if AlphaFold2 predictions were at hand, the interface prediction challenge could be alleviated for most of the targets. All in all, it is evident that new approaches are needed for the accurate prediction of assemblies, which undoubtedly will expand on the significant improvements in the tertiary structure prediction field.
Phylogeny-Aware Amino Acid Substitution Scoring
Nurdan Kuru, Onur Dereli, Emrah Akkoyun, Aylin Bircan, Öznur Taştan and Ogün Adebali
With the advancement in high throughput sequencing technologies, our ability to detect genetic variation, and to predict the effect of a variant in the clinical diagnosis have been revolutionized. Single nucleotide polymorphisms (SNPs) in coding regions might cause the change of a single amino acid into another in the resulting protein (i.e., missense mutations). These mutations might have no effect on protein function, or it can alter protein function which might result in a disease. Understanding the effect of a missense mutation with respect to whether it has a neutral or disease-causing effect on protein function helps to diagnose rare diseases. Although the cost of genome sequencing has decreased, it is still a challenging task to assess the functional consequences of variations. For this purpose, several conservation-based statistical and machine learning approaches have been proposed in the literature to predict the potential consequence of a variant. SIFT and PolyPhen-2 are the most widely used tools of such approaches. Although these methods do not yield the desired level of accuracy and as a result are not suggested to be used in clinical studies, the clinicians use them to prioritize and reduce the number of variants to be analyzed.
Here, we introduce a novel phylogeny-dependent probabilistic approach, Phylas (Phylogeny-Aware Amino Acid Substitution Scoring) to predict the functional effects of missense mutations. Our approach exploits the phylogenetic tree information to measure the deleteriousness of a given variant. Independent evolutionary events and phylogenetic relationship among species are driven from the gene-based phylogenetic trees. With the help of ancestral reconstruction, we obtain the probability distribution at each internal node of the phylogenetic tree. Starting from the queried specie, which is human in our algorithm, we travel through the tree and record the probability change for each amino acid. Although the positive change in probability is a result of an alteration, the negative changes are observed as the effect of a substitution that belongs to the previously covered part of the tree. To include each dependent change at once, the negative changes are ignored in the computation. In addition to taking the dependent alterations into account, the effects of independent substitutions observed during the evolutionary time repeatedly should be included. To account for this effect, we present a correction over the score by considering the count of the independent substitutions guaranteeing that the harmfulness of the related alteration is decreased. It has been previously hypothesized that a variant in the human gene is more likely benign when it is observed in closely related species, whereas it is more likely deleterious when it only exists in distant ones. Thus, during the travel through the phylogenetic tree, all positive changes on amino acid probabilities are summed by a weighting approach inversely proportional to the distance between related node of change and human. After completion of the travel, we obtain substitution scores for each of the 20 amino acids for the given position of the query sequence. The normalized resulting values give us the probability of observing any amino acid at the given position of the protein in question. This probability is used to measure the pathogenicity of a possible amino acid substitution. Although the query sequence is human in our experiments, the approach can be easily used for other species by changing the starting point and the direction of the travel through the tree.
We compare the predictive performance of our algorithm against SIFT and PolyPhen-2. We generated the benchmark datasets by combining variants from Clinvar, Humsavar, Gnomad, and four other datasets proposed in Varibench. Our algorithm outperforms SIFT and PolyPhen-2 in predicting the pathogenicity of missense mutations by improving the AUROC values by 3 and 7% respectively.
11:30 - 12:00
Elif Özkırımlı
Chemical and Biomedical Language Processing - Challenges and Opportunities
A researcher shares biomedical findings with the scientific community via scientific publications using domain specific language. Human codified representation of biochemicals is also a domain specific language a researcher uses in order to study the mechanism of molecular interactions. Application of natural language processing methodologies for such domain specific languages is often a challenge. However, a more challenging aspect of processing data in these domains is that they do not sample all of the available knowledge space (publications) or molecule space (molecular interactions). This is a pity because most interesting biology occurs at the edge or out-of-distribution. Identifying novel protein - compound pairs or finding rare/new information in publications are both limited by this imbalance problem in data sampling. In this talk, I will summarize our recent work on protein - compound affinity prediction and multilabel text classification of biomedical publications. I will briefly present two novel approaches that aim to address the "needle in a haystack"problem for these two tasks.
12:00 - 13:30
(2) DiMA: Protein Sequence Diversity Dynamics Analyser for Viruses
Yongli Hu, Shan Tharanga, Olivo Miotto, Eyyüb Selim Ünlü, Muhammet A. Çelik, Muhammed Miran Öncel, Hilal Hekimoğlu, Muhammad Farhan Sjaugi and Mohammad Asif Khan
(3) Convolutional Neural Network Approach to Distinguish and Characterize Tumor Samples Using Gene Expression Data
Büşra Nur Darendeli and Alper Yılmaz
(4) CoVrimer for SARS-CoV-2 Primer Prioritization
Merve Vural, Aslınur Aktürk, Mert Demirdizen, Ronaldo Leka, Rana Acar and Özlen Konu
(5) RNAseq Based Analysis for Investigation of Crosstalk among Estrogen and Drospirenone Mediated Signaling in Breast Cancer
Merve Vural, Kübra Çalışır, Farid Ahadli, Ronaldo Leka, Burçin Arıcı and Özlen Konu
(6) An Agent-Based Model to Evaluate the Effect of Test Kit Usage of Travelers in Sparsely Populated Areas during Pandemics
Baris Balaban, Erdem Berkay Bascura and Ugur Sezerman
(7) AnGenoV: A Toolbox for Analysis of Genomic Variants
Sevcan Doğramacı, Mert Doğan, Mehmet Gürsel Arslan, Süleyman Emre Çelik, Tuğba Nur Korkmaz and Arda Söylev
(8) On the Path to Reduce Sugar Intake: Sweet Plant Proteins
Nergiz Yuksel, Shokoufeh Yazdanıan Asr and Burcu Kaplan Turkoz
(9) Assessment of the CASP14 Assembly Predictions
Burcu Ozden, Ezgi Karaca and Andriy Kryshtafovych
(10) AMULET: A Novel Read Count-Based Method for Effective Multiplet Detection from Single Nucleus ATAC-seq Data
Duygu Ucar
(11) Candidate Antigen Enrichment Using scRNAseq Data Integration for CAR T Cell Therapy Against Non-Small Cell Lung Cancer
Mert Yıldız and Yasin Kaymaz
(12) Novel Approach for Microbiome Meta-Analysis
Farid Musa and Efe Sezgin
(13) Robust Prediction of Genetic Mutation Effects by Homology Analysis
Alperen Taciroğlu, Yeşim Aydin Son and Ogün Adebali
(14) Support Vector Machine Supported by Disease Ontology (SVM-DO) to Identify mRNA Signatures Discriminating Tumour Cells
Mustafa Erhan Özer, Pemra Özbek Sarıca and Kazım Yalçın Arga
(15) Bayesian Networks for Inter-Omics Analysis
Muntadher Jihad and İdil Yet
(16) Bioinformatic Analysis of Bifidobacterium Breve TIR Domain
Bahar Bakar, Dicle Dilara Akpinar and Burcu Kaplan Türköz
(17) Comparison of the Performances of in Silico Pathogenicity Prediction Tools on Cancer-Related Variants
Metin Yazar and Pemra Özbek Sarıca
(18) Phylogeny-Aware Amino Acid Substitution Scoring
Nurdan Kuru, Onur Dereli, Emrah Akkoyun, Aylin Bircan, Öznur Taştan and Ogün Adebali
(19) Discovering Coding Lncrnas Using Deep Learning Training Dynamics
Afshan Nabi, Berke Dilekoğlu, Ogün Adebali and Öznur Taştan
(20) G-Protein Selective Activation Mechanisms in GPCRs
Berkay Selçuk, İsmail Erol, Serdar Durdağı and Ogün Adebali
(21) Analysis of Structural and Functional Impact of SNVs in hAKT1 Gene Using in Silico Tools
Ilayda Uzumcu and Elif Uz Yildirim
(22) Bioinformatic Analyses of Heparinase HepIII from Azospirilum brasilanse
Seyhan İçier and Burcu Kaplan Türköz
(23) Co-Expression Networks from Transcriptome Data Reveal Molecular Mechanisms Playing Roles in the Progression of Parkinson’s Disease
Tunahan Çakır and Elif Emanetci
(24) Identification of Major Depression Related Transcriptional Changes Through Integration of Multiple Datasets
Berkay Selçuk, Tuana Aksu and Ogün Adebali
(25) A Story of an Online Internship in Computational Structural Biology
İrem Yılmazbilek and Ezgi Karaca
(26) Comparison and Assessment of Speed and Accuracy of AutoDock Vina and AutoDock CrankPep for Short Peptide Docking
Sefer Baday and Numan Nusret Usta
(27) miRModuleNet: Detecting miRNA-mRNA Regulatory Modules
Malik Yousef, Gokhan Goy and Burcu Bakir-Gungor
(28) Predicting Side Effects of Chemotherapy Using Drug-Induced Gene Expression Profiles and a Random Forest-Based Strategy
Ozlem Ulucan
(29) Identification Novel Inhibitors Targeting Putative Dantrolene Binding Site for Ryanodine Receptor 2
Cemil Can Saylan and Sefer Baday
(30) Piperidine-Including Natural Drug Discovery for Inhibition Type 4 Pili’s (T4P) in P. aeruginosa and N. meningitidis
Aslıhan Özcan Yöner, Özlem Keskin Özkaya, Berna Sarıyar Akbulut and Pemra Özbek Sarıca
(31) Studying Complex Human Diseases Using Time-Series Ancient DNA Data: Obesity and Type 2 Diabetes in Anatolia
İdil Taç, Gülşah Merve Kılınç, Ulaş Işıldak, Kıvılcım Başak Vural, Ezgi Altınışık, Yılmaz Selim Erdal, Mehmet Somel and Füsun Özer
(32) Sequence Diversity of Envelope Protein of Dengue Virus Serotype 1
Gökçen Şahin, Li Chuin Chong, Erdem Aybek and Mohammad Asif Khan
(33) Investigating Potential Interplay Between R-Loops and Nucleotide Excision Repair
Sezgi Kaya and Ogün Adebali
(34) PersonaDrive: A Computational Approach for Prioritization of Patient-specific Cancer Drivers
Cesim Erten, Aissa Houdjedj, Hilal Kazan and Ahmed Amine Taleb Bahmed
(35) PROT-ON: A Python Package for Redesigning the Protein-Protein Interfaces by Using EvoEF1
Mehdi Koşaca, Ayşe Berçin Barlas and Ezgi Karaca
(36) GeNetKEGG: Gene Expression based KEGG PathWay Grouping and Ranking
Malik Yousef, Fatma Ozdemir, Amhar Jabeer, Jens Allmer and Burcu Bakir-Gungor
(37) How Epstein-Barr Virus Envelope Glycoprotein gp350 Tricks the CR2? A Molecular Dynamics Study.
Ilgaz Taştekil, Cansu Yay, Nursena Keskin, Elif Naz Bingöl and Pemra Ozbek Sarica
(38) Metabolic Network-Driven Analysis of Yeast Metabolic Cycle through the Incorporation of RNA-seq and ATAC-seq Datasets
Müberra Fatma Cesur, Tunahan Çakır and Pınar Pir
(39) Peptide - Gold (111) Interactions: Mechanisms and Design
Didem Özkaya, Büşra Demir, Çağlanaz Akın, Zeynep Köker and Ersin Emre Ören
(40) The Mutation Profile of SARS-CoV-2 Is Primarily Shaped by the Host Antiviral Defense
Cem Azgari, Zeynep Kılınç, Berk Turhan, Defne Çirci and Ogün Adebali
(41) Application of Machine Learning Algorithm for The Accurate Diagnosis of Breast Cancer
Rumeysa Fayetörbay and Uğur Sezerman
(42) A Time-Efficient and User-Friendly Tool for Molecular Dynamics Analysis
Halil İbrahim Özdemir, Elif Naz Bingöl and Pemra Özbek-Sarıca
13:30 - 14:30
Discovering Coding Lncrnas Using Deep Learning Training Dynamics
Afshan Nabi, Berke Dilekoğlu, Ogün Adebali and Öznur Taştan
Genome-wide transcriptome analyses have revealed that the vast majority of the human genome is transcribed; but only 2% of the human genome is annotated as protein coding. A considerable fraction of transcripts are annotated as ncRNAs and lncRNAS constitute the largest category of ncRNAs. While lncRNAs studied are known to play vital roles in cellular processes, the functions of most lncRNAs remain unknown. Moreover, although lncRNAs - by definition- do not code for proteins, recent studies have shown that short the open reading frames (sORFs) within some lncRNAs are translated into micropeptides of a median length of 23 amino acids. The translation events of lncRNAs were overlooked previously because the open reading frames (ORFs) present in lncRNAs do not meet the conventional criteria of an ORF: that it encodes at least 100 amino acids in eukaryotes. Despite this, recent studies have shown that micropeptides translated from lncRNAs perform vital functions across species, including bacteria, flies and humans. Therefore, identifying misannotated lncRNAs is a necessary step towards the functional 1 characterization of this large class of transcripts.
We present a framework that leverages deep learning models’ training dynamics to determine whether a given lncRNA transcript in the dataset is misannotated. In the first step, we train convolutional neural network (CNN), long short term memory (LSTM) and Transformer architectures to predict whether a given nucleotide sequence is non-coding or coding. Each input RNA sequence is represented with 3-mer ‘words’ are obtained by using a window that slides by 1 nucleotide at each step and for each 3-mer ‘word’, 100-dimensional embeddings are used as input. Our models learn to distinguish between coding and non-coding RNAs with average AUC scores >91%. In the second step, we inspect the training dynamics of these deep sequence classifiers to identify possible misannotated lncRNAs. By inspecting lncRNAs where the model consistently and with high confidence predicts as coding through all training epochs, we identify the possibly misannotated candidates. Through this inspection, we identify candidate lncRNAs. Our results show a significant overlap with previous methods that use riboseq data to identify misannotated lncRNAs as well as with a set of experimentally validated misannotated lncRNAs. Moreover, we search proteins similar in sequence to the candidates and curate a subset with high similarity to known proteins.
This work represents the first instance where deep learning model training dynamics are successfully applied to identify misannotated lncRNAs from nucleotide sequences. This approach can be applied to better curate datasets for training coding potential prediction models and can be applied alongside ribo-seq data to identify misannotated lncRNAs with high confidence.
Piperidine-including natural drug discovery for inhibition Type 4 Pili’s (T4P) in P. aeruginosa and N. meningitidis
Aslıhan Özcan Yöner, Özlem Keskin Özkaya, Berna Sarıyar Akbulut and Pemra Özbek Sarıca
The rapidly increasing resistance to available antibiotics and the reduced rate in new antibiotic discovery lead to different therapeutic approaches. An attractive strategy is to identify molecules that target bacterial virulence as an alternative to traditional antibiotics with low efficacies. Anti-virulence therapies cleanse the pathogens from their weapons instead of killing the pathogens that cause infections in humans [1], [2].
The current work undertakes the effort to target the type 4 pili (T4P) in P. aeruginosa and N. meningitidis. T4P is an important virulence factor in both and is the main target of this study. The selected two microorganisms are among the World Health Organization’s list of pathogenic bacteria for which there is urgence to develop new therapeutics. To this end, PilB of P. aeruginosa and PilF of N. meningitidis have been studied in detail. As a novel strategy, piperidine-containing natural product solutions were screened for their inhibition of pilB and pilE. Piperidine is an incredibly significant building block in the production of pharmacological substances and is the most regularly utilized heterocycle among US FDA approved medications [3].
The strategy followed in this work started with homology modelling of pilB and pilE of P. aeruginosa and N. meningitidis, respectively, since their structures were not available. The binding sites in the target structures were determined by metaPocket 2.0. Here the ADP binding regions have been observed to play an important role in the to inhibit T4P. As the drug database, COlleCtion of Open Natural prodUcTs (COCONUT) resource was used and for filtration FAF-Drugs4 was used. Finally, in order to select natural products to inhibit T4P by binding pilB and pilE, virtual library screening was performed. Ligands with binding energies better than -9.0kcal/mol for both structures were accepted to be potential inhibitors. Molecules selected in this work might have a potential to be used in novel therapeutic applications in combination with existing drugs and/or other virulence factor inhibitors.
PersonaDrive: A Computational Approach for Prioritization of Patient-specific Cancer Drivers
Cesim Erten, Aissa Houdjedj, Hilal Kazan and Ahmed Amine Taleb Bahmed
A major challenge in cancer genomics is to distinguish the driver mutations that are causally linked to cancer from passenger mutations that do not contribute to cancer development. The majority of the methods proposed for this problem provide a single driver gene list for the entire cohort of patients. On the other hand, it is well-known that the mutation profiles of patients from the same cancer type show a high degree of heterogeneity. Since each patient has a distinct set of driver genes, a more ideal approach is to identify patient-specific drivers.
In this study, we propose a novel method that integrates genomic data, biological pathways, and protein connectivity information for personalized identification of driver genes. The method is formulated on a personalized bipartite graph which consists of the mutated genes of the patient in one partition, and the set of patient-specific dysregulated genes in the other. Our approach provides a personalized ranking of the mutated genes of a patient based on the sum of weighted ‘pairwise pathway coverage’ scores across all the patients, where an appropriate pairwise patient similarity score with respect to the sets of dysregulated genes of the pair is employed as a weighting factor of pathway coverage scores of mutant-dysregulated pairs. We compare our method against three state-of-the-art patient-specific cancer gene prioritization methods; Prodigy [1], SCS [2] and DawnRank [3]. The comparisons are with respect to a novel evaluation method which takes into account the personalized nature of the problem; different from previous evaluation methods employed in literature, we assume the results are reported as average values for the entire cohort as a function of the top k ranked genes, where k is dependent on the size of the personalized reference set and individuals with less than k ranked genes are excluded from the evaluation. Two main data sets are considered for the evaluations; TCGA and cell-line data (DepMAP [5]) for colon and lung cancers. For the TCGA data, the reference sets are defined based on the overlaps of mutated genes of the patient and the Cancer Gene Census (CGC) [6], Network of Cancer Genes (NCG) [7], and CancerMine[8] databases of known driver genes. For the cell-line data, we define novel reference gene sets by compiling the targets of drugs that are found to be sensitive based on data from GDSC [4] and DepMAP [5] databases. We show that our approach outperforms the existing alternatives for both the TCGA and cell-line evaluations. Additionally, we show that the KEGG/Reactome pathways enriched in our ranked genes and those that are enriched in cell lines reference sets overlap significantly when compared to the overlaps achieved by the rankings of the alternative methods. Fig 1 shows the performance of all methods on TCGA and CCLE cohorts. The findings of our approach can lead to the development of personalized treatments and therapies.
References
[1] Gal Dinstag, et al. (2020) Bioinformatics. 1831–1839
[2] Guo WF, et al. (2018). Bioinformatics. 1;34(11):1893-1903.
[3] Hou, J.P., Ma, J. (2014). Genome Med 6, 56.
[4] Wanjuan Yang, et al. (2013). Nucleic Acids Research, V.41, Pages D955–D961.
[5] Steven M Corsello, et al. (2019). bioRxiv.
[6] John G Tate, et al. (2019). Nucleic Acids Research, V.47, P.D941–D947.
[7] Repana D., et al. (2019). Genome Biol.
[8] Lever J. et al. (2019). Nat Methods, 16:505–507
14:30 - 15:00
Irene Papatheodorou
Expression Atlas: Gene expression across cells, tissues, species
The Gene Expression Group at EMBL-EBI develops and multiple functional genomics resources from data submission to analysis and visualisation of bulk and single cell sequencing data.
In this talk we will describe the resource Expression Atlas and its single cell component Single Cell Expression Atlas (www.ebi.ac.uk/gxa), our resource for searching gene expression across cell types, tissues and species from re-analyses of publicly available bulk and single cell RNA-Seq (scRNA-Seq) studies, including those from landmark projects. As of September 2021, Expression Atlas contains over 4000 datasets, across 65 species and over 5 million cells. Users can search for genes across these datasets and filter the results for particular cell types or tissues, identify in what conditions and populations a gene can act as a marker gene.
In addition, we will describe our flexible and scalable, standardised analysis pipelines. These are a set of flexible, scalable and interoperable analysis workflows for the analysis of scRNA-Seq data sets, providing access to data sets from the Single Cell Expression Atlas.
Finally, we will describe our current work to enable automated cell type annotation, power more informative searches, as well as intuitive visualisations using zoomable organ-specific anatomograms.
15:00 - 15:20
Break
15:20 - 16:20
Metabolic network-driven analysis of yeast metabolic cycle through the incorporation of RNA-seq and ATAC-seq datasets
Müberra Fatma Cesur, Tunahan Çakır and Pınar Pir
Saccharomyces cerevisiae, which is a well-established model organism in many industrial and medical applications, undergoes robust oscillations to regulate its physiology for adaptation and survival under nutrient-limited conditions. The rhythmic alterations in gene expression pattern and cell metabolism coordinate responding to environmental cues. Yeast metabolic cycle (YMC) is a remarkable example of the coordinated and dynamic yeast behaviour, which is regulated through metabolic oscillations. It is divided into three phases based on periodic alterations in gene expression across varying oxygen consumption levels: quiescence-related reductive charging (RC) phase, growth-related oxidative (OX) phase, and cell division-related reductive building (RB) phase. Thus, YMC tracks the life cycle of yeast via an interplay among growth, proliferation, and quiescent phases.
Genome-scale metabolic network (GMN) models have been extensively used platforms to analyse yeast metabolism since 2003. They are stoichiometry-based mathematical representations of metabolism with all known chemical reactions, metabolites, and genes. GMN models provide a powerful platform for the systems-based understanding of metabolic processes within an organism. To date, different omics data integrated models have been developed for phenotypic characterizations and metabolic engineering. Despite the common use of transcriptome in the contextualization of GMN models, incorporation of epigenetic information is still a gap in the field of metabolic modelling. Besides, a clear interaction between metabolism and epigenetics was highlighted in many studies. Here, we investigated the contribution of combinatory use of transcriptomic and epigenomic information in the simulation of cellular metabolism via a recent yeast model, Yeast8 (3,991 reactions, 2,691 metabolites, and 1,147 genes). To this aim, we first employed hierarchical clustering for both RNA-seq and ATAC-seq datasets dedicated to each YMC phase. Thus, the pathways associated with each YMC phase were identified (data-based approach). We subsequently reconstructed diverse GMN models through mapping these datasets in both individual and combinatorial fashions. This facilitated the simulation of early RC, mid OX, and late RB phases. Thus, we evaluated the performance of each model using the experimental flux data derived from 13C metabolic flux analysis. We also characterized differential flux profiles and pathways through comparative analyses (model-based approach). Lastly, we compared the results obtained via data- and model-based approaches to each other and validated based on the literature.
Comparative analysis of the predicted and measured fluxes revealed that the use of ATAC-seq data considerably improved model performances. The pathways dedicated to each YMC phase were elucidated using data- and model-based approaches. As expected, over-representation of the growth-related processes (e.g., biosynthesis of amino acids and nucleotides) and tricarboxylic acid cycle were shown in mid OX phase. On the other hand, early RC and late RB phases were found to exhibit similar characteristics. Over-representation of various glycolytic processes, NADP metabolism, and pentose-phosphate shunt were determined in late RB phase in agreement with literature. To our knowledge, this is the first attempt to use chromatin accessibility data in the reconstruction of context-specific GMN models, despite the increasing popularity of ATAC-seq method. Thus, we demonstrated that integration of epigenomic data with transcriptomic profiling can pave the way for more realistic metabolic simulations.
mirDisNet: A novel approach for cancer classification using mir-disease associations
Amhar Jabeer, Burcu Bakir-Gungor and Malik Yousef
miRNAs (microRNAs) are a family of short non-coding RNAs that regulate gene expression post-transcriptionally in diverse species. They repress protein production by translational silencing, binding to the 3'-UTR (untranslated region) of their target mRNAs, and destabilizing them. Growing evidence shows that miRNAs exhibit a variety of crucial regulatory functions in all mammals; and because of their potential role related to cell growth, development, and differentiation, while being associated with a wide variety of human diseases. They have been proposed to be good candidates for cancer therapy since they have been associated with cancer biology: metastasis, angiogenesis, and proliferation. Conversely, considering the inherent time-consuming and expensive method of traditional in vitro experiments, the need for feasible and efficient computational methods to predict miRNA and diseases association have become apparent. We propose mirDisNet, a novel approach that detects the biomarker of miRNAs genes that is associated with diseases. In mirDisNet, biological domain data that incorporates the knowledge about miRNAs association with disease is used to serve as the grouping function for the tool. Each of the groups created have a disease name with a corresponding set of miRNAs related to the disease. We ranked the groups by scoring them on their importance in the two-class classification task. By integrating miRNA-disease associations, mirDisNet showed promising results of 95% in accuracy, 92% in sensitivity, 96% in specificity, and 98% in AUC across 11 datasets obtained from TCGA. Additionally, the most significant miRNAs and disease groups which were ranked by robustrankaggreg, were validated by external datasets, databases, and literature. We hypothesize that mirDisNet has the prospect to understand disease prognosis as well as diagnosis by finding potential biomarkers and disease relationship networks.
Functional Stratification of Small Molecule Drugs through Integrated Network Similarity
Seyma Unsal Beyge and Nurcan Tuncbag
Recovery of an optimal cancer treatment strategy is challenging due to the inter- and intra-heterogeneity of tumor samples. Modulations in signaling pathways and interactions between various bio-entities are critically important in the multi-stage tumor cell formation. Hence, multi-omic data integration is vital for understanding the molecular interactions happening in the cancer cell and for development of an optimum treatment strategy. Since development and approval of new drugs is both expensive and time-consuming, drug repurposing is an advantageous strategy in cancer treatment. Classification of currently available therapeutic agents is important but also a complex procedure and necessary to index possible drugs for drug repurposing approaches. Moreover, determination of molecular mechanisms of available drugs in different cancer types deciphers the possibilities in drug repurposing given the heterogeneity of cancer. Conventionally, drugs are classified based on their primary targets, therapeutic actions, target specificity, nature of interaction, molecular type and chemical structure similarity. The effects of small drug molecules are highly dependent on the cellular and physiological factors. Many drugs have multiple targets with variable binding affinities. Depending on the features selected, groups of similar drugs may change. Even though two drugs are present in the same group, they may modulate different signaling mechanisms within the cell.
In this study, the transcriptomic and phosphoproteomic data of cell lines treated with small molecule drugs are used to reconstruct networks by integrating the data with drug targetome and human interactome. Data integration is a challenging task since the proper integration method changes depending on the nature of the data and the network reconstruction principles. In this study transcriptomic data is used to back trail the regulatory elements that acts on the experimental hits. Phosphoproteomic hits allowed the selection of functionally active proteins that may be closely related to the drug modulatory effects. Also, human interactome is referenced by the network reconstruction software, Omics Integrator, to map the seed proteins and find the optimum connected subnetwork. Human interactome is known with its incompleteness and also it has bias to well-studied proteins. Therefore, it has many false negatives and false positives. In this study, human interactome is processed to eliminate these drawbacks. First, high-degree nodes eliminated and for each cell line – drug condition, it is filtered for very low expressed genes by the aid of associated transcriptomic data. After, each interactome is enriched with a link prediction approach followed by localization filters. The final processed interactome used for each cell line – drug condition has turned into both tissue and drug specific.
Total of 250 cell line and drug specific networks are reconstructed covering 70 drugs and six cell lines. A rigorous topological and pathway analysis of these reconstructed networks provided insight into the drug modulations occurring in different cell lines, including mostly cancer models. It is found that chemically and functionally different drugs may modulate overlapping networks. Moreover, the target selectivity of the drug is an important factor leading to separate networks for drugs with same mode of action. Network-based analysis coupled with multi-omic data integration helped to reveal cell line and drug specific hidden modulated pathways such that drugs having overlapping networks generally modulate transcriptional misregulation in cancer pathway. Next, topological distance and active pathways of drug networks may guide the use of efficient drug combinations. Finally, separation between networks of a drug across cell lines can help to infer their resistance or sensitivity or no response to that drug.
16:20 - 17:20
Ziv Bar-Joseph
Reconstructing dynamic regulatory and signaling networks from time-series single-cell data
Biological processes, including those involved in immune response, disease progression and development, are often dynamic. To fully understand and model regulatory and signaling networks that are activated as part of these processes requires the integration of static and times series bulk and single cell data. I will discuss statistical and active learning methods for designing experiments for studying such systems and methods using continuous state hidden Markov models for the analysis and integration of profiled data. An application of these methods for improving protocols for the differentiation of iPSCs to lung cells and for predicting cell-type interactions in liver organoid development will be discussed.
09:00 - 09:30
Niranjan Nagarajan
Assembling and modelling complex microbiomes mediating antibiotic-resistance transfer
Human and environmental microbial communities mediating host-pathogen interactions often have complex genetic architectures and dynamics. Unravelling these needs new approaches for metagenome assembly at the strain level and microbiome modelling from limited relative abundance profiles. We propose a hybrid assembly framework that leverages long read sequencing to generate high-quality, near-complete strain genomes from complex metagenomes (OPERA-MS [1]). Applying this approach to human and environmental communities enabled recovery of 100s of novel genomes, plasmid and phage sequences, direct analysis of transmission patterns and investigation of antibiotic resistance gene combinations [1, 2]. Furthermore, we show how microbial community dynamics can be modelled accurately from sparse relative abundance data (BEEM [3]), providing insights into pathogen-commensal interactions in skin dermotypes [4]. Data from several studies tracking the emergence and transmission of multi-drug resistant pathogens across environmental and human microbiomes will be used to illustrate the utility of these methods [2,5].
References
[1] Bertrand et al. “Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes. Nature Biotechnology 2019
[2] Chng et al. ”Cartography of opportunistic pathogens and antibiotic resistance genes in a tertiary hospital environment." Nature Medicine 2020
[3] Li et al. “An expectation-maximization algorithm enables accurate ecological modeling using longitudinal microbiome sequencing data." Microbiome 2019
[4] Tay et al. “Atopic dermatitis microbiomes stratify intoecologic dermotypes enabling microbialvirulence and disease severity.” Journal of Allergy and Clinical Immunology 2020
[5] Cheng et al, “Metagenome-wide association analysis identifies microbial determinants of post-antibiotic ecological recovery in the gut.” Nature Ecology & Evolution 2020
09:30 - 10:30
Towards integrative mechanistic models of mammalian cell responses to extracellular perturbations: growth factors, hormones, and cytokines
Cemal Erdem, Sean M. Gross, Laura M. Heiser and Marc R. Birtwistle
A critical missing capability in current cancer research is the ability to predict how a particular single cancer cell will respond to microenvironmental cues or a drug cocktail. Yet, it is not even possible to perform this task well for normal healthy cells. This work builds on the hypothesis that first principles, mechanistic models of how cells respond to different perturbagens, will ultimately improve drug combination response predictions. However, building such single-cell models of complex, large-scale, and incompletely understood systems remains an extremely challenging task. To address this issue, we defined an open-source pipeline for scalable, single-cell mechanistic modeling from simple, annotated input files (structured lists of species, parameters, and reaction types). The input files are converted into an SBML (Systems Biology Markup Language) model file. Using this pipeline, we:
1.Re-created one of the largest pan-cancer signaling models in the literature (774 species, 141 genes, 8 ligands, 2400 reactions)
2.Enlarged the model to include Interferon-γ (IFNγ) signaling pathway (950 species, 150 genes, 9 ligands, 2500 reactions)
3.Re-parametrized the model to test and prioritize candidate mechanisms for experimental observations
Specifically, we used the enlarged model to test alternative mechanistic hypotheses for the experimental observations that IFNγ inhibits epidermal growth factor (EGF)-induced cell proliferation. We ran stochastic single-cell simulations for two different crosstalk mechanisms and looked at the number of cycling cells in each case. Our model-based analysis suggested, and experiments support that these observations are better explained by IFNγ-induced SOCS1 expression sequestering activated EGF receptors, thereby downregulating AKT activity, as opposed to direct IFNγ-induced upregulation of p21 expression. Finally, our new modeling format is available online (github.com/birtwistlelab/SPARCED) and compatible with high-performance (Kubernetes) computing platforms, enabling us to study virtual cell population responses. Overall, our new model enables easy modification of large mechanistic models and simulation of thousands of single-cell responses to multiple ligands and drug combinations.
Runs of homozygosity show that human inbreeding has decreased in time through the Holocene
Kanat Gürün, Francisco C. Ceballos, Ezgi Altınışık, Hasan Can Gemici, Cansu Karamurat, Dilek Koptekin, Kıvılcım Başak Vural, Igor Mapelli, Ekin Sağlıcan, Elif Sürer, Yılmaz Selim Erdal, Anders Götherström, Füsun Özer, Çiğdem Atakuman and Mehmet Somel
Runs of homozygosity (ROH) are long homozygous stretches of the genome, presence of which indicates inbreeding due to small population sizes and genetic drift, and/or mating between close relatives, i.e., consanguinity [1]. We developed a method to optimize the parameters of PLINK to detect ROH, which relies on a model-free, observational approach that does not require a reference panel [2]. We were able to tune ROH calling parameters to suit low genomic coverages and correct for ROH overestimation. Our method works efficiently down to 3x SNP coverage and reliably calls ROH > 1 Mb in genomes across >1 million SNPs [3]. We confirmed the power and accuracy using simulations and by comparison with a recently published method, which relies on a reference haplotype panel [4]. We used our approach to study the controversial history of human inbreeding by systematically analyzing for the first time the ROH levels in 411 published ancient genomes of the last 15,000 years from West and Central Eurasia. The Neolithic Transition to food production and the development of sedentary and/or agricultural societies may have influenced overall inbreeding levels, relative to those of hunter-gatherer communities. This transition could have had opposite effects on average ROH levels: ROH might decrease by increasing population size [5], or ROH might increase due to higher consanguinity and endogamy in farming communities compared to forager communities [6-7]. We estimated the genomic inbreeding coefficient FROH per genome as the sum of ROH >1.5 [8] and showed that the frequency of inbreeding, as measured by FROH, has decreased over time throughout the Holocene. This result was robust to the SNP list used and was reproducible in downsampling experiments. The result was also supported by a multiple regression model that includes time (age), cultural groups, and technical covariates. Some ancient individuals showed high FROH, but were rare in the sample and they included both hunter-gatherers and farmers. The main shift in FROH happens after the Neolithic, but the trend has since continued, indicating a population size effect on ROH and inbreeding prevalence. Post-Neolithic increase in population admixture [9] may also play role. We further show that most inbreeding in our historical sample can be attributed to small population size and drift, instead of consanguinity. Such high drift individuals were mainly hunter-gatherers. Extreme consanguineous matings did occur, but were rare and only observed among agriculturalist members of farming societies in our sample, in line with ethnographic work [6-7]. Despite the lack of evidence for common consanguinity in our ancient sample, consanguineous traditions are today prevalent in various modern-day Eurasian societies, suggesting that such practices may have become widespread within the last few millennia.
References
[1] Ceballos et al. 2018 Nat Rev Genet;
[2] Chang et al. 2015 Gigascience;
[3] Ceballos et al. 2020 Biorxiv;
[4] Ringbauer et al. 2020 Biorxiv;
[5] Gignoux et al. 2011 PNAS;
[6] Walker 2014 Evol & Human Behav;
[7] Hill et al 2011 Science;
[8] McQuillan et al. 2008 AJHG;
[9] Lazaridis et al. 2016 Nature
ProFAB – Open Protein Functional Annotation Benchmark
Ahmet Samet Özdilek, Ahmet Atakan, Tunca Doğan, Rengül Çetin-Atalay, Mehmet Volkan Atalay and Ahmet Süreyya Rifaioğlu
As the number of protein sequences in protein databases increases, accurate computational methods are required to annotate the available data. For this purpose, several machine learning methods have been proposed in recent years [1]. However, two main issues in the evaluation of computational prediction methods are the construction of reliable positive and their negative training/validation datasets and the fair evaluation of performances based on predefined experimental settings. Recently, several benchmarking platforms have been proposed in various fields to overcome similar issues. For example, Therapeutics Data Commons provides ready-to-use biomedical datasets for drug, toxicity, screening, antibody development along with the appropriate evaluation metrics [2]. Open Graph Benchmark is a platform that provides social and biological network datasets, and experimental settings for fair comparison of algorithms [3]. In the field of protein function prediction, Critical Assessment of Functional Annotation (CAFA) challenge [4] is an important initiative where the aim is to evaluate the performances of automated protein function prediction methods. CAFA challenge is organised about every two years; however, it is a one-time challenge and it is not trivial to repeat the challenge with the same experimental settings afterwards. In addition, CAFA challenge does not provide any training dataset, machine learning models or different data splitting strategies (e.g., temporal, similarity-based and random split settings). To overcome these limitations, here, we propose an open-source Python package called ProFAB, Open Protein Functional Annotation Benchmark platform. The aim of ProFAB is to create a fair comparison platform for protein function prediction methods based on Gene Ontology (GO) (861,299 proteins annotated to 8360 function terms) and Enzyme Commission (EC) numbers (563,554 proteins annotated to 269 enzymatic functions). ProFAB supplies both positive and negative datasets separately for each function. To obtain numerical features from protein sequence data, various protein descriptors are provided: amino acid composition (AAC), pseudo amino acid composition (PAAC), sequence order coupling number (SOCNumber), conjoint triad (CTriad) and grouped amino acid composition (GAAC). ProFAB consists of 5 main functionalities: (i) Training dataset construction where several training/test/validation datasets are created using UniProtKB and UniRef databases, (ii) Data splitting where random, temporal and similaritybased splitting methods are provided, (iii) Standardization where several methods are used to scale the input features, (iv) Training and tuning of the classifiers which constitute support vector machine (SVM), random forest, k-nearest neighbor (KNN), decision trees, naive bayes, multilayer perceptron, gradient boosting, logistic regression and the ridge classifier (tuning of these classifiers is done automatically by the modules with the predefined parameters that are specific and modifiable for each machine learning algorithm), (v) Evaluation where several metrics are used to assess the predictive performance of trained models. To perform above functionalities, ProFAB uses Python modules and interfaces such as NumPY, scikit-learn, RDKit, pickle and tqdm. To convert protein sequences into numerical feature vectors, iLearn web tool [5] was used. With these implementations, we believe that ProFAB is useful for both computer scientists to find ready-to-use biological datasets, and wet-lab researchers to utilise ready-to-use machine learning algorithms to gain pre-knowledge about the functional view of proteins. ProFAB is available at https://github.com/Sametle06/ProFAB.git. To access the use case, please see: https://github.com/Sametle06/ProFAB/blob/master/test_file.ipynb.
10:30 - 10:40
Break
10:40 - 11:10
Tunca Doğan
AI-centric approaches for integrating, associating, and analyzing large-scale and heterogeneous biomedical data
The recent availability of inexpensive technologies led to a surge of biological/biomedical data production and accumulation in public servers. These noisy, complex and large-scale data should be analyzed in order to understand mechanisms that constitute life and to develop new and effective treatments against prevalent diseases. A key concept in this endeavor is the prediction of unknown attributes and properties of biomolecules such as their molecular functions, physical interactions and etc., together with their relationships to high-level biomedical concepts such as systems and diseases. Lately, cutting-edge data-driven approaches are started to be applied to biological data to aid the development of novel and effective in silico solutions to problems in biomedicine. In this talk, I’ll summarize our efforts for integrating and representing heterogeneous data from different biological/biomedical data resources using graph-based technologies, together with the development and application of deep learning-based discriminative and generative computational methods for enriching the data by predicting the unknown properties and drug discovery centric ligand interactions of biomolecules.
11:10 - 11:50
Language Models can Learn Complex Functional Properties of Proteins
Serbulent Unsal, Heval Ataş, Muammer Albayrak, Kemal Turhan, Aybar Acar and Tunca Dogan
Proteins are essential macromolecules for life. To understand and manipulate biological mechanisms, functions of proteins should be understood, and this is possible through studying their relationship with the amino acid sequence and 3-D structure. So far, only a small percentage of proteins could be functionally characterized (currently ~0.5% according to UniProt) due to cost and time requirements of wet-lab-based procedures. Lately, protein function prediction (PFP), which can be defined as the annotation of proteins with functional definitions using statistical/computational methods, gains importance to explore the uncharacterized protein space and/or protein variants carrying function altering changes. Among many different algorithmic approaches proposed so far, machine learning (ML), especially deep learning (DL), techniques have become popular in PFP due to their high predictive performance. The input data used by these ML/DL methods are numerical feature vectors representing the protein (i.e., protein representations), and they are mostly generated from amino acid sequences of proteins which are readily available in databases (e.g., UniProt).
Early protein representation construction methods built upon evolutionary relationships, the composition of amino acids in the sequence, and/or the physicochemical features of amino acids, which are directly related to the biochemical function of the protein. These methods can be considered classical “model-driven” approaches. Recently, researchers started to utilize ML models to automatically learn protein representations from available protein data (e.g., protein sequence, protein-protein interactions, biomedical literature/texts), in the context of a “data-driven” approach called protein representation learning (PRL). Most of the novel PRL models are based on algorithms from the natural language processing (NLP) field (e.g., word2vec, doc2vec, LSTM/transformer-based BERT, XLNet, etc.), which are originally developed to model languages for automated and simultaneous translation, context-based text generation, etc. with elevated performances. In recent years, the number of PRL methods has multiplied and they are starting to be used in various areas from biomedicine to biotechnology. However, there is no comprehensive study and tool available to assess these representation methods in the context of modeling the functional properties of proteins, to help the researcher choose the suitable method for the task at hand.
In this study, we evaluated protein representation methods for the prediction of functional attributes of proteins and benchmarked these methods in 4 challenging tasks, namely: (i) Semantic similarity inference (we calculated pairwise semantic similarities between human proteins using their gene ontology annotations and compared them with representation vector similarities to observe the correlation in-between), (ii) Ontological protein function prediction (we built GO term categories based on term specificities and the sample sizes which reflects different levels of predictive difficulty and evaluated representation methods by training/validating ML models on these datasets), (iii) Drug target protein family classification (five major target families are selected and methods are evaluated in terms of classifying proteins to families via ML models), and (iv) Protein-protein binding affinity estimation (we used the SKEMPI dataset to evaluate methods in estimating protein-protein binding affinity changes upon mutations). We evaluated 23 protein representation methods in total, including both classical approaches and cutting-edge representation learning methods, to observe whether these novel approaches have advantages over classical ones, in terms of extracting high-level/complex properties of proteins that are hidden in their sequence. Finally, we provide an open-access tool, PROBE (Protein RepresentatiOn BEnchmark), where the user can assess new protein representation models over the above-mentioned benchmarking tasks with only a line of code.
The results of benchmarks showed that numerous DL-based PRL methods, especially large-scale protein language models, performed significantly better than classical representation methods on function and structural feature prediction-related tasks. Also, results indicated that the model architecture and training data types/sources are the two key factors affecting the performance. We also inspected possibilities of data leak from training to test, the cases where the tasks used during the pre-training are biologically related to the benchmark tasks (e.g., models that are constructed using Pfam protein family annotation data are good at predicting structural features since these two are directly related). Furthermore, we discussed current challenges in PRL such as differences between problems in the NLP domain and the ones in protein informatics, in the context of data structures and model interpretability. Finally, we discussed future applications of PRL in the fields of automated protein design and engineering. The details of methodology and results can be found in our preprint (https://doi.org/10.1101/2020.10.28.359828), which will be examined in detail and discussed further. PROBE/ProtBench is available at https://github.com/serbulent/TrainableRepresentationAnalysis.
DebiasedDTA: Model Debiasing to Boost Drug - Target Affinity Prediction
Rıza Özçelik, Alperen Bağ, Berk Atıl, Arzucan Özgür and Elif Özkırımlı
Prediction of drug-target affinity (DTA) in silico can significantly accelerate drug discovery process. Many in silico models rely on the drug-target interaction datasets, since they aim to learn the binding mechanisms between biomolecules through the information the datasets contain. However, the datasets on which the models rely also contain surface patterns, or dataset biases, that prevent models to generalize novel biomolecules [3, 2, 1]. Here we present DebiasedDTA, a model debiasing approach to boost the performance of DTA models on novel biomolecules. DebiasedDTA comprises a weak learner and a strong learner to identify and avoid dataset biases during training. With the fact that the non-complex models can memorize dataset biases easily, a weak learner is used to quantify the biases in the training samples. Once the bias is quantified, the strong learner avoids the biases by adjusting the training sample weights during training.
We experiment with 2 different weak learners to identify different bias sources: ID-DTA and BoW-DTA. ID-DTA is an identity-based weak learner that represents the biomolecules with one-hot encoding. On the other hand, BoW-DTA is a biomolecule-word-based approach that vectorizes the biomolecules with bag-of-words method, based on their chemical and protein words. Both weak learners concatenate chemical and protein representations to represent the interaction and use decision tree for regression.
We also experiment with 3 strong learners to observe the effect of debiasing with different strong learner architectures: DeepDTA, BPE-DTA, and LM-DTA. DeepDTA is a character-level-convolution based DTA model which was frequently used in the literature. DeepDTA uses SMILES strings of ligands and amino-acid sequences of proteinsfor biomolecule representation. We also design BPE-DTA, which uses the same model architecture as DeepDTA but segments sequences with Byte-Pair-Encoding instead of characters. Finally, we design LM-DTA that uses pre-trained biomolecule language model embeddings to represent the chemicals and proteins. Afterward, those representations are concatenated and fed into a 2 layered fully connected network.
We test DebiasedDTA on two datasets and evaluate the effect of debisasing on known and novel molecules separately. The results show that DebiasedDTA improves prediction performance on 44 of 48 experiments, suggesting that the proposed approach is useful in most scenarios. Both the known and novel biomolecules benefit from the performance boost and the boost is amplified when the test biomolecules are dissimilar to training set. The experiments also highlight that both identity and word based biases are prevalent in the datasets and each experimented strong learner can leverage the novel training scheme in DebiasedDTA, indicating that the proposed approach is generalizable across models. As such, we believe that DebiasedDTA will be an influential work for drug-target affinity prediction and will be used to debiase many future models.
References
[1] L. Chen, A. Cruz, S. Ramsey, C. J. Dickson, J. S. Duca, V. Hornak, D. R. Koes, and T. Kurtzman. Hidden bias in the dud-e dataset leads to misleading performance of deep learning in structure-based virtual screening. PloS one, 14(8):e0220113, 2019.
[2] J. Scantlebury, N. Brown, F. Von Delft, and C. M. Deane. Data set augmentation allows deep learning-based virtual screening to better generalize to unseen target classes and highlight important binding interactions. Journal of Chemical Information and Modeling, 60(8):3722–3730, 2020.
[3] J. Yang, C. Shen, and N. Huang. Predicting or pretending: artificial intelligence for protein-ligand interactions lack of sufficiently large and unbiased datasets. Frontiers in Pharmacology, 11:69, 2020.
11:50 - 13:30
(43) PhosProViz: A Web-based Tool to Generate and Interactively Explore Phosphoproteomics Networks
Irene Font Peradejordi, Shreya Chandrasekar, Berk Turhan, Selim Kalayci, Jeffrey Johnson and Zeynep Hülya Gümüş
(44) Metatranscriptome Analysis of Human Gut Microbiome by ASAIM Workflow
Ceyda Demirtaş and Seda Koldaş
(45) mirDisNet: A Novel Approach for Cancer Classification Using mir-Disease Associations
Amhar Jabeer, Burcu Bakir-Gungor and Malik Yousef
(46) Controversy Detection on Health-Related Tweets
Emine Ela Küçük, Selçuk Takır and Dilek Küçük
(47) Application of Machine Learning for The Identification of Novel Diagnostic Biomarkers for COVID-19 by Using Transcriptomic Data.
Didem Ökmen, Athanasia Pavlopoulou and Eralp Doğu
(48) MicroBiomeNet: Machine Learning Analysis of Metagenomics Datasets: Colon Cancer Dataset
Malik Yousef, Anas Nadifi, Amhar Jabeer and Burcu Bakir
(49) Protein Sequence Diversity Dynamics of Primate Erythroparvovirus 1
Pendy Tok, Li Chuin Chong and Mohammad Asif Khan
(50) Protein Sequence Diversity of Human Respiratory Syncytial Virus
Faruk Üstünel and M. Asif Khan
(51) Prediction of Regulatory Network Interactions with CNN Model using Human RNA-Seq Data
Gülce Çelen and Alper Yılmaz
(52) Integrating Multi-Omics Data and Deep Learning for Discovering New Subtypes of Breast Cancer
Hüseyin Uyar and Özgür Gümüş
(53) Survival Prediction of Sepsis Patients in an Intensive Care Unit
Beste Kaysi and Ozgur Gumus
(54) Potential Implementation of Amino Acid Conjugates as Novel Micronutrient Fertilizers
Emre Aksoy
(55) Functional Stratification of Small Molecule Drugs through Integrated Network Similarity
Seyma Unsal Beyge and Nurcan Tuncbag
(56) Towards Integrative Mechanistic Models of Mammalian Cell Responses to Extracellular Perturbations: Growth Factors, Hormones, and Cytokines
Cemal Erdem, Sean M. Gross, Laura M. Heiser and Marc R. Birtwistle
(57) Sequence Diversity of M Proteins of Influenza A (H7N9) Virus
Gizem Yılmaz, Li Chuin Chong, Hasiba Karimi, Eyyüb Selim Ünlü, Muhammed Miran Öncel and Mohammad Asif Khan
(58) Explainable Artificial Intelligence Perspective to the Computational Drug Discovery Process
Kevser Kübra Kırboğa and Ecir Uğur Küçüksille
(59) Classifying Antibiotic Resistance Mechanisms in Dihydrofolate Reductase by Tracking Dynamical Shifts in Hydrogen Bond Occupancies
Ebru Çetin, Ali Rana Atilgan and Canan Atilgan
(61) Expression Profile Survey of Circular RNAs and Their Parent Genes in Context of Tissue Specificity
Elif İrem KeleŞ and Alper Yılmaz
(62) Investigation of Radicals Present in Biological Systems by Molecular Modeling Methods
Buşra Baş and Cenk Selçuki
(63) Runs of Homozygosity Show That Human Inbreeding Has Decreased in Time through the Holocene
Kanat Gürün, Francisco C. Ceballos, Ezgi Altınışık, Hasan Can Gemici, Cansu Karamurat, Dilek Koptekin, Kıvılcım Başak Vural, Igor Mapelli, Ekin Sağlıcan, Elif Sürer, Yılmaz Selim Erdal, Anders Götherström, Füsun Özer, Çiğdem Atakuman and Mehmet Somel
(64) Biomarker Prediction for Parkinson’s Disease by Transcriptome Mapping on a Genome-Scale Metabolic Model
Ecehan Abdik and Tunahan Çakır
(65) Constraint-Based Modelling and Machine Learning Identifies Metabolic Alterations in the Substantia Nigra in Parkinson’s Disease
Regan Odongo and Tunahan Çakır
(66) Reconstruction and Transcriptome-based Analysis of Rat Brain Specific Genome Scale Metabolic Network Model for Parkinson’s Disease
Orhan Bellur and Tunahan Çakir
(67) Discovery of Latent Drivers from Double Mutations in Pan-Cancer Data Reveal their Clinical Impact
Bengi Ruken Yavuz, Chung-Jung Tsai, Ruth Nussinov and Nurcan Tunçbağ
(68) ProFAB – Open Protein Functional Annotation Benchmark
Ahmet Samet Özdilek, Ahmet Atakan, Tunca Doğan, Rengül Çetin-Atalay, Mehmet Volkan Atalay and Ahmet Süreyya Rifaioğlu
(69) Language Models can Learn Complex Functional Properties of Proteins
Serbulent Unsal, Heval Ataş, Muammer Albayrak, Kemal Turhan, Aybar Acar and Tunca Dogan
(70) Consensus Clustering Analysis as a Sample Selection Method in Biomarker Discovery: Lung Cancer Case-Study
Nehir Kızılilsoley and Emrah Nikerel
(71) Interaction Energy Analysis of Lidocaine and Papaverine with the Drug Carrier Pectin
Nesrin Işıl Yaşar, Tuğçe İnan, Ayşe Özge Kürkçüoğlu levitas and Fethiye Aylin Sungur
(72) DebiasedDTA: Model Debiasing to Boost Drug - Target Affinity Prediction
Rıza Özçelik, Alperen Bağ, Berk Atıl, Arzucan Özgür and Elif Özkırımlı
(73) Methylation Deviation as a Marker of Intratumor Heterogeneity and Cancer Progression
Ersin Onur Erdoğan, Ömer Çinal and Mehmet Baysan
(74) Predicting the Impact of Cancer Somatic Mutations on Protein-Protein Interactions
Ibrahim Berber, Cesim Erten and Hilal Kazan
(75) Drug-Target Interaction Prediction Using Transfer Learning
Alperen Dalkiran, Ahmet Süreyya Rifaioğlu, Aybar Can Acar, Tunca Dogan, Rengul Atalay and Volkan Atalay
(76) Prediction of Resistance to Drugs in Triple Negative Breast Cancer Based on Gene Expression Levels
Bengisu Karaköse, Berk Gürdamar and Osman Uğur Sezerman
(77) Predicting Oral Health Using Machine Learning
Emrah Kırdök and Andres Aravena
(78) Archaeogenetic Analysis of Neolithic Sheep from Anatolia
Erinç Yurtman, Onur Özer, Eren Yüncü, Nihan Dilşad Dağtaş, Dilek Koptekin, Yasin Gökhan Çakan, Mustafa Özkan, Ali Akbaba, Damla Kaptan, Gözde Atağ, Kıvılcım Başak Vural, Can Yumni Gündem, Louise Martin, Gülşah Merve Kılınç, Ayshin Ghalichi, Sinan Can Açan, Reyhan Yaka, Ekin Sağlıcan, Vandela Kempe Lagerholm, Maja Krzewinska, Torsten Günther, Pedro Morell Miranda, Evangelia Piskin, Müge Şevketoğlu, C. Can Bilgin, Çiğdem Atakuman, Yılmaz Selim Erdal, Elif Sürer, N. Ezgi Altınışık, Johannes Lenstra, Sevgi Yorulmaz, Mohammad Foad Abazari, Javad Hoseinzadeh, Douglas Baird, Erhan Bıçakçı, Özlem Çevik, Fokke Gerritsen, Rana Özbal, Anders Götherström, Mehmet Somel, İnci Togan and Füsun Özer
(79) Potential Inhibitor Identification for Deoxyhypusine Synthase
Ayşenur Öztürk and Fethiye Aylin Sungur
(80) Inter-Tissue Convergence of Gene Expression and Loss of Cellular Identity During Ageing
Hamit İzgi, Dingding Han, Ulas Isildak, Shuyun Huang, Ece Kocabiyik, Philipp Khaitovich, Mehmet Somel and Handan Melike Dönertaş
(81) Ensemble Learning Approach for Computational Drug Repurposing
Ismail Denizli, Oğuzhan Şahİn, Özgür DoĞan, Tuğba Süzek and Baris Süzek
(82) Extraction of Herb-Drug Interactions
Erkan Yaşar, Remzi Çelebi and Özgür Gümüş
(83) Identification of Autophagy-Related miRNA–mRNA Regulatory Network in Calorie-Restricted Mouse Brain
Atakan Ayden, Elif Yılmaz, Bilge G. Tuna, Ayşegül Kuskucu, Ömer F. Bayrak, Andres Aravena and Soner Doğan
13:30 - 14:00
Maria Secrier
Genomic triggers and evolutionary context of cancer dormancy
Tumour dormancy, a state in which cancer cells are reversibly arrested in the cell cycle, is frequently reported as a contributing factor of resistance to chemotherapy and other treatments that target cycling cells. However, despite its crucial role in cancer progression, dormancy is still poorly characterised and the molecular changes enabling the transition to this state remain largely unknown. Cellular stress can drive cells into dormancy as a mechanism to avoid further DNA damage and maintain genomic stability. How these shifts occur due to various mutational processes and how they impact cancer progression has not been elucidated. I will present an integrated computational framework for evaluating tumour dormancy and its genomic triggers across a variety of cancers. We show that dormancy preferentially emerges in the context of more stable, less mutated genomes which maintain TP53 integrity and lack the hallmarks of DNA damage repair deficiency. Using an ensemble elastic net regression model, we uncover several novel genomic dependencies of this process that could be exploited to maintain or promote exit from this state. We also use single cell data to demonstrate quiescence is a key resistance mechanism to a wide range of cancer therapies. Finally, we demonstrate broad reorganisation of the tumour tissue in the context of dormancy, which could inform therapeutic strategies.
14:00 - 14:30
Rayan Chikhi
A tale of optimizing the space usage of de Bruijn graphs
In the last decade in bioinformatics, many computational works have studied a graph data structure used to represent genomic data, the de Bruijn graph. It is closely tied to the problem of genome assembly, i.e. the reconstruction of an organism’s chromosomes using a large collection of overlapping short fragments. We start by highlighting this connection, noting that assembling genomes is a computationally intensive task, and then focus our attention on the various techniques developed to reduce the space taken by de Bruijn graph data structures. This talk is a retrospective aimed to be accessible without prior knowledge of this area.
14:30 - 14:45
Compecta
14:45 - 15:00
Break
15:00 - 16:00
Ivet Bahar
Network models in biology: Molecular machinery, chromosomal dynamics, and pathogenicity of missense variants
Network models have proven in recent years to assist in improving our understanding of the coupled dynamics of biomolecules, from individual proteins to supramolecular systems, and even the entire chromatin, in recent years. Among network models that have been developed for biological applications, elastic network models (ENMs) found wide usage in molecular biology 1 . The global motions predicted by ENMs have proven in numerous applications in the last two decades to provide a good description of molecular machinery and allosteric behavior. Application to supramolecular structures, including cryo-EM structures, has been a major utility of these models. More recently, ENMs proved useful to exploring chromosomal dynamics, using data from Hi-C experiments to reconstruct in silico the connectivity of the chromatin and provide a physical basis for gene regulation transcription and cell type differentiation 2 . Finally, machine learning algorithms that incorporate ENM predictions provide an improved assessment of the effect of mutations on function, compared to those based on sequence and structure exclusively 3,4 . These recent developments and future biomedical and pharmacological applications will be discussed.
References
[1] Krieger JM, Doruker P, Scott AL, Perahia D, Bahar I. (2020) Towards Gaining Sight of Multiscale Events: Utilizing Network Models and Normal Modes in Hybrid Methods. Curr Opin Struct Biol 64:34-41.
[2] Zhang S, Chen F, Bahar I (2020) Differences in the Intrinsic Spatial Dynamics of the Chromatin Contribute to Cell Differentiation. Nucleic Acids Res. 48, 1131-1145.
[3] Ponzoni L, Bahar I. (2018) Structural dynamics is a determinant of the functional significance of missense variants. Proc Natl Acad Sci USA 115: 4164-4169
[4] Ponzoni L, Penaherrera DA, Oltvai ZN, and Bahar I (2020) Rhapsody: Predicting the pathogenicity of human missense variants. Bioinformatics 36:3084-3092.
16:00 - 16:40
Inter-tissue convergence of gene expression and loss of cellular identity during ageing
Hamit İzgi, Dingding Han, Ulas Isildak, Shuyun Huang, Ece Kocabiyik, Philipp Khaitovich, Mehmet Somel and Handan Melike Dönertaş
Studying gene expression changes during ageing gives insight to identifying age-related molecular and cellular processes. Recent molecular studies including these two periods have reported a reversal of the ageing transcriptome towards pre-adult levels in primate brain and mouse liver and kidney. Several major questions remain to be answered. Prevalence of reversal phenotypes across tissues is yet unclear and is not studied extensively, as most research has been conducted in the brain. To address this, we generated RNA-seq transcriptomes of 16 mice from four tissues; cortex, lung, liver and muscle, covering development and ageing intervals. First, we revealed that approximately %50 of the genes showed expression reversals in each tissue, although these proportions are not significant in permutation tests, suggesting that the expression trajectories of the genes do not necessarily continue linearly into the ageing period. Functional analysis of the genesets showing reversal patterns identified pathways related to development, metabolism and inflammation. Next, we asked whether different tissues show similarities in their reversal patterns. We found no significant overlap in reversal genes among tissues suggesting that expression reversals might be tissue-specific. In concordance with the tissue-specific reversals, we showed that during development, tissues become more distinct (diverge) and interestingly during ageing, tissues become more similar (converge) in their gene expression levels. We confirmed this observation using two other independent RNA-seq (human) and microarray datasets (mouse). Moreover, divergence-convergence pattern is enriched among tissue-specific genes which show either decreased expression in native tissue or gaining expression in non-native tissue during ageing. Further, using publicly available single-cell transcriptome data, we showed that divergence-convergence pattern is driven both by alterations in cell type proportions and also by cell-autonomous expression changes. This supports our previous hypothesis that loss of cellular identity during ageing might be a general phenomenon in mammalian ageing.
Comparison and assessment of speed and accuracy of AutoDock Vina and AutoDock CrankPep for short peptide docking
Sefer Baday and Numan Nusret Usta
Protein-peptide interactions (PPI) are one of most interesting area of molecular biology due to their importance in biological processes. In computational biology, molecular docking is commonly used technique for investigating the PPI. However, due to the flexible structure of peptides, docking of peptides to proteins are very challenging due to high conformational degrees of freedom. Various molecular docking tools are available with different search and scoring algorithms for different docking needs such as protein-protein, protein-ligand, nucleic acid-ligand and etc. For investigating the effectiveness of molecular docking tools, some benchmark studies were done by different research groups. But, especially for protein-peptide docking, these studies could seem inadequate in terms of number of studied complexes and peptide sequence length diversity of complexes. By considering these circumstances, we choose PepBDB which is a comprehensive structural database of biological protein-peptide complexes. PepBDB has more than 13000 complexes that their peptide lengths ranged from 1 to 50. In this study, we worked with two different tools of most common free docking softwares AutoDock CrankPep (ADCP) and AutoDock Vina. Accuracy of results are determined with RMSD. We performed local docking for all complexes and evaluated the top pose of each of them. We analyzed docking results of the peptides by their sequence length. Docking results of shorter peptides are better compared to longer ones for both of the docking tools. Also, time consumption of longer peptides is much more in terms of peptide per minute. Success rate of ADCP is noticeably better than Vina for shorter peptides. However, the gap between these programs is decreasing as peptide length increases. We also determined the optimal parameters for both programs by the sequence length.
16:40 - 17:00
Break
17:00 - 18:00
Mehmet Koyutürk
Overcoming Bias in Computational Characterization of Cellular Signaling
Protein phosphorylation is a key regulator of protein function in signal transduction pathways. Kinases are the enzymes that catalyze the phosphorylation of other proteins in a target-specific manner. The dysregulation of phosphorylation is associated with many diseases, including cancers, neuro-degenerative disorders, and auto-immune diseases. Consequently, characterization of kinase activity in the context of these diseases is essential for the development of effective treatments.Indeed, in the last decade, kinase inhibitors have become central to the treatment of a broad range of cancers. Although technological advances enable the identification of phosphorylated sites for thousands of proteins, most of the phosphoproteome is still in the dark: more than 95% of the reported phosphorylation sites in humans have no known kinases. The incompleteness of kinase annotations and the limitations of data acquisition techniques give rise to two important computational problems in the study of cellular signaling: 1) How can we identify the kinases that target a given phosphorylation site? 2) How can we identify the kinases with altered activity based on the changes in the phosphorylation levels of their targets? In this talk, we demonstrate that the computational tools that are developed to address these fundamental problems have a common flaw that poses limitations on the expansion of knowledge: Both benchmarking data and algorithms are biased toward well-studied kinases. We then describe the algorithms we develop to overcome these hurdles by integrating a broad range of functional data. Finally, we present comprehensive results that aim to systematically assess the robustness of algorithms with respect to missing data. Our results show that network algorithms can significantly enhance the utility of phospho-proteomic data and the ability of computational tools in generating new knowledge that pertains to cellular signaling.
18:00 - 18:15
Closing Remarks