Variational Inference to Learn Representations for Protein Evolutionary Information

Issar Arab

MThesis_Report_Issar.pdf

Wenn Sie Schwierigkeiten haben, das Dokument zu öffnen, versuchen Sie auch bitte diesen Link

Dokumenttyp:: Masterarbeit
Autor(en):: Issar Arab
Titel:: Variational Inference to Learn Representations for Protein Evolutionary Information
Übersetzter Titel:: Variationsinferenz um Repräsentationen für Evolutionäre Information von Proteinen zu lernen
Abstract:: Many of the machine learning (ML) models used in the field of bioinformatics and computational biology to predict either function or structure of proteins rely on the evolutionary information as summarized in multiple-sequence alignments (MSAs) or the resulting position-specific scoring matrices (PSSMs), as generated by PSI-BLAST. Due to the exhaustive database search to retrieve evolutionary information, the current procedure used in protein structure and function prediction is computationally exhaustive and time-consuming. This issue is becoming more problematic since the protein sequence databases are increasing in size exponentially over time, and hence raising PSI-BLAST runtime as well. According to previous experiments we have conducted, one query protein takes PSI-BLAST on average 14 minutes to build a PSSM profile. Therefore, in order to build a simple ML model, one may need a couple of months waiting for PSI-BLAST to generate PSSMs for a few thousands of query proteins. This runtime bottleneck issue makes it a major problem that requires an efficient alternative solution. A protein sequence is a collection of contiguous tokens or characters called amino acids (AAs). This analogy to natural language allowed us to exploit the recent advancements in the field of Natural Language Processing (NLP) and therefore transferring NLP state-of-the-art algorithms to bioinformatics. A recent prominent alternative to replace the need of PSSMs as input to our prediction methods is the direction of Embedding Language Models (ELMo), converting a protein sequence to a numerical vector representation. ELMo/SeqVec is a state-of-the-art deep learning pre-trained model that embeds a protein sequence into a 3-dimensional tensor of numerical values. This ELMo trained a 2-layer bidirectional Long Short-Term Memory (LSTM) network following a two-path architecture, one for the forward and the second for the backward pass, on an unsupervised task predicting the next AA from the previously seen residues in the sequence. The performance of the embedder was then evaluated on downstream tasks, such as secondary structure and subcellular localization predictions. The results showed that the embeddings succeed to capture the biochemical and the biophysical properties of a protein, but not achieving state-of-the-art performance. By merging the idea of PSSMs with the concept of transfer-learning during pre-training, we are aiming to deploy a new ELMo with a better embedding power than SeqVec by training a novel single branch bidirectional language model (bi-LM), with four times less free parameters. This is the first time that an ELMo is trained not only on predicting the next AA but also on the probability distribution of the next AA derived from similar, yet different sequences as summarized in a PSSM simultaneously (multi-task training), hence learning evolutionary information of protein sequences as well. To train our novel embedder, we compiled the largest currently curated dataset of sequences with their corresponding PSSMs of size 1.83 Million proteins (~0.8 billion amino acids). The dataset of proteins is reduced to 40% sequence identity, with respect to the validation/test sets, and contains sequences ranging between 18 and 9858 residues in length. «
Many of the machine learning (ML) models used in the field of bioinformatics and computational biology to predict either function or structure of proteins rely on the evolutionary information as summarized in multiple-sequence alignments (MSAs) or the resulting position-specific scoring matrices (PSSMs), as generated by PSI-BLAST. Due to the exhaustive database search to retrieve evolutionary information, the current procedure used in protein structure and function prediction is computationally... »
übersetzter Abstract:: Informationen über den evolutionären Hintergrund von Proteinen sind einer der zentralen Bausteine für die maschinengestützte Vorhersage von beispielsweise Struktur oder Funktion von Proteinen in der Bioinformatik. Diese Informationen werden meist in Multiple-Sequence-Alignments (MSAs) oder den daraus abgeleiteten positions-spezifischen Scoring-Matrizen (PSSMs), wie beispielsweise von PSI-Blast generiert, codiert. Aufgrund der großen Menge an bekannten Proteinen ist die Erstellung dieser Informationen jedoch sehr Zeit- und Ressourcen-aufwendig. Zusätzlich dazu steigt die Anzahl an bekannten Proteinen weiter und erhöht damit die Laufzeit und Komplexität der Erstellung um ein Vielfaches. Für das Entwickeln und Testen einfacher Maschine Learning Modellen können monatelange Laufzeiten zur Erzeugung der Trainingsdaten anfallen. Diesen Laufzeitflaschenhals wollen wir mit dieser Arbeit verbessern. Im Allgemeinen ist es möglich Analogien zwischen Protein-Sequenzen und natürlicher Sprache zu finden. Dies erlaubt es uns die aktuellen Entwicklungen in Natural Language Processing (NLP) auszunutzen und diese auf die Bioinformatik zu übertragen. Wir verwenden dabei ein Modell namens ELMo, welches bereits zuvor im Bereich von Sekundärstrukturvorhersage Erfolge erzielt hat. In dieser Arbeit vereinen wir das Konzept von PSSMs und ELMo unter Verwendung relevanter Techniken, wie beispielsweise Transfer Learning, um dadurch den Trainingsaufwand deutlich zu reduzieren. Im Vergleich zu SeqVec, trainieren wir unser Modell nicht nur auf der Vorhersage der nächsten Aminosäuren, sondern auch auf deren Wahrscheinlichkeitsverteilung, welche durch eine passende PSSM gegeben ist. Insgesamt ist unser Modell damit in der Lage die evolutionären Informationen eines Proteins zu erlernen. «
Informationen über den evolutionären Hintergrund von Proteinen sind einer der zentralen Bausteine für die maschinengestützte Vorhersage von beispielsweise Struktur oder Funktion von Proteinen in der Bioinformatik. Diese Informationen werden meist in Multiple-Sequence-Alignments (MSAs) oder den daraus abgeleiteten positions-spezifischen Scoring-Matrizen (PSSMs), wie beispielsweise von PSI-Blast generiert, codiert. Aufgrund der großen Menge an bekannten Proteinen ist die Erstellung dieser Informat... »
Stichworte:: Protein Sequences, Evolutionary Information, PSSM, Variational Inference, Deep Learning, ELMo
Fachgebiet:: DAT Datenverarbeitung, Informatik
DDC:: 000 Informatik, Wissen, Systeme
Betreuer:: Rost, Burkhard (Prof. Dr.)
Gutachter:: Cavalli-Sforza,Violetta (Dr.); Heinzinger, Michael
Jahr:: 2020
Seiten/Umfang:: 84
Sprache:: en
Sprache der Übersetzung:: de
Hochschule / Universität:: Technische Universität München
Fakultät:: Fakultät für Informatik
Annahmedatum:: 15.09.2020
Präsentationsdatum:: 15.09.2020
Publikationsdatum:: 15.09.2020
BibTeX

Vorkommen:

mediaTUM Gesamtbestand Elektronische Prüfungsarbeiten School TUM School of Computation, Information and Technology