Ein Konstruktionsalgorithmus einer Indexstruktur für Genome

Witterstein, Gabriele

Gabriele Witterstein

If you experience problems opening the document, please try this link.

Original title:: Ein Konstruktionsalgorithmus einer Indexstruktur für Genome
Translated title:: A Construction Algorithm of an Index Structure for Genomes
Author:: Witterstein, Gabriele
Year:: 2005
Document type:: Dissertation
Faculty/School:: Fakultät für Mathematik
Advisor:: Hoffmann, Karl-Heinz (Prof. Dr. Dr. h.c. mult.)
Referee:: Hoffmann, Karl-Heinz (Prof. Dr. Dr. h.c. mult.); Kramer, Stefan (Prof. Dr.); Leser, Ulf (Prof. Dr.)
Format:: Text
Language:: de
Subject group:: BIO Biowissenschaften; DAT Datenverarbeitung, Informatik
Controlled terms:: Genom; DNS-Sequenz; Indexierung ; Algorithmus
TUM classification:: BIO 180d; DAT 530d; DAT 455d
Abstract:: Infolge weltweiter Genom-Sequenzierungsprojekte entstehen grosse Mengen an DNA Rohdaten. Eine grundlegende Fragestellung ist deren textuelle Analyse. Im Rahmen dieser Arbeit wird eine Indexstruktur über DNA-Sequenzen vorgestellt, welche grundlegende Aufgaben, wie schnelles approximatives Stringmatching, exaktes Stringmatching usw. unterstützt.
Da derartige Strukturen sehr speicherintensiv sind, wird ein Design einer Suffix Tree Variante besprochen, das sich sowohl an den wichtigsten biologischen Anwendungen ausrichtet, als auch an den Gebrauch eines relativ langsamen, dafür unbegrenzten Speichermediums. Das Ziel ist eine möglichst platzsparende Repräsentation bei maximaler Effizienz im Zugriff.
Die hauptsächliche wissenschaftliche Fragestellung ist die Konstruktion einer solchen Datenstruktur unter einem 'Two-Level-Memory Model' mit geringem Hauptspeicher. Es wird ein Konstruktionsalgorithmus vorgestellt, welcher auf der Partitionierung der Datenmenge basiert. Im Rahmen einer stochastischen Analyse, unter gewissen Modellannahmen an die DNA Sequenzen, kann gezeigt werden, dass die Komplexität des Algorithmus durchschnittlich von der Ordnung O(n log n) ist.
Eine Implementierung und Testläufe mit realen Sequenzen belegen die Anwendbarkeit des Algorithmus in der Biologie. Vergleiche mit ähnlichen Algorithmen auf diesem Gebiet dokumentieren einen beträchtlichen Performancegewinn.
Translated abstract:: Due to world-wide Genome sequence projects, large quantities of DNA raw data arise. The fundamental task emerging is their textual analysis. In the frame of this work, an index structure over DNA sequences is introduced, which supports important string analysis tasks, such as fast approximative string matching and fast exact string matching, and so on.
Because such structures are very main-memory expensiv, a design of a variant of a suffix tree is discussed. This design is orientated on the most important biological needs as well as on the usage of a relatively slow, but unlimited storage medium. The goal is a preferably space-saving representation with a maximum efficiency in the access.
Beside that, the main scientific question is the construction of such a data structure under a 'Two-Level Memory Model' with small main storage. A construction algorithm is presented, which is based on the partitioning of the data set. In the context of a stochastic analysis, under certain model assumptions to the DNA sequences, it can be shown that the complexity of the algorithm is of order O(n log n) in average.
An implementation and test runs with real sequences demonstrate the applicability of the algorithm in biology. Comparisons with similar algorithms in this field document a considerable performance benefit.
Publication :: Universitätsbibliothek der Technischen Universität München
WWW:: https://mediatum.ub.tum.de/?id=602028
Date of submission:: 06.12.2004
Oral examination:: 01.07.2005
File size:: 696203 bytes
Pages:: 105
Urn (citeable URL):: https://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bvb:91-diss20050831-1825266245
Last change:: 18.07.2007
BibTeX