Optimization of algorithms for large vocabulary isolated word recognition in embedded devices

Astrov, Sergey

Sergey Astrov

If you experience problems opening the document, please try this link.

Original title:: Optimization of algorithms for large vocabulary isolated word recognition in embedded devices
Translated title:: Optimierung von Algorithmen für die isolierte Worterkennung mit großem Vokabular in Embedded-Geräten
Author:: Astrov, Sergey
Year:: 2007
Document type:: Dissertation
Faculty/School:: Fakultät für Elektrotechnik und Informationstechnik
Advisor:: Ruske, Günther (Prof. Dr. Dr. habil.)
Referee:: Ruske, Günther (Prof. Dr. Dr. habil.); Ney, Hermann (Prof. Dr.)
Language:: en
Subject group:: ELT Elektrotechnik
Keywords:: Speech recognition, embedded devices, Viterbi search, SDCHMM, streams, optimization, tree search, word stem based structure
Translated keywords:: Spracherkennung, Vitebi Suche, Baumstruktur, Wordstammbasierte Struktur
Abstract:: Automatic speech recognition requires high processing power and a high amount of memory. Main algorithms in speech recognition (computation of emission probabilities and Viterbi search) are very memory and computation consuming. Modern workstations, personal computers and servers have sufficient memory and processing power, but embedded devices are limited in these resources. Speech recognition in embedded devices should have an acceptable trade-off in memory, processing power consumption and recognition quality. Several memory saving approaches and fast algorithms were investigated and the following results were achieved: The memory consumption of acoustic models after coding is decreased by 67% (reduction from 104 to 34 KB). The relative increase of word error rate in recognition is less than 10%. The fast computation of emission probabilities requires three times less computations than the baseline algorithm. The emission computation task requires only 8.2 MHz for speech recognition with a 30-word vocabulary, the baseline algorithm requires at least 28.9 MHz on an ARM microcontroller. The new search process on isolated word recognition tasks with a vocabulary of 1500 words requires less than 17 MHz on an ARM processor and 160 KB of memory. The fast computation of emission probabilities and the compact coding of acoustic model parameters is based on a streams approach. A set of 24-dimensional vectors from acoustic models is divided into streams: in case of 3-dimensional (3-D) streams, the first stream contains 1st, 2nd and 3rd components (dimensions) of vectors, the second stream contains 4th, 5th and 6th components of vectors, and so on. All 3-D stream vectors within each stream are coded by means of vector quantization. Only one shared codebook is used for all streams instead of several codebooks for each dimension, this decreases the memory consumption further. Distances between feature vector and vectors from acoustic models must be computed during the recognition. This process is performed every 15 ms and requires high amount of computations. For acoustic models with streams these computations are accelerated. In the first step, all possible distances are computed for all stream vectors from the codebook and stored in memory. This is possible because the codebook has a limited number of vectors. In the second step, the distances between feature vector and vectors from acoustic models are computed as a sum of the partial distances of stream vectors. For 3-D streams the computation costs are reduced by 66%. In order to accelerate the search process, a tree structure is combined with a word stem structure. The new search algorithm takes advantages from both approaches. In a tree structure the words starting with identical phonemes are processed together, the merged word parts with identical phonemes are processed only once during a search iteration, thus, the computation is accelerated. The tree structure requires less memory than the linear structure because the phonemes in similar word parts are stored inmemory only once. From the word stem search the new algorithm takes an advantage of stems (linear sequences of HMM states): the regular linear structures of stems are fast to process, the data for every stem is stored compactly in memory that is why the memory cache is used efficiently. The presented algorithms were tested. With these algorithms the large vocabulary speech recognition becomes possible for embedded devices. «
Automatic speech recognition requires high processing power and a high amount of memory. Main algorithms in speech recognition (computation of emission probabilities and Viterbi search) are very memory and computation consuming. Modern workstations, personal computers and servers have sufficient memory and processing power, but embedded devices are limited in these resources. Speech recognition in embedded devices should have an acceptable trade-off in memory, processing power consumption and re... »
Translated abstract:: In dieser Arbeit werden verschiedene Ansätze zur Reduktion des Speicherbedarfs und Rechenaufwands automatischer Spracherkennungssysteme für mobile Geräte untersucht. Für eine Reduktion des Speicherbedarfs und der Emissionswahrscheinlichkeitsberechnungen werden Stream-basierte HMMs ausführlich diskutiert und ein zusammengefasstes Codebook Verfahren präsentiert. Im Rahmen einer Beschleunigung der Viterbi Suche wird ein neuartiger Verbund von Baumstruktur und Wortstamm-basierter Suche vorgestellt. Der Nachweis der Wirksamkeit des Verfahrens erfolgt anhand umfangreicher und konsistenter Experimente mit unmittelbarem Bezug zu realen Anwendungen. «
In dieser Arbeit werden verschiedene Ansätze zur Reduktion des Speicherbedarfs und Rechenaufwands automatischer Spracherkennungssysteme für mobile Geräte untersucht. Für eine Reduktion des Speicherbedarfs und der Emissionswahrscheinlichkeitsberechnungen werden Stream-basierte HMMs ausführlich diskutiert und ein zusammengefasstes Codebook Verfahren präsentiert. Im Rahmen einer Beschleunigung der Viterbi Suche wird ein neuartiger Verbund von Baumstruktur und Wortstamm-basierter Suche vorgestellt.... »
WWW:: https://mediatum.ub.tum.de/?id=620864
Date of submission:: 22.08.2006
Oral examination:: 26.02.2007
Pages:: 112
Urn (citeable URL):: https://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bvb:91-diss-20070226-620864-0-1
Last change:: 31.05.2007
BibTeX

Occurrences:

mediaTUM Gesamtbestand Elektronische Prüfungsarbeiten School TUM School of Computation, Information and Technology

mediaTUM Gesamtbestand Einrichtungen Schools TUM School of Computation, Information and Technology Prüfungsarbeiten Dissertationen

mediaTUM Gesamtbestand Elektronische Prüfungsarbeiten Fachgebiet Elektrotechnik