Automatic speech recognition requires high processing power and a high amount of memory. Main algorithms in speech recognition (computation of emission probabilities and Viterbi search) are very memory and computation consuming. Modern workstations, personal computers and servers have sufficient memory and processing power, but embedded devices are limited in these resources. Speech recognition in embedded devices should have an acceptable trade-off in memory, processing power consumption and recognition quality. Several memory saving approaches and fast algorithms were investigated and the following results were achieved:
The memory consumption of acoustic models after coding is decreased by 67% (reduction from 104 to 34 KB). The relative increase of word error rate in recognition is less than 10%. The fast computation of emission probabilities requires three times less computations than the baseline algorithm. The emission computation task requires only 8.2 MHz for speech recognition with a 30-word vocabulary, the baseline algorithm requires at least 28.9 MHz on an ARM microcontroller. The new search process on isolated word recognition tasks with a vocabulary of 1500 words requires less than 17 MHz on an ARM processor and 160 KB of memory.
The fast computation of emission probabilities and the compact coding of acoustic model parameters is based on a streams approach. A set of 24-dimensional vectors from acoustic models is divided into streams: in case of 3-dimensional (3-D) streams, the first stream contains 1st, 2nd and 3rd components (dimensions) of vectors, the second stream contains 4th, 5th and 6th components of vectors, and so on. All 3-D stream vectors within each stream are coded by means of vector quantization. Only one shared codebook is used for all streams instead of several codebooks for each dimension, this decreases the memory consumption further.
Distances between feature vector and vectors from acoustic models must be computed during the recognition. This process is performed every 15 ms and requires high amount of computations. For acoustic models with streams these computations are accelerated. In the first step, all possible distances are computed for all stream vectors from the codebook and stored in memory. This is possible because the codebook has a limited number of vectors. In the second step, the distances between feature vector and vectors from acoustic models are computed as a sum of the partial distances of stream vectors. For 3-D streams the computation costs are reduced by 66%.
In order to accelerate the search process, a tree structure is combined with a word stem structure. The new search algorithm takes advantages from both approaches. In a tree structure the words starting with identical phonemes are processed together, the merged word parts with identical phonemes are processed only once during a search iteration, thus, the computation is accelerated. The tree structure requires less memory than the linear structure because the phonemes in similar word parts are stored inmemory only once. From the word stem search the new algorithm takes an advantage of stems (linear sequences of HMM states): the regular linear structures of stems are fast to process, the data for every stem is stored compactly in memory that is why the memory cache is used efficiently. The presented algorithms were tested. With these algorithms the large vocabulary speech recognition becomes possible for embedded devices.
«
Automatic speech recognition requires high processing power and a high amount of memory. Main algorithms in speech recognition (computation of emission probabilities and Viterbi search) are very memory and computation consuming. Modern workstations, personal computers and servers have sufficient memory and processing power, but embedded devices are limited in these resources. Speech recognition in embedded devices should have an acceptable trade-off in memory, processing power consumption and re...
»