Data stream processing is currently gaining importance due to the rapid increase in data volumes and developments in novel application areas like e-science, e-health, and e-business. In this thesis, we propose an architecture for a data stream management system and investigate methods for query processing on data streams in such systems. In contrast to traditional database management systems (DBMSs), queries on data streams constitute continuous subscriptions for retrieving interesting data rather than one-time ad-hoc queries. To meet the challenges of the new "streaming" paradigm, we propose StreamGlobe as a distributed data stream management system for efficiently querying and processing XML data streams in the spirit of a traditional DBMS. Beyond processing XQuery subscriptions, StreamGlobe in particular addresses the problem of efficiently distributing data streams in Peer-to-Peer networks by means of data stream sharing to avoid network and peer congestion. For the evaluation of XQuery subscriptions, StreamGlobe employs our novel streaming XQuery processor, FluX. FluX represents an extension of the XQuery language supporting event-based query processing and the conscious handling of main memory buffers to achieve a scalable execution of queries on data streams. XQueries are rewritten into the event-based FluX language by exploiting order constraints from the schema of a data stream to schedule event processors and to thus minimize the amount of buffering required for evaluating a query. Performance experiments prove the effectiveness of our approach. StreamGlobe further allows the use of user-defined operators for enabling expressive query processing. We discuss the implementation of such operators using our novel class of best-match join operators as an example. These operators address the problem of finding best matching pairs of data objects in multi-dimensional spaces. Considering multiple dimensions leads to a partial order on the pairs of objects. Since partial orders naturally have more than one minimum, traditional approaches aiming at determining a single "best" pair most likely fail to produce satisfying results. In contrast, our best-match join computes the best matching pairs having a maximum similarity on each individual dimension. We assess the effectiveness of this approach by means of performance experiments.
«
Data stream processing is currently gaining importance due to the rapid increase in data volumes and developments in novel application areas like e-science, e-health, and e-business. In this thesis, we propose an architecture for a data stream management system and investigate methods for query processing on data streams in such systems. In contrast to traditional database management systems (DBMSs), queries on data streams constitute continuous subscriptions for retrieving interesting data rath...
»