Doctor's Theses (authored and supervised):
"Distributed Heterogeneous Web Data Sources Integration - DeXIN Approach";
Supervisor, Reviewer: R. Pichler, U. Zdun;
Institut für Informationssysteme,
oral examination: 08-26-2011.
In modern business enterprises, it is frequent to develop an integrated application to provide uniform access to multiple existing information systems running internally or externally of the enterprise. Data integration is a pervasive challenge faced in these applications that need to query across multiple autonomous and heterogeneous data sources. Integrating such diverse information systems becomes a challenging task particularly when different applications use different data formats and query languages which are not compatible with each other.
With the growing popularity of web technologies and availability of the huge amount of data on the web, the requirements for data integration has changed from the traditional database integration approaches. The large scale of web data sources has not only led to high levels of distribution, heterogeneity, different data formats and query languages. Additionally, the data is also associated with data concerns like privacy, licensing, pricing, quality of data, etc. Hence, the data integration tools not only have to provide the optimal solution to mitigate the heterogeneity in data formats and query languages. In addition, also the various data concerns should be preserved when data is published and utilized. Moreover, data service selection and data selection should be based on these data concerns.
The goal of this thesis is to provide better means to easily and dynamically integrate distributed heterogeneous web data sources (particularly XML and RDF data sources) in such a way that the user can easily build data integration applications while assuring all the data concerns associated with the data.
The main topic of this work is devoted to the distributed heterogeneous data integration for web data sources. In order to deal with the challenge of XML and RDF data integration, we propose "DeXIN (Distributed extended XQuery for heterogeneous data INtegration)", an extensible framework for distributed query processing over heterogeneous, distributed and autonomous data sources. DeXIN considers one data format as the basis (the so-called "aggregation model") and extends the corresponding query language to executing queries over heterogeneous data sources in their respective query languages. We come up with an extension of XQuery which covers the full SPARQL language and supports the decentralized execution of both XQuery and SPARQL in a single query.
For the assurance of the data concerns associated with the published data over the web, we introduce a "Data Concerns Aware Querying System". A data concerns aware querying system incorporates several data concerns into a query language, thus enabling data services integration systems to handle data concerns associated with the data services. Our concerns aware querying system extends the XQuery language to make it concerns aware, with the introduction of special keywords for mentioning data concerns within the query.
In the last part of this thesis, we design a mashup tool on top of DeXIN. We propose a query based aggregation of multiple heterogeneous data sources by combining powerful querying features of XQuery and SPARQL with an easy interface of a mashup tool for data sources in XML and RDF. Our mashup editor allows for automatic generation of mashups with an easy to use visual interface. For the dynamic integration of heterogeneous web data sources we utilize the concept of data mashups, which uses the extension of XQuery proposed in DeXIN.
Created from the Publication Database of the Vienna University of Technology.