Data Integration Tool

Level

Undergraduate, Advanced Computing MSc (or any student with suitable experience)

Objective

This project would involve implementing a data integration tool around the AutoMed API, allowing a user to view schemas of databases, view the content of the database, and perform a guided process of data integration to build a set of mappings between data sources which allow for the migration and transformation of data and queries between databases.

Description

Many organisations now have several independent information systems which hold related information. It is often desirable that this information be processed as a unified database, which requires that we are able to perform data integration on the various databases --- each represented by their local schema --- to form a single database represented by a global schema.

This advanced project involves building a tool to support the application of already developed techniques (some of which were studied in the Advanced Databases Course, and others have been developed within the AutoMed project) for performing database schema integration.

The AutoMed API provides the ability to store the description of data modelling languages, descriptions of mappings between data sources using the BAV approach ([McBrien, Poulovassilis (2003)]), process queries over those data sources, and a simple GUI to view the information stored.

This project would involve extending this platform with a wizard tool to guide a user through the process of creating BAV mappings between the data sources. At the core of the project would be a schema transformation macro language to allow the `programming' of standard patterns of schema mapping. For example, a well known mapping in the relational model is normalisation, where if a relation person(id, dept, hod) had id as a key, and hod dependent on dept then it would often be the case the integration of the schema with other schemas would require the person table be normalised to person(id,dept) and dept(dept,hod). With the macro language this would be a user would call: normalise(<<person,dept>>,<<person,hod>>,) and the macro expand into the BAV rules: AddTable(<<dept>>, [{y} | {x,y} <- <<person,dept>>]) AddColumn(<<dept,dept>>, [{y,y} | {x,y} <- <<person,dept>>]) AddPrimaryKey(<<dept>>,<<dept,dept>>) AddColumn(<<dept,hod>>, [{x,y} | {z,x} <- <<person,dept>>; {z,y} <- <<person,hod>>]) DeleteColumn(<<person,hod>>, [{x,y} | {z,x} <- <<person,dept>>; {z,y} <- <<dept,hod>>])

Once the macro language has been written, and used to write rules in at least the relational data model, this project would develop a tool that allows users to view data held in the data sources, and guide the user in the application of templates to generate the mappings between local schemas and the global schema.

Addition capabilities might include one or more of the following:

Query the databases to check that the information present matches the transformations being applied. For example above, this would involve checking that no dept appears with more that one hod value (ie that the functional dependency is obeyed.
Demonstrate the macro language being used to generate mapping rules for schemas in other data modelling languages, such as the ER, ORM, XML or UML languages.
Provide tools to convert schemas between different data modelling languages, such as between relational and ER models, and between XML and ER models, and hence allow heterogeneous data sources to be integrated using a single data modelling language.
Use techniques from data mining, knowledge discovery and schema matching to automate the process to determining which constructs in different data sources are related.

The tool would have to be implemented in Java in order to use the AutoMed API.

For other projects, see Peter McBrien's teaching page