AutoMed Publications

Updated 9th October 2018

P.J. McBrien and A. Poulovassilis,
Towards Data Visualisation Based on Conceptual Modelling, ER 2018, Pages 91-99 (Longer Version) (Presentation)
Abstract: Selecting data, transformations and visual encodings in current data visualisation tools is undertaken at a relatively low level of abstraction - namely, on tables of data - and ignores the conceptual model of the data. Domain experts, who are likely to be familiar with the conceptual model of their data, may find it hard to understand tabular data representations, and hence hard to select appropriate data transformations and visualisations to meet their exploration or question-answering needs. We propose an approach that addresses these problems by defining a set of visualisation schema patterns that each characterise a group of commonly-used data visualisations, and by using knowledge of the conceptual schema of the underlying data source to create mappings between it and the visualisation schema patterns. To our knowledge, this is the first work to propose a conceptual modelling approach to matching data and visualisations.
R. Brownlow and A. Poulovassilis,
Intersection Schemas as a Dataspace Integration Technique. EDBT/ICDT Workshops 2014, Pages 92-99
Abstract: This paper introduces the concept of Intersection Schemas in the field of heterogeneous data integration and dataspaces. We introduce a technique for incrementally integrating heterogeneous data sources by specifying semantic overlaps between sets of extensional schemas using bidirectional schema transformations, and automatically combining them into a global schema at each iteration of the integration process. We propose an incremental data integration methodology that uses this technique and that aims to reduce the amount of up-front e ort required. Such approaches to data integration are often described as pay-as-you-go. A demonstrator of our technique is described, which utilizes a new graphical user tool implemented using the AutoMed heterogeneous data integration system. A case study is also described, and our technique and integration methodology are compared with a classical data integration strategy.
N.J. Martin, A. Poulovassilis and J. Wang
A Methodology and Architecture Embedding Quality Assessment in Data Integration. J. Data and Information Quality 4(4): 17:1-17:40, 2014
Abstract: Data integration aims to combine heterogeneous information sources and to provide interfaces for accessing the integrated resource. Data integration is a collaborative task that may involve many people with different degrees of experience, knowledge of the application domain, and expectations relating to the integrated resource. It may be difficult to determine and control the quality of an integrated resource due to these factors. In this article, we propose a data integration methodology that has embedded within it iterative quality assessment and improvement of the integrated resource. We also propose an architecture for the realisation of this methodology. The quality assessment is based on an ontology representation of different users quality requirements and of the main elements of the integrated resource. We use description logic as the formal basis for reasoning about users quality requirements and for validating that an integrated resource satisfies these requirements. We define quality factors and associated metrics which enable the quality of alternative global schemas for an integrated resource to be assessed quantitively, and hence the improvement which results from the refinement of a global schema following our methodology to be measured. We evaluate our approach through a large-scale real-life case study in biological data integration in which an integrated resource is constructed from three autononous proteomics data sources.
Jianing Wang, Nigel J. Martin, Alexandra Poulovassilis
An Ontology-Based Quality Framework for Data Integration. BIR Workshops: Pages 196-208, 2011
Abstract: The data integration (DI) process involves multiple users with roles such as administrators, integrators and end-users, each of whom may have requirements which have an impact on the overall quality of an integrated resource. Users requirements may conflict with each other, and so a quality framework for the DI context has to be capable of representing the variety of such requirements and provide mechanisms to detect and resolve the possible inconsistencies between them. This paper presents a framework for the specification of DI quality criteria and associated user requirements. This is underpinned by a Description Language formalisation with associated reasoning capabilities which enables a DI setting to be tested to identify those elements that are inconsistent with users requirements. The application of the framework is illustrated with an example showing how it can be used to improve the quality of an integrated resource.
L. Zamboulis, N.J. Martin, A. Poulovassilis,
Query performance evaluation of an architecture for fine-grained integration of heterogeneous grid data sources Future Generation Computer Systems, Volume 26, Number 8, 2010, Pages 1073-1091
Abstract Grid data sources may have schema- and data-level conflicts that need to be addressed using data transformation and integration technologies not supported by the current generation of Grid data access and querying middleware. We present an architecture that combines Grid data access and distributed querying with fine-grained data transformation/integration technologies, and the results of a query performance evaluation on this architecture. The performance evaluation indicates that it is indeed feasible to combine such technologies while achieving acceptable query performance. We also discuss the significance of our results for the further development of query performance over heterogeneous Grid data sources.
P.J. McBrien, N. Rizopoulos, and A.C. Smith,
SQOWL: Type Inference in an RDBMS (PostScript)
Proceedings of ER10
Abstract In this paper we describe a method to perform type inference over data stored in an RDBMS, where rules over the data are specified using OWL-DL. Since OWL-DL is an implementation of the Description Logic (DL) called SHOIN(D), we are in effect implementing a method for SHOIN(D) reasoning in relational databases. Reasoning may be broken down into two processes of classification and type inference. Classification may be performed efficiently by a number of existing reasoners, and since classification alters the schema, it need only be performed once for any given relational schema as a preprocessor of the schema before creation of a database schema. However, type inference needs to be performed for each data value added to the database, and hence needs to be more tightly coupled with the database system. We propose a technique to meet this requirement based on the use of triggers, which is the first technique to fully implement SHOIN(D) as part of normal transaction processing.
N. Rizopoulos,
Schema Matching and Schema Merging based on Uncertain Semantic Mappings
PhD Thesis, 2010, Imperial College London
Abstract This dissertation lies in the research area of schema integration: the problem of combining the data of different data sources by creating a unified representation of these data. Two core issues in schema integration are schema matching, i.e. the identification of correspondences, or mappings, between input schema objects, and schema merging, i.e. the creation of a unified schema based on the identified mappings. Examples of mappings found in the literature include semantic mappings, e.g. author represents the same concept as writer, and data mappings, e.g. each data value of name is equal to the concatenation of a first-name value and a last-name value. In this dissertation, we propose a schema integration framework which (1) is only concerned with semantic mappings (that associate schema objects based on simple set based comparisons of the objectsâ instances) and which (2) explicitly represents and manages the uncertainty as to which semantic relationship is the correct one to use in any mapping. In our framework, we adopt a wide set of semantic mappings that allow for a precise, rigorous and formal schema merging process. Our merging process produces a sound and complete integrated schema for each pair of input schemas, and in addition it generates view definitions between the input schemas and the integrated schema.
A.C. Smith,
A Schema Transformation Based Approach to Generic Model Management
PhD Thesis, 2010, Imperial College London
Abstract Model Management (MM) is an approach that provides a way of overcoming the problems with these data level solutions. The motivation behind MM is to raise the level of abstraction in these application areas from the data level to the schema level. The key idea is to develop a set of operators that can be applied to schemas, and the mappings between them, as a whole rather than to individual data elements. The operators should be applicable to a wide range of problems in database management and work on schemas and mappings speciÂ¯ed in a wide range of DDLs. Solutions to database management problems can then be specified at a high level of abstraction by combining these operators into a concise and reusable script.

A system that implements the MM operators is called a Model Management System (MMS). Two key abstractions are required for a such a system: Firstly, a common language that can describe the schemas from the different DDLs, called a Common Data Model (CDM), and second, a way of describing the mappings between those schemas, called a mapping language. The main contributions of this thesis are:
- The CDM we use is the Hypergraph Data Model (HDM). The HDM can represent schemas from a wide range of existing DDLs and is differentiated from other CDMs by supporting a generic constraint language. We show how this offers significant advantages in the implementation of the ModelGen operator which translates schemas from one DDL to another.
- Our mapping language is Both As View (BAV). A BAV mapping is made up of a sequence of bidirectional primitive transformations that together form a pathway and describes precisely how instances of each schema object in the source schema are mapped to instances in the target schema and vice versa. This is an implementation of the schema transformation approach and has advantages over methods because the transformation pathways we create tell us precisely how individual schema objects are transformed from the source to the target schema.
P.J. McBrien and N. Rizopoulos,
Schema Merging Based on Semantic Mappings, (PostScript)
In Proceedings of BNCOD 2009, pages 192-198

Abstract In model management, the Merge operator takes as input a pair of schemas, together with a set of mappings between their objects, and returns an integrated schema. In this paper we present a new approach to implementing the Merge operator based on semantic mappings between objects. Our approach improves upon previous work by (1) using formal low-level transformation rules that can be translated into higher-level rules and (2) examining a much wider range of semantic mappings between schema objects. Our precise mappings and rules enable us to automate Merge and provide a sound and complete framework where schemas are merged without any information loss or gain.
A.C. Smith N. Rizopoulos and P.J. McBrien,
AutoMed Model Management (PostScript)
Proceedings of ER08, 542-543

Abstract Model Management (MM) is a way of raising the level of abstraction in metadata intensive application areas. The key idea behind Model Management is to develop a set of generic algorithmic operators that work on schemas and mappings between schemas, rather than individual schema elements. In this demonstration we present a new approach to the implementation of MM operators based on schema transformation that provides some important advantages over existing methods.
D.M. Le, A.C. Smith and P.J. McBrien,
Robust Data Exchange for Unreliable P2P Networks (PostScript)
To appear in Proceedings of GRep08, DEXA08 Workship Proceedings, 352-356, ISBN 978-0-7695-3299-8

Abstract The aim of this work is to provide a robust way for peers with heterogeneous data sources to exchange information in an unreliable network. We address this problem in two ways. Firstly, we define a set of application-layer data exchange protocols to facilitate the discovery of, and communication between, peers. Secondly provide a query processing component with a cache-driven query processor that allows nodes on the network to cache queries and their results on demand, and to use the data caches to give partial or complete answers to a query if the original data sources are unavailable.
A.C. Smith and P.J. McBrien,
A Generic Data Level Implementation of ModelGen (PostScript) (Presentation) (Bibtex)
In Proceedings of BNCOD08,
Pages 63-74, 2008, Springer-Verlag, LNCS 5071, ISBN-978-3-540-70503-1
Abstract The model management operator ModelGen translates a schema expressed in one modelling language into an equivalent schema expressed in another modelling language, and in addition produces a mapping between those two schemas. This paper presents an implementation of ModelGen which in addition allows for the translation of data instances from the source to the target schema, and vice versa. The translation mechanism is distinctive from others in that it takes a generic approach that can be applied to any modelling language.
A.C. Smith and P.J. McBrien,
AutoModelGen: A Generic Data Level Implementation of ModelGen (PostScript) (Bibtex)
In Proceedings of the Forum at the CAiSE'08 Conference,
Pages 65-68
Abstract The model management operator ModelGen translates a schema expressed in one modelling language into an equivalent schema expressed in another modelling language, and in addition produces a mapping between those two schemas. AutoModelGen is a generic data level implementation of ModelGen that meets these desiderata. Our approach is distinctive in that (i) it takes a generic approach that can be applied to any modelling language, and (ii) it does not rely on knowing the modelling language in which the source schema is expressed in.
D.M. Le, A.C. Smith and P.J. McBrien,
Inter Model Data Integration in a P2P environment (PostScript)
In Proceedings of DBISP2P07, co-located with VLDB07
Abstract The wide range of data sources available today means that the integration of heterogeneous data sources is now a common and important problem. It is even more challenging in a P2P environment where peers often do not know in advance which schemas of other peers will suit their information needs and there is potentially a greater diversity of data modelling languages in use. In this paper, we propose a new approach to P2P inter model data integration which supports multiple data models whilst allowing peers the flexibility of choosing how to integrate their schemas.
L. Zamboulis, A. Poulovassilis, and G. Roussos
Flexible Data Integration and Ontology-Based Data Access to Medical Records (PostScript)
In Proceedings of BIBE'08
IEEE, October, 2008

Abstract The ASSIST project aims to facilitate cervical cancer research by integrating medical records containing both phenotypic and genotypic data, and residing in different medical centres or hospitals. The goal of ASSIST is to enable the evaluation of medical hypotheses and the conduct of association studies in an intuitive manner, thereby allowing medical researchers to identify risk factors that can then be used at the point of care to identify women who are at high risk of developing cervical cancer.

This paper presents the current status of the ASSIST medical knowledgebase. In particular, we discuss the challenges faced in constructing the ASSIST integrated resource and in enabling query processing through a domain ontology, and the solutions provided using the AutoMed heterogeneous data integration system. We focus on data cleansing issues, on data integration issues related to integrating relational medical data sources into an independent domain ontology and also on query processing. Of particular interest is the challenge of providing an easily maintainable integrated resource in a setting where the data sources and the domain ontology are developed independently and are therefore both highly likely to evolve over time.
L. Zamboulis, A. Poulovassilis, J. Wang
Ontology-Assisted Data Transformation and Integration (PostScript)
In Proceedings of ODBIS'08

Abstract Schema-based data transformation and integration (DTI) has been an active research area for some time, while more recent advances in ontologies have led to significant research in ontology-based DTI. These two approaches present some overlaps and some differences, and in this paper we investigate possible synergies between them. In particular, we show how ontologies can enhance schema-based DTI approaches by providing richer semantics for schema constructs. We also illustrate one way in which schema-based DTI approaches can be used together with ontology-based approaches in a heterogeneous data integration setting.
L. Zamboulis, N. Martin, A. Poulovassilis,
Bioinformatics Service Reconciliation By Heterogeneous Schema Transformation (PostScript)
In Proceedings of DILS'07
Springer-Verlag, LNBI 4544, Pages 89-104, 2007

Abstract This paper focuses on the problem of bioinformatics service reconciliation in a generic and scalable manner so as to enhance interoperability in a highly evolving field. Using XML as a common representation format, but also supporting existing flat-file representation formats, we propose an approach for the scalable semi-automatic reconciliation of services, possibly invoked from within a scientific workflows tool. Service reconciliation may use the AutoMed heterogeneous data integration system as an intermediary service, or may use AutoMed to produce services that mediate between services. We discuss the application of our approach for the reconciliation of services in an example bioinformatics workflow. The main contribution of this research is an architecture for the scalable reconciliation of bioinformatics services.
L. Zamboulis, N. Martin, A. Poulovassilis,
A Uniform Approach to Workflow and Data Integration (PostScript)
In Proceedings of U.K. e-Science All Hands Conference
September, 2007

Abstract Data integration in the life sciences requires resolution of conflicts arising from the heterogeneity of data resources and from incompatibilities between the inputs and outputs of services used in the analysis of the resources. This paper presents an approach that addresses these problems in a uniform way. We present results from the application of our approach for the integration of data resources and for the reconciliation of services within the ISPIDER and the BioMap bioinformatics projects. The ISPIDER integration setting demonstrates an architecture in which the AutoMed heterogeneous data integration system interoperates with grid access and query processing tools for the virtual integration of a number of proteomics resources, while the BioMap integration setting demonstrates the materialised integration of structured and semi-structured functional genomics resources using XML as a unifying data model. The service reconciliation scenario discusses the interoperation of the AutoMed system with a scientific workflow tool. The work presented here is part of the ISPIDER project, which aims to develop a platform using grid and e-science technologies for supporting in silico analyses of proteomics data.
P.J. McBrien and A. Poulovassilis,
P2P query reformulation over Both-as-View data transformation rules, (PostScript)
In Proceedings of DBISP2P 2006
Selected Papers, Springer Verlag LNCS, Volume 4125, Pages 310-322, 2006, ISBN 978-3-540-71660-0

Abstract The both-as-view (BAV) approach to data integration has the advantage of specifying mappings between schemas in a bidirectional manner, so that once a BAV mapping has been established between two schemas, queries may be exchanged in either direction between the schemas. By defining public schemas shared between peers, this allows peers to exchange queries via a public schema without the requirement for any one peer to hold the public schema data.

In this paper we discuss the reformulation of queries over BAV transformation pathways, and demonstrate the use of this reformulation in two modes of query processing. In the first mode, public schemas are shared between peers and queries posed on the public schema can be reformulated into queries over any data sources that have been mapped to the public schema. In the second, queries are posed on the schema of a data source, and are reformulated into queries on another data source via any public schema to which both data sources have been mapped.
L. Zamboulis, H. Fan, K. Belhajjame, J. Siepen, A. Jones, N. Martin, A. Poulovassilis, S. Hubbard, S. M. Embury and N. W. Paton
Data Access and Integration in the ISPIDER Proteomics (PostScript)
In Proceedings of DILS'06
Springer-Verlag, LNBI 4075, Pages 3-18, 2006, ISSN: 0302-9743, ISBN-10 3-540-36593-1

Abstract Grid computing has great potential for supporting the integration of complex, fast changing biological data repositories to enable distributed data analysis. One scenario where Grid computing has such potential is provided by proteomics resources which are rapidly being developed with the emergence of affordable, reliable methods to study the proteome. The protein identifcations arising from these methods derive from multiple repositories which need to be integrated to enable uniform access to them. A number of technologies exist which enable these resources to be accessed in a Grid environment, but the independent development of these resources means that significant data integration challenges, such as heterogeneity and schema evolution, have to be met. This paper presents an architecture which supports the combined use of Grid data access (OGSA-DAI), Grid distributed querying (OGSA-DQP) and data integration (AutoMed) software tools to support distributed data analysis. We discuss the application of this architecture for the integration of several autonomous proteomics data resources.
L. Zamboulis and A. Poulovassilis,
Information Sharing for the Semantic Web - a Schema Transformation Approach (PostScript),
Proceedings of DISWeb06, CAiSE06 Workshop Proceedings,
Pages 275-289, 2006, Presses Universitaires de Namur, ISBN-13 978-2-87037-525-9

Abstract This paper proposes a framework for transforming and integrating heterogeneous XML data sources, making use of known correspondences from them to ontologies expressed in the form of RDFS schemas. The paper first illustrates how correspondences to a single ontology can be exploited. The approach is then extended to the case where correspondences may refer to multiple ontologies, themselves interconnected via schema transformation rules. The contribution of this research is an XML-specific approach to the automatic transformation and integration of XML data, making use of RDFS ontologies as a 'semantic bridge'.
A.C.Smith and P. McBrien,
Inter Model Data Exchange of Type Information via a Common Type Hierarchy (PostScript),
Proceedings of DISWeb06, CAiSE06 Workshop Proceedings,
Pages 307-321, 2006, Presses Universitaires de Namur, ISBN-13 978-2-87037-525-9

Abstract Data exchange between heterogeneous schemas is a difficult problem that becomes more acute if the source and target schemas are from different data models. The data type of the objects to be exchanged can be useful information that should be exploited to help the data exchange process. So far little has been done to take advantage of this in inter model data exchange. Using a common data model has been shown to be effective in data exchange in general. This work aims to show how the common data model approach can be useful specifically in exchanging type information by use of a common type hierarchy.
Z. Bellahsène, C. Lazanitis, P.J. McBrien and N. Rizopoulos,
iXPeer: Implementing layers of abstraction in P2P Schema Mapping using AutoMed (PostScript),
Proceedings of IWI2006,
2006

Abstract The task of model based data integration becomes more complicated when the data sources to be integrated are distributed, heterogeneous, and high in number. One recent solution to the issues of distribution and scale is to perform data integration using peer-to-peer (P2P) networks. Current P2P data integration architectures have mostly been flat, only specifying mappings directly between peers. Some do form the schemas into hierarchies, but none provide any abstraction of the schemas. This paper describes a set of general purpose P2P meta-data and data exchange primitives provided by an extended version of the AutoMed toolkit, and uses the primitives to implement a new architecture called iXPeer. iXPeer deals with integration on several levels of abstraction, where the lower levels define precise mappings between data source schemas, but the higher levels are loser associations based on keywords.
M. Boyd and P.J. McBrien,
Comparing and Transforming Between Data Models via an Intermediate Hypergraph Data Model (PostScript),
Journal on Data Semantics IV,
Pages 69-109, Springer-Verlag, 2005, ISBN-13 978-3-540-31001-3, ISSN 0302-9743

Abstract Data integration is frequently performed between heterogeneous data sources, requiring that not only a schema, but also the data modelling language in which that schema is represented must be transformed between one data source and another.

This paper describes an extension to the hypergraph data model (HDM), used in the AutoMed data integration approach, that allows constraint constructs found in static data modelling languages to be represented by a small set of primitive constraint operators in the HDM. In addition, a set of five equivalence preserving transformation rules are defined that operate over this extended HDM. These transformation rules are shown to allow a bidirectional mapping to be defined between equivalent relational, ER, UML and ORM schemas.

The approach we propose provides a precise framework in which to compare data modelling languages, and precisely identifies what semantics of a particular domain one data model may express that another data model may not express. The approach also forms the platform for further work in automating the process of transforming between different data modelling languages. The use of the both-as-view approach to data integration means that a bidirectional association is produced between schemas in the data modelling language. Hence a further advantage of the approach is that composition of data mappings may be performed such that mapping two schemas to one common schema will produce a bidirectional mapping between the original two data sources.
M. Magnani, N. Rizopoulos, P.J. McBrien and D. Montesi,
Schema Integration based on Uncertain Semantic Mappings (PostScript),
In Proceedings of ER05,
LNCS Vol 3716, Pages 31-46, 2005, ISSN 0302-9743, ISBN-10 3-540-29389-2
Abstract Schema integration is the activity of providing a unified representation of multiple data sources. The core problems in schema integration are: schema matching, i.e. the identification of correspondences, or mappings, between schema objects, and schema merging, i.e. the creation of a unified schema based on the identified mappings. Existing schema matching approaches attempt to identify a single mapping between each pair of objects, for which they are 100% certain of its correctness. However, this is impossible in general, thus a human expert always has to validate or modify it. In this paper, we propose a new schema integration approach where the uncertainty in the identified mappings that is inherent in the schema matching process is explicitly represented, and that uncertainty propagates to the schema merging process, and finally it is depicted in the resulting integrated schema.
H. Fan
Using Schema Transformation Pathways for Incremental View Maintenance (PostScript)
In Proceedings of DaWak'05
Springer-Verlag, LNCS XXX, Pages XXX, 2005

Abstract With the increasing amount and diversity of information available on the Internet, there has been a huge growth in information systems that need to integrate data from distributed, heterogeneous data sources. Incrementally maintaining the integrated data is one of the problems being addressed in data warehousing research. This paper presents an incremental view maintenance approach based on schema transformation pathways. Our approach is not limited to one specific data model or query language, and would be useful in any data transformation/integration framework based on sequences of primitive schema transformations.
M. Maibaum, L. Zamboulis, G. Rimon, C. Orengo, N. Martin and A. Poulovassilis
Cluster based Integration of Heterogeneous Biological Databases using the AutoMed toolkit (PostScript)
In Proceedings of DILS'05
Springer-Verlag, LNCS 3615, Pages 191-207, 2005, ISSN: 0302-9743, ISBN-10 3-540-27967-9

Abstract This paper presents an extensible architecture that can be used to support the integration of heterogeneous biological data sets. In our architecture, a clustering approach has been developed to support distributed biological data sources with inconsistent identification of biological objects. The architecture uses the AutoMed data integration toolkit to store the schemas of the data sources and the semi-automatically generated transformations from the source data into the data of an integrated warehouse. AutoMed supports bi-directional, extensible transformations which can be used to update the warehouse data as entities change, are added, or are deleted in the data sources. The transformations can also be used to support the addition or removal of entire data sources, or evolutions in the schemas of the data sources or of the warehouse itself. The results of using the architecture for the integration of existing genomic data sets are discussed.
N. Rizopoulos, M. Magnani, P.J. McBrien and D. Montesi,
Uncertainty in Semantic Schema Integration (PostScript),
In Proceedings of BNCOD'05 Vol 2,
Pages 13-16, Univ. Sunderland Press, 2005, ISBN 1-873757-55-7
Abstract In this paper we present a new method of semantic schema integration, based on uncertain semantic mappings. The purpose of semantic schema integration is to produce a unified representation of multiple data sources. First, schema matching is performed to identify the semantic mappings between the schema objects. Then, an integrated schema is produced during the schema merging process based on the identified mappings. If all semantic mappings are known, schema merging can be performed (semi-)automatically.
D. Williams
Combining Data Integration and Information Extraction Techniques (PostScript)
Workshop on Data Mining and Knowledge Discovery
In Proceedings of BNCOD'05 Vol. 2
Pages 96-101, University of Sunderland Press, 2005
H. Fan and A. Poulovassilis
Using Schema Transformation Pathways for Data Lineage Tracing (PostScript)
In Proceedings of BNCOD'05 Vol. 1
Springer, LNCS 3567, Pages 133-144, 2005

Abstract With the increasing amount and diversity of information available on the Internet, there has been a huge growth in information systems that need to integrate data from distributed, heterogeneous data sources. Tracing the lineage of the integrated data is one of the problems being addressed in data warehousing research. This paper presents a data lineage tracing approach based on schema transformation pathways. Our approach is not limited to one specific data model or query language, and would be useful in any data transformation/integration framework based on sequences of primitive schema transformations.
S. Kittivoravitkul and P.J. McBrien,
Integrating Unnormalised Semi-Structured Data Sources (PostScript),
In Proceedings of CAiSE05,
Springer Verlag LNCS Vol 3520, Pages 460-474, 2005, ISBN-10 3-540-26095-1, ISBN-13 978-3-540-26095-0
Abstract Semi-structured data sources, such as XML, HTML or CSV files, present special problems when performing data integration. In addition to the hierarchical structure of the semistructured data, the data integration must deal with the redundancy in semi-structured data, where the same fact may be repeated in a data source, but should map into a single fact in a global integrated schema. We term semi-structured data containing such redundancy as being an unnormalised data source, and we define a normal form for semi-structured data that may be used when defining global schemas. We introduce special functions to relate object identifiers used in the global data model to object identifiers in unnormalised data sources, and demonstrate how to use these functions in query processing, update processing and integration of these data sources.
N. Rizopoulos and P.J. McBrien,
A General Approach to the Generation of Conceptual Model Transformations (PostScript),
In Proceedings of CAiSE05,
Springer Verlag LNCS Vol 3520, Pages 326-341, 2005, ISBN-10 3-540-26095-1, ISBN-13 978-3-540-26095-0
Abstract In data integration, a Merge operator takes as input a pair of schemas in some conceptual modelling language, together with a set of correspondences between their constructs, and produces as an output a single integrated schema. In this paper we present a new approach to implementing the Merge operator that improves upon previous work by considering a wider range of correspondences between schema constructs and defining a generic and formal framework for the generation of schema transformations. This is used as a basis for deriving transformations over high level models. The approach is demonstrated in this paper to generate transformations for ER and relational models.
H. Fan and A. Poulovassillis,
Schema Evolution in a Data Warehousing Environments - a schema-transformation based approach (PostScript),
In Proceedings of ER'04,
Springer Verlag LNCS 3288, Pages 639-653, 2004, ISBN 3-540-23723-2

Abstract In heterogeneous data warehousing environments, autonomous data sources are integrated into a materialised integrated database. The schemas of the data sources and the integrated database may be expressed in different modelling languages. It is possible for either the data source schemas or the warehouse schema to evolve. This evolution may include evolution of the schema, or evolution of the modelling language in which the schema is expressed, or both. In such scenarios, it is important for the integration framework to be evolvable, so that the previous integration effort can be reused as much as possible. This paper describes how the AutoMed heterogeneous data integration toolkit can be used to handle the problem of schema evolution in heterogeneous data warehousing environments. This problem has been addressed before for specific data models, but AutoMed has the ability to cater for multiple data models, and for changes to the data.
D. Williams and A. Poulovassillis,
An Example of the ESTEST Approach to Combining Unstructured Text and Structured Data (PostScript),
In Proceedings of DEXA Workshops,
Pages 191-195, 2004
L. Zamboulis,
XML Data Integration by Graph Restructuring (PostScript),
In Proceedings of BNCOD'04,
Springer Verlag LNCS 3112, Pages 57-71, 2004, ISBN 3-540-22382-7

Abstract This paper describes the integration of XML data sources within the AutoMed heterogeneous data integration system. The paper presents a description of the overall framework, as well as an overview of and comparison with related work and implemented solutions by other researchers. The main contribution of this research is an algorithm for the integration of XML data sources, based on graph restructuring of their schemas.
M. Boyd, S. Kittivoravitkul, C. Lazanitis, P.J. McBrien and N. Rizopoulos,
AutoMed: A BAV Data Integration System for Heterogeneous Data Sources (PostScript),
In Proceedings of CAiSE04,
Springer Verlag LNCS Vol 3084, Pages 82-97, 2004, ISSN 0302-9743, ISBN 3-540-22151-4
Abstract This paper describes the AutoMed repository and some associated tools, which provide the first implementation of the both as view (BAV) approach to data integration. Apart from being a highly expressive data integration approach, BAV in additional provides a method to support a wide range of data modelling languages, and describes transformations between those data modelling languages. This paper documents how BAV has been implemented in the AutoMed repository, and how several practical problems in data integration between heterogeneous data sources have been solved. We illustrate the implementation with examples in the relational, ER, and semi-structured data models.
L. Zamboulis and A. Poulovassilis,
Using AutoMed for XML Data Transformation and Integration (PostScript),
In Proceedings of DIWeb'04,
CAiSE Workshop Proceedings Volume 3, Pages 58-69, Ed. J. Grundspenkis and M. Kirikova

Abstract This paper describes how the AutoMed data integration system is being extended to support the integration of heterogeneous XML documents. So far, the contributions of this research have been the development of two algorithms. One restructures the schema describing an XML document into another schema, and the other materialises an integrated schema resulting from the transformation of several source XML schemas, using the source XML data and the transformation pathways between the source and integrated schemas.
M. Boyd and P. McBrien,
Towards a Semi-Automated Approach to Intermodel Transformation (PostScript),
In Proceedings of EMMSAD'04,
CAiSE Workshop Proceedings Volume 1, Pages 175-188, Ed. J. Grundspenkis and M. Kirikova

Abstract This paper introduces an extension to the hypergraph data model used in the AutoMed data intergration approach that allows constraints common in static data modelling languages to be represented by a small set of primitive constraint operators. A set of equivalence rules are defined for this set of primitive constraint operators, and demonstrated to allow a mapping between relational, ER or UML models to be defined. The approach provides both a precise framework in which to compare data modelling languages, and forms the platform for further work in automating the process of transforming between different data modelling languages.
E. Jasper, N. Tong, P. McBrien, and A. Poulovassilis,
Generating and optimising Views from Both as View Data Integration Rules (pdf),
In Proceedings of DBIS'04
Scientific Papers of the University of Latvia Vol 972, Pages 13-30, ISSN 1407-2157, ISBN 9984-770-11-7

This paper describes view generation and view optimisation in the AutoMed heterogeneous data integration framework. In AutoMed, schema integration is based on the use of reversible schema transformation sequences. We show how views can be generated from such sequences, for global-as-view (GAV), local-as-view (LAV) and GLAV query processing. We also present techniques for optimising these generated views, firstly by optimising the transformation sequences, and secondly by optimising the view definitions generated from them.
N. Rizopoulos
Automatic discovery of semantic relationships between schema elements
In Proc. of 6th ICEIS 2004
Abstract The identification of semantic relationships between schema elements, or schema matching, is the initial step in the integration of data sources. Existing approaches in automatic schema matching have mainly been concerned with discovering equivalence relationships between elements. In this paper, we present an approach to automatically discover richer and more expressive semantic relationships based on a bidirectional comparison of the elements data and metadata. The experiments that we have performed on real-world data sources from several domains show promising results, considering the fact that we do not rely on any user or external knowledge.
H. Fan and A. Poulovassilis
Using AutoMed MetaData in Data Warehousing Environments (PostScript)
In Proceedings of DOLAP'03

Abstract What kind of metadata can be used for expressing the multiplicity of data models and the data transformation and integration processes in data warehousing environments? How can this metadata be further used for supporting other data warehouse activities? We examine how these questions are addressed by AutoMed, a system for expressing data transformation and integration processes in heterogeneous database environments.
D. Williams and A. Poulovassilis,
Combining Data Integration with Natural Language Technology for the Semantic Web (PostScript),
In Proceedings of Proc. Workshop on Human Language Technology for the Semantic Web and Web Services, at ISWC'03, 2003

Abstract Current data integration systems allow a variety of heterogeneous structured or semi-structured data sources to be combined and queried by providing an integrated view over them. The Semantic Web also requires us to be able to integrate information from a variety of heterogeneous information sources. However, these information sources will also include natural language (e.g. web pages) and ontologies. In this paper we describe an architecture which combines the data integration approach with techniques from Information Extraction in order to allow information from ontologies and natural language sources to be integrated with other, semantically related, structured or semi-structured data. Our architecture is based on the AutoMed data integration system, and in this paper we describe several extensions which have been made to AutoMed in order to support this work, including adding RDF to the data sources supported by AutoMed and providing a repository for the data and metadata discovered by the Information Extraction process.
P.J. McBrien and A. Poulovassilis,
Defining Peer-to-Peer Data Integration using Both as View Rules, (PostScript)
In Proceedings of DBISP2P 2003 Revised Papers,
Springer Verlag LNCS, Volume 2944, Pages 91-107, 2004, ISBN 3-540-20968-9
Abstract The loose and dynamic association between peers in a peer-to-peer integration has meant that, to date, peer-to-peer systems have been based on exchange of files identified with a very limited set of attributes, and no schema is used to describe the data within those files. This paper extends an existing approach to data integration, called both-as-view, to be an efficient mechanism for defining peer-to-peer integration at the schema level, and demonstrates how the data integration can be used for the exchange of messages and queries between peers.
N. Tong,
Database Schema Transformation Optimisation Techniques for the AutoMed System (PostScript),
In Proceedings of BNCOD20 ,Springer Verlag LNCS, Volume 2712, Pages 157-171, 2003

Abstract AutoMed is a database integration system that is designed to support the integration of schemas expressed in a variety of high-level conceptual modelling languages. It is based on the idea of expressing transformations of schemas as a sequence of primitive transformation steps, each of which is a bi-directional mapping between schemas. To become an efficient schema integration system in practice, where the number and size of schemas involved in the integration may be very large, the amount of time spent on the evaluation of transformations must be reduced to a minimal level. It is also important that the integrity of a set of transformations is maintained during the process of transformation optimisation. This paper discusses a new representation of schema transformations which facilitates the verification of the well-formedness of transformation sequences, and the optimisation of transformation sequences.
E. Jasper, N. Tong, P.J. McBrien and A. Poulovassilis,
View Generation and Optimisation in the AutoMed Data Integration Framework (PostScript),
In Proceedings of CAiSE03 Forum,
Editors: J. Eder and T. Welzer, Univ. of Maribor Press, Pages 29-32, 2003, ISBN 86-435-0549-8
Abstract In AutoMed, data integration is based on the use of reversible sequences of schema transformations. We discuss how views can be generated from these sequences. We also discuss techniques for optimising the views, firstly by simplifying the transformation sequences, and secondly by optimising the view definitions generated from them.
P.J. McBrien and A. Poulovassilis,
Data Integration by Bi-Directional Schema Transformation Rules (PostScript),
In Proceedings of ICDE03,
IEEE, Pages 227-238, 2003, ISBN 0-7803-7665-X
Abstract In this paper we describe a new approach to data integration which subsumes the previous approaches of local as view (LAV) and global as view (GAV). Our method, which we term both as view (BAV), is based on the use of reversible schema transformation sequences. We show how LAV and GAV view definitions can be fully derived from BAV schema transformation sequences, and how BAV transformation sequences may be partially derived from LAV or GAV view definitions. We also show how BAV supports the evolution of both global and local schemas, and we discuss ongoing implementation of the BAV approach within the AutoMed project.
H. Fan and A. Poulovassilis
Tracing Data Lineage Using Schema Transformation Pathways
In Knowledge Transformation for the Semantic Web
Editors: B. Omelayenko and M. Klein
IOS Press, 2003
D. Theodoratos
Semantic Integration and Querying of Heterogeneous Data Sources Using a Hypergraph Data Model
In Proceedings of BNCOD19
Springer Verlag LNCS, Volume 2405, Pages 166-182, 2002
Abstract Information integration in the World-Wide-Web has evolved to a new framework where the information is represented and manipulated using a wide range of modelling languages. Current approaches to data integration use wrappers to convert the different modelling languages into a common data model. In this work we use a nested hypergraph based data model (called HDM) as a common data model for integrating different structured or semi-structured data. We present a hypergraph query language (HQL) that allows the integration of the wrapped data sources through the creation of views for mediators, and the querying of the wrapped data sources and the mediator views by the end users. We also show that the HQL queries (views) can be constructed from other views and/or source schemas using a set of primitive transformations. Our integration architecture is flexible and allows some (or all) of the views in a mediator to be materialized.
H. Fan
Data Lineage Tracing Using Automed Schema Transformation Pathways
In Proceedings of BNCOD19
Springer Verlag LNCS, Volume 2405, Pages 50-53, 2002
Abstract With the increasing amount and diversity of information available on the Internet, there has been a huge growth in information systems that need to integrate data from distributed, heterogeneous data sources. The problem of data lineage is one of focuses of research in data warehousing. In this paper, we discuss the classification of data lineage and propose a new approach to trace data linage using Automed's schema transformation pathways. We discuss how the individual transformation steps in an Automed transformation pathway can be used to trace the derivation of data in the integrated database in a step-wise fashion, thus simplifying the lineage tracing process.
E. Jasper
Global Query Processing in the AutoMed Heterogeneous Database Environment
In Proceedings of BNCOD19
Springer Verlag LNCS, Volume 2405, Pages 46-49, 2002
Abstract One of the requirements for implementing a heterogeneous database system is. The global query processor takes as input a query expressed on the constructs of a global schema of the heterogeneous database system. This query must be translated so that it is expressed over the constructs of the local database schemata. Appropriate local sub-queries are then sent to the local data sources for processing there. The global query processor must then take the results it receives from the local data sources and perform any necessary post-processing to obtain the result of the originally submitted global query.

This paper discusses global query processing within the AutoMed heterogeneous database environment, and describes the first version of the architecture of Automed's Global Query Processor and the implementation that has been completed so far (May 2002).

The AutoMed approach to heterogeneous database systems is a novel one and so in turn our work on global query processing approaches the problem from a novel standpoint. In AutoMed, a low level data model - the hypergraph data model - is used as the common data model. Furthermore, the AutoMed schema integration approach is based on schema transformations rather than view definitions. While some advantages of using the AutoMed automatically reversible schema transformation approach over view definition approaches are documented it is an open question as to what, if any, will be the benefits arising from a global query processing point of view and our longer term aim will be to investigate this question.
H. Fan and A. Poulovassilis
Tracing Data Lineage Using Schema Transformation Pathways
In Proceedings of the Workshop on Knowledge Transformation for the Semantic Web at ECAI-2002, Lyon, France, July 2002
Abstract A data warehouse consists of a set of materialized views defined over a number of data sources. It collects copies of data from remote, distributed, autonomous and heterogeneous data sources into a central repository to enable analysis and mining of the integrated information. Data warehousing is popularly used for on-line analytical processing (OLAP), decision support systems, on-line information publishing and retrieving, and digital libraries. However, sometimes what we need is not only to analyse the data in the warehouse, but also to investigate how certain warehouse information was derived from the data sources. Given a tuple t in the warehouse, finding the set of source data items from which t was derived is termed the data lineage problem.

In this paper, we propose a new approach for tracing data linage based on schema transformation pathways. We show how the individual transformation steps in a transformation pathway can be used to trace the derivation of the integrated data in a step-wise fashion, thus simplifying the lineage tracing process. We present definitions for data lineage based on why-provenance and where-provenance, which we term affect-pool and origin-pool, respectively, and we give algorithms for tracing the affect-pool and the origin-pool for data derived from sequences of schema transformations.
M. Boyd, P.J. McBrien and N. Tong
The AutoMed Schema Integration Repository
In Proceedings of BNCOD19
Springer Verlag LNCS, Volume 2405, Pages 42-45, 2002
Abstract In this paper we describe the first version of the repository of the AutoMed toolkit. This is a Java API, that uses a RDBMS to provide a persistent storage for data modelling language descriptions in the HDM, database schemas, and transformations between those schemas. The repository also provides some of the shared functionality that tools accessing the repository may require.

The AutoMed repository has two logical components, assessed via one API. The model definitions repository (MDR) allows for the description of how a data modelling language is represented as combinations of nodes, edges and constraints in the HDM. It is used by AutoMed `experts' to configure AutoMed so that it can handle a particular data modelling language. The schema transformation repository (STR) allows for schemas to be defined in terms of the the data modelling concepts in the MDR. It also allows for transformations to be specified between those schemas. Most AutoMed tools and users will be concerned with editing this repository, as new databases are added to the AutoMed repository, or those databases evolve.

This paper outlines how the MDR and STR APIs function, and gives an example of how to use the repository to describe two schemas in a variant of the ER modelling language, together with a sequence of transformations which map between the two schemas.
P.J. McBrien and A. Poulovassilis,
Schema Evolution in Heterogeneous Database Architectures, A Schema Transformation Approach, (PostScript)
In Proceedings of CAiSE02,
Springer Verlag LNCS, Volume 2348, Pages 484-499, 2002
Abstract This paper presents a new approach to schema evolution, which combines the activities of schema integration and schema evolution into one framework. In previous work we have a developed general framework to support schema transformation and integration in heterogeneous database architectures. The framework consists of a hypergraph-based data model and a set of primitive schema transformations defined for this model. Higher-level data models and primitive schema transformations for them can be defined in terms of this lower-level model. A key feature of the framework is that both primitive and composite schema transformations are automatically reversible. We have shown in earlier work how this allows automatic query translation from a global schema to a set of source schemas. In this paper shows how this framework also readily supports evolution of source schemas, allowing the global schema and the query translation pathways to be easily repaired, as opposed to having to be regenerated, after changes to source schemas.
P.J. McBrien and A. Poulovassilis,
A Semantic Approach to Integrating XML and Structured Data Sources,
In Proceedings of CAiSE01,
Springer Verlag LNCS, Volume 2068, Pages 330-345, 2001
Abstract XML is fast becoming the standard for information exchange on the WWW. As such, information expressed in XML will need to be integrated with existing information systems, which are mostly based on structured data models such as relational, object-oriented or object/relational data models. This paper shows how our previous framework for integrating heterogeneous structured data sources can also be used for integrating XML data sources with each other and/or with other structured data sources. Our framework allows constructs from multiple modelling languages to co-exist within the same intermediate schema, and allows automatic translation of data, queries and updates between semantically equivalent or overlapping heterogenous schemas.

Student Projects on AutoMed

D. Fourkiotis,
Implementation of parallel and distributed query processing in the AutoMed heterogeneous data integration toolkit,
MSc Project Report, Birkbeck College, University of London, 2007
Abstract: AutoMed is a heterogeneous data transformation and integration system, capable of handling virtual, materialized and indeed hybrid data integration across multiple data models. Queries submitted in a global schema are transformed and evaluated against a set of data sources, exploiting the functionality of the various components of the AutoMed Global Query Processor (AQP). Our motivation for this project is to enhance the AQP functionality, adding parallel and distributed query processing, in order to improve its overall performance.
J. Walters,
Implementation of an SQL to IQL query translation component for the AutoMed toolkit,
MSc Project Report, Birkbeck College, University of London, 2007
Abstract: IQL as AutoMed's query language presents itself as a potential barrier to users. As a functional language it is more expressive than SQL but is also more complicated to learn and understand. Its lower level implementation however, allows for it to be readily translated to and from the higher level query languages. This project, through background research, sought to establish the most suitable approach towards implementing a solution for translating input queries from SQL to IQL. As the most suitable approach, JavaCC, a parser generator was used to specify the grammar and translation rules. The complexity of these rules varied with the types of queries being translated, however the underlying support for relational algebra in both IQL and SQL created a common base for the translation process. Although the resulting translator supports only subset of SQL queries, its modularity allows for it to be readily extended to support any subset ANSI compliant SQL.
C. Lazanitis,
Schema Based Peer-to-Peer Data Integration,
MEng Project Report, Imperial College London, 2005
Abstract: Peer to peer data integration (P2PDI) is the process of integrating heterogeneous data sources in a peer to peer (P2P) network. The traditional approach to P2PDI achieves integration by defining pair wise mappings between schemas (of peers) something that we believe limits its scalability.

This report describes the analysis, design and implementation of a prototype P2PDI system that uses the BAV approach. In this system peers can make schemas public and peers in the network can provide pathways to any of the public schema. Integration between two schemas is achieved when they both of them have a pathway to some published schema.

The following topics are covered by the report:
- A description of the formal framework that we developed for modelling such systems.
- Analysis of how the model responds to several protocols we propose.
- Explanation of the implementation of an API that uses the AutoMed infrastructure to provide capabilities for building systems of this nature.
- Demonstration of a working system that uses the API.

A.C. Smith,
Translating between XML and Relational Databases using XML Schema and AutoMed (PostScript) ,
MSc Thesis, Dept. of Computing, Imperial College, September 2004
AbstractXML and relational databases are currently two of the most important mechanisms for storing and transferring data. A reliable and flexible way of moving data between them is very desirable goal. The way data is stored in each method is very different which makes the translation process difficult. To try and abstract some of the differences away a low-level common data model can be used. To successfully move data from one model to the other a way of describing the schema is needed. Until recently there was no widely accepted way of doing this for XML. Recently, however, XML Schema has taken on this role. This project takes XML conforming to XML Schema definitions and transforms in into relational databases via the low-level modeling language HDM. In the other direction a relational database is transformed into an XML Schema document and an XML instance document containing the data from the database. The transformations are done within the Automed framework providing a sound theoretical basis for the work. A visual tool that represents the XML Schema in a tree structure and allows some manipulation of the schema is also described.

The software described in the report is part of the AutoMed API. XML Schemas may be read using uk.ac.ic.doc.automed.wrappers.XMLSchemaWrapper
M. Asaria,
Translating Between XML and Relational Databases
MEng Project, Dept. of Computing, Imperial College, June 2002
Abstract This project aims to implement a framework for inter-model transformations of data, built on the Hypergraph Data Model (HDM). The work takes two data models as examples, XML and Relational Databases, and shows how they can be implemented within this framework.
- Executable and Source Code
N. Rizopoulos,
Database Schema Integration Tool,
MSc Thesis, Dept. of Computing, Imperial College, September 2001
Abstract Several database schema integration methodologies defining transformations between equivalent schemas have been proposed in the literature. This project adopts one of these methodologies and develops upon it to define a way of representing composite transformations between database schemas. A simple, high-level Application Program Interface (API) has been implemented that enables the definition of complex transformations and their execution on existing schemas. Finally, a graphical representation language is recommended that provides a data-model independent way of displaying database schemas.

This project succeeds in defining templates of composite transformations on database schemas, i.e. transformations irrespective of existing schemas. It employs a language that is model and schema independent and provides a way of expressing composite transformations in this language using the developed API. Essentially, this project extends the work described in [BT01] that enables the description of models, database schemas and primitive transformations.

A graphical tool has been built that demonstrates the utilization of the graphical representation language and allows the execution of template composite transformations on existing database schemas.
E. Jasper,
Query Translation in Heterogeneous Database Environments,
MSc Thesis, Dept. of Computing, Birkbeck College, September 2001
Abstract One of the things that must be achieved to implement a heterogeneous database system is global query processing. Global queries expressed on the global schema must be translated into queries that can be processed by the local databases. Using a low-level graph-based data model as the common data model this project investigates the automatic translation of such global queries. This automatic translation is accomplished by using the transformation pathways from the local schemas to the global schema.

The goal of this project is to produce software to translate global queries into queries expressed in terms of the local database constructs. A secondary goal is to produce software to optimise these translated queries for faster execution.

Previous Related Publications by Project Members

P.J. McBrien and A. Poulovassilis,
Automatic migration and wrapping of database applications - a schema transformation approach,
In Proceedings of ER99
Springer Verlag LNCS 1728, 96-113, 1999

Abstract Integration of heterogeneous databases requires that semantic differences between schemas are resolved by a process of schema transformation. Previously, we have developed a general framework to support the schema transformation process, consisting of a hypergraph-based common data model and a set of primitive schema transformations defined for this model. Higher-level common data models and primitive schema transformations for them can be defined in terms of this lower-level model.

In this paper, we show that a key feature of our framework is that both primitive and composite schema transformations are automatically reversible. We show how these transformations can be used to automatically migrate or wrap data, queries and updates between semantically equivalent schemas. We also show how to handle transformations between non-equivalent but overlapping schemas. We describe a prototype schema integration tool that supports this functionality. Finally, we briefly discuss how our approach can be extended to more sophisticated application logic such as constraints, deductive rules, and active rules.
P.J. McBrien and A. Poulovassilis
A Uniform Approach to Inter-Model Transformations
In Advanced Information Systems Engineering, 11th International Conference CAiSE'99
Springer Verlag LNCS 1626, 333-348, 1999

Abstract Whilst it is a common task in systems integration to have to transform between different semantic data models, such inter-model transformations are often specified in an ad hoc manner. Further, they are usually based on transforming all data into one common data model, which may not contain suitable data constructs to model directly all aspects of the data models being integrated. Our approach is to define each of these data models in terms of a lower-level hypergraph-based data model. We show how such definitions can be used to automatically derive schema transformation operators for the higher-level data models. We also show how these higher-level transformations can be used to perform inter-model transformations, and to define inter-model links.
P.J. McBrien and A. Poulovassilis
A Formalisation of Semantic Schema Integration
Information Systems 23(5) 307-334, 1998
Abstract Several methodologies for the semantic integration of databases have been proposed in the literature. These often use a variant of the Entity-Relationship (ER) model as the common data model. To aid the schema conforming, merging and restructuring phases of the semantic integration process, various transformations have been defined that map between ER representations which are in some sense equivalent. Our work aims to formalise the notion of schema equivalence and to provide a formal underpinning for the schema integration process.
We show how transformational, mapping and behavioural schema equivalence are all variants of a more general definition of schema equivalence. We propose a semantically sound set of primitive transformations and show how they can be used to express the transformations commonly used during the schema integration process and to define new transformations. We differentiate between transformations which apply to any instance of a schema and those which require knowledge-based reasoning since they apply only for certain instances; this distinction could serve to enhance the performance of transformation tools since it identifies which transformations must be verified by inspection of the schema extension; it also serves to identify when intelligent reasoning is required during the schema integration process.
A. Poulovassilis and P.J.McBrien
A General Formal Framework for Schema Transformation (PostScript),
Data and Knowledge Engineering 28(1) 47-71, 1998.
Abstract Several methodologies for integrating database schemas have been proposed in the literature, using various common data models (CDMs). As part of these methodologies transformations have been defined that map between schemas which are in some sense equivalent. This paper describes a general framework for formally underpinning the schema transformation process. Our formalism clearly identifies which transformations apply for any instance of the schema and which only for certain instances. We illustrate the applicability of the framework by showing how to define a set of primitive transformations for an extended ER model and by defining some of the common schema transformations as sequences of these primitive transformations. The same approach could be used to formally define transformations on other CDMs.
P. McBrien and A. Poulovassilis
A Formal Framework for ER Schema Transformation
In Proceedings of ER'1997
Springer-Verlag LNCS 1331, 408-421, 1997

Abstract Several methodologies for semantic schema integration have been proposed in the literature, often using some variant of the ER model as the common data model. As part of these methodologies, various transformations have been defined that map between ER schemas which are in some sense equivalent. This paper gives a unifying formalisation of the ER schema transformation process and shows how some common schema transformations can be expressed within this single framework. Our formalism clearly identifies which transformations apply for any instance of the schema and which only for certain instances.
S. Hild and A. Poulovassilis
Implementing Hyperlog, a Graph-Based Database Language (PostScript)
In Journal of Visual Languages & Computing
Vol 7(3), 267-289, 1996

AbstractWe describe the implementation of Hyperlog, a graph-based declarative language which supports both queries and updates over a graph-based data model called the Hypernode Model. This model is capable of representing arbitrarily complex data structures by means of nested and recursively defined graphs, while the Hyperlog language is computationally complete. By requiring only a very small number of graphical constructs, Hyperlog is well-suited to non-expert database users. In this paper we describe the graphical aspects of the Hyperlog implementation including novel techniques for: representing and updating data, queries and programs; representing and browsing the database; and representing and browsing the output from queries.

Page maintained by Peter McBrien (pjm@doc.ic.ac.uk)