Federated information systems provide access to interrelated data that is distributed over multiple autonomous and heterogeneous data sources. The integration of these sources demands for flexible and extensible architectures that balance both, the highest possible autonomy and a reasonable degree of information sharing. In current federated information systems, the integrated data sources do only have passive functionality with regard to the federation. However, continuous improvements take the functionality of modern databases beyond former limits. The significant improvement, on which this work is based on, is the ability of modern active database systems to execute programs written in a standalone programming language as user-defined functions or stored procedures from within their database management systems.
We introduce Enhanced Active Database Systems as a new subclass of active databases that are able to interact with other components of a federation using external program calls from within triggers. We present several concepts and architectures that are specifically developed for Enhanced Active Databases to improve interoperability and consistency in federated information systems. As the basic concept we describe Active Event Notifications to provide an information system with synchronous and asynchronous update notifications in real-time. Based on this functionality, Enhanced Active Databases are able to actively participate in global integrity maintenance executing partial constraint checks on interrelated remote data. Furthermore, we present an architecture for a universal wrapper component that especially supports Active Event Notifications, which makes it perfectly suitable for event-based federated systems with real-time data processing. This tightly coupled wrapper architecture is used to build up the Dígame architecture for a peer data management system with push-based data and schema replication. Finally, we propose a Link Pattern Catalog as a guideline to model and analyze P2P-based information systems.
Contents
1 Introduction
1.1 Motivation
1.2 Overview
2 Federated Information Systems
2.1 Why not one big database?
2.2 Basic Architecture
2.3 Distribution, Autonomy, and Heterogeneity .
2.3.1 Distribution
2.3.2 Autonomy
2.3.3 Heterogeneity
2.4 Integration Challenges
2.4.1 Schema and Data Integration
2.4.2 Entity Resolution
2.4.3 Global Integrity
2.4.4 Global Transaction Management . .
2.5 Common Integration Architectures
2.5.1 Federated Database Systems
2.5.2 Mediator-based Information Systems
2.5.3 Peer Data Management Systems
3 Enhanced Active Database Systems
3.1 Definition
3.2 Enhanced Activity
3.3 External Program Calls
3.4 Discussion
3.5 Current EADBS
4 Active Event Notification
4.1 Monitoring Concepts
4.1.1 The Event Monitor
4.1.2 Change Capture Methods
4.1.3 Data Delivery Options
4.2 Related Work
4.2.1 Research Projects
4.2.2 Commercial Change Capture Products . .
4.3 Active Event Notification
4.3.1 Pull-based Asynchronous Notification
4.3.2 Push-based Synchronous Notification
4.3.3 Push-based Asynchronous Notification . .
4.3.4 Pull-based Synchronous Notification
5 Global Integrity Maintenance
5.1 Active Component Database Systems
5.2 Partial Integrity Constraints
5.2.1 Definition of Partial Integrity Constraints
5.2.2 Partial Integrity Constraints as ECA Rules .
5.2.3 System Interaction
5.3 Checking Global Integrity Constraints
5.3.1 Attribute Constraints
5.3.2 Key Constraints
5.3.3 Referential Integrity Constraints
5.3.4 Aggregated Constraints
5.4 Discussion
5.5 Global Constraints with COMICS
5.5.1 System Overview
5.5.2 Checking Constraints with COMICS
5.6 Related Work
6 Tightly coupled Wrappers
6.1 Wrapper Architecture
6.2 Event Detection Subsystem
6.3 Application Fields
6.4 Related Work
7 The D´ıgame Architecture
7.1 Introduction
7.2 Basic Functionality
7.3 D´ıgame Architecture Components
7.4 Characteristics
7.5 Implementation Details
7.6 Related Work
8 Link Patterns
8.1 Motivation
8.2 The Data Link Modeling Language (DLML)
8.2.1 Introduction
8.2.2 DLML Components
8.2.3 Example
8.3 Link Patterns
8.3.1 Elements of a Link Pattern
8.3.2 Classification
8.3.3 Usage
8.4 Link Pattern Catalog
8.5 Example
8.6 Related Work
9 Conclusion and Future Work
9.1 Summary
9.2 Future Work
Bibliography
List of Figures
Preface
This thesis summarizes my research at the database group of the Department of Computer Science at the University of Düsseldorf, where I was working as a research assistant since October 2002. This work was primarily motivated by my diploma thesis that I wrote at the Ludwig-Maximilians-University of Munich, like this thesis also under supervision of Prof. Dr. Stefan Conrad.
I would like to thank all the persons that supported me in writing the thesis. In particular, I would like to express my sincere thanks to my supervisor and first referee Prof. Dr. Stefan Conrad for giving me the opportunity to work for my doctoral degree at his chair. I appreciate his support and confidence in my work and enjoyed working under his supervision. I would also like to thank Prof. Dr. Martin Mauve for his interest in my work and willingness to be the second referee.
I want to extend my compliments to my colleagues at the database group for the pleasant atmosphere, and especially to Cristian Pérez de Laborda, my longtime fellow student and coauthor of some nice papers, Evguenia Altareva, Johanna Vompras, and Tobias Riege for the stimulating tea brakes after lunch, as well as the members of the IGFZS community for the recreational activities.
Another word of thanks goes to the students that have contributed to my work with their bachelor theses, which are (in alphabetical order): Ludmila Himmelspach, Krasimir Kutsarov, Sandra Suljic, and Alexander Tchernin.
My special thanks go to Andrea Führer for supporting me throughout the time with her confidence and encouragement and for accompanying me through all the ups and downs.
Finally, I want to express my deepest thanks to my parents for their support, encouragement, and advise on my way so far, and especially for the care packages that made my time here a lot more comfortable.
Düsseldorf, April 2006
Abstract
Federated information systems provide access to interrelated data that is dis- tributed over multiple autonomous and heterogeneous data sources. The inte- gration of these sources demands for flexible and extensible architectures that balance both, the highest possible autonomy and a reasonable degree of infor- mation sharing. In current federated information systems, the integrated data sources do only have passive functionality with regard to the federation. However, continuous improvements take the functionality of modern databases beyond for- mer limits. The significant improvement, on which this work is based on, is the ability of modern active database systems to execute programs written in a standalone programming language as user-defined functions or stored procedures from within their database management systems.
We introduce Enhanced Active Database Systems as a new subclass of active databases that are able to interact with other components of a federation using external program calls from within triggers. We present several concepts and architectures that are specifically developed for Enhanced Active Databases to improve interoperability and consistency in federated information systems. As the basic concept we describe Active Event Notifications to provide an informa- tion system with synchronous and asynchronous update notifications in real-time. Based on this functionality, Enhanced Active Databases are able to actively par- ticipate in global integrity maintenance executing partial constraint checks on interrelated remote data. Furthermore, we present an architecture for a universal wrapper component that especially supports Active Event Notifications, which makes it perfectly suitable for event-based federated systems with real-time data processing. This tightly coupled wrapper architecture is used to build up the D´ıgame architecture for a peer data management system with push-based data and schema replication. Finally, we propose a Link Pattern Catalog as a guideline to model and analyze P2P-based information systems.
Chapter 1 Introduction
1.1 Motivation
Since the first centralized databases found their way into the enterprises in the late 60s, the needs and requirements have changed towards a more distributed management of data. Today, there are many corporations and organizations that possess large amounts of databases, often spread over different regions or countries and typically connected to a network. These databases raised in an autonomous and independent manner to fit the special needs of users at the local sites which led to logical and physical differences. However, local applications produce or modify data that is often semantically related to data stored on other sources. The integration of these sources allows a company to keep track of its distributed data and thereby improving data quality and availability. One of the main challenges in such environments is the autonomy of the integrated sources. This autonomy implies the ability of a source to choose its own database de- sign and operational behavior making it harder to integrate the source into a company-wide information system. Most companies have just started to see their distributed data as a valuable resource. Unfortunately, this data exists in various data models and formats. A study in 2004[101] showed that although relational databases are by far the most popular databases in a commercial envi- ronment, other formats like flat files, XML, and object-oriented data sources are also widely-used in practice. The need for an integrated information platform is increasing steadily.
Federated information systems integrate information from multiple autonom- ous and heterogeneous data sources and provide centralized access to their data. Unlike distributed databases with homogeneous structures they allow the inte- grated sources to retain a certain level of autonomy. Thus, a federated architec- ture for distributed data has to balance both, the highest possible local autonomy and a reasonable degree of information sharing. Depending on the application field different federated architectures support different operations on the local and global level but always have to address the problems imposed by distribu- tion, autonomy, and heterogeneity to ensure consistency within the system. In current federated information systems, the integrated data sources do only have passive functionality with regard to the federation. Like repositories they provide access to their data and do only respond to external requests. Although active databases are able to react on certain local events, their possibilities to actively support the federation are very limited or just nonexistent. Continuous improvements of (mainly relational) database systems and query languages re-sulted in the definition of SQL-invoked routines in the SQL-1999 standard. Those routines are stored procedures or user-defined functions that can be defined as external routines written in an external programming language like C or Java. We believe that this innovation takes data management and processing in feder-ated information systems to a higher level. To fully exploit the new capabilities of these Enhanced Active Databases for information sharing in a federated en-vironment, new techniques and architectures are required that are specifically designed for this data source activity class.
1.2 Overview
In this thesis we introduce Enhanced Active Database Systems (EADBS) as an extended class of active databases that are able to execute external routines written in an external programming languages to react on local events in a more complex way. We present concepts and architectures that are specifically devel- oped for Enhanced Active Databases to support information sharing in federated information systems.
Following the introduction we start with a general overview of federated information systems and their characteristics in Chapter 2. We motivate the need for an integrated environment and introduce the theoretical background for the following chapters. We discuss distribution, autonomy, and heterogeneity as the main dimensions of federated information systems, summarize important integration challenges, that must be addressed during the system development, and sketch some well-known federated architectures.
In chapters 3 and 4 we provide an in-depth discussion on Enhanced Active Database Systems and our novel Active Event Notification mechanism as the ba- sis for the following chapters. We first give a basic definition of Enhanced Active Databases and describe the enhanced activity that distinguishes this particular class of databases from others. The chapter also provides an overview of current databases with enhanced activity. Secondly, we present the concept of Active Event Notification that enables EADBSs to signal local data modifications to external components immediately after an update occurred. Besides a detailed description of this notification process, we provide a general overview of monitor- ing techniques and properties, and distinguish our approach from existing event detection solutions.
In Chapter 5 we show how EADBSs are able to actively participate in global integrity maintenance in federated information systems. Active Event Notifications allow them to perform synchronous constraint checks on interrelated data stored on remote database systems. We introduce partial integrity constraints as a new type of constraints suitable for EADBSs and explain the checking mechanism for commonly used constraints. Furthermore, we present the COMICS constraint management architecture that extends the basic concept by introducing an external constraint manager that performs the remote part of partial constraint checks. EADBSs directly interact with the constraint manager during constraint checks using Active Event Notifications.
Since wrappers are an essential component in most federated information systems, we have developed a tightly coupled wrapper architecture for various types of data sources. Chapter 6 gives a detailed description of the wrapper that comprises an event detection subsystem which is used to extract data changes from the encapsulated source. It particularly provides a Notification Interface to support Active Event Notifications from Enhanced Active Databases, which makes the wrapper perfectly suitable for event-based information systems with push-based real-time event delivery.
Based on tightly coupled wrappers and EADBSs, we introduce the D ıgame architecture for a peer-to-peer information system with push-based data and schema replication. Chapter 7 describes the basic functionality and components of this architecture, and discusses its major characteristics and application fields.
Finally, in Chapter 8 we propose a Link Pattern Catalog as a modeling guideline for recurring problems in information sharing environments like our D ıgame architecture or similar P2P data management systems. We introduce the Data Link Modeling Language for describing and modeling data flows and describe commonly used Link Patterns and their applications.
Chapter 9 summarizes the concepts proposed and their contribution to the development of federated information systems. We conclude with an outlook on possible future work.
Chapter 2 Federated Information Systems
In this introductory chapter we describe the main characteristics of federated information systems (FIS) and the challenges that we face during their design and implementation. We start with a motivation for distributed information systems and answer the question, why it is not reasonable or feasible in many scenarios to maintain a centralized database. We continue with a description of distribution, autonomy, and heterogeneity as three important dimensions of an information system architecture, followed by a summary of concrete problems that must be addressed during the integration of autonomous and heterogeneous data sources. The chapter closes with an overview of selected integration architectures.
2.1 Why not one big database?
In general, an information system integrates multiple sources from several net- work nodes, which emerged autonomously to fit the special needs at a local site. Local applications typically imply specific hard- and software requirements with regard to the data sources. For example, the design department of a company could use a third-party CAD application to develop a new product. This CAD software possibly requires a specific type of database system to store its applica- tion data or ships with its own internal data management system. Other depart- ments may have their own special purpose software to carry out their tasks, e.g. tools for accounting and billing, workflow management, or personnel administra- tion. Furthermore, if a department retains a high level of autonomy, it is able to set up its own database systems according to its needs and abilities. For exam- ple, a department could favor a certain database system simply due to financial reasons.
Centralization of data holds many advantages considering technical or eco- nomic aspects. It limits the costs of redundant systems and increases data con- sistency and integrity based on uniform standards. However, besides the problem that most top down data planning efforts do not meet user expectations, techni- cal considerations are only one factor regarding data centralization in a corporate environment. The more crucial factor for the success of an information system is the question of data ownership as discussed in [16, 116].
With the introduction of databases into business processes the traditional definition of data ownership has changes. Data ownership meant the total control over the creation, maintenance, and processing of the data. Subsequently, data sharing and data integration implied a loss of data ownership and with it the loss of total control over the content. Thus, the need for data sharing inevitably collides with the individual demand for ownership. According to Alystine et al. [116], a key factor for the importance of ownership is self-interest, which means that data owners have a greater interest in the success of an information system than non-owners. Thus, databases are maintained more conscientiously by their owners than by non-owners. Furthermore, they state that data quality can only be ensured if data ownership and data origination are not separated. Data should only be created and maintained by users with expert knowledge with regard to the application field. For example, consider a research department that is willing to share results of their experiments to the community. Cooperating departments should be able to process the data but not to manipulate it, since the results are specific to a certain experimental setup. If the department looses control over its data, it might not be willing to share further results.
In contrast to the ownership demands of individuals or individual depart- ments, the company requires control of its data resources to know about the present use of the data, predict its future use, and constantly adjust that use to meet the company’s goals. Data has to be considered as a corporate resource just like natural resources, finances, or personnel resources. If a company bears the costs of data, i.e. the costs for collecting, maintaining, and processing data, the company should be considered as the owner of that data. A decentralization of data means to delegate, transfer, and grant functional rights and privileges to individual departments and users so that these individuals can assist the com- pany in achieving its goals. This decentralization gives the individuals a sense of data ownership, allowing them to plan and carry out their functional mandate autonomously.
Putting it all together, data ownership is the key reason for decentralization and autonomy. Interrelated data gets spread over multiple autonomous and possibly heterogeneous data sources. However, it is crucial for a company to keep track of its distributed data to make the right decisions and increase its productivity. An information system that accesses and processes this data has to address the problems arising from distribution, autonomy, and heterogeneity with regard to data ownership to ensure a high level of data quality.
2.2 Basic Architecture
Federated information systems integrate various autonomous and heterogeneous sources and provide access to their interrelated data. In the following we give a description of the basic architecture of a federated information system and its basic functionality based on[20].A federated information system consists of a set of distinct and autonomous information system components that give up part of their autonomy to participate in the federation and share parts of their informa- tion. An information system component offers one or more interfaces to its data that can be stored on a single computer or distributed over multiple, possibly het- erogeneous network nodes. Single information systems can be database systems with DBMS or non-database systems like flat files, spreadsheets, or document collections without a standardized data model, predefined schema, or query lan- guage. However, non-database systems can be treated as database system if they enforce a strict format and offer some kind of declarative access language[1].
illustration not visible in this excerpt
Figure 2.1: Basic architecture of federated information systems (based on[20] )
Figure 2.1 depicts the general architecture of a federated information system. A federation layer provides uniform access to various autonomous and hetero- geneous information system components that store interdependent data. These components usually are integrated in the infrastructure using wrappers to over- come source-specific differences to provide the federation layer with a uniform interface (wrapper layer). While the individual data sources may still be ma- nipulated by local applications, global users and applications are able to access the integrated sources via the federation layer. The federation layer is a software component that implements a specific interoperation strategy that can be based on, for example, a federated schema, a uniform query language, or a set of source and content descriptors. The mechanisms depend on the composition and imple- mentation of the federation layer. A selection of concrete FIS architectures can be found in Section 2.5.
In the following, a single information system component that is integrated into a federated information system is called a component database or component database system (CDBS) . The data source comprised by a component database is called a local database . A local transaction is an operation that is submit- ted directly to a CDBS by a local user or application. It only affects data on the respective source. Contrary, global transactions (also denoted as external transactions from the point of view of a component database) are submitted by global applications and affect data stored on multiple CDBSs. In general, a global transaction is split into a sequence of local transactions which are exe- cuted on the affected component database. Query results from multiple sources are transformed into a common representation format and sent to the global user or application. The access to interdependent data using global transactions is managed by the federation layer of the federated information system. An ex- plicit schema definition of a local database system or the implicit structure of a local non-database system enforcing a strict format is called the local schema of a component database.
Depending on the operations that are supported globally and locally we can distinguish between the following operational types of federated information sys- tems:
- Global read-only: The information system provides read-only access to the integrated information sources. Data modifications are only supported locally preserving a high level of local autonomy of the participating com- ponent databases. The main challenge in a global read-only environment is a reasonable integration of the local schemas eluding the problem of global transaction management.
- Local read-only: Data can exclusively be modified globally via the feder- ation layer. Local applications can only read interdependent data, revoking autonomy from the component databases. This requires the local schemas to be properly integrated into a global schema and a global transaction mechanism to execute updates. Due to the restrictions to local operations, the global transaction manager has full control of all update operations in the federation eluding the problems of global deadlocks and serializability.
- Mixed: This operational type allows data to be modified by both, global and local applications and is therewith the most complicated type. The system has to cope with the entire set of problems relating to database transaction management including serializability, deadlock detection, and atomic commit.
The concrete composition of the federation layer and the supported operations determine the problem areas to cope with during the integration of individual component databases. Before we deal with common integration challenges, we discuss the main dimensions of federated information systems in the next chapter.
2.3 Distribution, Autonomy, and Heterogeneity
To address the problems arising in a federated information system we first have to identify and understand the characteristics that impose them. The architecture of a federated information system can be classified according to three dimen- sions: distribution, heterogeneity, and autonomy. A fully centralized information system, for example, processes data from a single data source, whereas a fully distributed system could integrate multiple heterogeneous sources running au- tonomously on different network nodes. Each dimension implies advantages and disadvantages on the overall system, whereas the dimensions cannot be treated independently. A system with centralized data storage will surely not have to deal with heterogeneity and autonomy problems. The following short descriptions of each dimension are based on [52, 84, 102].
2.3.1 Distribution
Data of an information system may be located on a single data source or dis- tributed among several databases on one or more physical machines. The benefits of a distribution of data clearly are the increase of availability and reliability and the improvement of access times. Furthermore, distribution is often required to satisfy data ownership and to adapt an information platform to the data policy of a company or organization (see Section 2.1). Data can be distributed over multiple sources in a non-redundant or redundant way. The non-redundant dis- tribution requires a partition of the data, like horizontally or vertically partition in relational terms. Redundant storage of data, i.e. the replication of data, re- quires mechanisms in the information system to ensure the consistency of all replicas of a data item, when data is modified locally.
In a system of autonomous data sources the distribution of data is mainly introduced in an uncontrolled and unintended way. The data sources are often designed autonomously with regard to the local needs and requirements but store data that is interrelated to other sources. The main problems arising in an in- formation system due to the distribution of data are the planning, scheduling, and execution of global read and write operations, allocation of data, mainte- nance of global integrity constraints, and replication management. Distributed databases address these problems in a homogeneous environment with a strong centralized control without any autonomy of the data nodes. Now, the addition of the dimensions autonomy and heterogeneity to the existing problems of data distribution imposes the main challenges of federated information systems.
2.3.2 Autonomy
The organizational structure of an information system generally reflects the orga- nizational structure of the collaborating partners themselves. In many companies the departments retain a high level of autonomy, which means that they are al- lowed to organize, execute, and monitor their tasks on their own. This autonomy directly effects the data sources managed by the department which means that data sources are often separately and independently controlled by the depart- ments. In order to design and compose an information system that comprises several autonomous data sources, we have to understand and address the prob- lems arising from the autonomy of the component databases. The autonomy of a component database of an information system is also termed as local autonomy [33]. A widely accepted classification of local autonomy is summarized in[102] which distinguishes between three types of local autonomy: design, communica- tion, and execution autonomy.
Design autonomy is the ability of a local database to choose its own database design independently from the design of other component databases. This particularly means that they retain their local schemas during the integration process and that they cannot be forced to change their designs.
In particular, design autonomy allows a CDBS to freely choose
- the data it manages,
- the naming of data elements and the data representation including the data model and query language,
- the conceptual modeling of real world objects and the semantic interpretation of the data,
- the constraints to ensure consistency of the data,
- the set of supported operations for data access and manipulation, and
- the concrete implementation of the system.
The design autonomy is the main reason for heterogeneity in an information system. Especially the ability of a CDBS to choose its own conceptualization of real world objects leads to the problem field of semantic heterogeneity, which will be discussed later (see 2.3.3).
Execution autonomy enables a CDBS to execute local transactions without in- terference from external transactions. A system with execution autonomy cannot be forced to execute transactions according to a certain schedule by an external component. Operations may also be rejected at any time, for example, if they violate local integrity constraints. A data source with execution autonomy does not have to inform external components about the execution order of external transactions. Basically, the CBDS treats external transactions like local trans- actions. The problems arising from execution autonomy are mainly concerning global transaction management and consistency. Since a global component is unable to manipulate the scheduling and execution of transactions at a CDBS with execution autonomy, it is unable to ensure a global atomic commit[77]. In particular, they cannot be forced to provide a prepare-to-commit state as re- quired for multi phase commit protocols (e.g. 2PC). The third type of autonomy is communication autonomy . It allows a CDBS to decide when and how to com- municate with other components. This includes the ability of a CDBS to join or leave the information system at any time. This particularly enables a CDBS to go offline at any time to rejoin the system again later on. The problem of volatile data nodes has especially to be addressed in information system architectures resembling peer-to-peer concepts. Further taxonomies and types of autonomy can be found in the literature, like operational autonomy and service autonomy [33], naming autonomy and transaction-control autonomy[40], or association au- tonomy[5]. They basically define subsets or combinations of the autonomy types listed above and can thus be described using the given classification.
As stated by Heimbigner and McLeod[52], the aim of an information system that integrates autonomous data sources is to achieve a feasible trade-off between local autonomy of the CDBSs and a reasonable degree of information sharing. Without the constraint of a central authority, information sharing is realized by cooperating component databases, which can communicate in three ways:
- Data communication: A component database provides access to its data or a subset of its data to other components directly. The information system thus has to provide mechanisms to support data sharing among the participating CDBSs.
- Transaction sharing: A component database may not allow other components to directly access its data. It rather provides a set of operations that can be executed upon its data stock. This requires components to be able to define transactions.
- Cooperative activities: Without the constraint of a central authority, the autonomous components need to cooperate to share information. They must be able to initiate, monitor, and control a series of actions that involve cooperation with other components using appropriate protocols (negotiated data sharing).
A cooperation certainly demands for agreements among the partners and in most of the cases it means restrictions to their local autonomy. Agreements among autonomous component databases for information sharing concern the data they are willing to share, the set of operations they support, but also additional cooperation parameters such as uptime guarantees.
2.3.3 Heterogeneity
The third dimension of an information system architecture is heterogeneity. As mentioned above, heterogeneity is mainly caused by the design autonomy of dis- tributed, collaborating component data source. The ability of an administrator of a CDBS to choose the type of database including its data model and query language, as well as the individual conceptualization of the data leads to het- erogeneity on the global system level. This heterogeneity has to be addressed during the integration of the CDBSs to provide a consistent global view on the partitioned data.
Heterogeneity can basically be divided into two classes, which are heterogene- ity due to differences in the data sources and heterogeneity due to the semantic interpretation of the data. Both classes are described in the following sections.
Differences in data sources
Departments with a high level of local autonomy can choose their own database system depending on their specific environment and requirements, which leads to differences at the system level and in data models. Heterogeneity on the system level includes transaction management primitives and techniques like concurrency control, commit protocols, and recovery mechanisms. Furthermore, the hardware and operating system on which the CDBS resides may induce heterogeneity con- cerning file systems, data formats and representation, transaction support, or communication capabilities. System aspects are especially important during the integration of data sources without database management system, since the inte- gration layer of the information system has to provide system-specific solutions for required operations, like file access or transaction management.
Heterogeneity in the data model describes the differences in the structures, constraints, and query languages. Different CDBSs can use different data models like relational, object-oriented, or semi-structured. Each data model provides different modeling constructs (e.g. inheritance and generalization in the object- oriented model) which will lead to different structures on the schema level. Even if two CDBSs use the same data model, the probability that a real world object will be modeled differently in the data sources will rise with the spectrum of available modeling constructs.
Besides differences in the structure, the data models may support different integrity constraints. Some integrity constraints might be inherent in one data model, but must be explicitly formulated in another one. For example, a special- ization or generalization constraint could be expressed inherently in an object- oriented data model using an inheritance relationship, whereas in the relational model it must explicitly be expressed by a referential integrity constraint. Fur- thermore, active databases may use triggers to check complex constraints which cannot be expressed in a passive database although they might both be relational data sources.
Heterogeneity can also be caused by differences in the query languages that are used to manipulate data represented in different data models. Two CDBSs with the same data model might use different query languages or support different versions or functionalities of the same query language.
Semantic Heterogeneity
A database can be considered as an image of the real world. During the database design, an administrator models real world objects in the database using the modeling constructs of the data model. To be more concrete, modeling an real world object means to name the conceptual image of the object (entity), to choose a set of attributes which describe the object regarding the specific requirements of the database application, and to assign a domain to each attribute. Furthermore, since real world objects can be related to each other, the administrator has to model relationships between entities in the database.
In a collection of autonomous databases each administrator may have differ- ent views on the same real world objects depending on its own understanding of things and the local needs regarding the data. This individual semantic interpre- tation of the data and its usage is the main reason for semantic heterogeneity. Unlike differences in data sources, semantic heterogeneity is harder to detect and to address. Differences can occur due to the selection and naming of entities and attributes as well as the selection of attribute domains or interpretation of attribute values.
2.4 Integration Challenges
As described in the previous section, the three main characteristics of an infor- mation system architecture are distribution, heterogeneity, and autonomy. These dimensions impose several problem fields during the integration of autonomous and heterogeneous data sources into an information system. This section de- scribes the main integration challenges which have to be addressed by informa- tion system architects to create a reliable system assuring a high level of data quality and consistency. A more detailed overview can be found in[30].
2.4.1 Schema and Data Integration
The first and probably most difficult problem is integration conflicts resulting from the heterogeneity of the integrated data sources. Differences in data sources and semantic heterogeneity lead to different views and models of the same real world objects in the CDBSs. Database designers might have individual infor- mation needs and use their own tools to satisfy them. During the integration process, these differences must be dissolved to provide a uniform global view ( global schema ) on the entire data and to enable system interoperability and data sharing. The basic approach to build a global schema is to select several in- dependently developed schemas from component databases with interdependent data ( local schema ), resolve syntactic and semantic conflicts among them, and create an integrated schema comprising all their information. A model-independent classification of integration conflicts is presented in[105]. This taxonomy distinguishes four conflict classes: semantic, descriptive, heterogeneity, and structural conflicts. They are briefly discussed in the following.
Semantic Conflicts: Two database designer might have different perceptions of a set of real world objects (entities). An object class employees might in one CDBS be used to represent employees of the entire company, but in another CDBS it might represent only employees of a single department. Although the object classes could be semantically equivalent in both CDBSs, they represent different sets of real world objects. The extension of two object classes can be disjunct, equivalent, overlapping, or a class extension can be strictly included in another.
Descriptive Conflicts: Descriptive conflicts arise from different conceptualiza- tions of the same set of real world objects. Two database designer might be interested in different properties of the same object and thus create schemas with different sets of attributes. Descriptive conflicts also include naming conflicts due to homonyms and synonyms, attribute scale, domain, constraints, and operations.
Heterogeneity Conflicts: Database designers could use different data models for their databases which results in heterogeneity conflicts. In general, the integration of heterogeneous data models also implies structural conflicts since the data models provide different constructs to model real world ob- jects.
Structural Conflicts: Even if designers use the same data model, there might be structural differences between the schemas. The same real world objects can be modeled using different modeling constructs. The more constructs are available, the more possibilities the designers have to represent the same object. As an example for a structural conflict consider the star and snowflake schema as relational representations of a multidimensional data model for data warehouses. Although both schemas store the same informa- tion, the snowflake schema uses normalized tables to reflect hierarchies in the dimensions whereas the star schema has a single non-normalized table for each dimension, but with redundant storage of information.
A well-known approach for schema integration uses assertions as the main concept. Assertions (or mappings ) express correspondences between the schemas or parts of the schemas to be integrated. They define dependencies and integra- tion rules for the schemas as well as transformation rules for the corresponding data instances. Thereby the mappings can be defined between a local and a
2.4 Integration Challenges 15
global schema or between two local schemas as required for information systems without a global schema (see 2.5). For detailed information on integration us- ing assertions we refer to [104, 105]. From the exhaustive list of work in this area, we only want to present a small selection.[42], for instance, discuss a method for schema integration that detects class similarities by comparing previ- ously enriched schemas along the generalization, specialization, and aggregation dimension. Similarly,[34] proposes a simple unified language for the specification of three fragmentation conflict types (classification, decomposition, and aggre- gation conflicts) together with techniques to solve them. Schema integration using a global data structure is presented in[18]. They propose the Summary Schemas Model to aid semantic identification. Users access local data via impre- cise queries on the global schema whereas the system matches the user’s terms to the semantically closest system terms. A similar approach is discussed in [36], where a semantically expressive common data model is used to capture the intended meanings of conceptual schemas. This Kernel Object Data Model de- scribes structures, constraints, and operations on the shared data. An approach for the integration of integrity constraints is presented in[31]. The authors apply rules to a set of elementary operations for schema integration and restructuring. Finally,[75] present an algorithm that discovers mappings between schema ele- ments based on their names, data types, constraints, and schema structure using linguistic, structural, and context-dependent matching techniques.
2.4.2 Entity Resolution
Semantic heterogeneity in federated information systems imposes multiple chal- lenges considering the different representations of real world objects in the local databases. If two database designers model overlapping views on the same real world entity, the resulting schemas will store redundant information concerning this entity. The problem field of entity resolution (also referred to as record link- age or deduplication[14] ) in the context of federated information systems deals with the identification of corresponding records referring to the same real world entity in multiple databases, possibly with different schemas.
The aim is to merge corresponding records into one record with more complete information. For example, two departments of a company could store customer information in their autonomous local databases. During the integration of these customer databases into a company wide information system with one global customer database, corresponding records that refer to the same customer have to be merged and duplicates have to be removed to ensure a consistent customer data stock. This join can already be problematic if the customers are globally identified by a company wide key (e.g. a customer number), but most often there will not be a unique key that can be used to join the records.
The basic mathematical model for entity resolution was introduced by Fellegi and Sunter[37]. Suppose the records are stored in the sources A and B . Fur- thermore, an individual real world entity is assumed to be identified by multiple attributes (key attributes) of a record in A and B , like name, address, date of birth, and gender. Two disjoint sets M and D are defined from the cross-product A × B and denote the sets of record pairs ( a,b ) from A and B that could be matched (( a, b ) ∈ M ) and those pairs that could not be matched (( a, b ) ∈ U ). The record linkage process tries to determine if a pair belongs to either M or
U . One of the standard algorithms for computing this task uses a probabilis- tic model with expectation-maximization to calculate probabilities for a match or non-match of a record pair by comparing the values of the key attributes [124, 48]. A comparison or agreement vector γ represents the level of agreement between a and b by calculating the matching weights of their key attributes. At- tributes can be weighted in the comparison depending on their importance or value distribution. The composite weight (or score) for a comparison vector γ is calculated using the conditional probabilities for a match m ( γ ) and a non-match u ( γ ) as defined as:
illustration not visible in this excerpt
Given two threshold values T µ and T λ , a record pair ( a, b ) is classified using its score [illustration not visible in this excerpt] asfollows:
illustration not visible in this excerpt
The algorithm performs multiple blocking passes for non-match record pairs in which it selects one or more blocking attributes to calculate the matching weight. If an attribute value distribution for a field is not uniform, the value can be weighted. The following overview based on[48] briefly concretizes the challenges that arise during the record linkage process:
Standardization: Without standardization, many records could be wrongly classified as non-matches due to typographical errors (e.g. ’stret’ instead of ’street’), homonyms and synonyms (e.g. ’name’,’lastname’,’fullname’), or alternating representations for the same concept (e.g. ’M/F’ or ’0/1’ for a ’gender’ attribute). During the data cleaning process, attribute values are transformed into a standardized representation and spelling or combined with other values to satisfy global standards.
Attribute Selection: This problem field concerns the selection of common at- tributes on which the matching weight is calculated. This requires to iden- tify common attributes or the optimal subset of common attributes that have sufficient information content to support the linkage quality.
2.4 Integration Challenges 17
Comparison: The actual comparison of two attributes is mainly based on dis- tance-based metrics to compute matches. Since text or string values are most commonly used as matching attributes, this problem field deals with the development of efficient string comparators. Well-known comparison techniques are based on the edit-distance, N-gram distance, vector space representation of fields, or adaptive comparator functions.
Decision Model: After matching weights of individual attributes are calculated they have to be combined to a composite score to determine if the record pair is a match, non-match, or possible match. This classification is per- formed using a decision model. Besides the probability model described above, other models are proposed, like statistical models that compute sta- tistical characteristics of errors or predictive models for learning threshold values and attribute weights.
The linkage of corresponding records is essential for the consistency of the overall system. Only if two records can be classified as a match or non-match the system is able to perform global integrity checks which are essential for the data quality in the information system.
2.4.3 Global Integrity
The integration of autonomous and heterogeneous information system compo- nents into a single federated information system inevitably raises the question of how to ensure consistency of the data from a global point of view. For exam- ple, two CDBSs might maintain locally consistent data stocks but might store controversial information about a real world entity from a global point of view. These conflicts must be resolved during integration and prevented in the future by the federated information system to ensure high data quality. Consistency in a federated information system can only be violated by write operations on interdependent data. Data on a local database that is not interrelated to re- mote data on another CDBS can be modified in accordance with local integrity constraints and does not compromise global consistency. Thus, we restrict our further considerations to operations that modify interrelated data locally and globally:
Local data modifications: A local data modification is a local transaction that inserts, deletes, or updates objects in a local database which are represented by a local schema. An update operation modifies one or more attribute values of a local object.
Global data modifications: A global data modification is a global transaction that is issued against a global schema. The insertion, update, or deletion of a global object results in a sequence of local write operations executed on the affected local databases.
To ensure global consistency, the information system has to monitor and check local and global write operations against global integrity constraints that express the dependencies among the interrelated data. While global transactions can easily be monitored by the federation layer, the detection of constraint viola- tions caused by local transactions and their compensation is a major problem of information system engineering. On the global level, we distinguish between the following three types of integrity constraints that result from different situa- tions and impose different challenges for integrity maintenance in an information system:
Constraints resulting from schema integration: The global integrity con- straints result from the integration of local schemas. The CDBSs might model equivalent, overlapping, or disjunct views of a real world entity. The information system has to ensure that objects are stored correctly in their corresponding extensions. For example, if an object is inserted into an ex- tension that is semantically equivalent to an extension on another CDBS, the object must be inserted into all affected extensions. If the insertion fails in one extension, then the overall operation must be rolled back. Se- mantically equivalent and overlapping local extensions require the system to maintain replicas of objects to ensure consistency. Update operations on equivalent or overlapping extensions must be executed concurrently on all replicated objects.
Constraints derived from local constraints: Local integrity constraints be- long to the semantics of the data and must also be enforced on the global level. They can be separated into explicit and implicit constraints. Explicit constraints are formulated and managed separately by the DBMS whereas implicit constraints are inherent properties of the data model. If the im- plicit constraints are not supported in the global data model, then they have to be translated and managed explicitly by the federation layer (see [31] for details).
Additional global constraints: Additionally to the constraints resulting from schema integration and the integration of local constraints, there might be further constraints that are formulated explicitly on the global schema to enforce certain rules on the interdependent data. This could be global key or aggregate constraints that express specific business rules or existence de- pendencies between local extensions. The information system must be able to monitor and enforce explicit integrity constraints to ensure consistency.
As already mentioned, an essential task of global integrity maintenance is the detection of local write operations on interdependent data. The concept we propose later in this thesis (see Chapter 5) addresses this problem using the functionality of Enhanced Active Database systems.
2.4 Integration Challenges 19
2.4.4 Global Transaction Management
The last challenge we discuss is the planning and execution of global transactions. Global transactions are issued against a global schema and affect data of at least two different component databases of the federated information system. In general, a global transaction is decomposed and executed as a sequence of local read and write operations (subtransactions) for each affected local databases ( nested distributed transactions [84] ). A transaction management component in the federation layer has the responsibility to generate this decomposition and to monitor the execution of the individual local transactions. Like transactions in a single information system, global transactions in a federated information system should apply to the following ACID properties to ensure a consistent and reliable system[84]:
Atomicity: A transaction is always executed as a single unit of operations. Ei- ther all actions of a transaction are completed or none of them. If one action cannot be completed successfully then all other actions of that transactions must be taken back. In particular, intermediate results during the execu- tion of a transaction that is not yet completed successfully must not be visible to other transactions.
Consistency: A transaction must map one consistent database state to another and may not violate the consistency of the data stock. Therefore, a trans- action is checked against existing integrity constraints and aborted if one of them is violated.
Isolation: This property requires a transaction to see a consistent database at all times. A transaction cannot reveal its result to concurrent transactions before its commitment. This ensures that a transaction does not access data that is concurrently updated by another transaction.
Durability: Once a transaction commits, its results are stored permanently in the database and can be processed by subsequent transactions. The dura- bility of a transaction also refers to the ability of a system to recover the last consistent state after a system failure.
While these properties are well-understood and guaranteed in centralized or homogeneous distributed database systems, their implementation impose great challenges in an environment of heterogeneous data sources with high level autonomy. In the following we describe the three major problem fields concerning transaction management in federated systems based on[17]:
Global Serializability: Since each participating component database may use its individual concurrency control protocol, existing serialization protocols for homogeneous distributed databases cannot be used. If the global trans- action manager is unaware of local transactions then it can only guarantee a serial execution of global transactions which does not automatically guar- antee global serializability due to indirect conflicts caused by local transac- tions.
Global Atomicity: The atomicity property of a transaction dictates that all subtransactions of that transaction are either committed or aborted. In a homogeneous distributed environment this is ensured using an atomic com- mit protocol such as 2PC (two phase commit protocol). It requires that the participating data sources provide a prepare-to-commit state for each subtransaction and guarantee to remain in this state until a coordinator sends a global commit or abort. Obviously, this strongly limits the execu- tion autonomy of the CDBS and not all data sources are able to provide a prepare-to-commit state. So, if execution autonomy should be preserved, we cannot force a CDBS to export a prepare-to-commit state. However, this will allow a CDBS to abort a subtransaction at any time before it is committed resulting in non-atomic global transactions and incorrect global schedules. As stated in[77] an atomic commit in an environment of au- tonomous and heterogeneous environments is impossible without either vi- olate local autonomy, limit the types of transactions allowed, or using a new or relaxed transaction/correctness model.
Global Deadlocks: The third major problem field concerns the detection and prevention of global deadlocks. In an autonomous environment where the participating component databases use locking mechanisms to ensure local serializability, there can be a sequence of subtransactions that leads to a global wait-for-cycle that results in a global deadlock. To detect and pre- vent deadlocks, the global transaction manager needs information about local transactions and locks. On the other hand, since CDBSs are unwill- ing to exchange their control information with the federation layer (design autonomy) they will be unaware of global deadlocks.
Solutions to global integrity and global transaction management require infor- mation on the state of the participating data sources. As the main contribution of this thesis, we describe how the functionality of Enhanced Active Databases can be used to interact with other component databases or the federation layer to sig- nal changes of their states to coordinate their actions. They particularly support immediate notifications perfectly suitable for real-time information systems.
2.5 Common Integration Architectures
The basic architecture presented in Section 2.2 gives only a general overview of the composition of a federated information system. In detail, the design and implementation of the federation layer including the specific interoperation strat- egy, the supported operations and transactions, the integration strategy, as well as the wrapper functionalities and supported data sources may vary strongly with the intended application field. An important criteria for the classification of a federated information system is the composition of its federation layer. In the following we present three architectures for building federated information systems with different levels of distribution of their federation layer: centralized, modular, and fully decentralized.
2.5.1 Federated Database Systems
A Federated Database System as defined by[102] is a collection of autonomous but cooperating database systems that are integrated into the federation and con- trolled by a federated database management system (FDBMS). Components give up parts of their autonomy to participate in the federation depending on the needs and desires of federation users and administrators. The FDBMS implements DBMS functionality of centralized or distributed database systems and controls access and manipulation of the data on the integrated component databases. It provides location, distribution, and replication transparency and commonly supports query language access to the data. Supporting read-write operations it contains a query processor and optimizer, as well as a global transaction manager to ensure global consistency while allowing concurrent updates across multiple databases. The FDBS maintains either one single or multiple federated schemas which are mapped to the local schemas of the structured sources.
illustration not visible in this excerpt
Figure 2.2: Five-level schema architecture of an FDBS[102]
Figure 2.2 depicts the well-known five-level schema architecture of an FDBS. The dashed lines between the federated schemas symbolize the option of a single or multiple federated schemas. Referring to the general architecture of federated information system (Figure 2.1) the component and export schemas are defined and maintained by wrapper component providing the federation layer with a common data model and query language. The global federated schema(s) and the export schemas reside in the federation layer. The federation layer of an FDBS typically consists of a non-distributed, monolithic federation service implementing the DBMS-like functionality. The static structure of FDBSs makes it harder to add or remove components or to react on changes in their schemas.
2.5.2 Mediator-based Information Systems
One of the first descriptions of mediator systems was introduced in[103] as a ”pseudo intelligent software controller which [...] mediates between an Information Retrieval System and its end-user”. This basic three-layer architecture consisting of users, mediators, and data sources is also reflected in[120] where mediators are defined as small and simple active software modules that implement dynamic interface functions between users workstations and database servers. Typical tasks performed by mediators are
- transformation and subsetting of databases,
- abstraction and generalization of underlaying data,
- providing intelligent directories to information bases, and
- providing access and merge data from multiple source.
Each mediator implements a specific set of mediation functions using one or a few databases. A user task will most likely need multiple distinct mediators to accomplish. Figure 2.3 depicts the basic architecture of a mediator-based information system.
A data source is typically encapsulated by a wrapper component to provide a homogeneous interface to the mediation layer. They convert query results into a common (or canonical) data model before they are sent to the mediators. The mediators communicate with one or a few wrappers to accomplish a specific mediation task that is offered to the application layer. Mediators can also use the functionality of other mediators as a data source, thus supporting a hierarchical composition of the mediation layer. Each mediator maintains its own federated schema which is ideally represented in the common data model and supports a common query language. The users now access the mediators that offer the required functionality to access the sources. The data is preprocessed using the mediation functions and returned for individual processing.
Figure 2.3: Architecture of mediator-based information systems [41, 20]
Contrary to federated database systems, mediator-based systems incorporate a modular structure and abstain from centralization. A data source can be ac- cessed by multiple mediators whereas new mediators and sources can be added to the system at any time making the system more dynamic than FDBSs. On the other hand, mediators typically provide read-only access to their data sources, since decentralization complicates global transaction management and concur- rency control.
2.5.3 Peer Data Management Systems
Peer-to-Peer (P2P) information systems or Peer Data Management Systems [49, 112] resemble concepts and mechanisms for fully decentralized sharing and ad- ministration of data. P2P systems are highly dynamic and scalable allowing autonomous and heterogeneous network nodes (peers) to join or leave the net- work at any time. The system gets more flexible than mediator-based systems, since there are no central global components that have to be maintained by ad- ministrators. Peers store data that the users are willing to share with other participants. Although not necessarily required, many P2P network topologies use super-peers to increase network performance (see Figure 2.4). Super-peers are used for peer aggregation, query routing, or query mediation[80].
Since P2P systems are decentralized there exists no global schema but a collec- tion of pairwise mappings between peer schemas that are typically created using schema mapping languages (e.g.[50] ). Contrary to FDBSs or mediator-based systems, the schema mediation does not follow a tree-like integration hierarchy with source schemas at the leaves and mediated schemas as inner nodes but an arbitrary graph of interconnected schemas. The set of mappings defines the se- mantic network (or topology) of the system. Queries are reformulated using the
illustration not visible in this excerpt
Figure 2.4: Architecture of Peer-to-peer information systems
schema mappings and routed to all the peers that might have answers. In turn, results are converted along the schema mappings from the remote into the local representation. Semantic related results from multiple peers are integrated either completely at the peer that issued the query or using intermediate integration results created by super-peers.
Referring to our basic architecture for federated information systems, the fed- eration layer is fully decentralized and distributed over the participating peers. Each peer data source is wrapped to provide a homogeneous query interface whereas pairwise connections between peers are established via P2P network in- terfaces. Each peer decides autonomously which data it is willing to share and maintains its own set of schema mappings. In general, a P2P system supports read-only operations with the option to cache query results locally. However, in Chapter 7 we describe an architecture for a P2P-based information system that implements a push-based replication strategy among autonomous and heteroge- neous peer databases.
Chapter 3 Enhanced Active Database Systems
Enhanced Active Databases build the basis for the concepts we propose in this thesis. In this chapter we introduce Enhanced Active Database systems and present their specific functionality that contributes to solutions to common prob- lems in federated information systems. After a definition of Enhanced Active Databases in Section 3.1, we describe the new functionalities that can be added to component databases using External Program Calls. We introduce remote state queries, injected transactions, and external notifications as three new opera- tions of component databases that enable the interaction of component databases within a federation. A detailed description of an External Program Call is subject to Section 3.3. We present the required components and explain the basic steps to communicate with external components. Section 3.4 discusses the effects of Ex- ternal Program Calls on the local autonomy of the component databases. Finally, the chapter is closed with an overview of current Enhanced Active Databases that are widely used in practice to confirm the practicability of our concepts.
3.1 Definition
Traditionally, database systems have been regarded as passive data providers that manage the storage of data and response to read and write requests issued by the users. More complex requirements regarding the integrity and consis- tency of the data had to be implemented in the applications themselves. But with the association of databases to highly complex information processing sce- narios, with huge amounts of data or high performance requirements, database systems were extended by more comprehensive facilities to model structural and behavioral aspects of data to support the applications. Active database systems were introduced that assist applications by migrating reactive behavior from the application to the DBMS. They are able to observe special requirements of ap- plications and react in a convenient way if necessary to preserve data consistency and integrity. The integration of active behavior in relational database systems is not particularly new and currently most commercial database systems support ECA rules, whereas the execution of triggers is mainly activated by operations on the structure of the database (e.g. insert or update a tuple) than by user- defined operations[86]. Unfortunately, the ability to react on events, especially from within the scope of trigger conditions and actions, has until recently been limited to the isolated databases they were defined at. Subsequent developments integrated special purpose programming languages (e.g. PL/SQL[74] ) into the database management system to overcome some limitations of the query language and to provide a more complex programming solution for critical applications. But again, the scope of these extensions was strictly limited to the system borders of the database system, so an interaction with its environment was impossible. However, the support of ECA rules in form of triggers is necessary, but not suf- ficient for the concepts we propose here.
The significant improvement, on which this work is based on, is the ability of modern active database systems to execute programs written in a standalone programming language as user-defined functions or stored procedures (also referred to as external routines) from within their database management systems. This enhancement takes the functionality of active databases beyond former limits. Thus, we define a new subclass of active databases as follows:
Definition 1 The ability of a database system to execute programs or methods from within its DBMS to interact with software or hardware components beyond its system border shall be called enhanced activity . A database with enhanced activity is a Enhanced Active Database System (EADBS). The execution of a program or method in this context shall be called an External Program Call (EPC).
The execution of external programs (EPs) from inside the DBMS offers new perspectives to data management and processing in a federated information sys- tem. The database has herewith access to the entire functionality of the pro- gramming language including user-created libraries and extensions. EADBSs are active databases that are actively able to invoke methods or programs from within their database management system. An Enhanced Active Database that partici- pates in a federation as a component database can offer its enhanced activity to improve interoperability in the federation. Which particular functionality can be provided by such component databases is presented in the next section.
3.2 Enhanced Activity
The enhanced activity allows Enhanced Active Database systems to execute ex- ternal programs to interact with hardware or software components beyond the system borders of the database. In the context of federated information systems, this functionality allows component databases with enhanced activity to commu- nicate with specific components of the federation, like, for instance, a wrapper component, a constraint or transaction manager, an event broker, or another component database or additional data source. Communication is established using the APIs and libraries provided by the programming language that is used to code the external program. In particular, we focus on the database connectiv- ity and client-server APIs for sockets or remote procedure calls (RPC) to add the following functionalities to a component database that participates in a federated information system:
Query the state of a remote database: The main functionality which is ele- mentary for our approach is the ability of an CDBS to query a remote data source directly during the execution of a database trigger. After a connec- tion has been established by the EP, we can perform any read operation on the remote schema items we are allowed to access. Depending on the query language we can formulate complex queries with group and aggregate functions (e.g. like in SQL). The query result of the remote database is used locally to evaluate conditions of ECA rules. We call this kind of query a remote state query .
Manipulating a remote database: After a connection is set up by the pro- gram, a CDBS is basically able to modify the data stock of the remote database directly during the execution of a database trigger. Assuming the appropriate permissions, any operation supported by the query language can be executed including data insertions, updates, and deletions. Depend- ing on the query language, a CDBS is thus basically able to modify even the schema of a remote database using for example ALTER TABLE statements in SQL. In the following, a manipulation of remote data or schema items from within a database trigger shall be called an injected transaction , since its execution depends on a triggering transaction on a local relation. From the point of view of the remote database, a remote state query or injected transaction is handled like a request of an ordinary application.
Notification of external components: Besides the database connectivity we use client-server APIs of the programming language to establish a connec- tion to a remote server component of an arbitrary software application. The database acts as a client and opens a communication channel via sockets or remote procedure calls. Thus, it is able to interact with remote applications and use their services during the execution of triggers or stored procedures. In particular, those connections are used to send notifications to external components via External Notification Programs (ENP), to either simply signal the manipulation of local data or to actually propagate the modified data items themselves. Within recent commercial database systems a commonly supported program- ming language that provides the technology we need to implement these enhanced functionalities is Java. It contains JDBC, a common database connectivity frame- work, that provides a standardized interface for a multitude of different data sources like relational databases or even flat files. JDBC is part of the Java core API since version 1.1 and is supported by all major database manufacturers[119]. Furthermore, it comprises APIs to establish client-server connections via sockets or Remote Method Invocations (RMI). Its platform independence is particularly useful in an heterogeneous environment such as a federated information system. Java functions can be migrated between component databases without much code rewriting. Remote state queries and injected transactions are executed via JDBC using SQL as the standard query language bridging the heterogeneity. Although we cast the remainder of this work in the context of Java UDFs using JDBC and RMI the concepts certainly adapt to Enhanced Active Databases supporting other programming languages that meet the requirements just mentioned.
3.3 External Program Calls
The enhanced functionalities like remote state queries, injected transactions, and update notifications are realized using external program calls, which are described in detail in the following. Although an external program could also be explicitly executed by a user as a stored procedure, we focus on external program calls from within database triggers as part of a database transaction. The de-facto standard for managing and querying databases is SQL, currently in the version SQL-2003. Its predecessor SQL-1999 has already defined the concept of SQL- invoked routines in the form of stored procedures and user-defined functions (UDF)[6]. The standard allows both types to be defined as external routines in an external programming language like C or Java. Such routines are already supported by major database systems (e.g. Java Stored Procedures or Java UDFs [73, 78]). They can be called from triggers during their execution as part of a database transaction.
Figure 3.1 displays a schematic overview of an external program call. In general, triggers are activated by transactions that execute write operations (up- dates) on the data stock. In our example, an update on a relation R fires a trigger that in turn sequentially calls one or more external programs. The EPs interact with external hardware or software components, making requests and eventually waiting for responses. After the EPs terminate, the trigger returns its call and results in a commit or abort of the corresponding transaction. The execution of a trigger including the external programs is typically synchronous, i.e. the DBMS holds a lock on the affected data until the trigger terminates. The concrete locking mechanism strongly depends on the implementation of the concurrency control protocol and thus varies with the database management system.
illustration not visible in this excerpt
Figure 3.1: Schematic overview of an external program call
Obviously, to use an external program practically, we must be able to pass parameters to the EP and to access the corresponding program output from inside the trigger. This output can be used to evaluate trigger conditions or to determine subsequent trigger actions. Since the EPCs are embedded straight inside the DBMS of the local system, we are able to delay or abort transactions depending on the result of an external program call. Just like common triggers that exclusively use local data to evaluate their trigger conditions, the DBMS autonomously schedules the execution of the trigger that encapsulates the EP. In particular, we do not force a component database to provide an atomic commitment protocol like 2PC (see 3.4 for a discussion).
We now illustrate the call of an external program using a simple example. Unfortunately, the concrete statements and mechanisms to load and register ex- ternal programs in a database vary among different database products. Thus, we use Java and the database DB2 as concrete representatives to give an example for an EPC. Before an external program can be called from a trigger it has to be loaded into the database and registered as a user-defined function or stored pro- cedure. The program must define a specific method, procedure, or function that should be callable by the database and execute the required operations. As an example, we assume that the following Java function someClass.someFunction shall be registered in the database:
illustration not visible in this excerpt
The function takes two arguments arg1 and arg2 with the given data types. It calculates an integer as a function value of the parameters. Depending on the database system, the compiled class someClass.class can be loaded directly into the database or it must be added to a Java archive first. We assume the class to be included into a jar file ep.jar, that is loaded into the database using a database-specific mechanism and registered as archive ep. Thereafter, the exter- nal function has to be registered as a Java UDF in the database. The following statement exemplarily creates a new UDF in a DB2 database and maps it to the someFunction function:
illustration not visible in this excerpt
The statement specifies the signature of the function (name, arguments, return type) and additional parameters required by the database system to successfully register the function as UDF, e.g. the language type security settings. The UDF is mapped to the external function in the registered jar archive ep. It is now accessible from within the database and can particularly be called by a trigger. The following trigger is executed before an update occurs on the relation R ( A, B ), with A and B being string and integer attributes respectively:
illustration not visible in this excerpt
The trigger calls the external program someUDF for each updated row in R , while the update transaction is blocked by the DBMS and waits for the trigger to return. The update operation only commits, if the function call for the updated values of A and B does not yield − 1. Otherwise an error is raised and the corresponding update transaction is aborted.
3.4 Discussion
We now discuss the restrictions to local autonomy that are induced by the execution of external programs. We refer to the well-known classification of local autonomy summarized in 2.3.2.
The execution of external program clearly violates local design autonomy, since the creation of triggers and user-defined functions requires changes to the database system. Communication autonomy, as it is defined in[102], allows the CDBS to decide when and how to respond to external requests. According to this definition, EPCs do not impose restrictions to communication autonomy, since we do not force to answer requests immediately. As we will see later (see Chapter 5), a component database may go offline during a global integrity check without compromising global integrity, if pessimistic checks are implemented.
Furthermore, a request via an EPC is initiated by the database itself without being forced to execute, which allows a CDBS to retain a high level of execution autonomy. The DBMS decides when to schedule a local transaction, the execution of a trigger or an external program. EPs do not interfere with local serialization of transactions or concurrency control. Like local constraint checks implemented by triggers, the DBMS waits for the termination of the EP that returns a value to commit or abort a transaction. A timeout value typically limits the time that a trigger, function, or procedure may take to execute. The decision to abort the execution of a trigger belongs exclusively to the DBMS.
As already motivated in the previous chapter, reasonable information sharing in a heterogeneous and autonomous environment demands for certain arrange- ments and assurances among the partners. The enhanced activity allows a com- ponent database to interact with other components of a federated information system while retaining a high level of local autonomy. Using triggers we are able to commit or abort a local transaction depending on the state of a remote data source. In particular, using Java with JDBC and SQL we can implement portable solutions to overcome heterogeneity in different types of data sources. As we will see in the next section, there is a comprehensive list of current EADBSs that sup- port Java as a standard language for coding external procedures and functions.
3.5 Current EADBS
In this section we give a brief overview over current Enhanced Active Database systems and their supported programming languages. Commonly supported pro- gramming languages meeting the requirements of data connectivity and client- server connections are C, C++, and Java. Like Java comprising JDBC as a database connectivity framework, C and C++ support data connections via the ODBC (Open Database Connectivity) interface using SQL as query language. Besides, we are able to implement client-server connections which makes C and C++ perfectly suitable for the concepts we propose in this work. In general, the external programs are packed into shared libraries (e.g. jar, so, dll) and loaded into the database using a product-specific installation routine. The following list contains database products that support triggers and user-defined functions writ- ten in at least one of the languages C, C++, or Java and can thus be classified as Enhanced Active Database systems. The list is ordered according to their market share as presented in[43] and does not claim to be complete.
Oracle Database: The object-relational database Oracle runs on various plat- forms and provides a comprehensive set of tools for data management [7, 73]. As an active database it supports triggers on row, statement, schema, and database level. The basic version of Oracle was developed in 1979 and since then con- stantly enhanced[82]. Since version 6.0 it comprises PL/SQL, a procedural language for database scripting, to provide a more complete programming so- lution for database applications. The first, although very limited possibility of interaction with its execution environment was introduced on version 7.3 with the UTL_FILE package allowing PL/SQL scripts to read and write external files sequentially. Access to operating system operations could be provided using the DBMS_PIPE package that allows a PL/SQL script to put a request in a database pipe from which it could be picked up and processed by a listener written in Perl or the Oracle Call Interface (OCI). In the current version 10g, Oracle sup- ports stored procedures and user-defined functions written in C and Java that are callable directly from within triggers. Java Stored Procedures were introduced in version 8i which was released in 1999. It comprises its own J2SE 1.4.x compliant Java Virtual Machine as well as a couple of extensions to the JDBC connectivity framework like JDBC Thin and server-side internal drivers.
IBM DB2 UDB: The IBM Universal Database 2 (DB2) was originally re- leased in 1983 and is now available in the version 8.2 [4, 11, 59]. Like Oracle it runs on various operating system platforms and has active capabilities in the form of triggers, although only supported on row and statement level. In the current version, stored procedures and functions can be written as SQL stored procedures based on procedural extensions to the SQL language (similar to Ora- cle PL/SQL), or based on high-level languages on the host system, such as RPG (Report Program Generator), COBOL, C, or Java. Non SQL procedures like C and COBOL were introduced with the DB2 version 6 released in 1999, whereas Java was not supported prior to version 7 from 2001. COBOL is a high-level lan- guage suitable for data processing in business applications. Initially designed for the handling of huge amounts of data stored in a specific record format, COBOL is also able to access databases directly via specific COBOL database bridges like [28].
Microsoft SQL Server: The MS SQL Server is currently in the version 2005 and like Oracle and DB2 supports triggers and external procedures and functions [76, 106]. The first version of the SQL Server was developed for the OS/2 platform by Sybase and released in 1988. In 1994 Microsoft ended the marketing partner- ship with Sybase and bought a copy of the source code to independently develop their own database server designed for Windows NT. After the release of the first SQL Server version 4.3, the versions 6.0 (in 1995), 6.5 (in 1996), 7 (in 1998), and 2000 followed. The database uses its own SQL dialect Transact-SQL to ma- nipulate the data. Stored procedures can either be a collection of Transact-SQL statements or a reference to a Microsoft .NET Framework common language run- time (CLR) method. The CLR component was integrated into SQL Server 2005 and allows the execution of stored procedures, triggers, or functions written in a compatible .NET programming language like Visual Basic .NET or Visual C#.
Connections to remote databases are established using the ADO.NET framework based on the ActiveX Data Object (ADO) technology. Furthermore, Visual Ba- sic and C# support remote procedure calls. In the previous versions (since 6.5) SQL Server supported extended stored procedures to load and execute a func- tion within a dynamic-link library (DLL). The development of such extended stored procedures is treated as any other DLL development. DLLs are shared object written in C or C# that can be accessed by multiple threads at the same time. They can be called from within trigger like common Transact-SQL state- ments using the data connectivity of the host language to connect to remote data sources.
Informix Dynamic Server: The Informix Dynamic Server (IDS) is the data- base system of the Informix company which was taken over by IBM in 2001[60]. The current version 10.0 runs on various platforms and is designed for online transactional processing (OLTP) applications. Like Sybase ASE, IDS descends from the Ingres relational database developed at the University of California, Berkeley. The database supports triggers on row and statement level. External functions and stored procedures in the languages C and Java is supported by the database since its version 9.2 released in 1999. C programs are loaded into the database as DLL or shared libraries depending on the operating system. The support of Java requires the J/Foundation extensions which contains the JVM of Sun. The mechanisms and syntax to load, register, and execute external functions and procedures is similar to DB2.
Sybase ASE: Sybase Adaptive Server Enterprise (ASE) 15.0 is the current version of the Sybase relational database which was first released in 1984 as Sybase SQL Server and has its name since version 11.5, released in 1997 [110, 111]. Like Informix, the database is a descendant of the Ingres database and was developed by Sybase in cooperation with Microsoft until 1994. After their marketing partnership ended in 1994, both databases were further developed independently. Due to their common history, the products share many basic foundations, particularly the SQL dialect Transact-SQL, although now in slightly different versions. The database runs on various platforms and implements active capabilities in the form of triggers. Since version 12.0, which was released in 1999, Sybase ASE supports external stored procedures that are registered in the database as common procedures but implemented by an Open Server application called XP Server. The procedural functions written in C or a language capable of calling C functions are loaded into the database as shared libraries like in Oracle, DB2, or Informix Dynamic Server. Also since version 12.0 the database comprises an internal Java Virtual Machine to execute Java methods as functions and stored procedures.
PostgreSQL: PostgreSQL was initially developed, again, as a successor of the Ingres database at the University of California at Berkeley[96]. In 1996 it started as an open source project and was soon replaced by a radically transformed and enhanced version known under its current name PostgreSQL. It is currently in the version 8.1 and supports per-row and per-statement triggers as well as external stored procedures and functions. They can be written in procedural languages like PL/pgSQL, PL/Tcl, PL/Perl, PL/Python, as well as C and, since version 8.0, also Java. The trigger definition in PostgreSQL strongly differs from other DBMSs. Trigger events are specified in SQL but the actual trigger action is implemented as an external trigger function, one for each trigger. The trigger is executed by a trigger manager that passes arguments to the trigger via specific trigger data structures.
This list shows that most database vendors already support external program calls and implies that the technology will be most likely included into most database products in the near future. In the following chapters we show how the enhanced functionality of Enhanced Active Databases could contribute to common problems in federated information systems.
Chapter 4 Active Event Notification
The integration of data sources into federated information systems is a difficult task, especially when they shall be loosely coupled to the system and retain the highest possible degree of autonomy. According to the basic architecture described in Section 2.2, a database is generally integrated using a wrapper com- ponent which encapsulates a source and provides a common interface to the federation layer. A difficult problem in such an environment with autonomous component databases is the detection of events in the integrated data sources in order to react to that event on the global level. Particularly in event-based systems of event producers and consumers, we need a mechanism to detect events in the attached sources and propagate those changes to corresponding event pro- cessing components. A special application scenario is real-time (or zero latency [81] ) data warehousing, where updates are propagated and integrated into the warehouse immediately after the update occurred. The implementation of such types of data warehouses demands for real-time event delivery mechanisms for the integrated sources.
A common approach to event detection in databases is monitoring. The source is scanned at specific intervals to poll events using change extraction algorithms based on, for example, snapshots or log files. Although this method is widely used in practice, it is only able to provide periodic or deferred updates to an information system, since the Event Monitor does not exactly know the time an updated occurred. A truly immediate update notification requires the notification process to be integrated into the update process at the data source itself. This requires firstly that the database is able to detect and react on local events, which is the case for active databases, and secondly that the database is able to actively notify an external component about the local update. The latter functionality is provided by the external program calls, which can use client-server APIs of the external programming language to open the required connections.
In this chapter we present a concept that allows Enhanced Active Databases to actively notify an external notification interface about updates in its local data stock. The concept fully exploits the enhanced activity of the database to pro- vide an information system with immediate update notifications. We describe the interaction of the database with a Notification Interface, which is specifically designed to support Active Event Notifications invoked by Enhanced Active Databases. The concept is particularly suitable for real-time data warehousing and global integrity maintenance supporting both, asynchronous and synchronous event delivery to a monitoring component. Active Event Notifications build the basis for the concept of global integrity maintenance and the tightly coupled wrapper component introduced later in this thesis.
We start with a general overview of monitoring concepts including a descrip- tion of event detection phases, concrete change capture methods, and additional change delivery options in Section 4.1. Section 4.2 summarizes related work pre- senting solutions to event detection in research projects and major commercial database products. Our concept of Active Event Notification for immediate syn- chronous and asynchronous update delivery is described in Section 4.3.
4.1 Monitoring Concepts
Before we describe our concept of Active Event Notification using Enhanced Ac- tive Databases, we shortly summarize popular monitoring techniques and prop- erties, mainly developed in the context of data warehouses for incremental view maintenance. Updates are extracted from the sources (base relations) and sent to the data warehouse where they are incrementally integrated and stored. How- ever, since a data warehouse can be considered as a federated information system with read-only operations on autonomous operational sources, the techniques are also applicable to other forms of federations, where the federation layer must be aware of local updates.
Which monitoring techniques are applicable to a data source strongly depends on the type and activity class of the data source. With the definition of Enhanced Active Database systems as an extended type of active databases we distinguish between three data source activity classes:
Passive Data Sources: Passive data sources are still widely used in practice and a lot of important data is stored in flat files (CSV) or spreadsheets. With the uprise of XML and the semantic web, the amount of semistruc- tured information sources steadily grows enormously. Unstructured and semistructured information is typically stored in text files without being managed by a database management system (except XML databases like [61] ). Thus, such flat files, spreadsheets, or XML files do not provide trans- action management and integrity checks. This activity class also includes passive database systems which in fact comprise a database management system but do not support triggers to react on local events.
Active Databases (ADBS): Active databases comprise an integrated active mechanism to react on local events and execute integrity checks on the local data to ensure consistency. They commonly support triggers based on ECA rules which can be set up to fire on a certain event (insert, update, or delete), evaluate a trigger condition, and determine subsequent trigger actions. With triggers we are able to implement constraint checks that involve more than one entity in the database.
Enhanced Active Databases (EADBS): Enhanced Active Databases as in- troduced in Chapter 3 are active databases with enhanced activity. The main difference is their ability to execute external routines written in ex- ternal programming languages, which enables them to actively interact with external components in a complex way, like calling remote procedures or querying remote databases directly.
The active capabilities determine the capture methods and data delivery options that can be used to monitor the data source. Changes in data sources are typically captured using an event monitor that implements a concrete monitoring concept suitable for the underlaying data source activity class.
4.1.1 The Event Monitor
Figure 4.1 depicts an Event Monitor and its interaction with other architectural components in an event-based information system. the Event Monitor can be implemented in any component that has access to the data source and wants to stay informed about events in the underlaying data source. In a federated information system, the monitor is usually part of the wrapper or federation layer, depending on the concrete design and intended functionality of the system. If implemented in the wrappers, the changes are extracted from the sources directly by the wrappers and propagated to the federation layer. Otherwise, updates are extracted from the sources by the federation layer via the wrapper components.
An event monitor basically consists of a Change Capturer component and a clock to trigger periodic data extractions. The capturer knows how to access the data source and implements a specific capture method suitable for the data model and activity class of the underlaying source. The captured changes are propagated to an event processor where they could, for example, be integrated into an existing data stock (like a data warehouses), replicated to another data source, or checked against integrity constraints. In many real-time scenarios, the Event Processor is a messaging system that distributes updates in a publish-subscribe fashion. Besides invoked by the internal clock, the capture process can also be triggered by an external user or application or directly by the data source itself. To better understand the sequential execution of the event detection process using an Event Monitor, we distinguish between the following three phases:
illustration not visible in this excerpt
Figure 4.1: Interaction of the Event Monitor.
1. Notification: During the notification phase, the change capturer receives a notification that changes should be extracted from the data source. Thereupon, the capturer executes the subsequent change extraction phase. The invocation can either be triggered by a user or application, a clock, or the data source itself. The type of notification is determined by the data delivery schedule that is discussed in detail in Section 4.1.3.
2. Change Extraction: During the change extraction phase, the change capturer uses a specific capture method to identify and retrieve updates from the database. The concrete change capture technique depends on the data model and the activity class of the source, i.e. if the source is a passive, active, or enhanced active data sources. A list of popular change capture methods are presented in Section 4.1.2.
There are several approaches for change extraction in different kinds of data sources available. For example, well-known solutions for passive relational or XML data sources are presented in[69] and[118] respectively, whereas updates are detected by comparing the updated data with a snapshot cre- ated previously. Other approaches use log files or timestamps to identify updates data items.
3. Change Propagation: After updates have been extracted from the source,
they are sent to an event processor component for further processing. In a data warehouse scenario this component would most likely be a data integrator that loads the updates into the staging area of the data warehouse. In the context of other information system architectures this could be a constraint manager, a replication manager, or an event broker that processes the updates in an application dependent way.
Before we discuss the data delivery options including the delivery schedule that determines the notification phase, we give an overview of commonly used approaches for the extraction of changes during the change extraction phase.
4.1.2 Change Capture Methods
There is a number of concepts for event detection in various types of data sources. They can basically be divided in static and incremental capture methods. While static capture is usually associated with taking snapshots of the data periodically, the incremental capture methods do consider only the updated data items and not the entire data stock. Both classes have different requirements towards the capabilities of the data sources and thus entail different advantages and disadvan- tages. In the context of federated information systems, we assume that monitor components of the federation layer monitor the source to detect and extract local updates.
Static Capture Methods
Static capture methods are intended for monitoring sources that store a man- ageable amount of data. They are less suitable for sources with high data load, since the data capture process could significantly influence their performance, especially if they store large data sets. However, except for the file comparison capture, the techniques are rather simple and therewith easy to implement.
Static Capture: The idea of this simple monitoring technique is to periodically take a snapshot of the entire base relation in the source and load it into the data warehouse. The information in the warehouse can either be replaced by the snapshot or the data can be appended to an existing table. Thus, the warehouse either holds a replica of the base relation at a given time or an accumulation of all data items over a certain time period. Unless the federation layer needs to maintain copies of the data as required for data warehousing, this capture method is rather inappropriate. In general, the federation layer needs to be provided with the updated records only rather than the entire data stock. These changes can be computed using a snapshot copy maintained by the federation layer (see file comparison capture below).
Timestamp Capture: The timestamp capture method assumes that every record contains information about the time at which it was last updated. These temporal information can now be used to select updates from the base relations. The monitors select only those records which have changed since the last scan. This capture method is independent of the database type but does not capture all changes of state of the records that occur in the time period between two scans. Furthermore, deleted records are not considered unless they are marked as deleted in the base relation and purged from the database after the capturing process. Thus, the timestamp capture is not applicable in federated information system, if the federation layer must be aware of all changes of state (as required for global integrity maintenance) or if the sources do not implement a marked-as-deleted status for deleted records.
File Comparison Capture: The file comparison capture (also known as snap- shot differential method[69] ) detects updates in the base relation by com- paring it with a previously taken snapshot. The records in the base relation are compared to the entries in the snapshot to reveal inserted tuples, that are not present in the snapshot, and deleted tuples that only exist in the snapshot. Records existing in both, the base relation and the snapshot, have to be scanned for changed attribute values to identify data updates. Typically, to speed up this comparison operation, the snapshot does only store the key attributes together with a hash value that is calculated from the attribute values of the record. If the hash value calculated from the base relation does not match the corresponding hash value in the snapshot then the record has changed and is identified as an update. The snapshot management and record comparison has to be implemented in the federa- tion layer, whereas the snapshot should ideally be stored in an additional repository maintained by the federation layer to preserve design autonomy of the source. Popular representatives of file comparison methods are the comparison method for unstructured strings[57], relational and hierarchi- cally structured data as presented in[69] and[23] respectively, or algorithms for the comparison of semi-structured data like XML files as described in [27, 118].
Incremental Capture Methods
Incremental capture methods provide the integration layer with updates without taking a look at the entire basis relation. Updates are stored in a persistent area ( delta sets [46] ) where they are captured by the monitor and deleted after processing. Thus, unlike using static capture methods, all changes of state of the records are recorded and accessible by the monitor. Incremental methods are closely tied to the capabilities of the source (DBMS) and more complex than static methods. They are particularly suitable for data sets where the amount of changes in a given update window is significantly smaller than the size of the entire base relation.
Application-assisted Capture: This incremental capture mode is implemen- ted in the applications that modify the local data sources. Every time an update operation is executed by the application on the local database, it concurrently writes the changes in a persistent delta set (file, database table, etc.) for further processing. The updates can be polled from the storage area immediately. The method inherits several problems: all applications that modify the data source have to be recoded to write the changes in delta sets. Furthermore, an application might only maintain key information of the records whereas additional information is added by the database (e.g. a timestamp or other default values). This requires the application to fetch the entire record from the source before writing it to the delta set. In the context of federated information systems, an implementation of the application-assisted capture method requires the recoding of all local applications that modify data on the component databases to maintain delta sets. These can then be retrieved periodically by a monitor in the federation layer to capture the data updates. Like the timestamp technique, this method is independent from the database type but is difficult to apply to legacy systems.
Transaction Log Capture: This capture method exploits the logging and re- covery capabilities of database management systems to capture updates. The DBMSs maintain those log files for transaction management and sys- tem recovery so they particularly store all the information about write operations on the database. The files can now be monitored to periodically extract changes of interest without limiting source performance. The main drawbacks of this method are that the monitor must be able to identify and extract only that information that is already committed by the database and that the transaction logs must be available until all changes of interest have been captured. Obviously, this method is only applicable to source with DBMS and the log files must be accessible by the federation layer. This could cause security problems, since the log file access enables the federation layer to basically read all local transactions including those that should not be visible to the federation. The transaction log-based capture method is also the basis for popular database replication techniques, so replication-based monitoring approaches are basically represented by this method.
Trigger-based Capture: The last capture method we discuss is based on the active capabilities of active databases and uses triggers to maintain the delta sets which are typically stored directly in the database. A trigger is invoked by a certain condition or event and writes a copy of the changes of interest to the delta set. The monitor can now retrieve the updates from the delta sets similar to the application-assisted approach. Since the invocations of the triggers significantly affect the system performance, this method should only be applied to sources that are capable of processing the expected number of events. The use of triggers implies a significant limitation of design and execution autonomy. The database administrators must agree to set up the triggers and store delta sets in their database.
The main advantage is that the delta sets are maintained directly by the database independent of local and global applications and contain only changes that are already committed by the database.
The capture methods do only describe the requirements and algorithms to retrieve updated records from the sources. The interaction of the Event Monitor with the source and other event consuming components is defined by the delivery options described next.
4.1.3 Data Delivery Options
Besides the capture methods that define the techniques to identify and extract updated information from the sources, there are additional delivery options that must be considered during source monitoring. Data delivery options define the interaction of the Event Monitor with the data source and the Event Processor, and significantly influence the behavior of the entire information system. The option we discuss here are the delivery schedule that determines the notification phase and therewith the freshness of the system, the delivery mode that differ- entiates between push and pull-based data delivery, and the coupling mode of the notification process and the local write operation. The descriptions in the following subsections are mainly based on[38].
Coupling Mode
The first delivery option that significantly influences the behavior of the entire system is the coupling mode. It defines the way of interaction between a local transaction process that modifies data, and the change extraction process executed by the data capturer. The coupling between these two processes can either be asynchronous or synchronous:
Asynchronous: The change extraction process is completely detached from the update operation process. The update operation is not blocked while changes are extracted and transfered to the Event Processor. Changes must be committed to the source before the change extraction is started. Since the extraction is detached from the update, there is typically a latency be- tween the time the changes are committed and the time they are extracted and propagated to the Event Processor.
Synchronous: The change capture process is executed as part of the update process. The DBMS blocks the local transaction during the change extrac- tion and event processing phase. After that, locks on the local transactions are released again.
[...]
- Quote paper
- Christopher Popfinger (Author), 2006, Enhanced active databases for federated information systems, Munich, GRIN Verlag, https://www.grin.com/document/79268
-
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X.