How to Integrate Data Data Integration Primer Resources TPM Federal Highway Administration
Post on: 22 Май, 2015 No Comment
Data Integration Primer
How to Integrate Data
Comprehensive Transportation Asset Management depends upon the availability of fully-integrated data. The process of integrating data is complex, and can be quite challenging. This is especially true when organizations are used to standalone records or database systems that rarely communicate with each other.
Where to begin? A thorough analysis of an agency’s Transportation Asset Management activities is the ideal place to start. This helps an organization pinpoint needs, priorities, and existing capabilities for data integration. Before the analysis begins, however, it is wise to establish a data integration team consisting of all stakeholders in the TAM and data management processes.
The Team
Figure 2 provides a general outline of the key activities in the Data Integration Process, along with things to consider for each activity. Analyzing requirements is the first step in the process, followed by data and process flow modeling, then the definition, evaluation, and selection of alternatives. After this, database design and specification can be pursued. Finally, the development, testing, and implementation of the chosen database integration strategy can be implemented.
Figure 2: The Data Integration Process
Evaluating the cost of the data integration process depends upon several key factors, including the availability of existing:
- Location referencing systems (LRS ) -Is a standard LRS being used by the agency?
- Geographic information systems (GIS) tools -Is the agency already using GIS databases and software?
- Quantity and quality of data -Do new data items need to be collected?
- Management systems -What systems are already in place for managing pavements, bridges, safety, signs, pavement markings, etc.?
- Hardware and software -Are legacy components sufficient to support the task of integrating large and complex sets of information?
Change is a constant in today’s transportation agency. Maintaining momentum during organizational shifts is a key ingredient for the success of a long-term process such as data integration. In fact, agencies may wish to leverage windows of opportunity presented when top management changes or new requirements are mandated by the Federal government regarding Transportation Asset Management. Once established, the data integration system quickly becomes so integrated into business processes that future changes in the management or budget environment are unlikely to undermine support for it, or its value.
Requirements Analysis
Philosopher, Plato, famously noted that the beginning is the most important part of the work. The most important stage of data integration is said to occur at the beginning: a requirements analysis. Depending upon the size and extent of integration, this can be a complex and time intensive step. Several areas must be examined to develop criteria for the best integration strategy.
Business Processes
Integrated databases can support a variety of functions, typically: inventory, data handling, decision-support processes, and systems for creating, acquiring, or maintaining pavements, bridges, tunnels, roadway hardware, equipment, and other physical transportation assets. To begin the requirements analysis, each business process is characterized according to the types of information it uses and produces, and the individuals who must be involved to do so. A system to support sign inventory and condition assessment, for instance, would identify key types of related information. These might include location, sign type and reflectivity, sign maintenance history, sign age, and the staff involved in assessing these data items (e.g. field crews, sign managers, and district and headquarters maintenance managers).
User Requirements
In any system supported by data integration, the requirements of data users such as field, technical, and management staff must be considered. Cooperation and involvement at all staff levels is critical to a successful integration strategy.
Requirements analysis includes ascertaining from a variety of staff where and how they obtain data, the business processes and information systems supported by that data, and any concerns they have about integrating databases. The very act of collecting this preliminary information helps to promote cooperation. Every data user must see that the strategy includes information relevant to his or her requirements and beneficial to the work at hand.
Organizational Characteristics
Each transportation organization is unique and a requirements analysis should reflect the individual agency’s characteristics. This includes recognition of the various groups that will be impacted by data integration. Each group’s business process needs to be understood, along with factors such as the relationships between and within groups, staff skills and capabilities, availability of staff to collect additional data, and how receptive staff members are to data integration (how much they feel it will improve their effectiveness). The broad operational climate of the agency must also be taken into account. Is decisionmaking in the organization, by nature and practice, centralized or decentralized? How can an integrated data system best support either framework?
It is critical to recognize these realities and involve all stakeholders in, first, evaluating the optimum process for integrating the agency’s data, then, migrating the data from traditional information structures to the integrated environment, and finally, testing and using the new data system. This creates the maximum level of trust, cooperation, and enthusiasm for the benefits of data integration. Full stakeholder involvement drives the highest possible return on the integration investment.
Information Systems Infrastructure
One critical area served by optimal stakeholder buy-in is the mapping of current information systems infrastructure. A clear view of the current picture helps the agency determine which software, hardware, and communications strategies will be required to integrate databases. From this analysis, the agency can then gauge its level of readiness for data integration. Most importantly, the analysis helps identify which potential data integration strategy can best marry the existing resources with the new infrastructure.
Useful information at this stage of the process includes an inventory of existing computer programming environments and database management or mapping software or servers, as well as computer hardware and operating systems.
Software systems are a particularly important component in planning the most efficient collection and reorganization of current data into the new structure. Many agencies use GIS software to manage a wide range of data inputs. It is important to ask what other software platforms contain information that must be identified, understood in context, and eventually harvested to build the integrated system.
Database and Database Management Characteristics
Key questions to ask when analyzing existing data and database systems within an agency might include:
- Where do the data come from and who collects it?
- How often, and how, are the data collected?
- What reference system or systems are used?
- What is the structure, format, and size of the data?
- How are the data currently transmitted, processed and stored?
- What is the general quality of the data? Is it accurate? Complete? Recent? Unique or redundant?
- How are the data used-in what business processes?
- What applications draw data from the databases (e.g. bridge management system, pavement management system)?
- What types of reports are produced currently? What types are needed?
Data and Process Flow Modeling
The objective of data and process flow modeling is to create a picture of the relationships between information and the business functions that the information supports. Data flow diagrams help database engineers and analysts determine the design specifications for the data integration system. All data and business processes identified in the requirements analysis can be captured in flow diagrams. A variety of software products exist to support this function.
To understand how data flows through an organization or agency division, analysts must know who collects the data, where it is stored, who uses it, and what levels of access users need (i.e. whether they need to modify, to view, or to update the data). It is also important to ascertain who owns the data, to provide guidelines or structure for its stewardship, and to establish a system of governance that protects the integrity of the data.
The current type, status, form, location, and uses of this information are first examined to determine data integration needs and opportunities. A path is mapped that allows specific information within each location or category to be accessed. Information extracted from a Bridge Inventory Record, for instance, will be maintained according to a protocol that allows it to relate to and be accessible across a wide platform of users for a broad set of purposes. In a well designed data integration system, inventory information might flow smoothly into a set of data that helps staff assess the structure’s condition, leading to decisions about maintenance and rehabilitation. In a less cohesive environment, these categories of information are unlikely to be aligned in such a way that they can be readily synthesized to produce the most accurate picture from which to make sound decisions.
In the bridge example shown in Figure 3, basic information required to monitor the status of a bridge structure might reside in several general locations managed by a variety of departments.
Figure 3: Data and Process Flow: A Bridge Example
Alternatives Definition, Evaluation, and Selection
Once requirements are analyzed and the flow of data is diagrammed, feasible integration alternatives can be identified. Two general approaches are available: fused databases and interoperable databases.
Data fusion (also known as data warehousing) combines information from multiple sources for the one-time use of making them accessible for data integration. The sources of fused data can be eliminated when the data is migrated to a centralized location. They can also continue to exist independently to serve various business processes. Ultimately, all fused data reside in a single database server with substantial processing and data storage capacity. All personal computers and terminals go through this server to access the data and perform the functions supported by the data warehouse.
Interoperable databases (also known as federated or distributed systems) consist of a series of data sources that communicate among themselves through a multi-database query. This requires a new interface through which a data source, such as an existing database, can be viewed and manipulated. In this environment, data reside in computers or database servers located in a variety of places, but each is linked through a computer network and viewed via the master interface. With interoperable databases, one computer can access or add to another’s information.
Figure 4 shows how each of these options supplies access to data in response to a specific question. In this example, an integrated data system helps the agency conduct a proactive safety analysis.
Figure 4: Data Integration Alternatives
Fused Databases
In data fusion, the data is gathered, cleaned to remove inconsistencies, and exported to a centralized database. There it is stored in a format that replicates the way the data would be viewed in the source location. This allows users access to vast stores of data. The data fusion, or warehousing, program relies on a common user interface to organize the relevant subsets of all component databases from which it is fed, and specifies the rules for fusing the data it acquires.
Often, this requires converting a database and its applications from one format to another. Data are then shipped from the legacy system to the new one, using data reengineering or other integration methods. An agency can choose to continue to use the data in the old format after it is made available to the centralized data warehouse, or the old infrastructure can be abandoned when the warehouse is complete.
Regardless of the method used to achieve data fusion, the database management system is key-it must be able to handle the accumulation and management of a large amount of data while still ensuring that it can be accessed quickly and easily. In this sense, the interface with a fused database is something like using an Internet search engine to learn about a given topic-it delivers rapid access to information from a variety of sources, without requiring the user to have advance knowledge of each of those sources.
When fusing data, the variety of databases or formats, as well as sources and applications, can make it difficult to ensure the integrity of the information in each database. This complicates the task of mapping the movement of data from old systems to the new one and it is here that skilled database managers and information technology professionals, whether agency staff or consultants, provide critical solutions.
There is no single approach to data fusion that will meet all agencies’ needs. The approach to data warehousing will continue to evolve as agency experience and technology advance.
Peer Perspectives.
Arizona DOT faced technical, cultural and business process challenges in its data integration effort. The agency chose a data warehousing approach and pulling data from many sources into a single repository exposed quality issues and data disconnects that had to be addressed at the source. As a result, the agency’s strategy targeted cultural and process issues concurrently with technology changes. (15)
Interoperable Databases
As the name implies, an interoperable, or federated, database approach is one in which a variety of databases are linked through a communications network so that all appear to form a single source. Users can access and manipulate data from a variety of original sources, without harming the integrity of each source, and without having to learn the data model or write their transactions in the language of the source. Planners, for instance, can easily access and apply to their processes information from environmental engineers within their agency whose data might be maintained in a different format.
Interoperable databases allow a user to make a query without concern for where the data resides or how it is organized at its source location. A federated view is somewhat akin to shopping online at a department store website-it provides access to the product a user seeks without that person having to search for or be familiar with the universe of possible suppliers.
A federated view is created when an appropriate data interface is set up to link individual databases. The integrated format hides the complexity and distribution of the underlying component databases. Such a system can support different data models and execute transactions written in various data languages. It does not require the migration of all agency data into a single format, so it leaves intact the complexity and depth of data at its source.
Decentralized agencies are a natural fit for interoperable databases. While information may be structured consistently within a division, there is often great variation in data management formats across divisions. An interoperable approach works well when multiple well-organized subject databases exist (pavement, bridges, etc.), but there is a substantial need for access to the data for agency-wide applications, such as maintenance management.
In short, the advantages of interoperable over fused (or centralized) database systems are that they provide:
- Easier access to resources on the network
- Improved database availability
- The ability to share data widely without relinquishing local database control
It is easy to imagine the myriad complications of developing a platform in which a wide variety of data sources converse in a universal language. In fact, the disadvantages of interoperable databases include:
- The difficulty of maintaining a functioning global model when thousands of source databases are involved (along with their query dialects, variations in supported functions, periodic updates, and differing data types and database versions)
- The expertise required to configure such a complex interface
- The ongoing tuning necessary to maintain acceptable performance of such a system
Federation does, however, allow agencies to preserve their investment in legacy systems while substantially improving data sharing capabilities. This, in turn, improves access to the information needed to maximize service to the highway customer.
Evaluation and Selection of Alternatives
Table 2 provides a quick reference guide to the chief advantages and disadvantages of fused and interoperable databases. More generally, however, agencies will want to consider four key elements when evaluating integration alternatives:
- What is the required level of effort to develop either approach?
- How much time is involved in moving to this type of system?
- What is the estimated cost of adopting this system-including the risks?
- What are the benefits or improvements the agency anticipates from implementing the chosen system?
Additional evaluation factors might arise from the requirements analysis, including the identification of unique agency needs.
Table 2: Comparison of Fused and Interoperable Databases