LANDMARK TECHNICAL PAPER 1 LANDMARK TECHNICAL PAPER The Challenges of Integrating Structured and Unstructured Data By Jeffrey W. Pferd, PhD, Sr. Vice President Strategic Consulting Practice at Petris Presented at the 14th Petroleum Network Education Conference (PNEC), 2010
LANDMARK TECHNICAL PAPER 1 The Challenges of Integrating Structured and Unstructured Data By Jeffrey W. Pferd, PhD, Sr. Vice President Strategic Consulting Practice at Petris 1 Introduction All organizations are aware that a considerable amount of technical and business information and knowledge resides in both the structured data bases and in unstructured repositories (e.g., documents, emails, etc.). Simply enabling independent searches of these does not produce the most value. Valuable conclusions are represented in reports that were developed from investigating structured data, and these are lost from view when searching only databases. Together, they provide the facts and the conclusions. When the searches are combined, they produce a plethora of information. Usability and workflows are important design considerations to achieve clear, quick access to multiple disparate information sources. The power of combined access to integrated structured and unstructured data is becoming the focus of many international oil and gas companies. This paper describes the technical and usability issues of integrating disparate sources into a clear and single information environment. Background The E&P scientific and technical business is information and data intensive. As the computerization of work has progressed, more and more of the information has migrated to electronic form. Initial data management focused on the workgroup. Multiple disciplines dealt with their own area of expertise. The initial efforts of management of data beyond the work groups concentrated on the shared structured data, such as well logs and seismic data. Interpretation results, such as horizons, picks, and faults were shared next. Recognizing that project speed and quality would increase by sharing data, effort was made to define massive multidisciplinary data models. These persist today and stand alongside the active project data stores. Most of these large-scale repositories are used to store the raw incoming information that is feeding the interpretation systems. At the same time, document management systems were growing. They started on the business side and migrated toward the scientific and technical by storing reports of interpretation. Over time, emails and the documents located on shared and personal computer disks began to hold larger and larger volumes of important and valuable information. We did an informal web survey amongst a number of E&P data management staff and consultants on where they believed valuable information resides. 1 Landmark Software & Services acquired Petris in 2012 at which time the author of this paper joined as Sr. Technical Advisor for Information Management.
LANDMARK TECHNICAL PAPER 2 100% 80% Where do you think dthe information with the biggest value for your organization hides? 60% 40% 42% 30% 20% 0% 17% 2% 7% Applicaton Databases DCM (Document, FileNET, etc) Wild Files S-drives, etc Corporate Databases SharePoint and the like Figure 1 - Informal Poll Results Scientific and technical users believe that traditional repositories hold valuable information. But they also believe that the unstructured, ad hoc repositories of SharePoint and wild files held nearly as much valuable information. It is understood by many people that most of the knowledge dialog that takes place in our organizations occurs in the email and casual documents that are written and exchanged on a daily basis. These are not stored in data bases, nor indexed in document management systems. They are commonly stored on shared drives, email servers and file backups. This information is usually sitting on individual PCs and desktop devices. Significant knowledge is stored in this mass of unstructured material. There are questions and answers, photos and diagrams, short statements and multipage reports. All are produced with expertise and effort. These repositories are becoming the new targets for mining valuable information in an organization. There are good reasons to include the unstructured information in an enterprise data management solution. These files contain interpretations, descriptions and decisions. The structured data stores contain mostly raw primary data, such as well logs and seismic traces. One could say that the unstructured data stores hold the intellectual capital and the structured data stores hold the valuable, basic factual data. So integrating these two information sources into the Enterprise data management strategy makes a lot of sense. Extracting Information from Unstructured Data Stores Increasingly capable technical solutions are enabling access to these attractive knowledge stores. Systems can access the email and documents. Systems can extract patterned information out of them, such as zip codes, addresses and telephone numbers. Search engines that operate in a Google-like manner have been implemented and are enabling delivery to the desktop hundreds if not thousands of documents. Just indexing the documents and files can be a daunting task and is frequently a roadblock to success in these projects. Great taxonomies are built and then they languish on paper because the hurdle of classifying each document is too time consuming and costly. Fortunately, new technologies are providing capabilities so that rules can be applied to apply standard taxonomies to the documents and electronic records. However, propagating this taxonomy across the enterprise, including structured data, requires a unique architecture.
LANDMARK TECHNICAL PAPER 3 Problems of Integrated Searches We and others have taken steps to include structured data records in these searches and added map displays of the location of all these information records. Now a key word search or a map search can bring to the desktop a massive number of documents, emails and database records. We now have a different problem. We have found information, now how to find meaning and value. How can we navigate through the massive piles of files? How can we be informed of information being developed right now? The emails never stop. The presentations do not get fewer. We rarely get rid of data, rather we obtain more. The tsunami continues! Capabilities to deliver to the desktop are outstripping the ability of the end users to extract value. This is becoming the most pressing of our challenges. Just getting information to the desktop does not streamline decision-making, isolate the most relevant information for your problem, nor deliver insights that are timely. Search Results, Problems, and Information Overload We and others have taken steps to include structured data records in these searches and added map displays of the location of all these information records. We are faced with a number of issues when working with both structured and unstructured sources. They are presented as different media. The search cycle that we use may be different at different times in our work efforts. We will get many results from the search and are not assured that we are getting the most current or relevant. Mitigation Strategies The problems faced in the enterprise integration of structured and unstructured data require multiple approaches to resolution. Some are focused on understanding the search patterns, others on deciding what is included, others on aggregation techniques and still others on the presentation mechanisms. We will describe these individually below. Understanding Search Patterns We find that enterprise search is not the same for all people and not the same for people who are in different stages of a project or operation. Query Life Cycle Stages Project Initiation Within Project Concluding Project Seeking Finding Filtering Displaying/Analylzing Return to found set Displaying Deciding Assembling Deciding Assembling Presenting Archiving Figure 2 - Searching Lifecycles
LANDMARK TECHNICAL PAPER 4 During the life of a project, the functions needed by a user change from stage to stage. Each function can be optimized for each episode. For example, Seeking can be enriched by adding both attribute and spatial selection functions. Returning to a Found Set can be assisted by enabling a rapid return to the previously selected information, but also can include alerts about data that came into the system after the original search. This new data should be highlighted in some manner so that it can be recognized as new to the end user. Assembling can be enhanced by the use of a collection attribute that crosses data types and sources. This enables completeness criteria to be applied to ensure that all the necessary data is available for Presentation. Lastly, the value of Archiving can be enhanced by including collection and completeness attributes along with the identity of the user and a record of the heritage of the information. By examining the needs of each functional step and considering the fore and aft operations, the data enterprise data management can produce the maximum value for the organization. Search Results Integration Integration of search results is key to sucess of enterprise data management. This high-level diagram shows how structured data items, map displays and indexed unstructured data are brought together in a portal environment for the end users. User interface designs were made so that the user could navigate with free text search, taxonomy filters and map-based selection to obtain the data of interest. Business Intelligence Microsoft SharePoint PetrisWINDS OneTouch Analysts, Executives and Casual Users Search, Collaborate, Workflow PetrisWINDS Enterprise Structured Data Access ESRI GIS Microsoft FAST Work teams and Technical Users Data Capture, Analysis, Store & Report Specialized Specialized Apps Apps Specialized Specialized Apps Apps Individual Technical Users Best of Breed Apps Unstructured Files Figure 3 - High-Level Integration Architecture The user experience and screen layout is shown in Figure 4. When the curser is hovering over the data item, its location on the map blinks. This is designed to accentuate the relationships between map positions and the information items found as documents or database records in the list.
LANDMARK TECHNICAL PAPER 5 Figure 4 - User View of Unstructured and Structured Information in Enterprise Information Portal The thumbnails identify geo-referenced documents that are listed in the center panel. The left taxonomy allows filtering of the found set. The documents can be read from the links provided and structured digital data can be viewed using plug-in viewers. The check boxes have been placed by each unstructured document so that the selected files of images and documents can be placed in a more convenient Image Navigation Tool, shown in Figure 5. Image Navigation Tools The Image Navigation Tool provides users a familiar user paradigm to thumb through a set of documents and images in a fast and efficient manner to select data that they would like to examine in more detail. This is another tool to deal with the hundreds or thousands of items that a search can return from an enterprise information system. Figure 5 - Document Navigation Tool
LANDMARK TECHNICAL PAPER 6 Focused Indexing When faced with the whole enterprise full of unstructured files, it may seem hard to know where to start. We have been fortunate to have had had access to the usage patterns of an early online document indexing system. These were the AAPG Bulletins and Special Publications. This scientific and technical information was updated monthly and aggregated for the many years of publication. In order to assess system loading, we asked the questions about the pattern of use from the technical public. We learned two important things and have brought that insight to our enterprise search approach. We found that the Bulletins and the Special Publications had different access patterns. Interest was lost in the monthly bulletins as time went on. The Special Publications compendiums of papers on single topics mostly retained their interest to the community even when the data was decades old Online Access Patterns to SpecPubs Online Access Patterns to Bulletins Number downloaded 400 350 300 250 200 150 100 50 0 1910 1920 1930 1940 1950 1960 Year 1970 1980 1990 2000 2010 Number downloaded 2000 1800 1600 1400 1200 1000 800 600 400 200 0 1900 1920 1940 1960 1980 2000 2020 Year What we learned was that the knowledge intensive and focused topic documents and newer documents seemed to provide the highest value to the scientists that made the effort to download the articles. This gave us an ability to prioritize our indexing and even weight the value of certain types of documents when presented to a user. Relationship Mapping A number of visualization tools are emerging that can provide some unique and even useful representations of the information found from an enterprise search. A Website worth visiting is Many Eyes from IBM. We have generated a few navigation diagrams from information found in our enterprise search demonstration. The first diagram is a relationship network between various E&P data types. This diagram can be used to summarize the content of data types that have been found. Daily Production Monthly Production Casing Segments Lease Casing String Field Well Logs Owners Seismic Line Shot Points Dirdectional Survey Reports Curves Attribute Values Authors Depth Values Figure 6 - Data Type Relationship Diagram
LANDMARK TECHNICAL PAPER 7 The next diagram is a Word Tree diagram. This is a visual search tool that lets you pick a word or phrase and shows you the different contexts in which it appears. The contexts are arranged in branching structures that show recurrent associations in the text. In Figure 7, we see that for lease 00063 well, data is available with core summaries and plugs. The highlight bar allows the user to zoom to a complete phrase or sentence that may be of interest. Figure 7 - Word Tree Diagram The last visualization is a bit stylistic, but remarkably useful. This is a Word Cloud that lets you see how frequently words appear in a given text. This can give a quick view of the contents of a report that has been found in an enterprise search. Figure 8 - Word Cloud of a Technical Document Central Catalog/Index Key to the propagation of a consistent taxonomy is the role of a central catalog that is synchronized with the federated data sources across the enterprise. It allows the organization to utilize the efforts invested in constructing their taxonomy. It also ensures speedy query responses and easy maintenance of the taxonomy, and reduces application license costs when searching across the enterprise. Central catalog architecture is at the heart of the fast Web search engines such as Google and Yahoo!, and they deal with massive amounts of information also.
LANDMARK TECHNICAL PAPER 8 Socialization of Data The socialization of data brings the concepts of Twitter, Facebook and Alerts that are available to the commercial users. We are introducing subscriptions and comment capture in our enterprise data management solutions. The first step is the subscription step, and Figure 9 is an example of a subscription management user interface from a seismic data management solution. Figure 9 - Subscription Management User Interface There are a variety of subscriptions that a user may wish to subscribe to. We described a situation in one of the Query Lifecycles where a returning user would like to know whether new data is available. As can be seen here, a rich array of conditions can be specified for messages to be sent. Some are data processing state changes, others are availability conditions, and still others are related to comments added by other users on the information being monitored. Notification can be implemented within the data management solution, via telecommunication portals such as SMS feeds or in emails that will arrive at the desk of the user. Figure 10 shows a consolidated summary of notification information that this user has subscribed to.
LANDMARK TECHNICAL PAPER 9 Figure 10 - Email Notification Summarizing Subscription Summary With the success of being able to access nearly all the information in your organization, there is a real danger of overwhelming the user with a tsunami of data. The value of this information is real and will make a significant contribution to the expert decision-making required in the E&P business. Instead of being overwhelmed by this information, we can use wise indexing and document weighting. We can provide speedy return of results and use multiple filters, visualizations and social interact with our information and associates. This will allow enterprise data management to achieve new levels of productivity, collaboration and sound decision-making for our organizations.
www.halliburton.com 2013 Halliburton. All rights reserved. Sales of Halliburton products and services will be in accord solely with the terms and conditions contained in the contract between Halliburton and the customer that is applicable to the sale. H010419 2013