1
Data Search Searching and Finding information in Unstructured and Structured Data Sources Erik Fransen Senior Business Consultant 11.00-12.00 P.M. November, 3 IRM UK, DW/BI 2009, London Centennium BI expertisehuis The Hague, The Netherlands e.fransen@centennium.nl 2
Agenda Introduction; Industry models; Combining structured & unstructured data Pure Portal Index it all Structure it all Summary. 3
Erik Fransen Profile Background: Knowledge Engineering, Middlesex University; Expertise areas: Business Intelligence Knowledge engineering Knowledge & Content management Data warehousing Analytics CBIP. 4
5 Introduction
Combining BI with unstructured data Integrated access to relevant information ( provide complete picture ); Unstructured data like documents provide valuable context to numerical data; Customer complaints Competitor s press releases Marketing documents Insurance fraud analysis (i.e. claim statistics and claim forms); Competitive Intelligence (i.e. market share data and competitor news); Customer retention (i.e. sales data and customer complaints); Data Search acts as a bridge between structured and unstructured data. 6
SQL-99 SQL-03 SQL-70 Oracle-79 SQL-89 SQL-92 >80% Unstructured (un)structured data keeps growing. 2009 Cave paintings, Bone tools 40,000 BC Writing 3500 BC Paper 105 Printing 1450 2005 2001 2000 Electricity, Telephone 1870 Transistor 1947 Computing 1950 Internet (DARPA) Late 1960s The Web 1993 1999 GIGABYTES Source: Forrester 7
Industry Model: Bill Inmon s DW 2.0 Hold data at the lowest detail; Hold data to infinity; Have integrity of data and have online high-performance transaction processing; Tightly couple metadata to the data warehouse environment; Link structured data and unstructured data; Text Data 8
9 Industry Model: Information Access Architecture (Gartner)
10 Industry Model: Enterprise Search Platform (Forrester)
Data Search Scenarios Searching and Finding information in Unstructured and Structured Data Sources 11
Unstructured Middleware Portal Structured Master & Meta Data Global architecture OLTP DWH Data Marts Data Marts Cubes Reports OLAP Mining Financial Apps ODS Content Man System Fileservers Search Index Database Search Text Mining Visualisation Email Intranet/inte rnet 12
Unstructured Middleware Portal Structured Master & Meta Data OLTP Structure it all Three data search scenarios DWH Data Marts Data Marts Cubes Reports OLAP Mining Financial Apps Content Man System ODS Index it all Pure Portal Fileservers Search Index Database Search Text Mining Visualisation Email Intranet/inte rnet 13
Scenario 1: Pure Portal Many portlets, one user interface; Business user may manually combines content from several independent sources; Risk: too complex for user. 14
Unstructured Middleware Portal Structured Master & Meta Data 1: Pure Portal OLTP DWH Data Marts Data Marts Cubes Reports OLAP Mining Financial Apps Content Man System ODS Pure Portal Fileservers Search Index Database Search Text Mining Visualisation Email Intranet/inte rnet 15
Integrate news with BI information Source: Aruba 16
17 Structured BI info
18 and Photos, Files and Maps
Scenario 2: index it all Enterprise Search from one user interface; Business user knows what to look for and expects a complete picture as a result; Risk: Many irrelevant search results due to the nature of document indexing. 19
Unstructured Middleware Portal Structured Master & Meta Data 2: Index it all OLTP DWH Data Marts Data Marts Cubes Reports OLAP Mining Financial Apps Content Man System ODS Index it all Fileservers Search Index Database Search Text Mining Visualisation Email Intranet/inte rnet 20
User interface Scenario 2: Index it all Unstructured data sources Search index Search application BI report is indexed as if it was a document Structured data sources Data warehouse Architecture Reports BI application 21
Example: IBM Cognos 8 Go! Search Integration with enterprise search applications (IBM OmniFind, Google OneBox for Enterprise, Yahoo, Autonomy) Search results return all relevant structured content (reports, analyses, etc.) and unstructured content (Word documents, PDFs, et) within a single interface. 22
23 Example: IBM OmniFind
24 Example: IBM OmniFind
25 SAP BusinessObject Intelligent Search
SAP BusinessObject Intelligent Search 26 11/6/2
Scenario 3: Structure it all Generate structure using document warehousing and text mining; Business user knows exactly what to look for; Risk: Limited flexibility for user. 27
Unstructured Middleware Portal Structured Master & Meta Data OLTP Structure it all 3: Structure it all DWH Data Marts Data Marts Cubes Reports OLAP Mining Financial Apps ODS Content Man System Fileservers Search Index Database Search Text Mining Visualisation Email Intranet/inte rnet 28
Generating structure in document warehouse Identify Sources Retrieve Documents Preprocess Documents Text Mining Compile Metadata Sources are not fixed Iterative process, sources lead to new sources Internal sources retrieval, file servers, CMS/DMS External source retrieval, using crawlers, spiders Sources are not fixed Iterative process, sources lead to new sources Format documents in a consistent matter Files must be in suitable form for text analysis Linguistic analysis Key features are extracted Indexing documents Summarizing documents Carefully attach metadata to document Used for querying, matching, navigation support Store in document warehouse Source: Dan Sullivan 29 Data warehouse Architecture Combine (meta)data Document warehouse Architecture
Document warehouse Contains complete documents or URLs Metadata about documents: summaries, authors names, publication dates, titles, sources, keywords, etc. Translations of documents Thematic clustering of similar documents Topical or thematic indexes Document warehouse Architecture Extracted key features (structure) Dimensions and Facts, linked to documents, summaries etc. Combine with the data warehouse 30
BI reporting on dimensional model Dim Product Dim Customer Dim Action Sales Facts Call Facts Dim Competitor Dim Sales person Dim Time Dim Telco Term Data warehouse Document warehouse 31
Generate structure using text mining tools 32 Example taken from SPSS PASW Text Analytics, many other tools available: IBM, SAS, Oracle, SAP BO, Microsoft etc. etc.
Generating structure using UIMA Unstructured Information Management Architecture Originates from IBM, now Apache UIMA http://incubator.apache.org/uima/ Source: IBM UIMA is supported by all main BI vendors. 33
Example: Generating structure using UIMA Analyzed by a collection of text analytics Detected Semantic Entities and Relations Highlighted Represented in UIMA Common Analysis Structure (CAS) 34
Summary Growing business need for combining BI with unstructured data; Data Search bridges the gap between both worlds Scenario 1: Pure Portal Scenario 2: Index it all Scenario 3: Structure it all Scenarios can be combined. Questions? 35