Big Data Maturity - The Photo and The Movie Mike Ferguson Managing Director, Intelligent Business Strategies BA4ALL Big Data & Analytics Insight Conference Stockholm, May 2015
About Mike Ferguson Mike Ferguson is Managing Director of Intelligent Business Strategies Limited. As an independent analyst and consultant he specializes in business intelligence, analytics, data management and big data. With over 33 years of IT experience, Mike has consulted for dozens of companies, spoken at events all over the world and written numerous articles. Formerly he was a principal and cofounder of Codd and Date Europe Limited the inventors of the Relational Model, a Chief Architect at Teradata on the Teradata DBMS and European Managing Director of DataBase Associates. www.intelligentbusiness.biz mferguson@intelligentbusiness.biz Twitter: @mikeferguson1 Tel/Fax (+44)1625 520700 2
New Data Sources Have Emerged Inside And Outside The Enterprise That Business Now Wants To Analyse Data volume Data velocity E.g. RFID tag sensor networks Customers Front Office Service Sales Marketing Credit Verification Product/ service line 1 Product line 2 Product line 3 Product line 4 BackOffice Finance Procurement HR Supply Chain Suppliers Planning Product line n Operations Data volume Data variety Number of sources weather data 3
Popular Types of Data That Businesses Now Want to Analyse Web data Clickstream data, e-commerce logs Social networks data e.g., Twitter Semi-structured data e.g., e-mail Unstructured content How much is TEXT worth to you Sensor data Temperature, light, vibration, location, liquid flow, pressure, RFIDs Vertical industries structured transaction data E.g. Telecom call data records, retail 4
Big Data Analytics Has Taken Us Beyond The Traditional DW New Big Data Analytical Workloads 1. Analysis of data in motion 2. Complex analysis of structured data 3. Exploratory analysis of un-modeled multi-structured data 4. Graph analysis e.g. social networks 5. Accelerating ETL and analytical processing of un-modeled data to enrich data in a data warehouse or analytical appliance 6. Data warehouse optimisation offload ETL processing 7. The storage and re-processing of archived data 5
The Changing Landscape We Now Have Different Platforms Optimised For Different Analytical Workloads Big Data workloads result in multiple platforms now being needed for analytical processing Real-time stream processing & decision m gmt Graph analysis Investigative analysis, Data refinery Data mining, model development Advanced Analytic (multi-structured data) DW & marts Advanced Analytics (structured data) MDM R Streaming data NoSQL DB e.g. graph DB NoSQL DBMS Hadoop data store EDW mart Data Warehouse RDBMS DW Appliance Analytical RDBMS C Prod Cust Asset D U Traditional query, reporting & analysis 6
The Photo - Big Data Workloads Mean Multiple Platforms Are Now Needed IBS Enterprise Analytical Ecosystem Graph Analytics tools Custom Analytic applications MR & Spark BI tools Search based BI tools BI tools platform & data visualisation tools Advanced Analytics (multi-structured data) Data Virtualisation and optimization DW & marts Advanced Analytics (structured data) MDM System R actions NoSQL DB e.g. graph DB EDW mart DW Appliance C Prod Cust Asset D U Filtered Stream processing data Enterprise Information Management sensors XML, clickstream JSON Web logs web services RDBMS feeds social Cloud Files office docs 7
Making The Movie - How Do You Bring This To Life? - Key Questions When Building A Big Data Roadmap Why do you need Big Data? What is the business purpose? What kinds of data do you need to analyse to achieve your business goals and where is that data? What kinds of characteristics does that data have? What kinds of analytical workload do you need to support to derive insight from that data? What skills do you need to do this? What technology choices best fit your needs? How will you deploy big data technologies? How will you organise and manage implementation? How will you integrate with your existing analytical investment? 8
What Are You Trying to Achieve And What Data Do You Need?- Industry Use Case Examples And Data Required Source: Hortonworks 9
Do You Need Real-Time Streaming Analytics? - Popular Streaming Analytic Applications Use Cases Extreme real-time analytics Source: Hortonworks 10
Business Example Oil and Gas Graph Analytics tools actions stream processing Custom Analytic applications NoSQL DB e.g. graph DB MR & Spark BI tools Seismic analysis, well integrity, Advanced pipeline Analytics analysis, (multi-structured data refinery data) Financial planning Search based BI tools DW & marts EDW BI tools platform & data visualisation tools Data Virtualisation and optimization mart Financial reporting, Advanced Analytics (structured data) DW Appliance Enterprise spend Information analysis, Management Tool Suite production reporting, field service maintenance Production forecasting equipment failure prediction development Simplified access, virtual marts, information services MDM System R C Prod Cust Asset D U sensors Real-time sensor data analysis of drilling, XML, clickstream JSON Web logs web services RDBMS equipment monitoring, well integrity, pipeline flows, market trade monitoring feeds social Cloud Files office docs 11
Identify Candidate Big Data Projects, Categorise them and Align With Priorities in Your Business Strategy Business Strategy Objectives KPIs KPI targets Priorities Initiatives Budgets Customer Operations Risk Finance Sustainability align align Candidate Big Data Projects enrich enrich enrich enrich enrich R R C asset R U prod C asset cust R U prod C asset cust R U prod C asset D cust U prod C asset D cust U prod D cust D D DW & marts EDW mart 12
Why New Data And Big Data Analytics? Example: Enrich Customer Data and Insight Source: IBM Redbook - Information Governance Principles and Practices for a Big Data Landscape 13
Closing the Loop Feeding Insight Back Into MDM to Create Competitive Customer Data New insights from big data e.g. social intelligence, online behaviour customer intelligence C R Enriched customer customer D U MDM system with master data services reporting CPM alerts OLAP/Mine BI system mart DW mart mart historical data DW/BI system social networks 14
Assessing New Sources To Enrich Master Data Has To Be A Collaborative Process You Need Business In The Loop We need all relevant people to help determine high value data sources Business data expert IT Data Architect Data Scientist Goal: Enrich CUSTOMER data for better marketing Data Steward Business analyst Business data expert sandbox sandbox Business data expert IT Data Architect IT Developer We need to capture discussions, share exploratory results, rate data, prioritise projects 15
Once You Have Assessed The Value You Can Recruit & Start Project(s) To Acquire And Analyse New Data For example additional data about customers could come from: Social media data Professional life Lifestyle Relationships Likes/dislikes Sentiment - positive or negative opinion Intent - wants to buy, travel etc. Ownership - products owned (could be from competitors) Interests - Could be short-lived In-bound customer email Sentiment Call centre notes 16
Going Beyond Basic Customer Identity E.g. Extending / Enriching Customer Insight Customer interaction data Email Chat / transcripts Call centre notes Click stream Person-to-person dialogue C R enriched customer customer Customer attitude data U Opinions Preferences Needs and desires Customer bahaviour data Orders Payments Transaction history Usage history D MDM system with master data services Customer descriptive data Attributes Characteristics Relationships Demographics The objective is to create the best Customer dimension possible using additional internal and external data sources Source: MDM 17
Enriching Customer MDM Which Data Sources Potentially Require Big Data Analytics to Derive Insight? Customer interaction data CRM, Email web logs Chat / transcripts Call centre notes Click stream Person-to-person dialogue C R enriched customer Customer attitude data U Opinions Preferences Needs and desires Potential big data sources CRM, social media data, review web sites Customer bahaviour data Orders Payments Transaction history sensor data, Usage history Source: MDM MDM system with master data services Customer descriptive data Attributes Characteristics Relationships Demographics web logs filings The objective is to create the best Customer dimension possible using additional internal and external data sources D social media data, SEC 18
Enriching Customer MDM Need to Consider Volume, Variety and Velocity of Valuable New Data Sources Customer interaction data CRM, Email web logs Chat / transcripts High volume undiscovered structured data Source: MDM Call centre notes Click stream Person-to-person dialogue Customer bahaviour data unstructured data C unstructured data Customer attitude data Opinions Preferences Needs and desires Customer descriptive data High velocity, high MDM system volume semistructured data Payments services Characteristics Orders with master data Attributes Transaction history Relationships social media sensor data, Usage history Demographics data web logs The objective is to create the best Customer dimension possible using additional internal and external data sources R enriched customer D U Potential big data sources CRM, social media data, review web sites semistructured data semistructured data 19
New Data Sources Example - What Are We Looking To Extract From Social Media Data Sources? Social Data Platforms Do you have people with these skills? Do you have all the technology needed? Additional Organisation data Additional Person data e.g. hobbies, Interests, desires Professional data e.g. employers Product ownership data Intent Sentiment Unknown Relationships Requires several techniques: 1. JSON schema extraction 2. Text analytics for entity extraction 3. Clickstream analysis 4. Graph analytics for relationship discovery analysis enrich R C customer U D MDM System HDFS files 20
Social Media Data Challenges A Person Could Have Multiple Social Personas 21
Enriching Customer Data - Extracting LinkedIn Social Profile Data Via Their REST API Most social media sites have APIs to access informaton Additional Person data e.g. education, interests Professional data e.g. employers, skills LinkedIn returns data in JSON or XML formats Source: LinkedIn 22
Key Point! Several Different Types Of Big Data Analytic Workloads Can Be Used to Produce New Insights Text analytics to get new structured data attributes from millions of documents e.g. SEC filings, tweets, reviews Sentiment analytics for customer opinion Graph analytics for discovery of new customer relationships Clickstream analytics for customer interaction behaviour You can also combine these to find new data E.g. Text analytics to extract new data feeding graph analytics to find relationships in extracted data Do you know what you types of analytical workloads you need to implement? Do you know what technologies you already have and what you need? 23
Increased Data and Analytical Complexity Has Created A Need For A New Role The Data Scientist Image source: www.computing.co.uk Source: Wikipedia Data Science is the process of investigative / exploratory analysis of multi-structured data to discover and produce new business insights 24
Ensure People In Different Roles In The Analytical Landscape Work Together To Deliver Business Value Business Strategy strategic objectives and targets including sustainability targets Strategic Business Objective Priority KPI Current KPI Value What is +1% worth? 1 $$$ 2 3 4 KPI Target Executive Accountable Business Initiatives (projects) Project Project Project Budget Allocation x Million Action Plan Data Scientist Business Analyst Business Manager / Operations worker / Customer Exploratory analysis Predictive / statistical model producer Model consumer Data visualisation Information Producer Build reports Build and publish dashboards Information consumer Decision maker Action taker 25
Chaos Is NOT An Option Business Alignment of Information Being Produced is Critical To Success Projects without alignment run a high risk of failure or cancellation Strategic Objectives Business Strategy What problem are you trying to solve? What data do you need? What kind(s) of analytic workload are needed Project Project Project Project Project 26
Organising Your Data Science Projects In A Data Reservoir This Needs To Be Done Incrementally Do you have this already planned and organised? Enterprise Local Data marts Data Ingest zone Trusted Data e.g. Master Data DW Archive zone Exploratory analysis zone (prepare & analyse data) sandbox New Insights zone archive Txns DW insights NoSQL DB Graph DBMS Analytical DBMS DW Appliance MDM R C D U How will you know what data you have? 27
Governance Policies Will Apply More To Refined Data In A Reservoir Raw data In-Progress data Refined data sandbox 1. Rate/classify as sensitive 2. Define privacy policies 3. Define access policies corporate firewall Fit for use What governance have Untrusted Data Refinery Trusted you got in place? Classifying data in a catalog helps determine what governance policies to apply 28
Technology Options for Refining (Preparing) Data IT developed ELT processing on Hadoop and analysis by data scientists Self-service data integration and analysis by data scientists Multi-role data management platforms with analytics A combination of the above 29
Hadoop As A Data Reservoir and Data Refinery contains clean, high value data Graph DBMS EDW DW appliance New high value Insights (pub/sub) sandbox sandbox sandbox other data Data Refinery Data Reservoir (raw data) Transform & Cleanse Data in Hadoop (MapReduce) Parse & Prepare Data in Hadoop (MapReduce) Discover data in Hadoop Load data into Hadoop ELT work -flow 30
Exploratory Analysis of Clickstream Data in Hadoop E.g. Weblog Data in HortonWorks 31
Sentiment Analytics - Deriving Structure From Unstructured Content Additional Person data e.g. hobbies, Interests, desires Intent Sentiment (source: Crunchbase) 32
Using Text Analytics To Extract Additional Data From Unstructured Content Requirement is automatic recognition of people, organisations, addresses This Text can analytics be a computationally can derive information intensive from process unstructured involving content, complex e.g. emails character-level operations such as pattern matching. On large volumes, scalability matters 33
The Sentiment Analytics Process Data and Insights Can Be Matched To Master Data Using Fuzzy Matching Social Data Platforms Text Analysis Customer Engagement Management Social Media Aggregators HDFS files MapReduce or Spark sentiment scoring application Hive tables Scored sentiment and Social profile data Analyse / Index / Deliver Twitter Firehose MySpace Klout Amazon Facebook reddit Flickr Youtube bit.ly CRM applications critical fields Probabilistic ( fuzzy ) matching R C customer U D enrich C R enriched customer D MDM System U MDM System 34
Key Use Case - Enriching Customer Master Data With New Relationships Using Graph Analysis Customer interaction data Email Chat / transcripts Call centre notes Clickstream Person-to-person dialogue C R Enriched customer customer Customer attitude data U Opinions Preferences Needs and desires Source: MDM Customer bahaviour data Orders Payments Transaction history Usage history Click stream navigation MDM system with master data services Customer descriptive data Attributes Characteristics Relationships Demographics The objective is to create the best Customer dimension possible using additional internal and external data sources D 35
Example Identifying New Relationships Using Information Extracted From SEC Filings Subsidiaries list subsidiaries of a company Forms 8-K Current Events merger and acquisition bankruptcy change of officers and directors material definitive agreements Forms 3/4/5, SC 13D, SC 13G, 10-K, FDIC Call Report subsidiaries, insider, 5%, 10% owner, banking subsidiaries Shareholders related institutional managers Holdings in different securities Forms 10-K, DEF 14A, 8-K, 3/4/5, 13F, SC 13D, SC 13G, FDIC Call Report Forms 10-K, 10-Q, 8-K Loan Agreements loan summary details counterparties (borrower, lender, other agents) commitments borrower, lender Company Loan Reference SEC table Event Forms 3/4/5, SC 13D, SC 13G employment, director, officer insider, 5% owner, 10% owner Insider filings transactions holdings Insider relationship Security holdings, transactions Person Forms 13F, Forms 3/4/5 Forms 10-K, DEF 14A, 8-K, 3/4/5 5% beneficial ownership owner issuer % owned date Officers & Directors mention bio range, age, current position, past position signed by committee membership Source: IBM 36
Analysing Enriched Customer Master Data Can Improve Accuracy of Next Best Action To Be Taken Life events Additional Organisation data Additional Person data e.g. hobbies, Interests, desires Professional data e.g. employers Behaviour Product ownership data Intent score Sentiment score Unknown Relationships enrich C R Enriched customer D U Enriched MDM System analyse Next best action 37
New Insights Can Be Added Into A Data Warehouse To Enrich What You Already Know R C MDM U Operational systems D Data Scientists Web logs sandbox D I DW social web cloud new insights e.g. Deriving insight from social web sites like for sentiment analytics 38
Alternatively New Insights In Hadoop Can Integrated With A DW Using Data Virtualization To Provide Enriched Information OLTP systems D I DW Web logs Data Scientists sandbox SQL on Hadoop Data Vitualisation social web cloud new insights e.g. Deriving insight from social web sites like for sentiment analytics 39
Conclusions Whare Are You On The Maturity Model Taking Phots Or Making Movies? Source: IBM 40
Thank You! Big Data and Analytics Stockholm, November 26-27, 2015 www.intelligentbusiness.biz mferguson@intelligentbusiness.biz Twitter: @mikeferguson1 Tel/Fax (+44)1625 520700 41