Big Data Analytics DAMA NY DAMA Day October 17, 2013 IBM 590 Madison Avenue 12th floor New York, NY Tom Haughey InfoModel, LLC 868 Woodfield Road Franklin Lakes, NJ 07417 201 755 3350 tom.haughey@infomodelusa.com
Definition Types of data Structured Semi-structured Unstructured Why Big Data is important Sources of Big Data Levels of Big Data Use cases for Big Data Big Data analytics Data mining Predictive analysis NOSQL and Big Data Landscape The new business intelligence architecture How to prepare for Big Data Pitfalls Conclusions Agenda InfoModel, LLC. 2013 2
Big Data Definition Big data consists of high-volume, high-velocity, high-variety and high value data and processes that demand cost-effective, innovative forms of information processing for enhanced insight and decision making Source: modified from Gartner Glossary InfoModel, LLC. 2013 3
Big Data in the Past A decade ago, Big Data was: A scalability problem A performance problem Added to that was the difficulty of making sense of it That is where today s Big Data and Big Data Analytics come into play InfoModel, LLC. 2013 4
Why Now? Why can we achieve this now? The four-minute mile syndrome Nobody could do it till Roger Bannister did it Now lots of us can do it (!) Before we didn t have: The hardware technology The software systems The data management systems The thought processes InfoModel, LLC. 2013 5
Sources of Big Data Streaming data (e.g., stock market) Video archives Large-scale e-commerce Social and professional networks Internet text and documents Internet search indexing Call detail records Web logs RFID Medical records Sensor networks Social networks Military surveillance Astronomy Video and music archives Atmospheric science Genomics, biogeochemical Biological & other complex data Interdisciplinary scientific research InfoModel, LLC. 2013 6
Big Data Support Technologies The emergence of commodity servers NOSQL file management systems and Hadoop Inverted column databases In memory databases and analytics Convergence of machine learning and data mining Management of structured and unstructured content Support of Hadoop, Map:Reduce by major RDBMS vendors InfoModel, LLC. 2013 7
Types of Data Structured having a fixed and external structure (external to the data structure itself) Semi-structured having a structure imbedded in the data. Instances contain data values and metadata. The structure may vary but still needs to be planned and modeled Unstructured having no known structure. Often transformed to structured or semi-structured data for processing. The structure may vary but still needs to be planned and modeled InfoModel, LLC. 2013 8
Big Data and Business Analytics These levels affect data structure, data access and scale Level 4 Roll Your Own Level 2 Level 3 Unknown, unstructured or semistructured. Processed directly or at small to large scale Unknown, unstructured. Transformed to structured. Processed at small to large scale Level 1 Known, structured. Processed at small to large scale. Adapted from McKinsey InfoModel, LLC. 2013 9
Sample Use Cases by Data Level Level 1: Pricing: targeted price setting Campaign lead generation Customer experience Pricing based on customer value Level 2: Impact of marketing on sales Market basket to determine risk Next product to buy (NPTB) Cross channel integration Level 3: Fraud prevention Discount targeting based on location, likelihood-to-leave, web analytics Level 4: Targeted advertising, right landing page Pricing and targeted advertising, right price and landing page Credit line management Adapted from McKinsey InfoModel, LLC. 2013 10
Business Analytics Solutions used to build analytical, historical models and simulations to create scenarios, understand current status and predict future states Business analytics includes: Data mining, predictive analytics, applied analytics and statistics, and is delivered as an application suitable for a business user. Big Data Analytics is the convergence of Big Data and Business Analytics Big Data Big Data Analytics Business Analytics Without Big Data Analytics, big data is just a lot of data InfoModel, LLC. 2013 11
Should You Kill Your Data Warehouse? See Forbes, 8/24/2011 [Maybe don t see!] InfoModel, LLC. 2013 12
Try This Query Try this query on semi-structured or unstructured data on NOSQL or other multi-structured data environment Give me a breakdown of sales revenue and volume by household by month, order it by the org unit that sold the product and the org unit that owned the product, summarize it from product type, to product subgroup and product group, for the last 5 years In a DW containing this data, this query can be run efficiently and fairly easily coded in SQL How do you do this on enormous quantities of semi-structured or unstructured data using existing technologies? InfoModel, LLC. 2013 13
Forms of Analytics Traditional BI and OLAP Will stay Consumers already use these Consumers will add to them Traditional BI and OLAP Well known and required. Works well with most EDWs. Many levels and styles of BI Big Data Analytics Discovery oriented Shows value in Big Data Can leverage new platforms: e.g., Analytics DB Undergoing strong acceptance by consumer Advanced SQL Well-known SQL-based tools/ techniques. Can result in long, complex SQL statements to gather, aggregate and model data Predictive Analytics Data mining/statistics to understand the past and predict future events. Requires special tools and rock stars. New Analytic Methods/Tools Visualization, artificial intelligence, natural language processing. Analytical DB functions: inverted column DBs, DW appliances, MapReduce, etc. InfoModel, LLC. 2013 14
Data Mining The use of mathematical algorithms to find hidden relationships in the data It can be used to: Find rules or approaches that worked well in the past Identify dependencies or relationships between things Segment or classify customers based on how well they match something you care about Group and cluster things that are similar to each other Spot and identify anomalies buried in the data Text Source: James Taylor InfoModel, LLC. 2013 15
Techniques Used to Mine the Data Just as the popularity of new tools is exploding, so are the capabilities in data mining Data-mining techniques fall into four major categories: Classification such as targeted marketing Association such as market basket analysis Sequencing those who bought this bought that Clustering developing conclusions using space and distance NOTE: In Hadoop, querying and mining can be done through Hive, Mahout and Pig InfoModel, LLC. 2013 16
Predictive Analytics Applys mathematical techniques to historical data to build a future analytic model. It predicts: How likely something is to be true Its likely value The likely sequence For instance, instead of: Finding dependencies true in historical data, find dependencies likely to be true in the future Grouping customers based on historical similarities, group them on likelihood that they will behave similarly in the future Some regard data mining as the first step in predictive analytics Some use the terms synonymously Source: James Taylor InfoModel, LLC. 2013 17
Uses of Predictive Analysis Its major uses are to: Improve efficiency Reduce risk Increase profitability Examples: First case: Professional sports Moneyball Who should guard LeBron? What are individual players really worth to the team? Second case: banking Customers are using a new free business checking system for personal checks as well, increasing the cost of those accounts Will it be more profitable to pay them to leave Third case: 7% of customers account for 43% of revenue What should we offer them? InfoModel, LLC. 2013 18
Who Uses Big Data Data Scientists / Data Teams Knowledge workers BI consumers Decision makers at all levels of the business InfoModel, LLC. 2013 19
InfoModel, LLC. 2013 20
The Big Data Analytics Environment Big Data Analytics Complex analysis of structured data Analysis of irregularly structured data in Hadoop Social sentiment and social network analysis Big Data Enterprise Data Warehouse Environment Traditional Reporting and Analysis Traditional Data Warehouse Environment RDBMSs Appliance NOSQL HADOOP DW Mart Data Integration Real-time Analytics RYO data Streams Sensors Events Docs XML/JSON Files Cloud Tables OLAP Web Logs Consumers InfoModel, LLC. 2013 21
Means to Achieving Value in Big Data Create integrated, analytic sandboxes Use Hadoop is a complement to previous systems, not a replacement Derive data from Big Data as it is needed Less emphasis on pre-aggregation and pre-summarization As has been said since the opening days of client-server, send the function to the data Not the data to the function (as in some vertical DBMSs today) Learn to use parallel, distributed, commodity servers Use Big Data for staging and well as a live archive Virtualize Big Data For reuse across multiple analytical applications For easy access to the data when it is needed InfoModel, LLC. 2013 22
Preparing for Big Data (BD) Define the business objectives Big Data (BD) will yield business advantages But not without business involvement Understand and prepare the data for BD as in any environment It is just about slamming the data to some humongous staging area Data modeling is here to stay, but new methods are needed Costs and technology frustrations will increase But business advantages will go up as well Get the right staff Both BD and Analytics are new skills Organizations will need to hire, train, and learn accordingly Source the right suppliers and technology BD Analytics will be mainstream; not just for giant web firms Tools and platforms will improve so there will be less coding Plus improvement in scalability, performance, real time availability Expect Hadoop and other Big Data infrastructure to become common Hadoop will not replace anything Data Warehousing and BI will continue InfoModel, LLC. 2013 23
Pitfalls Potential pitfalls that can trip up organizations on big data analytics initiatives include: Absence of clear business purpose Jettisoning data management principles and practices Absence of internal analytics skills (you need rock stars) The high cost of hiring experienced analytics professionals High costs of the new infrastructure (hardware and software) Challenges in integrating Hadoop systems and data warehouses Selecting the right vendors who offer software connectors across and to Big Data technologies InfoModel, LLC. 2013 24
Conclusions Big Data must deliver Business Value That is the sine qua non of Big Data Analytics Reporting, analysis and OLAP will stay You also need discovery analytics, predictive analysis and data mining Plan your entry into big data and implement it in sensible increments Be clear up front on the business goals Select key sources (data from Web, other systems, social networks) You will have to make some upgrades: Add new BI/DW technologies Train your staff Change is inevitable Give the business what it needs Discovery analytics to understand change, find opportunities Broader, more complete views of all relevant entities (e.g., customer) Analytics targeting your industry and your organization s specific needs and unique collection of big data InfoModel, LLC. 2013 25