Taming the Beast of Big Data Jeff Zakrzewski Vice President Sogeti USA Local Touch, Global Reach 1
Agenda What is Big Data? Some Sources of Big Data Approaches to Big Data The Hadoop Buzz Vertical Perspective Vendor Perspective Role of the Future Q & A Local Touch, Global Reach 2
Local Touch, Global Reach 3
What is Big Data? Local Touch, Global Reach 4
What is Big Data? Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. As much as 80% of the world s data is now in unstructured formats, which is created and held on the web. This data is increasingly associated with genuine Cloud-based services, used externally to the Enterprise IT. The part of Big Data that relates to the expected explosive growth and creation of new value is the unstructured data mostly arising from these external sources. Data sets are growing at a staggering pace Expected to grow by 100% every year for at least the next 5 years. Most of this data is unstructured or semi-structured generated by servers, network devices, social media, and distributed sensors. Big Data refers to such data because the volume (petabytes and exabytes), the type (semi- and unstructured, distributed), and the speed of growth (exponential) make the traditional data storage and analytics tools insufficient and cost-prohibitive. An entirely new set of processing and analytic systems are required for Big Data, with Apache Hadoop being one example of a Big Data processing system that has gained significant popularity and acceptance. According to a recent McKinsey Big Data report, Big Data can provide up to $300 billion annual value to the US Healthcare industry, and can increase US retail operating margins by up to 60%. It s no surprise that Big Data analytics is quickly becoming a critical priority for large enterprises across all verticals. Local Touch, Global Reach 5
Big Data V3 Characteristics The usual big data characteristics are: Volume: there is a lot of data to be analyzed and/or the analysis is extremely intense; either way, a lot of hardware is needed. Variety: the data is not organized into simple, regular patterns as in a table; rather text, images and highly varied structures or structures unknown in advance are typical. Velocity: the data comes into the data management system rapidly and often requires quick analysis or decision making. Local Touch, Global Reach 6
Big Data Trend Overview Drivers Volume, variety, velocity, and complexity of incoming data streams Growth of Internet of Things results in explosion of new data Commoditization of inexpensive terabyte-scale storage hardware is making storage less costly.so why not store it? Increasingly enterprises are needing to store non-traditional and unstructured data in a way that is easily queried Desire to integrate all the data into a single source The power of Compression Local Touch, Global Reach 7
Big Data Trend Overview Challenges Data comes from many different sources (enterprise apps, web, search, video, mobile, social conversations and sensors) All of this information has been getting increasingly difficult to store in traditional relational databases and even data warehouses Unstructured or semi-structured text is difficult to query. How does one query a table with a billion rows? Culture, skills, and business processes Conceptual Data Modeling Data Quality Management Local Touch, Global Reach 8
Big Data Trend Overview Implications Emerging capabilities to process vast quantities of structured and unstructured data are bringing about changes in technology and business landscapes As data sets get bigger and the time allotted to their processing shrinks, look for ever more innovative technology to help organizations glean the insights they'll need to face an increasingly data-driven future Local Touch, Global Reach 9
Have you processed your Yottabyte today? With the advent of big data comes even bigger storage capacity now we can deal in Yottabytes! The National Security Agency (NSA) is already building a gigantic supercomputer to process this gigantic amount of information in the biggest spy center ever (bigger than 17 football fields). The million square foot Centre will be more than five times the size of the US Capitol and be able to sift through literally all electronic communications all over the world. Local Touch, Global Reach 10 The Utah-based facility that can process yottabytes (a quadrillion gigabytes) of data, (according to the Gizmondo technology blog), is designed to intercept, decipher, analyze, and store vast swaths of the world s communications as they zap down from satellites and zip through the underground and undersea cables of international, foreign, and domestic networks, It will be the centerpiece for the Global Information Grid and is set to go live in September 2013.
Big Data The Byte Scale The file size conversion table below shows the relationship between the file storage sizes that computers use. Binary calculations are based on units of 1,024, and decimal calculations are based on units of 1,000. File size measures the size of a computer file. Typically it is measured in bytes with a prefix. The actual amount of disk space consumed by the file depends on the file system. The maximum file size a file system supports depends on the number of bits reserved to store size information and the total size of the file system. For example, with FAT32, the size of one file cannot be equal or larger than 4 GiB. Name Symbol Binary Measurement Decimal Measurement Number of Bytes Equal to kilobyte KB 2^10 10^3 1,024 1,024 bytes megabyte MB 2^20 10^6 1,048,576 1,024KB gigabyte GB 2^30 10^9 1,073,741,824 1,024MB terabyte TB 2^40 10^12 1,099,511,627,776 1,024GB petabyte PB 2^50 10^15 exabyte EB 2^60 10^18 zettabyte ZB 2^70 10^21 yottabyte YB 2^80 10^24 1,125,899,906,842,624 1,152,921,504,606,846,976 1,180,591,620,717,411,303,424 1,208,925,819,614,629,174,706,176 1,024TB 1,024PB 1,024EB 1,024ZB Local Touch, Global Reach 11
Some Sources of Big Data Local Touch, Global Reach 12
A Connected World Local Touch, Global Reach 13
An Explosion in Data in Recent History! 1.8 Billion RFID tags in 2005 4 Billion RFID tags in 2009 30 Billion RFID tags in 2010 Over 2.3 Billion Internet users 24 Petabytes of data processed in a single day Billions of financial transactions daily TBs of data! 6 Billon Mobile Phones World Wide 100s of Millions Videos 10s of Petabytes of Data World Data Centre for Climate 220 Terabytes of Web data 9 Petabytes of additional data Twitter processes 12 terabytes of data every day - 230 million tweets Facebook processes 25 terabytes of data every day The Human Genome Project Fully mapped in 2003 Local Touch, Global Reach Petabytes 14 of data ~1GB per human non-compressed
What do we do with all of this data? Local Touch, Global Reach 15
The Challenge: Bring Together a Large Volume and Variety of Data to Find New Insights Analyzing a variety of data at enormous volumes Insights on streaming data Large volume structured data analysis Multi-channel customer sentiment and experience analysis Detect life-threatening conditions at hospitals in time to intervene Predict weather patterns to plan optimal wind turbine usage, and optimize capital expenditure on asset placement Make risk decisions based on real-time transactional data Local Touch, Global Reach 16 Identify criminals and threats from disparate video, audio, and data feeds
Approaches to Big Data Local Touch, Global Reach 17
The Big Data Approach: Information Sources Drive Creative Discovery Business and IT Identify Information Sources Available New insights drive integration to traditional technology IT Delivers a Platform that enables creative exploration of all available data and content Business determines what questions to ask by exploring the data and relationships Local Touch, Global Reach 18
Big Data Enterprise Data Platform Manage Big Data from the instant it enters the enterprise High fidelity no changes to original format Available for new uses, analyses, and integrations Business Analytic Applications and Solutions Big Data Applications Operational Data Store Big Data Platform Big Data Solutions Big Data User Environment Client and Partner Solutions Warehouse and Appliances Developers End Users Admin. Big Data Enterprise Engine Traditional data sources Streaming analytics Internet-scale analytics Source data (Web, sensors, logs, media, etc. ) Local Touch, Global Reach 19 Govern: Quality, Lifecycle Management, Security, Privacy
Data Processing and Analytics: The Old Way Traditionally, data processing for analytic purposes follows a fairly static blueprint. Namely, through the regular course of business enterprises create modest amounts of structured data with stable data models via enterprise applications like CRM, ERP and financial systems. Data integration tools are used to extract, transform and load the data from enterprise applications and transactional databases to a staging area where data quality and data normalization (hopefully) occur and the data is modeled into neat rows and tables. The modeled, cleansed data is then loaded into an enterprise data warehouse. This routine usually occurs on a scheduled basis usually daily or weekly, sometimes more frequently Traditional Data Processing/Analytics - Source: Wikibon 2011 Local Touch, Global Reach 20
Big Data Analytics Complements the DW Transactional Big-data projects cannot use Hadoop, as it is not real-time. For transactional systems that do not need a database with ACID 2 guarantees, NoSQL databases can be used, though there are constraints such as weak consistency guarantees (e.g., eventual consistency) or restricting transactions to a single data item. For big-data transactional SQL databases that need the ACID 2 guarantees the choices are limited. Traditional scale-up databases are usually too costly for very large-scale deployment, and don't scale out very well. Most social medial databases have had to hand-craft solutions. Recently a new breed of scale-out SQL database have emerged with architectures that move the processing next to the data (in the same way as Hadoop), such as Clustrix. These allow greater scaleoutability. This area is extremely fast growing, with many new entrants into the market expected over the next few years. 2 ACID stands for atomicity, consistency, isolation, durability. Local Touch, Global Reach 21
Merging Traditional and Big Data Approaches Traditional Approach Structured & Repeatable Analysis Big Data Approach Iterative & Exploratory Analysis Business Users Determine what question to ask IT Delivers a platform to enable creative discovery IT Structures the data to answer that question Monthly sales reports Profitability analysis Customer surveys Business Explores what questions could be asked Brand sentiment Product strategy Maximum asset utilization Preventative care Local Touch, Global Reach 22
Data flow and Processes Compared Local Touch, Global Reach 23
Enterprise Integration Trusted Information & Governance Companies need to govern what comes in, and the insights that come out Data management Insights from Big Data must be incorporated into the warehouse Data Warehouse Enterprise Integration Big Data Platform Traditional Sources New Sources Local Touch, Global Reach 24 24
Local Touch, Global Reach 25
Big data and Hadoop What is Hadoop? The most well known technology used for Big Data is Hadoop. It has been inspired from Google publications on MapReduce, GoogleFS and BigTable. As Hadoop can be hosted on commodity hardware (usually Intel PC on Linux with one or 2 CPU and a few TB on HDD, without any RAID replication technology), it allows them to store huge quantities of data (petabytes or even more) at very low costs (compared to SAN systems). Hadoop is an opensource version of Google s MapReduce framework. It is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation: http://hadoop.apache.org/. The Hadoop brand contains many different tools. Two of them are core parts of Hadoop: Hadoop Distributed File System (HDFS) is a virtual file system that looks like any other file system except than when you move a file on HDFS, this file is split into many small files, each of those files is replicated and stored on (usually, may be customized) 3 servers for fault tolerance constraints. Hadoop MapReduce is a way to split every request into smaller requests which are sent to many small servers, allowing a truly scalable use of CPU power. Local Touch, Global Reach 26
How does Hadoop help? What problems can Hadoop solve? The Hadoop framework is used by major players including Google, Yahoo, IBM, ebay, LinkedIn and Facebook, largely for applications involving search engines and advertising. The preferred operating systems are Windows and Linux but Hadoop can also work with BSD and OS X. Hadoop was originally the name of a stuffed toy elephant belonging to a child of the framework's creator, Doug Cutting. Mike Olson (Cloudera): The Hadoop platform was designed to solve problems where you have a lot of data perhaps a mixture of complex and structured data and it doesn't fit nicely into tables. It's for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That's exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms. Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they're more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built. Those are just a few examples. Local Touch, Global Reach 27
What does the Hadoop architecture look like? Hadoop Internal Software Architecture Local Touch, Global Reach 28
Enterprise Hadoop Vendors The free open source application, Apache Hadoop, is available for enterprise IT departments to download, use and change however they wish. But for many business users, the need for support and technical expertise often largely overshadows the lure of free do-it-yourself applications, especially when there are critical IT systems at stake. That's where supported, enterprise-ready versions of Hadoop can instead be a better, more realistic option. Here is a sampling of some of the major commercial vendors that can help your company get started with Hadoop. Some offer on-premises software packages; others sell Hadoop in the cloud. There are also some Hadoop database appliances beginning to appear, including the recently announced joint effort by Oracle and Cloudera. Amazon Web Services runs Amazon Elastic MapReduce, a hosted Hadoop framework running on Amazon's Elastic Compute Cloud and its Simple Storage Service The Cloudera Enterprise subscription service The Datameer Analytics Solution using Hadoop The DataStax Enterprise Hadoop software Greenplum, a Division of EMC, offers Greenplum HD Enterprise-Ready Apache Hadoop The Hortonworks Data Platform BigInsights, an unstructured-data cloud service from IBM based on Hadoop Karmasphere Analyst, a toolkit to help produce data using Hadoop MapR provides an enterprise-ready M5 edition of its Hadoop software This list features only some of the many vendors offering enterprise Hadoop products and services today. The number of vendors is constantly growing as Hadoop gains steady traction in the data marketplace. Local Touch, Global Reach 29
WHY HADOOP? Hadoop Open source platform supporting large-scale parallel processing 1000 s of servers Massive scale distributed file system Petabytes of data Customer Requirements Very affordable, scalable storage (petabytes) Want to store complete transaction data Flexible schema new datasets with new schema created regularly Scalable, flexible analytics generation of models of fraudulent card usage Job fault-tolerance Hadoop Benefits We showed that jobs that took multiple weeks reduced to hours with Hadoop Fundamentally change what they are able to do 30 30 Local Touch, Global Reach 30
Vendor Perspective Local Touch, Global Reach 31
Big Data Vendor Landscape Local Touch, Global Reach 32
Big Data Market The Big Data market is on the verge of a rapid growth spurt that will see it top the $50 billion mark worldwide within the next five years. As of early 2012, the Big Data market stands at just over $5 billion based on related software, hardware, and services revenue. Increased interest in and awareness of the power of Big Data and related analytic capabilities to gain competitive advantage and to improve operational efficiencies, coupled with developments in the technologies and services that make Big Data a practical reality, will result in a super-charged CAGR of 58% between now and 2017. Local Touch, Global Reach 33
Big Data Market Forcast Big Data is the new definitive source of competitive advantage across all industries. For those organizations that understand and embrace the new reality of Big Data, the possibilities for new innovation, improved agility, and increased profitability are nearly endless. Below is Wikibon s five-year forecast for the Big Data market as a whole: Local Touch, Global Reach 34 Source: Wikibon 2012
Big Data Pure-Play Vendors Annual Revenue Below is a worldwide revenue breakdown of the top Big Data pure-play vendors as of February 2012. Local Touch, Global Reach 35 Source: Wikibon 2012
Big Data Pure-Play Vendors Market Share Below is a breakdown of market share among the pure-play segment of the Big Data market. Local Touch, Global Reach 36 Source Wikibon 2012
Components of Big-data Processing Big-data projects have a number of different layers of abstraction from abstaction of the data through to running analytics against the abstracted data. Figure 1 shows the common components of analytical Bigdata and their relationship to each other. The higher level components help make big data projects easier and more productive. Hadoop is often at the center of Big-data projects, but it is not a prerequisite. Analytical Big-data Components - Source: Wikibon 2011 Local Touch, Global Reach 37
The Forrester Wave : Enterprise Hadoop Solutions, Q1 2012 The Forrester Wave is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave are trademarks of Forrester Research, Inc. The Forrester Wave is a graphical representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change. Local Touch, Global Reach 38
Cloudera Local Touch, Global Reach 39
InfoSphere Information Server IBM Big Data Platform IBM Big Data Solutions Client and Partner Solutions Marketing IBM Unica Text Statistics Big Data Accelerators Financial Geospatial Acoustic Content Analytics ECM Image/Video Mining Times Series Mathematical Connectors Applications Blueprints Big Data Enterprise Engines Business Analytics Cognos & SPSS Warehouse Appliance IBM Netezza InfoSphere Streams InfoSphere BigInsights Productivity Tools and Optimization Workload Management and Optimization Consumability and Management Tools Open Source Foundation Compnents Eclipse Oozie Hadoop HBase Pig Lucene Jaql Master Data Management InfoSphere MDM Data Warehouse InfoSphere Warehouse Database DB2 Data Growth Management InfoSphere Optim Local Touch, Global Reach 40
LEGEND BigInsights Platform and Roadmap IBM unique value IBM differentiating value IBM complementary value Open source Performance Manageability Consumability Analytics Integration BigInsights Enterprise Console DBs Crawlers Streams Data Explorer Application Flows Dashboards/Reports Administration DBA/Analyst/ Programmer Analyst Analyst BigInsights Enterprise Engine DBA DataStage Streams DB2 Netezza JMS HTTP Web & Application logs Analytics (machine learning, text) Languages (Jaql, Pig, Hive, HBase) Workflow orchestration Map-reduce (Hadoop) File system (GPFS+, HDFS) Indexing Workload Prioritization SPSS Cognos Unica... Local Touch, Global Reach 41 41
IBM PureSystems Local Touch, Global Reach 42
Oracle s Big Data Solutions Local Touch, Global Reach 43
Oracle s Big Data Appliance and Exadata Local Touch, Global Reach 44
Microsoft BI Connectivity to Hadoop Local Touch, Global Reach 45
Microsoft Big Data stack Local Touch, Global Reach 46
The Microsoft Big Data Solution Local Touch, Global Reach 47
The Informatica Approach Data Warehouse Data Migration Test Data Management & Archiving Data Consolidation Master Data Management Data Synchronization B2B Data Exchange SWIFT NACHA HIPAA Cloud Computing Application Database Unstructured Partner Data Local Touch, Global Reach 48
Informatica Big Data Unleashed Local Touch, Global Reach 49
EMC Greenplum s MPP Shared-Nothing Architecture Local Touch, Global Reach 50
Pentaho and DataStax Pentaho and DataStax will offer the first Cassandra-based big data analytics solution that combines the highly scalable, low-latency performance of Cassandra with Kettle s visual interface for high-performance data extract, transformation and load, as well as integrated reporting, visualization and interactive analysis capabilities. This will make it easier for developers and data scientists to operationalize, integrate and analyze both big data and traditional data sources. Local Touch, Global Reach 51
DataStax Cassandra Enterprise DataStax Enterprise real-time, analytic, and search capabilities in one integrated big data platform Local Touch, Global Reach 52
Vertical Perspective Local Touch, Global Reach 53
Enhancing Fraud Detection for Banks and Credit Card Companies Scenario Build up-to-date models from transactional to feed real-time risk-scoring systems for fraud detection. Requirement Analyze volumes of data with response times that are not possible today. Apply analytic models to individual client, not just client segment. Benefits Detect transaction fraud in progress, allow fraud models to be updated in hours than weeks. Local Touch, Global Reach 54
Social Media Analysis for Products, Services and Brands Scenario Monitor data from various sources such as blogs, boards, news feeds, tweets, and social medias for information pertinent to brand and products, as well as competitors Requirement Extract and aggregate relevant topics, relationships, discover patterns and reveal up-and-coming topics and trends Benefits Brand Management for marketing campaigns, Brand protection for ad placement networks Local Touch, Global Reach 55
Store Clustering Analysis in the Retail industry Scenario Retailer with large number of stores needs to understand cluster patterns of shoppers. Requirement Use shopping patterns for multiple characteristics like location, incomes, family size for better product placement. Benefits Store specific clustering of products, clustering specific types of products by locations. Local Touch, Global Reach 56 Age Range Education Income Children Assets Urbanicity
Healthcare and Energy Industry IBM Stream Computing for Smarter Healthcare InfoSphere Streams based analytics can alert hospital staff of impending life threatening infections in premature infants up to 24 hours earlier than current practices Healthcare Energy Vestas Wind Systems use IBM big data analytics software and powerful IBM systems to improve wind turbine placement for optimal energy output. Local Touch, Global Reach 57
Big Data Value Potential Index Local Touch, Global Reach 58
The Role of the Future Local Touch, Global Reach 59
Data Science and the Data Scientist Local Touch, Global Reach 60
Data Science and the Data Scientist Local Touch, Global Reach 61
Some References Local Touch, Global Reach 62
Big Data Some References Forrester : The Forrester Wave : Enterprise Hadoop Solutions, Q1 2012 IBM Software: Big Data and Data Management IBM Systems: Big Data IBM - Big Data and Better Business Outcomes A Strategic Foundation for Analytics International Data Corporation (IDC) Oracle: Big Data McKinsey Global Institute Microsoft: Big Data EMC Greenplum: Big Data Cloudera.com Hadoop.com Wikibon: Big Data Wikipedia: Big Data Local Touch, Global Reach 63
Q & A Local Touch, Global Reach 64
Prize & Thank you! Local Touch, Global Reach 65