Big Data Concepts. Considerations Gayane IBM Client Center



Similar documents
IBM Big Data Platform

Exploiting Data at Rest and Data in Motion with a Big Data Platform

Are You Ready for Big Data?

Industry Impact of Big Data in the Cloud: An IBM Perspective

Are You Ready for Big Data?

Deploying Big Data to the Cloud: Roadmap for Success

How the oil and gas industry can gain value from Big Data?

How to Leverage Big Data in the Cloud to Gain Competitive Advantage

IBM Data Warehousing and Analytics Portfolio Summary

Luncheon Webinar Series May 13, 2013

BAO & Big Data Overview Applied to Real-time Campaign GSE. Joel Viale Telecom Solutions Lab Solution Architect. Telecom Solutions Lab

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Big Data and Trusted Information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data overview. Livio Ventura. SICS Software week, Sept Cloud and Big Data Day

IBM Big Data in Government

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

IBM Big Data Platform

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Big Data & Analytics. The. Deal. About. Jacob Büchler jbuechler@dk.ibm.com Cand. Polit. IBM Denmark, Solution Exec IBM Corporation

Beyond Watson: The Business Implications of Big Data

The Future of Data Management

A New Era Of Analytic

Big Data and the new trends for BI and Analytics Juha Teljo Business Intelligence and Predictive Solutions Executive IBM Europe

Issues in Big Data: Analytics

VIEWPOINT. High Performance Analytics. Industry Context and Trends

Raul F. Chong Senior program manager Big data, DB2, and Cloud IM Cloud Computing Center of Competence - IBM Toronto Lab, Canada

How To Create A Data Science System

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Klarna Tech Talk: Mind the Data! Jeff Pollock InfoSphere Information Integration & Governance

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Real World Use of BIG DATA. Tim Brown Information Management Technical Pre-Sales Aruna Kolluru Information Management Technical Pre-Sales 04/2013

The Enterprise Data Hub and The Modern Information Architecture

How to make BIG DATA work for you. Faster results with Microsoft SQL Server PDW

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Big Data Are You Ready? Jorge Plascencia Solution Architect Manager

Transforming the Telecoms Business using Big Data and Analytics

IBM Information Management Overview

The New Normal: Get Ready for the Era of Extreme Information Management. John Mancini President, DigitalLandfill.

Poslovni slučajevi upotrebe IBM Netezze

Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM

Big Data Use Case Deep Dive 5 Game Changing Use Cases for Big Data

Big Data and Your Data Warehouse Philip Russom

BIG Data Analytics Move to Competitive Advantage

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

YOU VS THE SENSORS. Six Requirements for Visualizing the Internet of Things. Dan Potter Chief Marketing Officer, Datawatch Corporation

Big Data. Fast Forward. Putting data to productive use

Demystifying Big Data Government Agencies & The Big Data Phenomenon

Big Data and Data Quality - Mutually Exclusive?

Get Ready for Big Data with IBM System z

WHITE PAPER LOWER COSTS, INCREASE PRODUCTIVITY, AND ACCELERATE VALUE, WITH ENTERPRISE- READY HADOOP

Analytics and the Context Multiplier

W H I T E P A P E R. Architecting A Big Data Platform for Analytics INTELLIGENT BUSINESS STRATEGIES

Chapter 7. Using Hadoop Cluster and MapReduce

Advanced In-Database Analytics

BIG DATA TRENDS AND TECHNOLOGIES

SQLSaturday #399 Sacramento 25 July, Big Data Analytics with Excel

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

III JORNADAS DE DATA MINING

Ganzheitliches Datenmanagement

Il mondo dei DB Cambia : Tecnologie e opportunita`

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

The Future of Data Management with Hadoop and the Enterprise Data Hub

Smarter Analytics. Barbara Cain. Driving Value from Big Data

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Getting Started Practical Input For Your Roadmap

IBM System x reference architecture solutions for big data

Information systems architecture for the Oil and Gas industry

The Future of Business Analytics is Now! 2013 IBM Corporation

Oracle Big Data Building A Big Data Management System

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Smarter Analytics Leadership Summit Big Data. Real Solutions. Big Results.

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Beyond the Single View with IBM InfoSphere

Hur hanterar vi utmaningar inom området - Big Data. Jan Östling Enterprise Technologies Intel Corporation, NER

Data Refinery with Big Data Aspects

Teradata s Big Data Technology Strategy & Roadmap

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Information Architecture

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

A holistic approach to Big Data

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

IBM InfoSphere BigInsights Enterprise Edition

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

BIG DATA What it is and how to use?

How To Handle Big Data With A Data Scientist

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Operational Intelligence: Real-Time Business Analytics for Big Data Philip Russom

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Big Data & Analytics for Semiconductor Manufacturing

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Big Data Strategies with IMS

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

How To Get More Data From Your Computer

Transcription:

Big Data Concepts. Considerations Gayane IBM Client Center 82190210@ru.ibm.com 2015 IBM Corporation

Big Data Concepts Agenda What is Big Data? Data at Rest: Hadoop and InfoSphere BigInsights Data in Motion: InfoSphere Streams Considerations for BigInsights and Streams Concluding Thoughts 2

Big Data Concepts and Hardware Considerations The Big Data Opportunity Extracting insight from an immense volume, variety and velocity of data, in a timely and cost-effective manner. Think Big! Variety: Velocity: Volume: Verasity All kinds of data All kinds of analytics Streaming data and large volume data movement Scale from terabytes to zettabytes Truthfulness a certainty of data 3

Where is this data coming from? data every day Big Data Concepts and Hardware Considerations 12+ TBs of tweet data every day 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide? TBs of 100s of millions of GPS enabled devices sold annually 25+ TBs of log data every day 76 million smart meters in 2009 200M by 2014 2+ billion people on the Web by end 2011 4

Big Data Concepts and Hardware Considerations What is BIG DATA? Where do I find it? Throw it away vs. Storing it? Log files practically every system creates and stores some kind of log typically not examined unless there is trouble Typically text and log wraps around on a regular basis to save space HTTP All web content Web content based on xml format and is highly variable No evident pattern to store in DB. DB cannot capture context Metering and Instrumentation Usually depicts real time status or cumulative value Variable over time Usually collect over interval Summary over interval or recorded peak. Patterns not analyzed Video & Audio Either streamed or stored in large files Detail cannot be stored in DB Only segments are of interest and require processing to analyze Personal profiles Volumes of texts and pictures Can be external Can receive block or interval Analysis requires cognitive parsing Large volumes of retrieved data are irrelevant Metadata Information that is in addition to actual data stored in DB Additional detail of a transaction that does not relate directly to billing End to end story of events that relates to a transaction Large volumes of data that is difficult to store over time Big Data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high velocity capture, discovery and/or analysis. Source: Matt Eastwood, IDC 5

Concept Associations for Old Data and New Big Data Standard DBW (Warehouse) Structured Schema Ad-hoc queries Reports Indexes Repeatable Optimized queries ETL Cleansed data Transactions High availability MPP SMP Complex analytics Data models Master data Model building SQL New Big Data (BigInsights, Streams) Unstructured Streaming Discovery Programming Text analytics Video Time series Sensors Log files Noisy data Commodity hardware Cluster Real-time analytics Complex analytics Tweets Sentiment analysis Social network analysis Model-driven optimization NoSQL

Data Warehouse and BigInsights Comparison Chart Data Warehouse Hadoop / Streams Data Types Largely structured data Any type of data, structured or unstructured Data Loading Data is cleansed/structured before going into the warehouse to maximize its utility Reliability ACID compliant Not ACID compliant Raw data may be ingested as is, without any modification, as the relationships may not be understood or defined Integrity Database maintains integrity Applications code integrity Analytic Approach High value, structured data Repeated operations and processes (e.g. transactions, reports, BI, etc.) Relatively stable sources Well-understood requirements Optimized for fast access and analysis Highly variable data and content Iterative, exploratory analysis (e.g. scientific research, behavioral modeling) Volatile sources Ill-defined questions and changing requirements Optimized for flexibility Hardware Powerful appliance and optimized hardware Inexpensive, commodity hardware

Big Data Concepts and Hardware Considerations IBM s Value: Complementary Analytics Traditional Approach Structured, analytical, logical New Approach Creative, holistic thought, intuition Transaction Data Data Warehouse Hadoop Streams Web Logs Internal App Data Structured Repeatable Mainframe Data Linear Monthly sales reports Profitability analysis Customer surveys OLTP System Data Structured Repeatable Linear Enterprise Integration Unstructured Exploratory Iterative Social Data Unstructured Exploratory Iterative Text Data: emails Brand sentiment Product strategy Maximum asset utilization Sensor data: images ERP data Traditional Sources New Sources RFID 8

The Big Data Ecosystem: Interoperability is Key Streaming Data Internet- Scale Data Sets Non-Traditional / Non-Relational Data Feeds Non-Traditional / Non-Relational Data Sources Traditional / Relational Data Sources Streams RTAP: Analytics on Data in Motion BigInsights Analytics on Data at Rest Traditional Warehouse Traditional / Relational Data Sources Data Warehouse Analytics on Structured Data

Classic OLTP/Data Warehouse Environment Transactional Systems Transactional Data ETL Replication ETL Warehouse ETL Data Mart OLAP SQL Business Intelligence

Hadoop / Streams Environment Ingest Scripts Streams Ingest BigInsights Platform Nodes MapReduce Analytic Engines: SystemT SystemML Warehouse OLAP Reports Visualizations Classic BI

New Consolidated Environment Ingest Scripts Streams Ingest BigInsights Platform Nodes ETL Transactional Systems Transactional Data Replication ETL MapReduce Analytic Engines: SystemT SystemML Warehouse ETL Data Mart OLAP SQL Reports Visualizations Business Intelligence

Big Data Concepts and Hardware Considerations What can you do with big data? Financial Services Fraud detection Risk management 360 View of the Customer Utilities Weather impact analysis on power generation Transmission monitoring Smart grid management Transportation Weather and traffic impact on logistics and fuel consumption Traffic congestion IT System log analysis Cybersecurity Health & Life Sciences Epidemic early warning ICU monitoring Remote healthcare monitoring Retail 360 View of the Customer Click-stream analysis Real-time promotions Telecommunications CDR processing Churn prediction Geomapping / marketing Network monitoring Law Enforcement Real-time multimodal surveillance Situational awareness Cyber security detection 13

Big Data Concepts and Hardware Considerations The IBM Big Data Platform InfoSphere BigInsights Hadoop-based low latency analytics for variety and volume Data-At-Rest Hadoop Apache Hadoop: open source framework for the distributed processing of large data sets across clusters of computers using a simple programming model InfoSphere Information Server High volume data integration and transformation Information Integration MPP Data Warehouse Stream Computing InfoSphere Streams Low Latency Analytics for streaming data Velocity, Variety & Volume Data-In-Motion Netezza High Capacity Appliance PureData for Analytics 14 InfoSphere Warehouse Large volume structured data analytics Informix Timeseries Time-structured analytics

New Architecture to Leverage All Data and Analytics Streams Real-time Analytics Intelligence Analysis Data in Motion Data at Rest Information Ingestion and Operational Information Stream Processing Data Integration Master Data Video/Audio Network/Sensor Entity Analytics Predictive Landing Area, Analytics Zone and Archive Raw Data Structured Data Text Analytics Data Mining Entity Analytics Machine Learning Exploration, Integrated Warehouse, and Mart Zones Discovery Deep Reflection Operational Predictive Decision Management BI and Predictive Analytics Navigation and Discovery Data in Many Forms Information Governance, Security and Business Continuity

Entry points are accelerated by products within the big data platform 1 Discover, understand and navigate all big data sources Data Explorer (Vivisimo) 2 Take advantage of streaming data and find the valuable insights BI / Reporting Exploration / Visualization Visualization & Discovery Analytic Applications Functional App Industry App Predictive Analytics IBM Big Data Platform Application Development Content BI / Reporting Analytics Systems Management 5 Optimize infrastructure to lower cost and improve performance PDA/PDOA (Netezza/ISAS) InfoSphere Streams 3 - Extend your data warehouse to incorporate new data types InfoSphere BigInsights Hadoop System Accelerators Stream Computing Data Warehouse 6 Establish and maintain an accurate, secure and trusted view of all available data InfoSphere Information Server & Optim 4 Reduce the cost of your data warehouse by utilizing Hadoop InfoSphere BigInsights Information Integration & Governance Master Data Management Databases & Tools 7 Manage and leverage trusted customer and product information InfoSphere MDM

Big Data - Hadoop

Big Data Concepts and Hardware Considerations What is Hadoop? Traditonal computation model Bring data to the function Load data into memory, and process on a central server Does not scale well for Big Data problems Apache Hadoop: open source framework for data-intensive applications Inspired by Google technologies (MapReduce, GFS) Well-suited to batch-oriented, read-intensive applications Yahoo! Adopted these technologies and open sourced them into the Apache Hadoop project Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner CPU + disks of commodity box = Hadoop node Boxes can be combined into massive clusters New nodes can be added as needed without changing 18 Data formats How data is loaded How jobs are written

Big Data Concepts and Hardware Considerations Hadoop Explained, Two Key Concepts: Map Reduce Hadoop computation model Data stored in a distributed file system spanning many inexpensive computers Bring function to the data Distribute application to the compute resources where the data is stored Scalable to thousands of nodes and petabytes of data public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); Hadoop Data Nodes public void map(object key, Text val, Context StringTokenizer itr = new StringTokenizer(val.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWrita private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> val, Context context){ int sum = 0; for (IntWritable v : val) { sum += v.get();... MapReduce Application Distribute map tasks to cluster Shuffle 1. Map Phase (break job into small parts) 2. Shuffle (transfer interim output for final processing) 3. Reduce Phase (boil all output down to a single result set) 19 Result Set Return a single result set

Big Data Concepts and Hardware Considerations Hadoop Explained, Two Key Concepts: HDFS HDFS stores data across multiple nodes HDFS achieves reliability by replicating data across multiple nodes (typically 3 or more) file system is a cluster of data nodes serves up blocks of data over the network using a block protocol specific to HDFS HDFS Name Node is a single point of failure IBM eliminates this single point of failure while improving file system performance with GPFS-SNC (available as beta code) IBM Value for Hadoop! Nodes HDFS Cluster Operating System NameNode (Metadata store) Nodes GPFS-SNC Cluster Kernel Level 20

Big Data Concepts and Hardware Considerations InfoSphere BigInsights = Hadoop + IBM Innovation + IBM Innovation Scalable New nodes can be added on the fly Affordable Massively parallel computing on commodity servers Flexible Hadoop is schema-less, and can absorb any type of data Fault Tolerant Through MapReduce software framework Performance & Reliability Adaptive MapReduce, Compression, BigIndex, Flexible Scheduler Analytic Accelerators Productivity Accelerators Web-based UIs Tools to leverage existing skills End-user visualization Enterprise Integration To extend & enrich your information supply chain 22

Big Data Concepts and Hardware Considerations How IBM BigInsights extends Hadoop capability Manageability Single Click Integrated Install Browser Based Cluster Mgmt GPFS-SNC Developer Value New Query Language (jaql) Eclipse Tools for Analytics Broad Integration with other Information Management Technologies (DW, DataStage, RDBMS, et al) Integration with InfoSphere Streams Performance & Availability GPFS-SNC (Data Replication) Splittable Compression Improved Map/Reduce Job Scheduling Improvements Large Scale Indexing Advanced Analytics Bundled Scalable Text Analytics (AQL Sentiment Analysis NLP) BigData Scale Visualization (BigSheets) Bundled Scalable Machine Learning (DML) Security Secure File System (GPFS-SNC) Cluster Hardening 23

Providing competitive advantage Faster analysis, design and simulation while managing costs Financial Services Banking, financial markets, insurance Risk management, compliance, investment decisions, liquidity management Manufacturing Automotive, Aerospace and Defense and Engineering New designs, more complex products and higher quality Electronics Electronics and Semiconductor More complex designs and simulations Petroleum Research & Academic Government & Intelligence Oil and gas exploration and production Reserve identification, imaging and reclamation Life sciences, research, higher education Drug development, sequencing, cross department collaboration Scientific research, classified/defense, weather/ environmental sciences Intelligence gathering and insight development 24

Dynamic resource sharing among heterogeneous tenants A B C D Existing Analytic workload (SPSS) Big Data App #1 Big Data App #2 Big Data App #3 Workload Manager C C C C C C B B A A A A A A A A C C C C C C B B A A A A A A A A C C C C C C B D D D D D D B B B B B B B B B B D D D D D D B B B Resource Orchestration

Streams

Big Data Concepts and Hardware Considerations Stream Computing Analyze Data in Motion Traditional Computing Stream Computing Historical fact finding Find and analyze information stored on disk Batch paradigm, pull model Query-driven: submits queries to static data Current fact finding Analyze data in motion before it is stored Low latency paradigm, push model Data driven: bring the data to the query Query Data Results Data Query Results 27

Big Data Concepts and Hardware Considerations InfoSphere Streams: Massively Scalable Stream Analytics Linear Scalability Clustered deployments unlimited scalability Automated Deployment Automatically optimize operator deployment across clusters Performance Optimization JVM Sharing minimize memory use Fuse operators on same cluster Telco client 25 Million messages per second Analytics on Streaming Data 28 Analytic accelerators for a variety of data types (text, acoustic, image, video, geospatial, etc) Optimized for real-time performance Streaming Data Sources Source Adapters Deployments Analytic Operators Streams Runtime Sync Adapters Streams Studio IDE Automated and Optimized Deployment Visualization

Big Data Concepts and Hardware Considerations How Streams Works Continuous ingestion Continuous analysis Filter / Sample Infrastructure provides services for Scheduling analytics across hardware hosts, Establishing streaming connectivity Transform Annotate Correlate Classify Achieve scale: By partitioning applications into software components By distributing across stream-connected hardware hosts Where appropriate: Elements can be fused together for lower communication latency 29

IBM Consumer Oriented Analytic Architecture

Industrial references Banking Insurance Customer Intimacy & Offer Optimization Integration into Enterprise Risk Cross-sell & Retention Risk-based Claims Processing Government Crime Prevention Analytics for Smarter Cities Education Student Performance Retail Market Basket Analysis Assortment Planning Telco Churn Management Industrial Predictive Maintenance 35

Thank You Your questions 36