Big Data projects and use cases. Claus Samuelsen IBM Analytics, Europe csa@dk.ibm.com



Similar documents
Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data Management and Security

Blistering Fast SQL Access to Hadoop using. IBM BigInsights 3.0 with Big SQL 3.0

Workshop on Hadoop with Big Data

Bringing Big Data to People

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

HDP Hadoop From concept to deployment.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hadoop Ecosystem B Y R A H I M A.

Qsoft Inc

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Upcoming Announcements

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

HDP Enabling the Modern Data Architecture

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Peers Techno log ies Pv t. L td. HADOOP

Introduction to Big Data Training

ITG Software Engineering

Integrating Apache Spark with an Enterprise Data Warehouse

Complete Java Classes Hadoop Syllabus Contact No:

Native Connectivity to Big Data Sources in MSTR 10

Constructing a Data Lake: Hadoop and Oracle Database United!

Comprehensive Analytics on the Hortonworks Data Platform

Certified Big Data and Apache Hadoop Developer VS-1221

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

QUEST meeting Big Data Analytics

Dominik Wagenknecht Accenture

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data Realities Hadoop in the Enterprise Architecture

Cognizant Interactive. Digital Marketing & Analytics(DMA) Practice. 2012, Cognizant

I/O Considerations in Big Data Analytics

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

Implement Hadoop jobs to extract business value from large and varied data sets

#TalendSandbox for Big Data

IBM BigInsights for Apache Hadoop

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

WHITE PAPER BEsT PRAcTIcEs: PusHIng ExcEl BEyond ITs limits WITH InfoRmATIon optimization

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Please give me your feedback

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Informatica PowerCenter

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Technology and Consulting - Newsletter 1. IBM. July 2013

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

BIG DATA HADOOP TRAINING

Fundamentals Curriculum HAWQ

SQL on NoSQL (and all of the data) With Apache Drill

Lexmark ESF Applications Guide

Modernizing Your Data Warehouse for Hadoop

Moving From Hadoop to Spark

Big Data Course Highlights

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

COURSE CONTENT Big Data and Hadoop Training

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

How To Get Acedo With Microsoft.Com

Big Data Infrastructure at Spotify

IBM Big Data Platform

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

We are XMA and Viglen.

HADOOP. Revised 10/19/2015

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

and Hadoop Technology

PEPPERDATA OVERVIEW AND DIFFERENTIATORS

Bringing the Power of SAS to Hadoop. White Paper

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

EDS-Unigraphics MIS DataBroker Architecture

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

Hadoop Job Oriented Training Agenda

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Chapter 3: e-business Integration Patterns

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Human Capital & Human Resources Certificate Programs

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Apache Hadoop: The Big Data Refinery

BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION

Hadoop IST 734 SS CHUNG

Enhanced continuous, real-time detection, alarming and analysis of partial discharge events

Red Hat Enterprise Linux is open, scalable, and flexible

WINMAG Graphics Management System

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Luncheon Webinar Series May 13, 2013

Big Data Technologies Compared June 2014

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Advanced ColdFusion 4.0 Application Development Server Clustering Using Bright Tiger

How Intel IT Successfully Migrated to Cloudera Apache Hadoop*

Dell In-Memory Appliance for Cloudera Enterprise

Large scale processing using Hadoop. Ján Vaňo


Transcription:

Big projects and use cases Caus Samuesen IBM Anaytics, Europe csa@dk.ibm.com

IBM Sofware Overview of BigInsights IBM BigInsights Scientist Free Quick Start (non production): IBM Open Patform BigInsights Anayst, Scientist features Community support Text Anaytics IBM BigInsights Anayst Industry standard SQL (Big SQL) Spreadsheet-stye too (BigSheets) Machine Learning on Big R IBM BigInsights Enterprise Management Big R (R support) Big SQL POSIX Distributed Fiesystem BigSheets Muti-workoad, muti-tenant scheduing... IBM Open Patform with Apache Hadoop* (, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig, Snappy, Sor, Spark, Sqoop, Zookeeper, Open JDK, Knox, Sider) *IBM Open Patform with Apache Hadoop is a 100% open source Apache Hadoop distribution. IBM wi incude the Open Patform common kerne once avaiabe. 2 2014 IBM Corporation

IBM Big SQL Runs 100% of the queries Other environments require significant effort at scae Key points With Impaa and Hive, many queries needed to be re-written, some significanty Owing to various restrictions, some queries coud not be re-written or faied at run-time Re-writing queries in a benchmark scenario where resuts are known is one thing doing this against rea databases in production is another Resuts for 10TB scae shown here 3 2014 IBM Corporation

Hadoop-DS benchmark Singe user performance @ 10TB Big SQL is 3.6x faster than Impaa and 5.4x faster than Hive 0.13 for singe query stream using 46 common queries Based on IBM interna tests comparing BigInsights Big SQL, Coudera Impaa and Hortonworks Hive (current versions avaiabe as of 9/01/2014) running on identica hardware. The test workoad was based on the atest revision of the TPC-DS benchmark specification at 10TB data size. Successfu executions measure the abiity to execute queries a) directy from the specification without modification, b) after simpe modifications, c) after extensive query rewrites. A minor modifications are either permitted by the TPC-DS benchmark specification or are of a simiar nature. A queries were reviewed and attested by a TPC certified auditor. Deveopment effort measured time required by a skied SQL deveoper famiiar with each system to modify queries so they wi execute correcty. Performance test measured scaed query throughput per hour of 4 concurrent users executing a common subset of 46 queries across a 3 systems at 10TB data size. Resuts may not be typica and wi vary based on actua workoad, configuration, appications, queries and other variabes in a production environment. Coudera, the Coudera ogo, Coudera Impaa are trademarks of Coudera. Hortonworks, the Hortonworks ogo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries. 4 2014 IBM Corporation

Big Projects Stock Trade Anaysis Positive side effects of drugs Log Fie Root Cause Anaysis CRM anaysis 360 Degree Customer View Ontoogies Gamers Behaviour Document cassification Weather Anaysis Roaming Log Anaysis Sensitive Access Connected Cars Tax Fraud Investigation Historica Archive Research Warehouse Augmentation DNA sequencing 2009 IBM Corporation

Warehouse Augmentation Banking Industry Fraud Anaysis The customer wanted to impement two different kinds of fraud anaysis: Transaction fraud and Socia Engeneering fraud. Probem: Existing data warehouse does not aow for ong running jobs Extending the data warehouse has a huge cost 2009 IBM Corporation

Warehouse Augmentation Banking Industry Fraud Anaysis Soution: Moving data to IBM BigInsights reduces the cost significanty No imitations on ong running jobs Obtaining the data from the various sources is the most time consuming process Using BigSQL we can run the same queries in Hadoop as in the traditiona warehouse With BigSQL customer can connect using their standard JDBC/ODBC based SQL toos. 2009 IBM Corporation

Document Cassification Insurrance Industry Automatic cassification Probem: Insurance documents are not standardized. They are typicay free form documents written as e-mais, MS Words etc. Incoming documents are not cassified, and are therefore often sent to wrong department or wrong person, thus resuting in unacceptabe ong processing time. 2009 IBM Corporation

Document Cassification Soution: Using BigInsights Text Anaytics new documents can be cassified automatic. Customer had described what was the characteristics of the different casses the the documents had to be put into. Using these descriptions we coud in three weeks impements the rues in BigInsights to a degree that satisfied the customer. 2009 IBM Corporation

IBM big data An IBM Proof of Technoogy IBM big data IBM big data IBM big data IBM big data IBM big data IBM big data THINK IBM big data IBM big data IBM big data 2013 IBM Corporation

IBM Software Distinguishing characteristics Appication Portabiity & Integration Performance shared with Hadoop ecosystem Comprehensive fie format support Superior enabement of IBM and Third Party software Modern MPP runtime Powerfu SQL query rewriter Cost based optimizer Optimized for concurrent user throughput Resuts not constrained by memory Rich SQL Comprehensive SQL Support IBM SQL PL compatibiity Extensive Anaytic Functions 11 Federation Enterprise Features Distributed requests to mutipe data sources within a singe SQL statement Main data sources supported: DB2 LUW, Teradata, Orace, Netezza, Informix, SQL Server Advanced security/auditing Resource and workoad management Sef tuning memory management Comprehensive monitoring 2014 IBM Corporation

IBM Software Big SQL Behind the scenes Big SQL is derived from an existing IBM shared-nothing RDBMS A very mature MPP architecture Aready understands distributed joins and optimization Behavior is sufficienty different Certain SQL constructs are disabed Traditiona data warehouse partitioning is unavaiabe New SQL constructs introduced On the surface, porting a shared nothing RDBMS to a shared nothing custer (Hadoop) seems easy, but database partition database partition database partition database partition Traditiona Distributed RBMS Architecture 12 2014 IBM Corporation

IBM Software Architecture Overview base Service Big SQL Scheduer Big SQL Master Hive Metastore DDL Big SQL Worker Native I/O Java I/O HBase Temp 13 Big SQL Worker Node Native I/O MR Task Tracker Java I/O HBase Other Service Temp Big SQL Worker Node Native I/O MR Task Tracker Java I/O HBase Other Service Temp Node MR Task Tracker Other Service 2014 IBM Corporation