Big projects and use cases Caus Samuesen IBM Anaytics, Europe csa@dk.ibm.com
IBM Sofware Overview of BigInsights IBM BigInsights Scientist Free Quick Start (non production): IBM Open Patform BigInsights Anayst, Scientist features Community support Text Anaytics IBM BigInsights Anayst Industry standard SQL (Big SQL) Spreadsheet-stye too (BigSheets) Machine Learning on Big R IBM BigInsights Enterprise Management Big R (R support) Big SQL POSIX Distributed Fiesystem BigSheets Muti-workoad, muti-tenant scheduing... IBM Open Patform with Apache Hadoop* (, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig, Snappy, Sor, Spark, Sqoop, Zookeeper, Open JDK, Knox, Sider) *IBM Open Patform with Apache Hadoop is a 100% open source Apache Hadoop distribution. IBM wi incude the Open Patform common kerne once avaiabe. 2 2014 IBM Corporation
IBM Big SQL Runs 100% of the queries Other environments require significant effort at scae Key points With Impaa and Hive, many queries needed to be re-written, some significanty Owing to various restrictions, some queries coud not be re-written or faied at run-time Re-writing queries in a benchmark scenario where resuts are known is one thing doing this against rea databases in production is another Resuts for 10TB scae shown here 3 2014 IBM Corporation
Hadoop-DS benchmark Singe user performance @ 10TB Big SQL is 3.6x faster than Impaa and 5.4x faster than Hive 0.13 for singe query stream using 46 common queries Based on IBM interna tests comparing BigInsights Big SQL, Coudera Impaa and Hortonworks Hive (current versions avaiabe as of 9/01/2014) running on identica hardware. The test workoad was based on the atest revision of the TPC-DS benchmark specification at 10TB data size. Successfu executions measure the abiity to execute queries a) directy from the specification without modification, b) after simpe modifications, c) after extensive query rewrites. A minor modifications are either permitted by the TPC-DS benchmark specification or are of a simiar nature. A queries were reviewed and attested by a TPC certified auditor. Deveopment effort measured time required by a skied SQL deveoper famiiar with each system to modify queries so they wi execute correcty. Performance test measured scaed query throughput per hour of 4 concurrent users executing a common subset of 46 queries across a 3 systems at 10TB data size. Resuts may not be typica and wi vary based on actua workoad, configuration, appications, queries and other variabes in a production environment. Coudera, the Coudera ogo, Coudera Impaa are trademarks of Coudera. Hortonworks, the Hortonworks ogo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries. 4 2014 IBM Corporation
Big Projects Stock Trade Anaysis Positive side effects of drugs Log Fie Root Cause Anaysis CRM anaysis 360 Degree Customer View Ontoogies Gamers Behaviour Document cassification Weather Anaysis Roaming Log Anaysis Sensitive Access Connected Cars Tax Fraud Investigation Historica Archive Research Warehouse Augmentation DNA sequencing 2009 IBM Corporation
Warehouse Augmentation Banking Industry Fraud Anaysis The customer wanted to impement two different kinds of fraud anaysis: Transaction fraud and Socia Engeneering fraud. Probem: Existing data warehouse does not aow for ong running jobs Extending the data warehouse has a huge cost 2009 IBM Corporation
Warehouse Augmentation Banking Industry Fraud Anaysis Soution: Moving data to IBM BigInsights reduces the cost significanty No imitations on ong running jobs Obtaining the data from the various sources is the most time consuming process Using BigSQL we can run the same queries in Hadoop as in the traditiona warehouse With BigSQL customer can connect using their standard JDBC/ODBC based SQL toos. 2009 IBM Corporation
Document Cassification Insurrance Industry Automatic cassification Probem: Insurance documents are not standardized. They are typicay free form documents written as e-mais, MS Words etc. Incoming documents are not cassified, and are therefore often sent to wrong department or wrong person, thus resuting in unacceptabe ong processing time. 2009 IBM Corporation
Document Cassification Soution: Using BigInsights Text Anaytics new documents can be cassified automatic. Customer had described what was the characteristics of the different casses the the documents had to be put into. Using these descriptions we coud in three weeks impements the rues in BigInsights to a degree that satisfied the customer. 2009 IBM Corporation
IBM big data An IBM Proof of Technoogy IBM big data IBM big data IBM big data IBM big data IBM big data IBM big data THINK IBM big data IBM big data IBM big data 2013 IBM Corporation
IBM Software Distinguishing characteristics Appication Portabiity & Integration Performance shared with Hadoop ecosystem Comprehensive fie format support Superior enabement of IBM and Third Party software Modern MPP runtime Powerfu SQL query rewriter Cost based optimizer Optimized for concurrent user throughput Resuts not constrained by memory Rich SQL Comprehensive SQL Support IBM SQL PL compatibiity Extensive Anaytic Functions 11 Federation Enterprise Features Distributed requests to mutipe data sources within a singe SQL statement Main data sources supported: DB2 LUW, Teradata, Orace, Netezza, Informix, SQL Server Advanced security/auditing Resource and workoad management Sef tuning memory management Comprehensive monitoring 2014 IBM Corporation
IBM Software Big SQL Behind the scenes Big SQL is derived from an existing IBM shared-nothing RDBMS A very mature MPP architecture Aready understands distributed joins and optimization Behavior is sufficienty different Certain SQL constructs are disabed Traditiona data warehouse partitioning is unavaiabe New SQL constructs introduced On the surface, porting a shared nothing RDBMS to a shared nothing custer (Hadoop) seems easy, but database partition database partition database partition database partition Traditiona Distributed RBMS Architecture 12 2014 IBM Corporation
IBM Software Architecture Overview base Service Big SQL Scheduer Big SQL Master Hive Metastore DDL Big SQL Worker Native I/O Java I/O HBase Temp 13 Big SQL Worker Node Native I/O MR Task Tracker Java I/O HBase Other Service Temp Big SQL Worker Node Native I/O MR Task Tracker Java I/O HBase Other Service Temp Node MR Task Tracker Other Service 2014 IBM Corporation