A brief introduction of IBM s work around Hadoop - BigInsights

Similar documents

IBM Big Data Platform

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Luncheon Webinar Series May 13, 2013

Hadoop & its Usage at Facebook

Hadoop & Spark Using Amazon EMR

Apache Hadoop. Alexandru Costan

Data processing goes big

Workshop on Hadoop with Big Data

IBM Big Data Platform

Sisense. Product Highlights.

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data and Data Quality - Mutually Exclusive?

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Big Data on Microsoft Platform

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

What's New in SAS Data Management

Hadoop & its Usage at Facebook

Implement Hadoop jobs to extract business value from large and varied data sets

IBM BigInsights for Apache Hadoop

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Unified Batch & Stream Processing Platform

Big Data Management and Security

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Native Connectivity to Big Data Sources in MSTR 10

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Using the Bluemix Analytics for Hadoop Service to Analyse Data

Search and Real-Time Analytics on Big Data

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Hadoop Job Oriented Training Agenda

Introduction to Cloud Computing

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Apache Flink Next-gen data analysis. Kostas

Scaling Out With Apache Spark. DTL Meeting Slides based on

A Brief Introduction to Apache Tez

Hadoop Ecosystem B Y R A H I M A.

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

HDP Hadoop From concept to deployment.

Issues in Big Data: Analytics

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

How To Scale Out Of A Nosql Database

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Big Data System and Architecture

Big Data Analytics Nokia

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hadoop IST 734 SS CHUNG

IBM InfoSphere BigInsights Enterprise Edition

Integrating Big Data into the Computing Curricula

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop and Map-Reduce. Swati Gore

Chapter 7. Using Hadoop Cluster and MapReduce

Apache Hadoop: Past, Present, and Future

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Massive Scale Analytics for a Smarter Planet

Big Data Too Big To Ignore

Exploring your InfoSphere BigInsights cluster and sample applications

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Enhancing Massive Data Analytics with the Hadoop Ecosystem

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop. Sunday, November 25, 12

Business Intelligence for Big Data

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop Big Data for Processing Data and Performing Workload

ITG Software Engineering

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Large scale processing using Hadoop. Ján Vaňo

HADOOP. Revised 10/19/2015

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Oracle Big Data SQL Technical Update

Comprehensive Analytics on the Hortonworks Data Platform

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

HadoopRDF : A Scalable RDF Data Analysis System

Constructing a Data Lake: Hadoop and Oracle Database United!

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Professional Hadoop Solutions

Information Architecture

Manifest for Big Data Pig, Hive & Jaql

and Hadoop Technology

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Real-time Big Data Analytics with Storm

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

XpoLog Competitive Comparison Sheet

Chase Wu New Jersey Ins0tute of Technology

CSE-E5430 Scalable Cloud Computing Lecture 2

Transcription:

A brief introduction of IBM s work around Hadoop - BigInsights Yuan Hong Wang Manager, Analytics Infrastructure Development China Development Lab, IBM yhwang@cn.ibm.com

Adding IBM Value To Hadoop Role Business Analyst Most users interact here IBM value-add over time (green) Collection manipulation/visualization Catalog of Collections Developer Custom development, hybrid models, etc go here Available Resources/Functions Job / Work Flow Creation PIG JAQL Hive IT Infrastructure admin IBM Hadoop Hardware System Mgmt 2 2

3 BigInsights Software Stack BigInsights Application Server SPSS Mining and scoring Unstructured Analytics (SystemT) Metatracker Jaql BigInsights Core Install & Configuration Monitoring Management console DB & Warehouse integration Applications & Solutions Enabling Infrastructure BigSheets (included in BigInsights) Applications / Solutions / Partners / Community Toro Gumshoe Next Generation Credit Risk Analytics Custom applications IBM Distribution of Apache Hadoop Passed IBM legal and IP review, safe to use Enhancement: Flex Scheduler, HA, GPFS++..

Flex Scheduler 4

FIFO, FAIR and Flex FIFO concentrates on makespan Simple Works well enough for batch jobs but has well-known job starvation problems in interactive environments FAIR (Hadoop Fair Scheduler) concentrates on fairness Avoids starvation by respecting minimum slot allocations per job, and proportionally sharing the remaining slots (slack) across the jobs Has become the standard MapReduce scheduler Not really optimizing to common scheduling metrics Flex A new scheduler which could be easily optimized for different standard scheduling metrics within the given constraint fits naturally above FAIR scheduler Metrics(total 16) included: weighted response time, weighted number of tardy, SLA cost Example of Response time: minimizing average time until job completes Good: 2 minute job, then 100 minute job. Average time: 52 minutes Bad: 100 minute job, then 2 minute job. Average time: 101 minutes 5

Two ideas Flex Scheduler is based on Given a priority ordering of jobs we compute a high quality malleable packing scheme Fact: For any of our possible metrics, this packing will actually be optimal for some priority ordering We can find a high quality ordering for any of our possible metrics by optimally solving an appropriate but generic Resource Allocation Problem (RAP) RAP scheme will actually be optimal in the context of moldable scheduling, assuming positive minima for each job We can create better, possibly optimal schemes which are specific to selected metrics and to minimum (and maximum) range values 6

Malleable Packing First Layer Second Layer 7 7 Third Layer Final Packing

Allocation Layer Model/Assignment Layer Reality Lots of independent small tasks that get assigned to slots when other tasks complete Time Time Slots Slots Allocation Layer Model Assignment Layer Reality 8 8

Current Status FLEX code is integrated with Hadoop Experiments Extensive simulation results 50% improvement in average response time, maximum stretch > 5x improvement for other, harder metrics Ran experiments with the GridMix2 workload 30% improvement in average response time over FAIR. Detailed experiments runs with a customer workload Up to 50% improvement in average response time over FAIR. Paper accepted at Middleware 2010 9 9

MetaTracker 10

MetaTracker Background Challenges of managing production analytics flows Complex Failure-prone Long-running Continuous, 24x7 MetaTracker: Data-centric workflow orchestration system for production analytics flows, providing mechanisms for Defining flows that mix time- and data-triggered jobs Hadoop Job, Jaql/PIG/Hive Job, arbitrary scripts, programs, or Java code Defining continuous flows that never stop Modifying flow parameters on the fly Injecting data into a running flow Managing permanent data HDFS, GPFS and NFS support Multi-version concurrency control for data stored in Lucene indexes Recovering from failures Prevents data loss or data corruption Creates stand-alone test cases for unrecoveable software errors 11

MetaTracker: Architecture Flow Description (Job Graph expressed in JSON) Java API Scheduler JI JI JI Active Job Instances JI State Manager State Store Temp Directories Permanent Directories Working Directories Database Distributed Filesystem Local Filesystem 12

Example Job: Compute scores for documents Data Trigger: Create a new Job Instance when instances of all input directories are ready script variable: Location of Jaql script that computes scores Docs Input: Placeholder for directory containing a batch of documents DocScores Output: Placeholder for a directory containing a batch of document scores Data Docs script:analytics.jaql Job JaqlJob JaqlJob Class: Java class that knows how to create Jaql Job Instances DocScores 13

Example Job Graph: Crawl some documents, then compute their scores Timetriggered crawl Time Data script : analytics.jaql Crawl Job CrawlDocs Docs Analytics Job DocScores CrawlerJob JaqlJob docscoresdir Any output can be directed to a permanent location on disk 14

MetaTracker: Comparison with Oozie Oozie is a Hadoop Workflow system from Yahoo Capabilities of the MetaTracker not currently supported in Oozie Defining flows that mix time- and data-triggered jobs Defining continuous flows that never stop Oozie v2 provides limited support: Separate coordinator system that invokes a one-shot flow multiple times Modifying flow parameters on the fly Injecting data into a running flow Managing permanent data and recovering from failures Oozie s scheduler is fault tolerant but all data operations in an Oozie workflow are performed by the flow itself Preventing data corruption is the responsibility of the author of the flow Capabilities of Oozie not currently supported in the MetaTracker Support for multiple users submitting workflows Web UI for monitoring flow status 15

JAQL http://code.google.com/p/jaql/ 16

Jaql: Reusable scripts for massive, semi-structured data Dataflows for conceptual JSON data Key differentiators: 1. Functions: reusability + abstraction 2. Physical Transparency: precise control when needed 3. Data model : semi-structured based on JSON Flexible scripting language Jaql Scalable map-reduce runtime Jaql Map Jaql Jaql Jaql Map Reduce Map Reduce Fault Tolerant DFS 17

Jaql Basics: Writing a pipeline in Jaql source operator operator sink Find users in zip 94114 [ { id: 12, name: Joe Smith, bday: date( 1971-03-07 ), zip: 94114 }, { id: 17, name: Ann Jones, bday: date( 1973-02-04 ), zip: 94110 }, { id: 19, name: Alicia Fox, bday: date( 1975-04-20 ), zip: 94114 } ] [ { id: 12, name: Joe Smith }, { id: 19, name: Alicia Fox } ] 18

Jaql Basics: Writing a pipeline in Jaql read filter transform write Find users in zip 94114 Query read(hdfs( users )) filter $.zip == 94114 transform { $.id, $.name } write(hdfs( inzip )); Data [ { id: 12, name: Joe Smith }, { id: 19, name: Alicia Fox } ] *Support common data operators, filter, transform, join, group, sort, union. (pls see http://code.google.com/p/jaql/ ). 19

Jaql New Features Java/Python API Support Jaql to be called from Java and Python programs Exception Handling / Logging Timeout, fence Native Map/Reduce job Allow existing MapReduce programs to be evaluated from Jaql External call Allow existing programs(legacy programs or additional tools) to be called from Jaql Parallel operators: Union, Tee, SORT, etc.. R/SPSS integration Run R & SPSS in parallel over partitions of Hadoop data. Launch Jaql jobs from R & SPSS Load Hadoop data into R & SPSS More exploratory features... 20

SPSS Script with Embedded Jaql * Jaql script to load one month of data for one station BEGIN PROGRAM JAQL. fd = del("co2.dat", // Data in a CSV file in HDFS { schema : schema { station:string, year:long, month:long, day:long, co2:double? } }); read(fd) filter $.station == 'Hongkong' filter $.month == 7 filter not isnull($.co2) spssdataset(); // Jaql function to set current SPSS dataset END PROGRAM. * Compute descriptive statistics (quartiles, min, max) on the station-month of data. EXAMINE VARIABLES=co2 /PERCENTILES /NOTOTAL. Produces following dataset, pivot tables, and chart 21

Jaql Script with Embedded SPSS Statistics boxstats = fn(nums) ( s = spssproc(proc = "EXAMINE VARIABLES= VAR1 /PERCENTILES /NOTOTAL.", command = "Explore", subtypes = ["Percentiles", "Descriptives"], data = nums), { min: s.descriptives[7].statistic, // extracted from pivot tables in SPSS output log q25: s.percentiles[1].@25, q50: s.percentiles[1].@50, q75: s.percentiles[1].@75, max: s.descriptives[8].statistic } ); Result: read(fd) -> filter $.station == 'Hongkong' -> filter $.month == 7 -> filter not isnull($.co2) -> transform $.co2 -> boxstats(); // pass CO2 values to SPSS { min: 334.97, q25: 337.07, q50: 338.07, q75: 341.17, max: 349.57 } 22

Bigsheets http://www-01.ibm.com/software/ebusiness/jstart/bigsheets/index.html 23

Bigsheets Background What is it? It is a web application, similar to a traditional spreadsheet used by the domain expert for performing ad-hoc analysis at web-scale on unstructured and structured content Formula, function, customer functions Putting Map/Reduce & Hadoop to work for the line-ofbusiness user How does it works? Gather content either statically (e.g. crawl) or dynamically through connectors Extract local or document level information (e.g. congress person s name), cleanse, normalize Explore, analyze, annotate, navigate content, filter on existing and new relationships, generate results and visualize Iterate at any and all steps Employees a browser-based visual front end spreadsheet metaphor to create worksheets for exploring/visualizing the big data 24

Bigsheets Logical view BigSheets REST API for customer choice of analytic service/engine REST API for choice of visualization Export content as feeds, JSON, CSV, Create Monitor Import/ Export Extend User Interface & Front End Server MetaTracker Server Visualize JSP Container Jetty + JDBC Standalone BigSheets + Hadoop Job Controller Map/Reduce (Hadoop) Distributed File System (HDFS/GPFS) Jaql/Pig/Hive (Scripting Map/Reduce) Nutch (Web Crawler) Other tools (LanguageWare, ICM, etc.) Apache Projects IBM Analytic Products IBM Hadoop Common Component IBM Products and Apache Enabling Projects 25

Bigsheets dynamic view Hadoop & HDFS Cluster Front End Server (Web App) 2. BigSheets Job Server 1. 3. 4. 1. Main Communication to Hadoop Cluster Job & Collection Lifecycle Job Monitoring 2. Reads data for collections directly from DFS (FileSystem) 3. Job Controller Job and Collection management Collections may consists of many M/R Jobs, the Job Server tracks each Job and monitors progress (upon request) Executes entire Job chain Starts, Stops, Monitors running Jobs Manages collection versions and job versions 4. Communication channel when Job complete, Hadoop informs Job Server so that the next step can be executed or the job completed 26