A brief introduction of IBM s work around Hadoop - BigInsights Yuan Hong Wang Manager, Analytics Infrastructure Development China Development Lab, IBM yhwang@cn.ibm.com
Adding IBM Value To Hadoop Role Business Analyst Most users interact here IBM value-add over time (green) Collection manipulation/visualization Catalog of Collections Developer Custom development, hybrid models, etc go here Available Resources/Functions Job / Work Flow Creation PIG JAQL Hive IT Infrastructure admin IBM Hadoop Hardware System Mgmt 2 2
3 BigInsights Software Stack BigInsights Application Server SPSS Mining and scoring Unstructured Analytics (SystemT) Metatracker Jaql BigInsights Core Install & Configuration Monitoring Management console DB & Warehouse integration Applications & Solutions Enabling Infrastructure BigSheets (included in BigInsights) Applications / Solutions / Partners / Community Toro Gumshoe Next Generation Credit Risk Analytics Custom applications IBM Distribution of Apache Hadoop Passed IBM legal and IP review, safe to use Enhancement: Flex Scheduler, HA, GPFS++..
Flex Scheduler 4
FIFO, FAIR and Flex FIFO concentrates on makespan Simple Works well enough for batch jobs but has well-known job starvation problems in interactive environments FAIR (Hadoop Fair Scheduler) concentrates on fairness Avoids starvation by respecting minimum slot allocations per job, and proportionally sharing the remaining slots (slack) across the jobs Has become the standard MapReduce scheduler Not really optimizing to common scheduling metrics Flex A new scheduler which could be easily optimized for different standard scheduling metrics within the given constraint fits naturally above FAIR scheduler Metrics(total 16) included: weighted response time, weighted number of tardy, SLA cost Example of Response time: minimizing average time until job completes Good: 2 minute job, then 100 minute job. Average time: 52 minutes Bad: 100 minute job, then 2 minute job. Average time: 101 minutes 5
Two ideas Flex Scheduler is based on Given a priority ordering of jobs we compute a high quality malleable packing scheme Fact: For any of our possible metrics, this packing will actually be optimal for some priority ordering We can find a high quality ordering for any of our possible metrics by optimally solving an appropriate but generic Resource Allocation Problem (RAP) RAP scheme will actually be optimal in the context of moldable scheduling, assuming positive minima for each job We can create better, possibly optimal schemes which are specific to selected metrics and to minimum (and maximum) range values 6
Malleable Packing First Layer Second Layer 7 7 Third Layer Final Packing
Allocation Layer Model/Assignment Layer Reality Lots of independent small tasks that get assigned to slots when other tasks complete Time Time Slots Slots Allocation Layer Model Assignment Layer Reality 8 8
Current Status FLEX code is integrated with Hadoop Experiments Extensive simulation results 50% improvement in average response time, maximum stretch > 5x improvement for other, harder metrics Ran experiments with the GridMix2 workload 30% improvement in average response time over FAIR. Detailed experiments runs with a customer workload Up to 50% improvement in average response time over FAIR. Paper accepted at Middleware 2010 9 9
MetaTracker 10
MetaTracker Background Challenges of managing production analytics flows Complex Failure-prone Long-running Continuous, 24x7 MetaTracker: Data-centric workflow orchestration system for production analytics flows, providing mechanisms for Defining flows that mix time- and data-triggered jobs Hadoop Job, Jaql/PIG/Hive Job, arbitrary scripts, programs, or Java code Defining continuous flows that never stop Modifying flow parameters on the fly Injecting data into a running flow Managing permanent data HDFS, GPFS and NFS support Multi-version concurrency control for data stored in Lucene indexes Recovering from failures Prevents data loss or data corruption Creates stand-alone test cases for unrecoveable software errors 11
MetaTracker: Architecture Flow Description (Job Graph expressed in JSON) Java API Scheduler JI JI JI Active Job Instances JI State Manager State Store Temp Directories Permanent Directories Working Directories Database Distributed Filesystem Local Filesystem 12
Example Job: Compute scores for documents Data Trigger: Create a new Job Instance when instances of all input directories are ready script variable: Location of Jaql script that computes scores Docs Input: Placeholder for directory containing a batch of documents DocScores Output: Placeholder for a directory containing a batch of document scores Data Docs script:analytics.jaql Job JaqlJob JaqlJob Class: Java class that knows how to create Jaql Job Instances DocScores 13
Example Job Graph: Crawl some documents, then compute their scores Timetriggered crawl Time Data script : analytics.jaql Crawl Job CrawlDocs Docs Analytics Job DocScores CrawlerJob JaqlJob docscoresdir Any output can be directed to a permanent location on disk 14
MetaTracker: Comparison with Oozie Oozie is a Hadoop Workflow system from Yahoo Capabilities of the MetaTracker not currently supported in Oozie Defining flows that mix time- and data-triggered jobs Defining continuous flows that never stop Oozie v2 provides limited support: Separate coordinator system that invokes a one-shot flow multiple times Modifying flow parameters on the fly Injecting data into a running flow Managing permanent data and recovering from failures Oozie s scheduler is fault tolerant but all data operations in an Oozie workflow are performed by the flow itself Preventing data corruption is the responsibility of the author of the flow Capabilities of Oozie not currently supported in the MetaTracker Support for multiple users submitting workflows Web UI for monitoring flow status 15
JAQL http://code.google.com/p/jaql/ 16
Jaql: Reusable scripts for massive, semi-structured data Dataflows for conceptual JSON data Key differentiators: 1. Functions: reusability + abstraction 2. Physical Transparency: precise control when needed 3. Data model : semi-structured based on JSON Flexible scripting language Jaql Scalable map-reduce runtime Jaql Map Jaql Jaql Jaql Map Reduce Map Reduce Fault Tolerant DFS 17
Jaql Basics: Writing a pipeline in Jaql source operator operator sink Find users in zip 94114 [ { id: 12, name: Joe Smith, bday: date( 1971-03-07 ), zip: 94114 }, { id: 17, name: Ann Jones, bday: date( 1973-02-04 ), zip: 94110 }, { id: 19, name: Alicia Fox, bday: date( 1975-04-20 ), zip: 94114 } ] [ { id: 12, name: Joe Smith }, { id: 19, name: Alicia Fox } ] 18
Jaql Basics: Writing a pipeline in Jaql read filter transform write Find users in zip 94114 Query read(hdfs( users )) filter $.zip == 94114 transform { $.id, $.name } write(hdfs( inzip )); Data [ { id: 12, name: Joe Smith }, { id: 19, name: Alicia Fox } ] *Support common data operators, filter, transform, join, group, sort, union. (pls see http://code.google.com/p/jaql/ ). 19
Jaql New Features Java/Python API Support Jaql to be called from Java and Python programs Exception Handling / Logging Timeout, fence Native Map/Reduce job Allow existing MapReduce programs to be evaluated from Jaql External call Allow existing programs(legacy programs or additional tools) to be called from Jaql Parallel operators: Union, Tee, SORT, etc.. R/SPSS integration Run R & SPSS in parallel over partitions of Hadoop data. Launch Jaql jobs from R & SPSS Load Hadoop data into R & SPSS More exploratory features... 20
SPSS Script with Embedded Jaql * Jaql script to load one month of data for one station BEGIN PROGRAM JAQL. fd = del("co2.dat", // Data in a CSV file in HDFS { schema : schema { station:string, year:long, month:long, day:long, co2:double? } }); read(fd) filter $.station == 'Hongkong' filter $.month == 7 filter not isnull($.co2) spssdataset(); // Jaql function to set current SPSS dataset END PROGRAM. * Compute descriptive statistics (quartiles, min, max) on the station-month of data. EXAMINE VARIABLES=co2 /PERCENTILES /NOTOTAL. Produces following dataset, pivot tables, and chart 21
Jaql Script with Embedded SPSS Statistics boxstats = fn(nums) ( s = spssproc(proc = "EXAMINE VARIABLES= VAR1 /PERCENTILES /NOTOTAL.", command = "Explore", subtypes = ["Percentiles", "Descriptives"], data = nums), { min: s.descriptives[7].statistic, // extracted from pivot tables in SPSS output log q25: s.percentiles[1].@25, q50: s.percentiles[1].@50, q75: s.percentiles[1].@75, max: s.descriptives[8].statistic } ); Result: read(fd) -> filter $.station == 'Hongkong' -> filter $.month == 7 -> filter not isnull($.co2) -> transform $.co2 -> boxstats(); // pass CO2 values to SPSS { min: 334.97, q25: 337.07, q50: 338.07, q75: 341.17, max: 349.57 } 22
Bigsheets http://www-01.ibm.com/software/ebusiness/jstart/bigsheets/index.html 23
Bigsheets Background What is it? It is a web application, similar to a traditional spreadsheet used by the domain expert for performing ad-hoc analysis at web-scale on unstructured and structured content Formula, function, customer functions Putting Map/Reduce & Hadoop to work for the line-ofbusiness user How does it works? Gather content either statically (e.g. crawl) or dynamically through connectors Extract local or document level information (e.g. congress person s name), cleanse, normalize Explore, analyze, annotate, navigate content, filter on existing and new relationships, generate results and visualize Iterate at any and all steps Employees a browser-based visual front end spreadsheet metaphor to create worksheets for exploring/visualizing the big data 24
Bigsheets Logical view BigSheets REST API for customer choice of analytic service/engine REST API for choice of visualization Export content as feeds, JSON, CSV, Create Monitor Import/ Export Extend User Interface & Front End Server MetaTracker Server Visualize JSP Container Jetty + JDBC Standalone BigSheets + Hadoop Job Controller Map/Reduce (Hadoop) Distributed File System (HDFS/GPFS) Jaql/Pig/Hive (Scripting Map/Reduce) Nutch (Web Crawler) Other tools (LanguageWare, ICM, etc.) Apache Projects IBM Analytic Products IBM Hadoop Common Component IBM Products and Apache Enabling Projects 25
Bigsheets dynamic view Hadoop & HDFS Cluster Front End Server (Web App) 2. BigSheets Job Server 1. 3. 4. 1. Main Communication to Hadoop Cluster Job & Collection Lifecycle Job Monitoring 2. Reads data for collections directly from DFS (FileSystem) 3. Job Controller Job and Collection management Collections may consists of many M/R Jobs, the Job Server tracks each Job and monitors progress (upon request) Executes entire Job chain Starts, Stops, Monitors running Jobs Manages collection versions and job versions 4. Communication channel when Job complete, Hadoop informs Job Server so that the next step can be executed or the job completed 26