Distributed Computing and Big Data: Hadoop and MapReduce

Similar documents

Distributed Computing and Big Data: Hadoop and MapReduce

Software Engineering and Development

HEALTHCARE INTEGRATION BASED ON CLOUD COMPUTING

883 Brochure A5 GENE ss vernis.indd 1-2

How To Use A Network On A Network With A Powerline (Lan) On A Pcode (Lan On Alan) (Lan For Acedo) (Moe) (Omo) On An Ipo) Or Ipo (

Power Monitoring and Control for Electric Home Appliances Based on Power Line Communication

Cloud Service Reliability: Modeling and Analysis

Significant value in diagnostic information

Comparing Availability of Various Rack Power Redundancy Configurations

Scheduling Hadoop Jobs to Meet Deadlines

9:6.4 Sample Questions/Requests for Managing Underwriter Candidates

Welcome to the Cloud Stream. Sponsored by:

AN IMPLEMENTATION OF BINARY AND FLOATING POINT CHROMOSOME REPRESENTATION IN GENETIC ALGORITHM

Comparing Availability of Various Rack Power Redundancy Configurations

Alarm transmission through Radio and GSM networks

Converting knowledge Into Practice

Towards Automatic Update of Access Control Policy

Database Management Systems

INITIAL MARGIN CALCULATION ON DERIVATIVE MARKETS OPTION VALUATION FORMULAS

Strength Analysis and Optimization Design about the key parts of the Robot

An Approach to Optimized Resource Allocation for Cloud Simulation Platform

A framework for the selection of enterprise resource planning (ERP) system based on fuzzy decision making methods

Towards Realizing a Low Cost and Highly Available Datacenter Power Infrastructure

Uncertain Version Control in Open Collaborative Editing of Tree-Structured Documents

Give me all I pay for Execution Guarantees in Electronic Commerce Payment Processes

How to SYSPREP a Windows 7 Pro corporate PC setup so you can image it for use on future PCs

Public Health and Transportation Coalition (PHiT) Vision, Mission, Goals, Objectives, and Work Plan August 2, 2012

IBM Research Smarter Transportation Analytics

How to recover your Exchange 2003/2007 mailboxes and s if all you have available are your PRIV1.EDB and PRIV1.STM Information Store database

Questions & Answers Chapter 10 Software Reliability Prediction, Allocation and Demonstration Testing

California s Duals Demonstration: A Transparent. Process. Margaret Tatar Chief, Medi-Cal Managed Care Division. CA Coo 8/21/12

Continuous Compounding and Annualization

Tracking/Fusion and Deghosting with Doppler Frequency from Two Passive Acoustic Sensors

High Availability Replication Strategy for Deduplication Storage System

Memory-Aware Sizing for In-Memory Databases

The future challenges of Healthcare

ENABLING INFORMATION GATHERING PATTERNS FOR EMERGENCY RESPONSE WITH THE OPENKNOWLEDGE SYSTEM

A Two-Step Tabu Search Heuristic for Multi-Period Multi-Site Assignment Problem with Joint Requirement of Multiple Resource Types

How to create a default user profile in Windows 7

An Efficient Group Key Agreement Protocol for Ad hoc Networks

Reduced Pattern Training Based on Task Decomposition Using Pattern Distributor

The Role of Gravity in Orbital Motion

Review Graph based Online Store Review Spammer Detection

STUDENT RESPONSE TO ANNUITY FORMULA DERIVATION

Ilona V. Tregub, ScD., Professor

Things to Remember. r Complete all of the sections on the Retirement Benefit Options form that apply to your request.

The impact of migration on the provision. of UK public services (SRG ) Final Report. December 2011

Supporting Efficient Top-k Queries in Type-Ahead Search

Over-encryption: Management of Access Control Evolution on Outsourced Data

THE DISTRIBUTED LOCATION RESOLUTION PROBLEM AND ITS EFFICIENT SOLUTION

DOCTORATE DEGREE PROGRAMS

Model-Driven Engineering of Adaptation Engines for Self-Adaptive Software: Executable Runtime Megamodels

SUPPORT VECTOR MACHINE FOR BANDWIDTH ANALYSIS OF SLOTTED MICROSTRIP ANTENNA

Hitachi Virtual Storage Platform

Module Availability at Regent s School of Drama, Film and Media Autumn 2016 and Spring 2017 *subject to change*

How to create RAID 1 mirroring with a hard disk that already has data or an operating system on it

who supply the system vectors for their JVM products. 1 HBench:Java will work best with support from JVM vendors

Confirmation of Booking

Anti-Lock Braking System Training Program

Do Vibrations Make Sound?

Spirotechnics! September 7, Amanda Zeringue, Michael Spannuth and Amanda Zeringue Dierential Geometry Project

Left- and Right-Brain Preferences Profile

NURBS Drawing Week 5, Lecture 10

Ignorance is not bliss when it comes to knowing credit score

Exam #1 Review Answers

Define What Type of Trader Are you?

The transport performance evaluation system building of logistics enterprises

Automatic Testing of Neighbor Discovery Protocol Based on FSM and TTCN*

Methods for the specification and verification of business processes MPB (6 cfu, 295AA)

An Introduction to Omega

Research on Risk Assessment of the Transformer Based on Life Cycle Cost

Analyzing Ballistic Missile Defense System Effectiveness Based on Functional Dependency Network Analysis

SELF-INDUCTANCE AND INDUCTORS

Financial Derivatives for Computer Network Capacity Markets with Quality-of-Service Guarantees

A formalism of ontology to support a software maintenance knowledge-based system

College of Engineering Bachelor of Computer Science

Efficient Redundancy Techniques for Latency Reduction in Cloud Systems

Automatic Closed Caption Detection and Filtering in MPEG Videos for Video Structuring

2 r2 θ = r2 t. (3.59) The equal area law is the statement that the term in parentheses,

Chapter 3 Savings, Present Value and Ricardian Equivalence

Modal Characteristics study of CEM-1 Single-Layer Printed Circuit Board Using Experimental Modal Analysis

Experiment 6: Centripetal Force

Application for Admission GENEVA COLLEGE

Transcription:

Distibuted Computing and Big Data: Hadoop and Map Bill Keenan, Diecto Tey Heinze, Achitect Thomson Reutes Reseach & Development Agenda R&D Oveview Hadoop and Map Oveview Use Case: Clusteing Legal Documents 2 1

Thomson Reutes Leading souce of intelligent infomation fo the wold s businesses and pofessionals. 55,000+ employees acoss moe than 100 counties Financial, Legal, Tax and Accounting, Healthcae, Science and Media makets Poweed by the wold s most tusted news oganization (Reutes). 3 Oveview of Copoate R&D 40+ compute scientists Reseach scientists, Ph.D. o equivalent Softwae enginees, achitects, poject manages Highly focused aeas of expetise Infomation etieval, text categoization, financial eseach Financial analysis Text & data mining, machine leaning Web sevice development, Hadoop 4 2

Ou Intenational Roots 5 Role Of Copoate R&D Anticipate Reseach Patne Delive 6 3

Hadoop and Map Big Data and Distibuted Computing Big Data at Thomson Reutes Moe than 10 petabytes in Eagan alone Majo data centes aound globe: financial makets, tick histoy, healthcae, public ecods, legal documents Distibuted Computing Multiple achitectues and use cases Focus today: using multiple seves, each woking on pat of job, each doing same task Key Challenges: Wok distibution and ochestation Eo ecovey Scalability and management 8 4

Hadoop & Map Hadoop: A softwae famewok that suppots distibuted computing using Map Distibuted, edundant file system (HDFS) Job distibution, balancing, ecovey, schedule, etc. Map: A pogamming paadigm that is composed of two functions (~ elations) Map Both ae quite simila to thei functional pogamming cousins Many add-ons 9 Hadoop Clustes NameNode: stoes location of all data blocks Job Tacke: wok manage Task Tacke: manages tasks on one Data Node Client accesses data on HDFS, sends jobs to Job Tacke NameNode Job Tacke Seconday NameNode Client Task Tacke Data Node 1 Task Tacke Data Node 2 Task Tacke Data Node N 10 5

HDFS Key Concepts Google File System Small # of lage files Steaming batch pocesses Redundant, ack awae Failue esistant Wite-once (usually), ead many Single point failue Incomplete secuity Not only fo Map 11 Map/ Key Concepts <key,value> Mappes: input -> intemediate kv pais s: intemediate -> output kv pais InputSplits Pogess epoting Shuffling, patitioning Scheduling Task distibution Topology awae Distibuted cache Recovey Compession Bad Recods Speculative execution 12 6

Use Cases Quey log pocessing Quey mining Text Mining XML tansfomations Classification Document Clusteing Entity Extaction 13 Case Study: Lage Scale ETL Big Data: Public Recods Waehouse loading long pocess, expensive infastuctue, complex management Combine data fom multiple epositoies (extact, tansfom, load) Idea: Use Hadoop s natual ETL capabilities Use existing shaed infastuctue 14 7

Why Hadoop Big data billions of documents Needed to pocess each document, combine infomation Expected multiple passes, multiple types of tansfomations Minimal wokflow coding 15 Use Case: Language Modeling Build Languages Models fom clustes of legal documents Lage initial copus: 7,000,000 xml documents Copus gows ove time 8

Pocess Pepae the input Remove duplicates fom the copus Remove stop wods (common English, high fequency tems) Stem Convet to binay (spase TF vecto) Ceate centoids fo seed clustes Pocess List of Document IDs belonging to this Seed Clustes Seed clustes encoded Encoded Documents Input Output Cluste Centoids C-values fo each document 9

Pocess Clusteing Iteate until numbe of clustes equals goal Multiply matix of document vectos and matix of cluste centoids Assign document to best cluste Mege clustes and e-compute centoids Pocess Seed Clustes C-vectos Geneate Cluste vectos Input Repeat the loop until all the clustes ae meged. Mege Clustes Run Algoithm Mege List W-values 10

Pocess Validate and Analyze Clustes Ceate classifie fom clustes Assign all non-clusteed documents to clustes using the classifie Build Language Model fo each cluste HDFS Sample Flow Task Task Task Task Task node node node node node Task node Mappe Mappe Mappe Mappe Mappe Mappe HDFS 11

Pepae Input using Hadoop Fits the Map/ paadigm Each document is atomic: documents can be equally distibuted within the HDFS Each mappe emoves stop wods, tokenizes, and stems Mappes emit token counts, hashes, and tokenized documents s build Document Fequency dictionay (basically, the wod count example) s the hashes h to a single document (de-duplication) d Additional Map/ convets tokenized documents to spase vectos using the DF dictionay Additional Map maps document vectos and seed cluste ids and educe geneates centoids HDFS Sample Flow Distibute documents Task Task Task Task Task node node node node node Task node Remove stop wods,stem Mappe Mappe Mappe Mappe Mappe Mappe Emit filteed documents, wod counts Emit dictionay, filteed documents HDFS 12

HDFS Sample Flow Distibute filteed documents Task Task Task Task Task node node node node node Task node Compute vectos Mappe Mappe Mappe Mappe Mappe Mappe Emit vectos by seed cluste id Compute cluste centoids Emit vectos, seed cluste centoids HDFS Clusteing using Hadoop Map/ paadigm Each document vecto is atomic: documents can be equally distibuted within the HDFS Mappe initialization equied loading lage matix of cluste centoids Lage memoy utilization to hold matix multiplications Decompose matices into smalle chunks and un multiple map/educe steps to obtain final esult matix (new clustes) 13

Validate and Analyze Clustes using Hadoop Map/ paadigm A document classifie based on the documents within the clustes was built n.b. the classie itself was tained using Hadoop Un-clusteed documents (still in the HDFS) ae classified in a mappe and assigned a cluste id. A eduction step then takes each set of oiginal documents in a cluste and ceates a language model fo each cluste HDFS Sample Flow Distibute documents Task Task Task Task Task node node node node node Task node Extact n-gams, emit by cluste id Mappe Mappe Mappe Mappe Mappe Mappe Build language model fo each cluste HDFS 14

Using Hadoop Othe Expeiments WestlawNext Log Pocessing Billions of aw usage events ae geneated Used Hadoop to map aw events to a use s individual session s ceated complex session objects Session objects educible to xml fo xpath queies fo mining use behavio Remote Logging Povide a way to ceate and seach centalized Hadoop job logs, by host, job, and task ids Send the logs to a message queue Bowse the queue o Pull the logs fom the queue and etain them in a db Remote Logging: Bowsing Client 15

Lessons leaned State of Hadoop Weak secuity model, changes in woks Cluste configuation, management and optimization still sometimes difficult Uses can oveload a cluste. Need to balance optimization and safety. Leaning cuve modeate Quick to un fist naïve MR pogams Skill/expeience equied fo advanced o optimized pocesses 31 Lessons Leaned Loading HDFS is time consuming: Wote multitheaded loade to educe bound IO Multiple step pocess needed to be e-un using diffeent test copuses: Wote paameteized Pel scipt to submit jobs to the Hadoop cluste Test Hadoop on a single node cluste fist: Install Hadoop locally Local mode within Eclipse (Windows, Mac) Pseudo-distibuted mode (Mac, Cygwin, VMWae) using Hadoop plugin (Kamasphee) 16

Lessons Leaned Tacking intemediate esults: Detect bad o inconsistent esults afte each iteation Recod messages to Hadoop node logs Ceate emote logge (event detecto) to boadcast status Regession tests: Small, sample copus un though local Hadoop and distibuted Hadoop. Intemediate and final esults compaed against efeence esults ceated by baseline Matlab application Lessons Leaned Pefomance evaluations Detect bad o inconsistent esults afte each iteation Don t accept long duation tasks as nomal Regession test on Hadoop took 4 hous while same Matlab test took seconds (because of the need to spead the matix opeations ove seveal map/educe steps) Re-evaluated coe algoithm and found ways to eliminate and compess steps elated to cluste meging Diect convesion of mathematics, as developed, to java stuctues t and map/educe was not efficient i New clusteing pocess no longe uses Hadoop: 6 hous on single machine vs. 6 days on 20 node Hadoop cluste As size copus gows, we will need to migate new cluste algoithm back to Hadoop 17

Lessons Leaned Pefomance evaluations Leveage combines and mappe statics the amount of data duing shuffle Lessons Leaned Releases ae still volatile Coe API changed significantly fom elease.19 to.20 Functionality elated to distibuted cache changed (application files loaded to each node at untime) Eclipse Hadoop plugins Souce code only with elease.20 and only woks in olde Eclipse vesions on Windows Kamasphee plugin (Eclipse and NetBeans) moe matue but still moe of a concept than poductive Just use Eclipse fo testing Hadoop code in local mode Develop altenatives fo handling distibuted cache 18

Reading Hadoop The Definitive Guide Tom White O Reilly Data-Intensive Text Pocessing with Map Jimmy Lin and Chis Dye Univesity of Mayland http://www.umiacs.umd.edu/~jimmylin/book.html Questions? Thank you. 38 19