Distributed Computing and Big Data: Hadoop and MapReduce

Similar documents

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Hadoop and Map-Reduce. Swati Gore

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Map Reduce & Hadoop Recommended Text:

Implement Hadoop jobs to extract business value from large and varied data sets

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Introduction to Hadoop

Open source Google-style large scale data analysis with Hadoop

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Big Data With Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

Networking in the Hadoop Cluster

Apache Hadoop. Alexandru Costan

Analysis of MapReduce Algorithms

Hadoop Parallel Data Processing

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Big Data Processing with Google s MapReduce. Alexandru Costan

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

CS 378 Big Data Programming. Lecture 2 Map- Reduce

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

University of Maryland. Tuesday, February 2, 2010

Introduction to Hadoop

Bringing Big Data Modelling into the Hands of Domain Experts

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

CS 378 Big Data Programming

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

COURSE CONTENT Big Data and Hadoop Training

MAPREDUCE Programming Model

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Data processing goes big

Extreme Computing. Hadoop MapReduce in more detail.

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

A very short Intro to Hadoop

Peers Techno log ies Pv t. L td. HADOOP

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Internals of Hadoop Application Framework and Distributed File System

ITG Software Engineering

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Cloudera Certified Developer for Apache Hadoop

Fast Data in the Era of Big Data: Twitter s Real-

Big Data and Scripting map/reduce in Hadoop

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

CSE-E5430 Scalable Cloud Computing Lecture 2

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Generic Log Analyzer Using Hadoop Mapreduce Framework

Workshop on Hadoop with Big Data

Cloud Computing at Google. Architecture

Advanced In-Database Analytics

Developing MapReduce Programs

The basic data mining algorithms introduced may be enhanced in a number of ways.

Hadoop Ecosystem B Y R A H I M A.

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Real-time Big Data Analytics with Storm

Hadoop SNS. renren.com. Saturday, December 3, 11

Complete Java Classes Hadoop Syllabus Contact No:

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Manifest for Big Data Pig, Hive & Jaql

Machine Learning using MapReduce

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Machine- Learning Summer School

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Integrating Big Data into the Computing Curricula

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Data Mining in the Swamp

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Big Data Course Highlights

Data-Intensive Computing with Map-Reduce and Hadoop

BIG DATA - HADOOP PROFESSIONAL amron

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Sriram Krishnan, Ph.D.

MapReduce. Tushar B. Kute,

Distributed Apriori in Hadoop MapReduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Map Reduce / Hadoop / HDFS

What's New in SAS Data Management

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

HADOOP MOCK TEST HADOOP MOCK TEST II

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Transcription:

Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development

Agenda R&D Overview Hadoop and MapReduce Overview Use Case: Clustering Legal Documents 2

Thomson Reuters Leading source of intelligent information for the world s businesses and professionals. 55,000+ employees across more than 100 countries ti Financial, Legal, Tax and Accounting, Healthcare, Science and Media markets Powered by the world s most trusted news organization (Reuters). 3

Overview of Corporate R&D 40+ computer scientists Research scientists, Ph.D. or equivalent Software engineers, architects, project managers Highly focused areas of expertise Information retrieval, text categorization, financial research Financial analysis Text & data mining, machine learning Web service development, Hadoop 4

Our International Roots 5

Role Of Corporate R&D Anticipate Research Partner Deliver 6

Hadoop and MapReduce

Big Data and Distributed Computing Big Data at Thomson Reuters More than 10 petabytes in Eagan alone Major data centers around globe: financial markets, tick history, healthcare, public records, legal documents Distributed Computing Multiple architectures and use cases Focus today: using multiple servers, each working on part of job, each doing same task Key Challenges: Work distribution and orchestration Error recovery Scalability and management 8

Hadoop & MapReduce Hadoop: A software framework that supports distributed computing using MapReduce Distributed, redundant file system (HDFS) Job distribution, balancing, recovery, scheduler, etc. MapReduce: A programming paradigm that is composed of two functions (~ relations) Map Reduce Both are quite similar to their functional programming cousins Many add-ons 9

Hadoop Clusters NameNode: stores location of all data blocks Job Tracker: work manager Tracker: manages tasks on one Data Node Client accesses data on HDFS, sends jobs to Job Tracker NameNode Job Tracker Secondary NameNode Client Tracker Data Node 1 Tracker Data Node 2 Tracker Data Node N 10

HDFS Key Concepts Google File System Small # of large files Streaming batch processes Redundant, rack aware Failure resistant Write-once (usually), read many Single point failure Incomplete security Not only for MapReduce 11

Map/Reduce Key Concepts <key,value> Mappers: input -> intermediate kv pairs Reducers: intermediate -> output kv pairs InputSplits Progress reporting Shuffling, partitioning Scheduling distribution Topology aware Distributed cache Recovery Compression Bad Records Speculative execution 12

Use Cases Query log processing Query mining Text Mining XML transformations Classification Document Clustering Entity Extraction 13

Case Study: Large Scale ETL Big Data: Public Records Warehouse loading long process, expensive infrastructure, complex management Combine data from multiple repositories (extract, transform, load) Idea: Use Hadoop s natural ETL capabilities Use existing shared infrastructure 14

Why Hadoop Big data billions of documents Needed to process each document, combine information Expected multiple passes, multiple types of transformations Minimal workflow coding 15

Use Case: Language Modeling Build Languages Models from clusters of legal documents Large initial corpus: 7,000,000 xml documents Corpus grows over time

Process Prepare the input Remove duplicates from the corpus Remove stop words (common English, high frequency terms) Stem Convert to binary (sparse TF vector) Create centroids for seed clusters

Process List of Document IDs belonging to this Seed Clusters Seed clusters encoded Encoded Documents Input Output Cluster Centroids C-values for each document

Process Clustering Iterate until number of clusters equals goal Multiply matrix of document vectors and matrix of cluster centroids Assign document to best cluster Merge clusters and re-compute centroids

Process Seed Clusters C-vectors Generate Cluster vectors Input Repeat the loop until all the clusters are merged. Merge Clusters Run Algorithm Merge List W-values

Process Validate and Analyze Clusters Create classifier from clusters Assign all non-clustered documents to clusters using the classifier Build Language Model for each cluster

HDFS Sample Flow node node node node node node Mapper Mapper Mapper Mapper Mapper Mapper Reduce Reduce Reduce r r r HDFS

Prepare Input using Hadoop Fits the Map/Reduce paradigm Each document is atomic: documents can be equally distributed within the HDFS Each mapper removes stop words, tokenizes, and stems Mappers emit token counts, hashes, and tokenized documents Reducers build Document Frequency dictionary (basically, the word count example) Reduces the hashes to a single document (de-duplication) Additional Map/Reduce converts tokenized documents to sparse vectors using the DF dictionary Additional MapReduce maps document vectors and seed cluster ids and reducer generates centroids

HDFS Sample Flow Distribute documents node node node node node node Remove stop words,stem Mapper Mapper Mapper Mapper Mapper Mapper Emit filtered documents, word counts Emit dictionary, filtered documents Reduce Reduce Reduce r r r HDFS

HDFS Sample Flow Distribute filtered documents Compute vectors node node node node node node Mapper Mapper Mapper Mapper Mapper Mapper Emit vectors by seed cluster id Compute cluster centroids Reduce Reduce Reduce r r r Emit vectors, seed cluster centroids HDFS

Clustering using Hadoop Map/Reduce paradigm Each document vector is atomic: documents can be equally distributed within the HDFS Mapper initialization required loading large matrix of cluster centroids Large memory utilization to hold matrix multiplications Decompose matrices into smaller chunks and run multiple Decompose matrices into smaller chunks and run multiple map/reduce steps to obtain final result matrix (new clusters)

Validate and Analyze Clusters using Hadoop Map/Reduce paradigm A document classifier based on the documents within the clusters was built n.b. the classier itself was trained using Hadoop Un-clustered documents (still in the HDFS) are classified in a mapper and assigned a cluster id. A reduction step then takes each set of original A reduction step then takes each set of original documents in a cluster and creates a language model for each cluster

HDFS Sample Flow Distribute documents node node node node node node Extract n-grams, emit by cluster id Mapper Mapper Mapper Mapper Mapper Mapper Build language model for each cluster Reduce Reduce Reduce r r r HDFS

Using Hadoop Other Experiments WestlawNext Log Processing Billions of raw usage events are generated Used Hadoop to map raw events to a user s individual session Reducers created complex session objects Session objects reducible to xml for xpath queries for mining user behavior Remote Logging Provide a way to create and search centralized Hadoop job logs, by host, job, and task ids Send the logs to a message queue Browse the queue or Pull the logs from the queue and retain them in a db

Remote Logging: Browsing Client

Lessons learned State of Hadoop Weak security model, changes in works Cluster configuration, management and optimization still sometimes difficult Users can overload a cluster. Need to balance optimization and safety. Learning curve moderate Quick to run first naïve MR programs Skill/experience required for advanced or optimized processes 31

Lessons Learned Loading HDFS is time consuming: Wrote multithreaded loader to reduce bound IO Multiple step process needed to be re-run using different test t corpuses: Wrote parameterized Perl script to submit jobs to the Hadoop cluster Test Hadoop on a single node cluster first: Install Hadoop locally Local mode within Eclipse (Windows, Mac) Pseudo-distributed mode (Mac, Cygwin, VMWare) using Hadoop plugin (Karmasphere)

Lessons Learned Tracking intermediate results: Detect bad or inconsistent results after each iteration Record messages to Hadoop node logs Create remote logger (event detector) to broadcast status Regression tests: Small, sample corpus run through local Hadoop and distributed Hadoop. Intermediate and final results compared against reference results created by baseline Matlab application

Lessons Learned Performance evaluations Detect bad or inconsistent results after each iteration Don t accept long duration tasks as normal Regression test on Hadoop took 4 hours while same Matlab test took seconds (because of the need to spread the matrix operations over several map/reduce steps) Re-evaluated core algorithm and found ways to eliminate and compress steps related to cluster merging Direct conversion of mathematics, as developed, to java structures and map/reduce was not efficient New clustering process no longer uses Hadoop: 6 hours on single machine vs. 6 days on 20 node Hadoop cluster As size corpus grows, we will need to migrate new cluster algorithm back to Hadoop

Lessons Learned Performance evaluations Leverage combiners and mapper statics Reduce the amount of data during shuffle

Lessons Learned Releases are still volatile Core API changed significantly from release.19 to.20 Functionality related to distributed cache changed (application files loaded to each node at runtime) Eclipse Hadoop plugins Source code only with release.20 and only works in older Eclipse versions on Windows Karmasphere plugin (Eclipse and NetBeans) more mature but still more of a concept than productive Just use Eclipse for testing Hadoop code in local mode Develop alternatives for handling distributed cache

Reading Hadoop The Definitive Guide Tom White O Reilly Data-Intensive Text Processing with MapReduce Jimmy Lin and Chris Dyer University of Maryland http://www.umiacs.umd.edu/~jimmylin/book.html

Questions? Thank you. 38