Large Scale Text Analysis Using the Map/Reduce



Similar documents
Hadoop IST 734 SS CHUNG

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Workshop on Hadoop with Big Data

Hadoop Ecosystem B Y R A H I M A.

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

A Performance Analysis of Distributed Indexing using Terrier

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Accelerating and Simplifying Apache

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

In Memory Accelerator for MongoDB

Big Data and Apache Hadoop s MapReduce

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Deploying Hadoop with Manager

Mobile Storage and Search Engine of Information Oriented to Food Cloud

CSE-E5430 Scalable Cloud Computing Lecture 2

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Cloudera Certified Developer for Apache Hadoop

Introduction to Big Data Training

HDFS. Hadoop Distributed File System

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Using distributed technologies to analyze Big Data

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Oracle Big Data SQL Technical Update

THE HADOOP DISTRIBUTED FILE SYSTEM

Complete Java Classes Hadoop Syllabus Contact No:

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Data processing goes big

Search and Real-Time Analytics on Big Data

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data Course Highlights

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop Architecture. Part 1

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Hadoop implementation of MapReduce computational model. Ján Vaňo

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Xiaoming Gao Hui Li Thilina Gunarathne

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Entering the Zettabyte Age Jeffrey Krone

How To Scale Out Of A Nosql Database

Performance and Scalability Overview

Hadoop & Spark Using Amazon EMR

07/11/2014 Julien! Poorna! Andreas

Open source Google-style large scale data analysis with Hadoop

Cloud Computing at Google. Architecture

Data Processing in the Cloud at Yahoo!

From Wikipedia, the free encyclopedia

Open Source Technologies on Microsoft Azure

BIG DATA What it is and how to use?

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Certified Big Data and Apache Hadoop Developer VS-1221

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Hadoop Job Oriented Training Agenda

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

Google Bing Daytona Microsoft Research

A very short Intro to Hadoop

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Data Domain Profiling and Data Masking for Hadoop

Fast, Low-Overhead Encryption for Apache Hadoop*

Big Data in a Relational World Presented by: Kerry Osborne JPMorgan Chase December, 2012

Data Modeling for Big Data

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Comparing SQL and NOSQL databases

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

BIG DATA TRENDS AND TECHNOLOGIES

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

WHITE PAPER USING CLOUDERA TO IMPROVE DATA PROCESSING

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Cost-Effective Business Intelligence with Red Hat and Open Source

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Investigating Hadoop for Large Spatiotemporal Processing Tasks

Apache HBase. Crazy dances on the elephant back

Big Data With Hadoop

Transcription:

Large Scale Text Analysis Using the Map/Reduce Hierarchy David Buttler This work is performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 Lawrence Livermore National Laboratory

Large scale computing with commodity hardware Origins Google GFS, Map/Reduce, BigTable Microsoft Azure Hadoop: Yahoo! / Open Source Software Why do we care: k-mer lexing 10 hours on a single fat node 1 hour on an old cluster 2

The M/R stack of open source software Workflow (Cascading / Azkaban) Katta Solr / Lucene Pig Hive Zookeeper HBase Map / Reduce HDFS 3

The M/R stack of open source software HDFS Workflow (Cascading / Azkaban) Katta Solr / Lucene Pig Hive Zookeeper HBase Map / Reduce HDFS 4

HDFS Replicated Distributed ib t Centrally managed Data Node 1 SPOF Data a Node 2 Limited number of files Not POSIX compliant Rack-aware Name Node File: path, {Blocks} Data Node n File: path, {Blocks} 5

The M/R stack of open source software Map / Reduce Workflow (Cascading / Azkaban) Katta Solr / Lucene Pig Hive Zookeeper HBase Map / Reduce HDFS 6

Map/Reduce is functional programming distributed over a cluster Distributed computation Two phase computation ti Built-in shuffle/sort between phases Canonical example: word frequency count for the web Map to, 1 be, 1 or, 1 not, 1 to, 1 be, 1 Input Documents shuffle / sort be, 1 be, 1 not, 1 Reduce to, 1 to, 1 be, 2+ not, 1+ to, 2+ 7

More interesting M/R examples Map input: document Map output: t raw text t Map input: text Map output: Named entity annotations 8

The M/R stack of open source software Zookeeper Workflow (Cascading / Azkaban) Katta Solr / Lucene Pig Hive Zookeeper HBase Map / Reduce HDFS 9

Zookeeper A highly available, scalable, distributed, configuration, consensus, group membership, leader election, naming, and coordination service Uses: HBase: row locking; region key ranges; region server addresses Katta: shard location information Message queues Not: a large scale data store 10

Zookeeper Guarantees 1. Clients will never detect old data. 2. Clients will get notified of a change to data they are watching within a bounded period of time. 3. All requests from a client will be processed in order. 4. All results received by a client will be consistent with results received by all other clients. 11

Zookeeper Data Model Hierarchal namespace Each znode has data and children data is read and written in its entirety Nodes store < 1MB data Writes go to all nodes / services YaView servers locks Name 1 Name n read-1 HBase Katta 12

ZooKeeper Service ZooKeeper Service Leader Server Server Server Server Server Client Client Client Client Client Client Client All servers store a copy of the data (in memory) A leader is elected at startup Followers service clients, all updates go through leader Update responses are sent when a majority of servers have persisted the change 13

The M/R stack of open source software HBase Workflow (Cascading / Azkaban) Katta Solr / Lucene Pig Hive Zookeeper HBase Map / Reduce HDFS 14

HBase Distributed column oriented data store Only supports one data type Tables are broken into regions Regions are automatically split and redistributed All data is local Scales to > 1M row / second insert rate (20 node cluster) Tightly integrated with Hadoop -> rows can be input/output t t for map/reduce tasks 15

HBase Data model Table Row ID Column Family Map Family Qualifier Map Qualifier Version Value Physical files 16

HBase Data model (simplified) Table Row ID Key Family: Column: Version Value aue Regions partitioned i on row key 17

HBase System Architecture From http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html 18

HBase Master manages region servers 19

Hbase Client directly access region servers for data 20

The M/R stack of open source software Hive Workflow (Cascading / Azkaban) Katta Solr / Lucene Pig Hive Zookeeper HBase Map / Reduce HDFS 21

Hive provides and SQL-like interface to data Components Shell: SQL-like command line; Web; JDBC Driver: API interface Compiler: parse, plan, optimize Execution Engine: DAG of stages (M/R, HDFS, or metadata) Metastore: schema, location in HDFS, SerDe http://www.cloudera.com/videos/introduction_to_hive 22

The M/R stack of open source software Solr / Katta Workflow (Cascading / Azkaban) Katta Solr / Lucene Pig Hive Zookeeper HBase Map / Reduce HDFS 23

Solr Faceted text search interface built on top of Lucene Built as a native web app drops into any web server 24

Faceted search is a foundational component for ad hoc document analysis 25

Solr architecture HTTP Request Servlet Update Servlet Admin Interface Standard Request Handler Disjunction Max Request Handler Custom XML XML Request Response Update Handler Writer Interface Config Schema Caching Analysis Solr Core Concurrency Update Handler Replication Lucene Diagram by Yonik Seeley 26

Katta provides vertical and horizontal scalability Shard A Node 1 Shard Z 1) Query 5) Combined Results Text search interface Solr Node n Shard A Shard Z Zookeeper 27

Projects using Hadoop at LLNL Student projects Bioinformatics i [James Leek] Continuous time LDA [Kurt Miller and Tina Elliasi-Rad] Advanced R&D projects Network analytics Keyword tagging g and entity extraction Faceted Search Research projects READ LDRD Program deployments BKMS 28

Bioinformatics (student project) KPATH: produce DNA signatures for detection of pathogens k-mer lexing: produce set of unique DNA sequences of length 15-60 Sliding window Discover k-mers that are unique between bacteria and viruses 29

K-mer parsing performance comparisons Lexing bacteria file 30 k-mer length [120 GB] Optimized i suffix tree [ C implementation] ti on single node, 256 GB RAM, 16 processor system 10.5 hours Custom hadoop implementation 85 nodes, 8 GB RAM, dual processor [old] ~1 hour 30

Unique K-mer grouping performance Group unique k-mers of length 15 [13 GB data] Pig implementation ti using outer joins [10 LOC] More than 9 hours Customer hadoop implementation: 3 hours 26 minutes 31

Network data HDFS provides storage layer for large repositories of network data Hive provides an SQL interface Performance on single query for 6 months of data: Tuned Oracle DB: hours to days Hive: minutes 32

Hadoop-based document management architecture HBase RSS Pub Med Email Ingest cumen nt Source Text DoMetadata Annotations Process Katta/Solr Pubmed NYT Access Access Tomcat Document Viewer RSS etc. Faceted Search 33

Example Data Flow Initial Load Parser NLP Topics Index Load original documents into document table Custom map code to extract text and meta data Named Entity Extraction (SNER) Parsing / Coreference Send corpus slices to LDA for topic modeling Write specific HBase columns to faceted Solr index shards Serve Manage indexes with Katta over HDFS 34

Keyword tagging & Entity extraction Keyword tagging Large dictionaries i (100K terms) Finite state machine to store dictionary Named Entity Recognition Stanford NER CRF model [People, Organizations, Location] 35

Performance of Keyword tagging & Entity extraction 21M Pubmed entries + 1M news articles 11M Pubmed abstracts bt t 55K dictionary key phrases 6 node cluster [16 core, 96 GB RAM, 6 TB disk] Keyword Tagging g 8 minutes, 34 seconds Named Entity Annotation 1 hr 58 minutes 36

37

38

39

Faceted Search Indexing Performance Creating 1 Solr index on 1M news articles: 8hrs 16 min Map: 37 min Reduce: 8 hrs 14 min Creating 50 Solr indexes on 1M news articles: 55 min Map: 7 min Reduce: 54 min 40

Open-sourced products and others in the open source pipeline iscore Content-based t personalization [pre-hadoop] Reconcile Coreference resolution software built on open source tools [with Cornell and U. Utah] Additional adaptation to Hadoop Dunk An elegant java annotation system that allows you to have the fields of a java object serialized (deserialized) to (from) an HBase table Simplifies queries, object construction, and map/reduce formulation 41

Questions? Disclaimer This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes. 42