A stream computing approach towards scalable NLP



Similar documents
Hadoop Architecture. Part 1

Real-time Big Data Analytics with Storm

SGT Technology Innovation Center Dasvis Project

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Architectures for massive data management

Hadoop IST 734 SS CHUNG

NoSQL Data Base Basics

Introducing Storm 1 Core Storm concepts Topology design

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Big Data Analysis: Apache Storm Perspective

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Big Data Analytics - Accelerated. stream-horizon.com

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Resource Aware Scheduler for Storm. Software Design Document. Date: 09/18/2015

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Client Overview. Engagement Situation. Key Requirements

Openbus Documentation

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia

Very Large Enterprise Network, Deployment, Users

HPC performance applications on Virtual Clusters

Big Data Analytics for Cyber

CSE-E5430 Scalable Cloud Computing Lecture 2

Towards Smart and Intelligent SDN Controller

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Predictive Analytics with Storm, Hadoop, R on AWS

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Topology Aware Analytics for Elastic Cloud Services

NOT IN KANSAS ANY MORE

Networking in the Hadoop Cluster

Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside

Time series IoT data ingestion into Cassandra using Kaa

Bangladesh Voter Registration Duplicate Search System Implemented by the Bangladesh Army and Dohatec Based on MegaMatcher Technology

Cloud Computing through Virtualization and HPC technologies

From Spark to Ignition:

Future Internet Technologies

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Architectures for Big Data Analytics A database perspective

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Software Scalability Issues in Large Clusters

Cisco Data Preparation

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Image Analytics on Big Data In Motion Implementation of Image Analytics CCL in Apache Kafka and Storm

Scaling from 1 PC to a super computer using Mascot

Integrating Big Data into the Computing Curricula

Very Large Enterprise Network Deployment, 25,000+ Users

Parallels Plesk Automation


A Brief Introduction to Apache Tez

Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation)

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Tableau Server Scalability Explained

Hybrid Software Architectures for Big

Chapter 7. Using Hadoop Cluster and MapReduce

Multilingual Event Detection using the NewsReader Pipelines

An Introduction to High Performance Computing in the Department

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

A survey of big data architectures for handling massive data

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

[Hadoop, Storm and Couchbase: Faster Big Data]

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Dragon Medical Enterprise Network Edition Technical Note: Requirements for DMENE Networks with virtual servers

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Fault Tolerance in Hadoop for Work Migration

Microsoft HPC. V 1.0 José M. Cámara (checam@ubu.es)

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

McAfee Agent Handler

MapReduce and Hadoop Distributed File System V I J A Y R A O

HPC Wales Skills Academy Course Catalogue 2015

Unified Batch & Stream Processing Platform

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

ANALYTICS ON BIG FAST DATA USING REAL TIME STREAM DATA PROCESSING ARCHITECTURE

Hadoop and Map-Reduce. Swati Gore

Exploring Oracle E-Business Suite Load Balancing Options. Venkat Perumal IT Convergence

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Big Data Course Highlights

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

How To Test For Performance And Scalability On A Server With A Multi-Core Computer (For A Large Server)

HADOOP. Revised 10/19/2015

HDP Hadoop From concept to deployment.

GraySort on Apache Spark by Databricks

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

A Middleware Strategy to Survive Compute Peak Loads in Cloud

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Technical Overview Simple, Scalable, Object Storage Software

A very short Intro to Hadoop

Transcription:

A stream computing approach towards scalable NLP Xabier Artola, Zuhaitz Beloki, Aitor Soroa <zuhaitz.beloki@ehu.es> IXA group. University of the Basque Country. LREC, Reykjavík 2014

Table of contents 1 NewsReader project Goals Storm 2 3 Experiment setting 1. experiment 2. experiment 4 Conclusion Future work

NewsReader project Goals Storm Overwhelming flow of textual data available (Big Data) Computational power needs increased Solution: distributed computing Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 1/16

NewsReader project NewsReader project Goals Storm Base project for our experiments NewsReader project goals: Perform real-time event detection Extract from text what happened to whom, when, where... Estimation of 2 million news items per day to process Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 2/16

Requirements NewsReader project Goals Storm NLP modules distributed across a cluster Distribution of data Use of a stream computing framework Synchronisation between nodes Load balancing Fault tolerance Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 3/16

Goals NewsReader project Goals Storm Create a distributed system to process large amount of documents in parallel Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 4/16

Goals NewsReader project Goals Storm Create a distributed system to process large amount of documents in parallel Use of third party NLP modules Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 4/16

Goals NewsReader project Goals Storm Create a distributed system to process large amount of documents in parallel Use of third party NLP modules Several possible levels of parallelisation: Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 4/16

Goals NewsReader project Goals Storm Create a distributed system to process large amount of documents in parallel Use of third party NLP modules Several possible levels of parallelisation: Redesign and reimplementation of the core algorithms of each NLP module (out of scope) Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 4/16

Goals NewsReader project Goals Storm Create a distributed system to process large amount of documents in parallel Use of third party NLP modules Several possible levels of parallelisation: Redesign and reimplementation of the core algorithms of each NLP module (out of scope) Deploy multiple instances of each NLP module Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 4/16

Goals NewsReader project Goals Storm Create a distributed system to process large amount of documents in parallel Use of third party NLP modules Several possible levels of parallelisation: Redesign and reimplementation of the core algorithms of each NLP module (out of scope) Deploy multiple instances of each NLP module Goal: test and measure performance improvements with this approach Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 4/16

Apache Storm NewsReader project Goals Storm Distributed stream computing system Open source Horizontal scalability Fault-tolerant Guarantees all data will be processed Large and active user community Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 5/16

Apache Storm NewsReader project Goals Storm Storm concepts Topology: a graph of computation, composed by spouts and bolts Spout: input processing modules Bolt: rest of processing modules Tuple: structure of the data to transfer between processing modules Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 6/16

(NAF) Common annotation format: NAF (Fokkens et al., 2014) Used in NewsReader project to integrate all the NLP modules Layered, stand-off format References between annotations in different layers Specifically designed to work on distributed environments +10 layers: raw, text (tokens), terms, chunks, entities... Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 7/16

NAF example

Experiment setting Experiment setting 1. experiment 2. experiment Storm concepts in our system: Spout: reads text documents and sends to the first bolt Bolts: wrappers of NLP modules (tokenizer, POS tagger...) Tuple: <doc id, NAF doc> Small pipeline (4 modules): Tokenizer => POS tagger => NERC => WSD Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 9/16

Experiment setting Experiment setting 1. experiment 2. experiment Input document sets 1. set: 10 documents (16.208 words, 682 sentences) 2. set: 100 documents (138.803 words, 5.416 sentences) 3. set: 1000 documents (1.185.933 words, 48.746 sentences) Hardware for testing: single commodity PC (Linux), Intel Core i5-3570, 3.4GHz (x4), 4GB RAM Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 10/16

1. experiment Experiment setting 1. experiment 2. experiment Baseline system: the four modules sequentially processed Experiment: Storm topology implementation Pipeline approach, but: When a module finishes processing a document, starts with the next one Parallelisation level: number of NLP modules Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 11/16

1. experiment results Experiment setting 1. experiment 2. experiment Total time words/s sent/s Gain 10 documents pipeline 2m42s 99.8 4.2 - Storm 2m25s 111.5 4.7 %10.4 100 documents pipeline 21m16s 108.8 4.2 - Storm 18m43s 123.5 4.8 %12.0 1000 documents pipeline 3h15m16s 101.2 4.2 - Storm 2h50m21s 116.0 4.8 %12.8 Little performance gain 96% of the processing time was spent in the WSD module Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 12/16

2. experiment Experiment setting 1. experiment 2. experiment Multiple instances of the WSD module Parallelisation level: configurable Only 4 CPU cores available: test with 2x, 3x, 4x, 5x and 6x WSD instances Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 13/16

2. experiment results Experiment setting 1. experiment 2. experiment Total time words/s sent/s Gain 10 documents pipeline 2m42s 99.8 4.2 - Storm 2m25s 111.5 4.7 %10.4 Storm 2xWSD 1m29s 182.9 7.7 %45.4 Storm 5xWSD 1m28s 182.5 7.7 %45.3 Storm 6xWSD 1m22s 195.4 8.2 %49.0 100 documents pipeline 21m16s 108.8 4.2 - Storm 18m43s 123.5 4.8 %12.0 Storm 2xWSD 10m48s 214.3 8.4 %49.3 Storm 5xWSD 7m44s 299.1 11.7 %63.7 Storm 6xWSD 7m48s 296.1 11.6 %63.3 1000 documents pipeline 3h15m16s 101.2 4.2 - Storm 2h50m21s 116.0 4.8 %12.8 Storm 2xWSD 1h40m37 196.5 8.1 %48.5 Storm 5xWSD 1h10m45s 279.3 11.5 %63.8 Storm 6xWSD 1h11m37s 276.0 11.3 %63.3 Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 14/16

Conclusion Conclusion Future work We proposed a new approach for scalable distributed NLP using Storm Performance gain of 63% with a single commodity PC Significant boost in performance expected with large clusters Big room for improvements in overall NLP performance Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 15/16

Future work Conclusion Future work Test with a multi-node cluster More real scenario Much larger input set Enhance general system architecture Distributed message queue system (Kafka) Use of a NoSQL database to store/retrieve data (MongoDB) Topology design improvements Non-linear topologies Granularity-based splitting of documents Xabier Artola, Zuhaitz Beloki, Aitor Soroa A stream computing approach towards scalable NLP 16/16

A stream computing approach towards scalable NLP Xabier Artola, Zuhaitz Beloki, Aitor Soroa <zuhaitz.beloki@ehu.es> IXA group. University of the Basque Country. LREC, Reykjavík 2014