Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Similar documents
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Parallel Computing. Benson Muite. benson.

Big Data and Apache Hadoop s MapReduce

CSE-E5430 Scalable Cloud Computing Lecture 2

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop IST 734 SS CHUNG

Large scale processing using Hadoop. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo

Scaling Out With Apache Spark. DTL Meeting Slides based on

Hadoop Ecosystem B Y R A H I M A.

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Big Data With Hadoop

Hadoop: Embracing future hardware

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Enabling High performance Big Data platform with RDMA

Moving From Hadoop to Spark

HiBench Introduction. Carson Wang Software & Services Group

I/O Considerations in Big Data Analytics

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop. Sunday, November 25, 12

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

A Performance Analysis of Distributed Indexing using Terrier

Unified Big Data Analytics Pipeline. 连 城

Managing large clusters resources

Introduction to Hadoop

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

A Brief Introduction to Apache Tez

Chapter 7. Using Hadoop Cluster and MapReduce

On a Hadoop-based Analytics Service System

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

The Hadoop Framework

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Apache Flink Next-gen data analysis. Kostas

Hadoop and Map-Reduce. Swati Gore

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data Challenges in Bioinformatics

MapReduce and Hadoop Distributed File System

Introduction to Hadoop

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Optimization and analysis of large scale data sorting algorithm based on Hadoop

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

MapReduce Evaluator: User Guide

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Energy Efficient MapReduce

MapReduce and Hadoop Distributed File System V I J A Y R A O

Hadoop Cluster Applications

Hadoop Architecture. Part 1

Open source Google-style large scale data analysis with Hadoop

A survey on platforms for big data analytics

Parallel Processing of cluster by Map Reduce

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

BIG DATA What it is and how to use?

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

MapReduce with Apache Hadoop Analysing Big Data

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

How Companies are! Using Spark

Manifest for Big Data Pig, Hive & Jaql

HADOOP MOCK TEST HADOOP MOCK TEST I

A Brief Outline on Bigdata Hadoop

Implement Hadoop jobs to extract business value from large and varied data sets

Intro to Map/Reduce a.k.a. Hadoop

Similarity Search in a Very Large Scale Using Hadoop and HBase

Accelerating and Simplifying Apache

Hadoop & its Usage at Facebook

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Application Execution on Cloud using Hadoop Distributed File System

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Case Study : 3 different hadoop cluster deployments

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

GraySort on Apache Spark by Databricks

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Pre-Stack Kirchhoff Time Migration on Hadoop and Spark

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

HADOOP PERFORMANCE TUNING

BEYOND MAP REDUCE: THE NEXT GENERATION OF BIG DATA ANALYTICS

Bringing Big Data Modelling into the Hands of Domain Experts

How To Create A Data Visualization With Apache Spark And Zeppelin

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Apache Hadoop FileSystem and its Usage in Facebook

Transcription:

Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging Case Studies: Bioinformatics Natural Language Processing (NLP) 2

Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging Case Studies: Bioinformatics Natural Language Processing (NLP) 3

High Performance Computing (HPC) What types of big problem might require a Big Computer? Compute Intensive: A single problem requiring a large amount of computation Memory Intensive: A single problem requiring a large amount of memory High Throughput: Many unrelated problems to be executed over a long period Data Intensive: Operation on a large amount of data 5

High Performance Computing (HPC) What types of big problem might require a Big Computer? Compute Intensive: Distribute the work across multiple CPUs to reduce the execution time as far as possible: Each thread performs a part of the work on its own CPU, concurrently with the others CPUs may need to exchange data rapidly, using specialized hardware Large systems running multiple parallel jobs also need fast access to storage Many use cases from Physics, Chemistry, Energy, Engineering, Astronomy, Biology... The traditional domain of HPC and supercomputers 6

High Performance Computing (HPC) What types of big problem might require a Big Computer? Memory Intensive: Aggregate sufficient memory to enable solution at all Technically more challenging if the program cannot be parallelized High Throughput: Distribute work across multiple CPUs to reduce the overall execution time as far as possible Workload is trivially (or embarrassingly) parallel Workload breaks up naturally into independent pieces Each piece is performed by a separate process on a separate CPU (concurrently) Emphasis is on throughput over a period, rather than on performance on a single problem Obviously a supercomputer can do this too 7

High Performance Computing (HPC) What types of big problem might require a Big Computer? Data Intensive: Distribute the data across multiple CPUs to process in a reasonable time Note that the same work may be done on each data segment Rapid movement of data in and out of (disk) storage becomes important Big Data and how to efficiently process it currently occupies much thought 8

High Performance Computing (HPC) Challenges in the exascale era FLOPS is not on command Once upon a time when FPU was the most expensive and precious resource in a supercomputer Metrics: FLOPS, FLOPS and FLOPS But Data movement s energy efficiency isn t imporving as fast as Flop s energy efficiency Algorithm designer should be thinking in terms of wasting the inexpensive resource (flops) to reduce data movement Communication-avoiding algorithms 13

Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging Case Studies: Bioinformatics Natural Language Processing (NLP) 15

Big Data meets HPC Some facts about Big Data Data in 2013: 4.4 Zettabytes (4.4 x 10 21 bytes) Estimation in 2020: 44 Zettabytes Searching for one element in a 1 MB file: < 0.1 seconds Searching for one element in a 1 GB file: a few minutes Searching for one element in a 1 TB file: about 30 hours Searching for one element in a 1 PB file: > 3 years Searching for one element in a 1 EB file: 30 centuries Searching for one element in a 1 ZB file: 3,000 millennium Estimation using a PC 17

Big Data meets HPC Technologies: how to process all these data Map/Reduce paradigm Introduced by Dean and Ghemawat (Google, 2004) As simple as providing: MAP function that processes a key/value pair to generate a set of intermediate key/value pairs REDUCE function that merges all intermediate values associated with the same intermediate key Runtime takes care of: Partitioning the input data (Parallelism) Scheduling the program s execution across a set of machines (Parallelism) Handling machine failures Managing inter-machines communication (Parallelism) 20

Big Data meets HPC Technologies: Apache Hadoop Apache Hadoop is an open-source implementation of the Map/Reduce paradigm It s a framework for large-scale data processing It is designed to run on cheap commodity hardware It automatically handles data replication and node failure Hadoop provides: API+implementation for working with Map/Reduce Job configuration and efficient scheduling Browser-based monitoring of important cluster stats A distributed filesystem optimized for HUGE amounts of data (HDFS) 22

Big Data meets HPC Technologies: Apache Hadoop Pros: Write here all the advantages commented previously Hadoop it s written in Java but allows to execute codes from different programming languages (Hadoop Streaming) Hadoop Ecosystem (Pig, Hive, HBase, etc ) 25

Big Data meets HPC Technologies: Apache Hadoop Cons: The problem must fit the Map/Reduce paradigm (embarrassingly parallel problems) Bad for iterative applications Important degradations in performance when using Hadoop Streaming (i.e. when codes are written in languages as Fortran, C, Python, Perl, etc.) Intermediate results output is always stored on disks (In-Memory MapReduce IMMR) No reuse computation for jobs with similar input data: For example, job runs everyday to find the most frequently read news over the past week Hadoop was not-designed by HPC people (joke!!!) 26

Big Data meets HPC Technologies: Apache Spark Apache Spark is an open source project It starts as a research project in Berkeley Cluster computing framework designed to be fast and generalpurpose Pros (I): Extends the Map/Reduce paradigm to support more types of computations (interactive queries and stream processing) APIs in Python, Java and Scala Spark has the ability to run computations in memory (Resilient Distributed Datasets RDDs) 27

Big Data meets HPC Technologies: Apache Spark Pros (II): It supports different workloads in the same engine: Batch applications Iterative algorithms Streaming and iterative queries Good integration with Hadoop Cons: Memory requirements 28

Big Data meets HPC Other Technologies Apache Flink It s an European project Functionalities very similar to those explained for Spark If you like R, try this: RHIPE Released as R package Map and Reduce functions as R code Big R (O. D. Lara et al. Big R_ Large-scale Analytics on Hadoop Using R, IEEE Int. Congress on Big Data, 2014) It hides the Map/Reduce details to the programmer 31

Big Data meets HPC HPC and Big Data converging Those technologies were designed to run on cheap commodity clusters, but there is more to Big Data than large amounts of information It also related to massive distributed activities such as complex queries and computation (analytics or dataintensive scientific computing) High Performance Data Analytics (HPDA) 33

Big Data meets HPC HPC and Big Data converging Infiniband: It s the standard interconnect technology used in HPC supercomputers Commodity clusters use 1Gbps or 10Gbps ethernet Hadoop is very network-intensive (e.g. Data Nodes and Task Trackers exchange a lot of information) 56Gbps FDR can be 100x faster than 10 Gbps ethernet due to its superior bandwidth and latency It allows to scale the big data platform to the desired size, without worrying about bottlenecks 34

Big Data meets HPC HPC and Big Data converging Accelerators: Hadoop and Spark only work on homogeneous clusters (CPUs) HPC is moving to heterogeneous platforms consisting of nodes which include GPUs, co-processors (Intel Xeon Phi) and/or FPGAs. Most promising systems to reach the exaflop Accelerator can boost the performance (not all applications fit well) For complex analytics or data-intensive scientific computing A lot of interest: HadoopCL, Glasswing, GPMR, MapCG, MrPhi, etc 35

Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging Case Studies: Bioinformatics Natural Language Processing (NLP) 36

Outline Natural Language Processing NLP suited to structure and organize the textual information accessible through Internet Linguistic processing is a complex task that requires the use of several subtasks organized in interconnected modules Most of the existent NLP modules are programmed using Perl (regular expressions) We have integrated into Hadoop three NLP modules (Hadoop Streaming): Named Entity Recognition (NER): It consists of identifying as a single unit (or token) those words or chains of words denoting an entity, e.g. New York, University of San Diego, Herbert von Karajan, etc. PoS-Tagging: It assigns each token of the input text a single PoS tag provided with morphological information e.g. singular and masculine adjective, past participle verb, etc. Named Entity Classification (NEC): It is the process of classifying entities by means of classes such as People, Organizations, Locations, or Miscellaneous. J. M. Abuín, Juan C. Pichel, Tomás F. Pena, P. Gamallo and M. García. Perldoop: Efficient Execution of Perl Scripts on Hadoop Clusters, Big Data Conference, 2014 39

Outline Natural Language Processing HPC people (including myself) are obsessed with performance Flops, Flops and Flops Hadoop Streaming is really nice, but it shows an important degradation in performance (w.r.t. Java codes) Perldoop: we designed an automatic source-to-source translator Perl to Java: It s not general-purpose Perl scripts should be in Map/Reduce format Perl codes should follow some simple programming rules J. M. Abuín, Juan C. Pichel, Tomás F. Pena, P. Gamallo and M. García. Perldoop: Efficient Execution of Perl Scripts on Hadoop Clusters, Big Data Conference, 2014 40

Thank you! juancarlos.pichel@usc.es citius.usc.es 43