Alexander Reinefeld, Thorsten Schütt, ZIB 1



Similar documents
Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

QoS-Aware Storage Virtualization for Cloud File Systems. Christoph Kleineweber (Speaker) Alexander Reinefeld Thorsten Schütt. Zuse Institute Berlin

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Use of Hadoop File System for Nuclear Physics Analyses in STAR

HPCHadoop: MapReduce on Cray X-series

Big Data Challenges in Bioinformatics

Big Data and Apache Hadoop s MapReduce

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Constructing a Data Lake: Hadoop and Oracle Database United!

I/O Considerations in Big Data Analytics

Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Hadoop SNS. renren.com. Saturday, December 3, 11

ORACLE BIG DATA APPLIANCE X3-2

Log Mining Based on Hadoop s Map and Reduce Technique

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Task Scheduling in Hadoop

Distributed Framework for Data Mining As a Service on Private Cloud

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective

Developing MapReduce Programs

Cloud Computing For Bioinformatics

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Big Data and Natural Language: Extracting Insight From Text

Apache Hadoop. Alexandru Costan

Package hive. January 10, 2011

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Hadoop: Embracing future hardware

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Data-intensive computing systems

Evaluating Task Scheduling in Hadoop-based Cloud Systems

Real Time Big Data Processing

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Intro to Map/Reduce a.k.a. Hadoop

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Hadoop Architecture. Part 1

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Lecture 24: WSC, Datacenters. Topics: network-on-chip wrap-up, warehouse-scale computing and datacenters (Sections )

Enabling High performance Big Data platform with RDMA

Bringing Big Data Modelling into the Hands of Domain Experts

Data-Flow Awareness in Parallel Data Processing

5 SCS Deployment Infrastructure in Use

2015 The MathWorks, Inc. 1

Data Mining with Hadoop at TACC

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia

Lecture 10 - Functional programming: Hadoop and MapReduce

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Management & Analysis of Big Data in Zenith Team

HADOOP PERFORMANCE TUNING

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Introduction to MapReduce and Hadoop

MapReduce, Hadoop and Amazon AWS

Xiaoming Gao Hui Li Thilina Gunarathne

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop Parallel Data Processing

Big Data Analytics for Net Flow Analysis in Distributed Environment using Hadoop

Integration of Virtualized Workernodes in Batch Queueing Systems The ViBatch Concept

Open source Google-style large scale data analysis with Hadoop

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big Data and Scripting map/reduce in Hadoop

MapReduce (in the cloud)

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Distributed Apriori in Hadoop MapReduce Framework

High Availability of the Polarion Server

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez ( ) Sameer Kumar ( )

A Novel Cloud Based Elastic Framework for Big Data Preprocessing


From Internet Data Centers to Data Centers in the Cloud

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

Introduction to Cloud Computing

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Scaling Out With Apache Spark. DTL Meeting Slides based on

Big Data and Cloud Computing for GHRSST

Variations in Performance and Scalability when Migrating n-tier Applications to Different Clouds

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Adaptive Task Scheduling for Multi Job MapReduce

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Can t We All Just Get Along? Spark and Resource Management on Hadoop

# Not a part of 1Z0-061 or 1Z0-144 Certification test, but very important technology in BIG DATA Analysis

Transcription:

Alexander Reinefeld, Thorsten Schütt, ZIB 1

Solving Puzzles with MapReduce Alexander Reinefeld and Thorsten Schütt Zuse Institute Berlin Hadoop Get Together, 29.09.2009, Berlin

How to solve very large combinatorial problems which cannot be solved on a single computer which do not fit into main memory which require massively parallel computers (or clusters) or distributed computers (cloud computing, datacenters) Alexander Reinefeld, Thorsten Schütt, ZIB 3

15-Puzzle 16!/2 = 10,461,394,944,000 states Alexander Reinefeld, Thorsten Schütt, ZIB 4

What is the hardest configuration? Alexander Reinefeld, Thorsten Schütt, ZIB 5

Disclaimer: Talk is about Map-Reduce! Not about Hadoop!

1500 Compute Nodes Large Parallel File System Alexander Reinefeld, Thorsten Schütt, ZIB 7

No Local Disk! Just CPU Memory Infiniband (20/40 Gbit) Alexander Reinefeld, Thorsten Schütt, ZIB 8

Alexander Reinefeld, Thorsten Schütt, ZIB 9

Hadoop No Local Disks Parallel File System (Lustre) Batch System Was unstable (2 years ago) Waste of Disk Space Alexander Reinefeld, Thorsten Schütt, ZIB 10

Breadth-First Search

Breadth-First Search main() { Queue queue = [start_node]; // work-queue Set visited = {}; // visited set while (queue!= []) { foreach(position pos in neighbors(queue.pop_first())) { if(pos not in visited) { // check is visited? visit(pos); // visit visited.add(pos); } } } } Alexander Reinefeld, Thorsten Schütt, ZIB 12

Breadth-First Search main() { Queue queue = [start_node]; // work-queue Set visited = {}; // visited set while (queue!= []) { foreach(position pos in neighbors(queue.pop_first())) { if(pos not in visited) { // check is visited? visit(pos); // visit visited.add(pos); } } } } Imagine visited has a size of 10,461,394,944,000-1 Alexander Reinefeld, Thorsten Schütt, ZIB 13

Breadth-First Frontier Search Key: Position Value: Store used moves U, D, L, R 4 Bit Alexander Reinefeld, Thorsten Schütt, ZIB 14

Key/value pairs in the 15-puzzle key = position value = move(s) that generated that pos. Alexander Reinefeld, Thorsten Schütt, ZIB 15

Breadth-First Frontier Search with MapReduce void mapper(position position, set usedmoves) { foreach((successor, move) in successors(position)) if(inverse(move) & usedmoves == mt) emit(successor, move); } void reducer(position position, set<set> usedmoves) { moves := 0; foreach(move in usedmoves) moves := moves move; emit(position, moves); } main() { front = [(start_node, 0)]; while (front!= mt) { intermediate = map(front, mapper()); front = reduce(itermediate, reducer()); } } Alexander Reinefeld, Thorsten Schütt, ZIB 16

How to store a Puzzle Position? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 used 17 Bytes * 10,461,394,944,000 = 170TB 16 Bytes 1 Byte Alexander Reinefeld, Thorsten Schütt, ZIB 17

How to store a Puzzle Position? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 used 17 Bytes * 10,461,394,944,000 = 170TB 16 Bytes 1 Byte 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 used 16 Nibble (4 Bit) = Int64 1 Byte 9 Bytes * 10,461,394,944,000 = 90TB 16 Bytes * 10,461,394,944,000 = 160 TB Alexander Reinefeld, Thorsten Schütt, ZIB 18

How to store a Puzzle Position? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 used 17 Bytes * 10,461,394,944,000 = 170TB 16 Bytes 1 Byte 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 used 16 Nibble (4 Bit) = Int64 1 Byte 9 Bytes * 10,461,394,944,000 = 90TB 16 Bytes * 10,461,394,944,000 = 160 TB 1 2 3 4 5 6 7 8 9 10 11 12 13 14 used 8 Bytes * 10,461,394,944,000 = 80TB 15 Nibble (4 Bit) 4 Bit Int64 for Key + Value Alexander Reinefeld, Thorsten Schütt, ZIB 19

66 h later. Alexander Reinefeld, Thorsten Schütt, ZIB 20

Result 16!/2 = 10,461,394,944,000 states ~66 hours execution time Alexander Reinefeld, Thorsten Schütt, ZIB 21

17 Positions @ Depth 80 Alexander Reinefeld, Thorsten Schütt, ZIB 22

Solution for FA8CBE9D72513640 Alexander Reinefeld, Thorsten Schütt, ZIB 23

Solving Single Instances Shortest Path Admisible Heuristic Estimates the Distance to the start Position Will never overestimate the Distance Used for pruning Nodes 32 Opteron Cores + 256GB main memory Alexander Reinefeld, Thorsten Schütt, ZIB 24

Demo

Applications All Kinds of Optimization Problems DNA Sequence Alignment Model Checking Alexander Reinefeld, Thorsten Schütt, ZIB 26

References A. Reinefeld, T. Schütt. Out-of-Core Parallel Heuristic Search with MapReduce. High-Performance Computing Symposium HPCS 2009, Kingston, Ontario. Korf et al. (2000, 2005, ): frontier search, delayed duplicate detection Zhou et al. (2004, ): BFHS, BF-IDA*, structured duplicate detection Edelkamp et al. (2004): External A* Culberson et al. (1994,..): pattern databases Alexander Reinefeld, Thorsten Schütt, ZIB 27

Thanks! Questions?