Data Mining with Hadoop at TACC

Similar documents
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

How To Scale Out Of A Nosql Database

Parallel Large-Scale Visualization

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Big Data on Microsoft Platform

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Benchmarking Hadoop & HBase on Violin

Chapter 7. Using Hadoop Cluster and MapReduce

Log Mining Based on Hadoop s Map and Reduce Technique

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

BIG DATA What it is and how to use?

Map Reduce / Hadoop / HDFS

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Hadoop. Sunday, November 25, 12

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1

The Greenplum Analytics Workbench

Apache Hadoop: Past, Present, and Future

CSE-E5430 Scalable Cloud Computing Lecture 2

MapReduce and Hadoop Distributed File System V I J A Y R A O

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Open source Google-style large scale data analysis with Hadoop

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Hadoop Parallel Data Processing

Distributed Framework for Data Mining As a Service on Private Cloud

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Bringing Big Data Modelling into the Hands of Domain Experts

Testing 3Vs (Volume, Variety and Velocity) of Big Data

A Study of Data Management Technology for Handling Big Data

Big Data and Apache Hadoop s MapReduce

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Chase Wu New Jersey Ins0tute of Technology

HPCHadoop: MapReduce on Cray X-series

Apache Hama Design Document v0.6

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Performance and Energy Efficiency of. Hadoop deployment models

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

A Brief Outline on Bigdata Hadoop

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Cloud Computing Where ISR Data Will Go for Exploitation

Dell Reference Configuration for Hortonworks Data Platform

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

Open source large scale distributed data management with Google s MapReduce and Bigtable

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM

Parallel Processing using the LOTUS cluster

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

MapReduce and Hadoop Distributed File System

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

HPC performance applications on Virtual Clusters

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

A Brief Introduction to Apache Tez

Big Data Processing with Google s MapReduce. Alexandru Costan

Investigating Hadoop for Large Spatiotemporal Processing Tasks

Paolo Costa

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

BIG DATA TRENDS AND TECHNOLOGIES

Hadoop implementation of MapReduce computational model. Ján Vaňo

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Challenges for Data Driven Systems

MapReduce Evaluator: User Guide

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Efficient Analysis of Big Data Using Map Reduce Framework

The Performance Characteristics of MapReduce Applications on Scalable Clusters

CloudRank-D:A Benchmark Suite for Private Cloud Systems

Hybrid Software Architectures for Big

Big Application Execution on Cloud using Hadoop Distributed File System

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Development of Monitoring and Analysis Tools for the Huawei Cloud Storage


WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

TRAINING PROGRAM ON BIGDATA/HADOOP

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia

A Performance Analysis of Distributed Indexing using Terrier

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

Hadoop IST 734 SS CHUNG

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Transcription:

Data Mining with Hadoop at TACC Weijia Xu Data Mining & Statistics

Data Mining & Statistics Group Main activities Research and Development Developing new data mining and analysis solutions for practical problems from various domains through collaborations. Consulting Providing advice on suitable analytic approaches for users. Helping user to scale-up existing solutions. Active areas of collaborative projects Data intensive computing Statistical Computing with R Visual Analytic

Outline Background Common HPC system Hadoop Enabling dynamic Hadoop cluster on Longhorn A project example

Background: Typical HPC Architecture

Background: HPC Architecture All data (raw input, intermediate data, output) are stored in a shared high performance parallel file system. Each compute node can access all the data. No/minimum local storage at each compute node Data are transferred from storage to compute node memory via high performance network connections.

Background: Hadoop Hadoop is a library to enable efficient distributed data processing easily. Hadoop is an open source implementation of MapReduce programming model in JAVA with interface to other programming language such as C/C++, python. Hadoop includes several subprojects HDFS, a distributed file system based on googel file system (GFS), as its shared file system. Mahout, scalable machine learning and data mining library Pig, a high-level data-flow language and execution framework for parallel computation.

Background: MapReduce Programming Model Data storage is distributed among multiple nodes. Bring the computation to the data. Automatic parallelization & distribution Users implement interface of two functions: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list

Input key*value pairs Input key*value pairs... Data store 1 map Data store n map (key 1, values...) (key 2, values...) (key 3, values...) (key 1, values...) (key 2, values...) (key 3, values...) == Barrier == : Aggregates intermediate values by output key key 1, intermediate values key 2, intermediate values key 3, intermediate values reduce reduce reduce final key 1 values final key 2 values final key 3 values

Hadoop vs. HPC Data storage HPC: a shared global high performance parallel file systems. Hadoop: duplicated across all the data node. Parallelism HPC: computing parallel Hadoop: Data parallel Which one is better? Is there enough memory? How many times data need to be read/write?

Enabling Dynamic Hadoop Cluster Motivations on Longhorn To run existing hadoop code To test/learn if the Hadoop is the right solutions. To experiment what is the best Hadoop cluster setups.. Funded through Longhorn Innovation Fund for Technology (LIFT) through UT.

Goal Enable dynamically starting Hadoop cluster session by user though SGE. Expanding local storage of Longhorn nodes Data is distributed to local storage for further processing Support data analysis jobs written with Hadoop library. The Hadoop cluster is fully customizable by users.

Background: Typical HPC Architecture

Compute Nodes on Longhorn 240 Dell R610 2 Intel Nehalem quad-core processors (8 cores) @ 2.53 GHz 48GB RAM, 73GB local disk 16 Dell R710 2 Intel Nehalem quad-core processors (8 cores) @ 2.53 GHz 144GB RAM, 73GB local disk

Hadoop on Longhorn Local Storage Expansion 192 500GB 7.2k drives are installed on 48 R610 nodes on Longhorn. 112 146GB 15k drives are installed on 16 R710 nodes on Longhorn. /hadoop file system Expanded Storage are available as /hadoop at each node. R610 nodes and R710 nodes are accessible via different queue 48 R610 nodes: ~2TB available per node, 96TB total available through Hadoop queue. 16 R710 nodes: ~1TB available per node, 16TB total available through largemem queue

Hadoop job queue on Longhorn Queue priority on Longhorn Normal queue has the highest priority, Hadoop queue is same as the long, largemem queue. For a job request, computes nodes will be assigned in the following order to minimize the usage on hadoop nodes 1. 192 R601 nodes without /hadoop 2. 48 R610 nodes with /hadoop 3. 16 R710 nodes with /hadoop

Running Hadoop on Longhorn Default job submission scripts are provided for user. User can start Hadoop session using any number of nodes up to 48 (384 cores). The name node and data node in Hadoop cluster are configured dynamically for each job. The configuration file is stored in the user home directories for user to customize settings.

Running Hadoop on Longhorn Details at https://sites.google.com/site/tacchadoop/ In a nutshell Get and install tacc version Hadoop locally Submit Hadoop scripts Starts VNC sessions (optional) Starts hadoop clusters User interactions within VNC session / running data processing with existing Hadoop codes and scripts User close vns session/ Job termination Shutdown of hadoop clusters Shutdown of vnc server.

Distributed, Scalable Clustering for Detecting Halos in Terascale Astronomy Problem: ~200 Terabyte data has been generated through simulation to study evolution of the universe by astronomic researcher. A 1GB data file can contain about ~15 million data objects Difficult to analyze or visualize. Goal: exploring computational solution for Automatically identify regions of interests as halo candidates for detailed rendering Automatically identify substructure within halo candidates Algorithmic approach: A density clustering based approach. Identify dense area iteratively, shaves off points in sparse area during each iteration.

Scale of the problem An image demonstrates the scale of the problem. http://www.utexas.edu/cio/itgovernance/lift/articles/i mages/cloud_large_1.png An image demonstrate the scale of the problem. Image shows a potential interesting halo area out of 6k-cubed simulation result. Image curtsey of Paul Navratil.

Distributed, Scalable Clustering for Detecting Halos in Terascale Astronomy Methods and status Parallel data flow approach with Longhorn: Distribute data into each compute node for independent processing Leveraging the multicores for computations at each node About ~57k data points can be processed per compute node per hour Distributed computation using Hadoop with Longhorn: Data is stored in a dynamically constructed hierarchical distributed file system using local hard drives at each compute node. Using Map Reduce computing model implemented with Hadoop Up to ~600k data points can be processed per compute node per hour

Distributed, Scalable Clustering for Detecting Halos in Terascale Astronomy Dhandapani et. al. LNCS 2011

Result demonstration A movie demonstrate the clustering results of a 20k particle data set. http://taccweb.austin.utexas.edu/staff/home/xwj/public_html/20k.mpeg First you see the original dataset, with each point plotted as a dot. Points from one cluster (cluster 11) are shown with isosurface rendering at density of 10.0, using a 32^3 radial gaussian kernel on a 512^3 grid for the density calculation. Then points of the cluster are added to the isosurface. Followed by a volume rendering of the density function with opacity ramping up from density=10 to the data set max. Movie is curtsey of Greg Abram.

Other Ongoing Projects Utilizing Hadoop on Longhorn TACC involved projects Large scale document collection summarization and analysis with iterative density based clustering Bae et al., ICDAR 2011 Classification of web page archives with supervised learning. User projects Improve Blog Ranking by Mining Memes, PI: Matt Lease, ischool TextGrounder, geo-spatio tagging of texts, PI: Jason Baldridge, UTCL

Acknowledgment --Enabling Cloud Computing at UT Austin People: TACC staff: Weijia Xu, Data Mining & Statistics External collaborators: Dr. Matthew Lease, Department of Computer Science / ischool Dr. Jason Baldridge, Computational Linguistics. Related References and links Hadoop @ TACC wiki site: https://sites.google.com/site/tacchadoop/ Feature story on recent project progress http://www.utexas.edu/cio/itgovernance/lift/articles/cloud-computingupdate.php Funding is provided through Longhorn Invoational Fund for Technology

Acknowledgement -- Halo Detection Project People: TACC staff: Weijia Xu, Data Mining & Statistics Paul Navratil, Visualization & Data Analysis External collaborators: Dr. Joydeep Ghosh, ECE, UT Austin Sankari Dhandapani, Graduate student Sri Vatsava, Graduate student Dr. Nena Marin, Pervasive Inc., Austin, TX Dr. Ilian Iliev, Astronomy, University of Sussex, UK Related Publication: Presented at KDCloud 10 in during ICDM 10 Daruru, S., Dhandapani, S., Gupta, G., Iliev, I., Xu, W., Navratil, P., Marin, P. and Ghosh, J. (2010) Distributed, Scalable Clustering for Detecting Halos in Terascale Astronomy in workshop proceedings of IEEE International Conference on Data Mining: Knowledge Discovery Using Cloud and Distributed Computing Platforms (KDCloud 10) Dec. 2010, Sydney, Australia, IEEE Computer Society, Washington, DC, USA, p.p. 138-147, Extended version to appear on LNCS Dhandapani, S., Daruru, S., Gupta, G., Iliev, I., Xu, W., Navratil, P., Marin, P. and Ghosh, J. (2011) Distributed, Scalable Clustering for Detecting Halos in Terascale Astronomy Datasets, Lecture Notes in Computer Science (LNCS), in press

Question and Discussion

Additional References Hadoop @ TACC wiki site: https://sites.google.com/site/tacchadoop/ Feature story on recent project progress http://www.utexas.edu/cio/itgovernance/lift/articles/ cloud-computing-update.php