PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012



Similar documents
MapReduce. Tushar B. Kute,

A. Aiken & K. Olukotun PA3

How To Install Hadoop From Apa Hadoop To (Hadoop)

MapReduce, Hadoop and Amazon AWS

Hadoop Architecture. Part 1

Hadoop Setup. 1 Cluster

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Map Reduce & Hadoop Recommended Text:

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Introduction to Cloud Computing

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

H2O on Hadoop. September 30,

Apache Hadoop new way for the company to store and analyze big data

A Cost-Evaluation of MapReduce Applications in the Cloud

Open source Google-style large scale data analysis with Hadoop

Hadoop Parallel Data Processing

Single Node Hadoop Cluster Setup

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Open source large scale distributed data management with Google s MapReduce and Bigtable

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

L1: Introduction to Hadoop

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Apache Hadoop. Alexandru Costan

Hadoop 2.6 Configuration and More Examples

TP1: Getting Started with Hadoop

Step by Step Guide to Importing Genetic Data into JMP Genomics

Single Node Setup. Table of contents

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

HADOOP MOCK TEST HADOOP MOCK TEST II

Hadoop (pseudo-distributed) installation and configuration

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Running Knn Spark on EC2 Documentation

How To Use Hadoop

CDH 5 Quick Start Guide

Introduction to HDFS. Prasanth Kothuri, CERN

Fast Analytics on Big Data with H20

Data processing goes big

Chapter 7. Using Hadoop Cluster and MapReduce

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop Training Hands On Exercise

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

6. How MapReduce Works. Jari-Pekka Voutilainen

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Tutorial- Counting Words in File(s) using MapReduce

CSE-E5430 Scalable Cloud Computing. Lecture 4

Running Kmeans Mapreduce code on Amazon AWS

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Introduction to MapReduce and Hadoop

Optimize the execution of local physics analysis workflows using Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Mobile Cloud Computing for Data-Intensive Applications

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

HADOOP MOCK TEST HADOOP MOCK TEST I

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Qsoft Inc

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

How to install Apache Hadoop in Ubuntu (Multi node/cluster setup)

IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM

Hadoop Design and k-means Clustering

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Big Data Too Big To Ignore

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

HSearch Installation

Tutorial on gplink. PLINK tutorial, December 2006; Shaun Purcell,

Hadoop Basics with InfoSphere BigInsights

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Extreme computing lab exercises Session one

COURSE CONTENT Big Data and Hadoop Training

How MapReduce Works 資碩一 戴睿宸

Package bigrf. February 19, 2015

Hadoop Distributed File System Propagation Adapter for Nimbus

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Click Stream Data Analysis Using Hadoop

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Hands-on Exercises with Big Data

PassTest. Bessere Qualität, bessere Dienstleistungen!

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Getting to know Apache Hadoop

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

GraySort and MinuteSort at Yahoo on Hadoop 0.23

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Transcription:

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping Version 1.0, Oct 2012 This document describes PaRFR, a Java package that implements a parallel random forest algorithm for regression tasks with multivariate responses. The package has been designed for quantitative trait mapping with a very large number of genetic markets (SNPs) and high- dimensional traits. The software can be run on a single machine as well as on a private or commercial cluster. We assume that the user has some familiarity with Hadoop. Some guidance can be found in the Apache Hadoop web site: Usage: http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html The DRF software is run in Hadoop as follows: hadoop jar DRF.jar ss sample_size - gp genotype_path - pp phenotype_path - op output_path m mtry n ntree tgp test_genotype_path tpp test_phenotype_path - o oob_flag - vp varprox_flag - d distance_flag c covar_flag - nm #maptasks - nr #reducetasks - ms #minimum_split_size Argument explanation: sample_size(required) is an integer representing the sample size of the dataset. This is a required argument genotype_path(required) is the name of the file containing the genotype data. Assuming that the sample size is N, the file must contain N rows (one for each subject), and each row must contain P values, each one representing the minor allele dosage at each SNP, separated by a space. This format is called raw in the Plink software for genetic analysis- for more information, the user is recommend to consult the Plink documentation: http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml#recode An example is given below: 0 1 2 1 2 1 0 0. 2 1 0 2 This is required argument. phenotype_path(required) is the name of the file containing the phenotype data or quantitative traits. Assuming that the sample size is N, the file must contain N rows (one for each subject), and each row must contain Q real- valued responses

representing the quantitative traits, separated by a space. An example is given below: 1.2 2.4 1.5 1.8 2.9 1.3. 3.1 0.9 2.5 Each row is the phenotype of an individual and each phenotype is separated by a space. When it is univariate phenotype, there is no need to add a space for each row. This is required argument. output_path(required) is the name of the folder in which to store the output files. Please make sure there is no existing folder for the output folder. This is required argument. Mtry(optional) is an integer number that specifies how many predictors to use at each node. This is an optional argument and if not provided, the default value is P/3. Ntree(optional) is an integer number that specifies how many trees to build. This is an optional argument and if not provided, the default value is 500. If user wants to do the regression for the test data, the following two arguments are needed, but only when the oob_flag is on and varpro_flag is off, so it only supports the OOB error estimation. test_genotype_path(optional) is the path to the test genotype test_phenotype_path(optional) is the path to the test phenotype oob_flag(optional) is a 0 or 1 indicator whether to calculate OOB error of the forest. 0: calculation of OOB error is disabled. 1: calculation of OOB error is enabled. varpro_flag(optional) is a 0 or 1 or 2 indicator whether to calculate variable importance and proximity matrix 0: calculation of variable importance is disabled. 1: calculate the information gain importance score and proximity matrix 2: calculate the permutation importance score and proximity matrix This is an optional argument and if not provided, the default value is 0.

dist_flag(optional) is a 0 or 1 indicator whether to use distance- based RF. 0: use standard RF. 1: use distance- based RF. covar_flag(optional) is a 0 or 1 indicator whether to consider dependence between phenotypes of the forest. 0: the calculation of covariance between phenotypes is disabled. 1: the calculation of covariance between phenotypes is enabled. #maptasks(optional) is the number of map tasks to launch for the job. If it is not given, a default value 10 is used. #reducetasks(optional) is the number of reduce tasks to launch for the job. If it is not given, the cluster default configuration for the number of reduce tasks is used. #minimum_split_size(optional) is the sample size to determine whether a node should be split further or not, the sample size below this value at a node will be returned as a terminal node. For multivariate regression analysis, a default sample size 20 is recommended. Example run: Hadoop jar DRF.jar -ss 100 -gp data/genotype.txt -pp data/phenotype.txt -op output/ -n 500 -m 300 -o 1 v 1 d 1 -c 1 -nm 10 -nr 2 ms 5 The order of the parameter does not matter, in this example, we run PaRFR using two genotype and phenotype files from data/ folder(-gp data/genotype.txt -pp data/phenotype.txt), the output folder is output/ (-op output/), the number of tree is 500 (-n 500) and the number of variable to select at each node is 300 (-m 300), the calculation involves the OOB error(-o 1) as well as information gain based importance score(-v 1) using distance- based random forest(-d 1) which consider the dependence between phenotypes(-c 1). Output: The output folder stores the output of all the reduce tasks. Generally these files have the format part-r-0000x. It is recommended to merge all the resulting output files into one unique file using, for instance, using the following command: cat part-r-000* > results.txt

The results.txt file contains each sample s OOB error rate and each SNP s importance score. To sort the file, use the following command sort - gk 1 results.txt > sorted.txt Remember there are two parts results in this file, OOB error rate and variable importance score. To do the post- processing, you need to sort this file so that the top part is OOB error rate and bottom part is variable importance score The sorted.txt file contains two columns, the first column is the id and second column is the value. Since this file contains both OOB error and importance score, the first sample_size rows are the sample id and its MSE. The rest rows are the SNP id and its importance score. However, the true SNP id in the original data should be obtained by minus sample_size- 1 in the first column of sorted.txt file. Below is the output format using the o 1 v 1. Suppose there are 100 samples and 1000 SNPs, this sorted file has 1100 lines, the first 100 lines contain the OOB error rate for the 100 samples and last 1000 rows contain SNP importance score. The output like below 0 12.3 (first sample) 1 2.3 99 1.2 100 23(first SNP) 101 34 1099 38 Because the index start from 0, so the first SNPs (id should be 1) in this file is in the 101 row, but its id in this file is 100, so need to minus 99 to get the true id, which is 1. How to run PaRFR on your Hadoop cluster First make sure all the namenode, datanode, tasktracker and jobtracker are launched correctly. 1). Create the input folder in the HDFS filesystem Hadoop fs mkdir data/ 2). Put the data file into the input folder in the file system.

Hadoop fs - put genotype.txt data/ Hadoop fs - put phenotype.txt data/ 3). Run the program using the following command Hadoop DRF.jar - ss 100 - gp data/genotype.txt - pp data/phenotype.txt - op output/ Note: In mapr distribution, you should add the full path and maprfs:/ for the file path instead of above relative path, for example, the data/genotype.txt is the relative path to /user/username/data/genotype.txt. In the mapr distribution, the file path should be maprfs:/user/username/data/genotype.txt. This is a known issue of mapr, hope it will be fixed in new version. How to run PaRFF on a commercial cluster Besides running the program on your private Hadoop cluster, users can use all kinds of Hadoop cluster provided by any cloud providers, like Amazon Elastic Compute Cloud (EC2), Amazon Elastic MapReduce etc to run the program. 1) Make Hadoop cluster running on cloud a) For how to use Amazon Elastic Compute Cloud, users can download the documents from. http://docs.amazonwebservices.com/awsec2/latest/gettingstartedguide/. We suggest user use the tool under the Hadoop package to launch the cluster with Hadoop setup. The tool is under the Hadoop package /src/contrib/ec2/bin. The instructions can be found there also. Or users can find the instructions from http://wiki.apache.org/hadoop/amazonec2 b) For using Amazon Elastic MapReduce, users can download the documents from http://aws.amazon.com/elasticmapreduce/ 2) How to run our program a) Upload your jar file and your data to the cloud. b) Then run the jar file exactly the same as running on your own cluster. Instructions can be found above. Credits The software has been designed by Giovanni Montana and Yue Wang, and written by Yue Wang while visiting the Department of Mathematics at Imperial College and was financially supported by a NGS scholarship. GM acknowledges support by the EPSRC and Wellcome Trust.