PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping Version 1.0, Oct 2012 This document describes PaRFR, a Java package that implements a parallel random forest algorithm for regression tasks with multivariate responses. The package has been designed for quantitative trait mapping with a very large number of genetic markets (SNPs) and high- dimensional traits. The software can be run on a single machine as well as on a private or commercial cluster. We assume that the user has some familiarity with Hadoop. Some guidance can be found in the Apache Hadoop web site: Usage: http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html The DRF software is run in Hadoop as follows: hadoop jar DRF.jar ss sample_size - gp genotype_path - pp phenotype_path - op output_path m mtry n ntree tgp test_genotype_path tpp test_phenotype_path - o oob_flag - vp varprox_flag - d distance_flag c covar_flag - nm #maptasks - nr #reducetasks - ms #minimum_split_size Argument explanation: sample_size(required) is an integer representing the sample size of the dataset. This is a required argument genotype_path(required) is the name of the file containing the genotype data. Assuming that the sample size is N, the file must contain N rows (one for each subject), and each row must contain P values, each one representing the minor allele dosage at each SNP, separated by a space. This format is called raw in the Plink software for genetic analysis- for more information, the user is recommend to consult the Plink documentation: http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml#recode An example is given below: 0 1 2 1 2 1 0 0. 2 1 0 2 This is required argument. phenotype_path(required) is the name of the file containing the phenotype data or quantitative traits. Assuming that the sample size is N, the file must contain N rows (one for each subject), and each row must contain Q real- valued responses
representing the quantitative traits, separated by a space. An example is given below: 1.2 2.4 1.5 1.8 2.9 1.3. 3.1 0.9 2.5 Each row is the phenotype of an individual and each phenotype is separated by a space. When it is univariate phenotype, there is no need to add a space for each row. This is required argument. output_path(required) is the name of the folder in which to store the output files. Please make sure there is no existing folder for the output folder. This is required argument. Mtry(optional) is an integer number that specifies how many predictors to use at each node. This is an optional argument and if not provided, the default value is P/3. Ntree(optional) is an integer number that specifies how many trees to build. This is an optional argument and if not provided, the default value is 500. If user wants to do the regression for the test data, the following two arguments are needed, but only when the oob_flag is on and varpro_flag is off, so it only supports the OOB error estimation. test_genotype_path(optional) is the path to the test genotype test_phenotype_path(optional) is the path to the test phenotype oob_flag(optional) is a 0 or 1 indicator whether to calculate OOB error of the forest. 0: calculation of OOB error is disabled. 1: calculation of OOB error is enabled. varpro_flag(optional) is a 0 or 1 or 2 indicator whether to calculate variable importance and proximity matrix 0: calculation of variable importance is disabled. 1: calculate the information gain importance score and proximity matrix 2: calculate the permutation importance score and proximity matrix This is an optional argument and if not provided, the default value is 0.
dist_flag(optional) is a 0 or 1 indicator whether to use distance- based RF. 0: use standard RF. 1: use distance- based RF. covar_flag(optional) is a 0 or 1 indicator whether to consider dependence between phenotypes of the forest. 0: the calculation of covariance between phenotypes is disabled. 1: the calculation of covariance between phenotypes is enabled. #maptasks(optional) is the number of map tasks to launch for the job. If it is not given, a default value 10 is used. #reducetasks(optional) is the number of reduce tasks to launch for the job. If it is not given, the cluster default configuration for the number of reduce tasks is used. #minimum_split_size(optional) is the sample size to determine whether a node should be split further or not, the sample size below this value at a node will be returned as a terminal node. For multivariate regression analysis, a default sample size 20 is recommended. Example run: Hadoop jar DRF.jar -ss 100 -gp data/genotype.txt -pp data/phenotype.txt -op output/ -n 500 -m 300 -o 1 v 1 d 1 -c 1 -nm 10 -nr 2 ms 5 The order of the parameter does not matter, in this example, we run PaRFR using two genotype and phenotype files from data/ folder(-gp data/genotype.txt -pp data/phenotype.txt), the output folder is output/ (-op output/), the number of tree is 500 (-n 500) and the number of variable to select at each node is 300 (-m 300), the calculation involves the OOB error(-o 1) as well as information gain based importance score(-v 1) using distance- based random forest(-d 1) which consider the dependence between phenotypes(-c 1). Output: The output folder stores the output of all the reduce tasks. Generally these files have the format part-r-0000x. It is recommended to merge all the resulting output files into one unique file using, for instance, using the following command: cat part-r-000* > results.txt
The results.txt file contains each sample s OOB error rate and each SNP s importance score. To sort the file, use the following command sort - gk 1 results.txt > sorted.txt Remember there are two parts results in this file, OOB error rate and variable importance score. To do the post- processing, you need to sort this file so that the top part is OOB error rate and bottom part is variable importance score The sorted.txt file contains two columns, the first column is the id and second column is the value. Since this file contains both OOB error and importance score, the first sample_size rows are the sample id and its MSE. The rest rows are the SNP id and its importance score. However, the true SNP id in the original data should be obtained by minus sample_size- 1 in the first column of sorted.txt file. Below is the output format using the o 1 v 1. Suppose there are 100 samples and 1000 SNPs, this sorted file has 1100 lines, the first 100 lines contain the OOB error rate for the 100 samples and last 1000 rows contain SNP importance score. The output like below 0 12.3 (first sample) 1 2.3 99 1.2 100 23(first SNP) 101 34 1099 38 Because the index start from 0, so the first SNPs (id should be 1) in this file is in the 101 row, but its id in this file is 100, so need to minus 99 to get the true id, which is 1. How to run PaRFR on your Hadoop cluster First make sure all the namenode, datanode, tasktracker and jobtracker are launched correctly. 1). Create the input folder in the HDFS filesystem Hadoop fs mkdir data/ 2). Put the data file into the input folder in the file system.
Hadoop fs - put genotype.txt data/ Hadoop fs - put phenotype.txt data/ 3). Run the program using the following command Hadoop DRF.jar - ss 100 - gp data/genotype.txt - pp data/phenotype.txt - op output/ Note: In mapr distribution, you should add the full path and maprfs:/ for the file path instead of above relative path, for example, the data/genotype.txt is the relative path to /user/username/data/genotype.txt. In the mapr distribution, the file path should be maprfs:/user/username/data/genotype.txt. This is a known issue of mapr, hope it will be fixed in new version. How to run PaRFF on a commercial cluster Besides running the program on your private Hadoop cluster, users can use all kinds of Hadoop cluster provided by any cloud providers, like Amazon Elastic Compute Cloud (EC2), Amazon Elastic MapReduce etc to run the program. 1) Make Hadoop cluster running on cloud a) For how to use Amazon Elastic Compute Cloud, users can download the documents from. http://docs.amazonwebservices.com/awsec2/latest/gettingstartedguide/. We suggest user use the tool under the Hadoop package to launch the cluster with Hadoop setup. The tool is under the Hadoop package /src/contrib/ec2/bin. The instructions can be found there also. Or users can find the instructions from http://wiki.apache.org/hadoop/amazonec2 b) For using Amazon Elastic MapReduce, users can download the documents from http://aws.amazon.com/elasticmapreduce/ 2) How to run our program a) Upload your jar file and your data to the cloud. b) Then run the jar file exactly the same as running on your own cluster. Instructions can be found above. Credits The software has been designed by Giovanni Montana and Yue Wang, and written by Yue Wang while visiting the Department of Mathematics at Imperial College and was financially supported by a NGS scholarship. GM acknowledges support by the EPSRC and Wellcome Trust.