cloud-kepler Documentation Release 1.2 Scott Fleming, Andrea Zonca, Jack Flowers, Peter McCullough, El July 31, 2014
Contents 1 System configuration 3 1.1 Python and Virtualenv setup....................................... 3 1.2 Hadoop setup............................................... 3 1.3 Lein setup................................................ 5 1.4 LEMUR setup.............................................. 5 1.5 References................................................ 6 2 Quickstart Guide 7 2.1 Specifying the data to download..................................... 7 2.2 Configuration file options........................................ 7 3 Retrieving and downloading data 9 3.1 get_data Get data from MAST or hard disk............................ 9 3.2 join_quarters Stitch multiple quarters of data together..................... 9 4 BLS pulse algorithm 11 4.1 drive_bls_pulse Driver interface to BLS pulse......................... 11 4.2 bls_pulse_python Naive pure Python implementation..................... 11 4.3 bls_pulse_vec Vectorized Python implementation........................ 11 4.4 bls_pulse_cython Optimized Cython implementation..................... 11 5 detrend Detrend lightcurve data 13 6 clean_signal Signal cleaning (removal of strong periodic signals) 15 7 postprocessing Analyze output from BLS pulse 17 8 utils Utility functions 19 Python Module Index 21 i
ii
cloud-kepler Documentation, Release 1.2 cloud-kepler is a cloud-enabled Kepler planet searching pipeline. Contents: Contents 1
cloud-kepler Documentation, Release 1.2 2 Contents
CHAPTER 1 System configuration 1.1 Python and Virtualenv setup To set up Python and Virtualenv, run the following commands from a terminal: cd ~/temp curl -L -o virtualenv.py https://raw.github.com/pypa/virtualenv/master/virtualenv.py python virtualenv.py cloud-kepler --no-site-packages. cloud-kepler/bin/activate pip install numpy pip install simplejson pip install pyfits Test that the basic python code is working: cat {DIRECTORY_WITH_CLOUD_KEPLER}/test/test_q1.txt python {DIRECTORY_WITH_CLOUD_KEPLER}/python/down If it starts downloading and spewing base64 encoded numpy arrays, then you re good. 1.2 Hadoop setup Install Oracle VM VirtualBox 4.2.14 from VirtualBox-4.2.14-86644-win from https://www.virtualbox.org/wiki/downloads Extract cloudera-quickstart-demo-vm-4.3.0-virtualbox.tar.gz from https://ccp.cloudera.com/display/support/cloudera+quickstart+v Enter the created folder and extract cloudera-quickstart-demo-vm-4.3.0-virtualbox.tar, you should end up with cloudera-quickstart-demo-vm.ovf and cloudera-quickstart-demo-vm.vmdk in whatever folder you extracted to Open up Oracle VM VirtualBox Manager Select the New icon, the Create Virtual Machine window boots up. For operating system, select Linux and Ubuntu For memory size, select 4096 MB For Hard Drive, select Use an existing virtual hard drive and path to cloudera-quickstart-demo-vm.vmdk Press Create. Virtual machine now selectable in the main window on virtualbox manager. Press the Settings button, opens the settings window. Choose the system tab 3
cloud-kepler Documentation, Release 1.2 Change chipset to ICH9, make sure Enable IO APIC is checked. Select it and pressed Start, boot begins, this part takes a little while. If it gets stuck on any one step for more than 20 minutes, you can assume something is wrong. Eventually the boot sequence will end and you will see a desktop in your virtual machine. Success! 1.2.1 WordCount Example Note that this assumes a cloudera vm distribution of hadoop. Inside your virtual machine, go to the Cloudera Hadoop Tutorial at http://www.cloudera.com/content/clouderacontent/cloudera-docs/hadooptutorial/cdh4/hadoop-tutorial/ht_topic_5_1.html Copy the source code for WordCount and past it into the gedit text editor. Save as WordCount.java in the cloudera s home folder. Per the instructions there, open terminal, cd to the home directory, then run as follows: mkdir wordcount_classes javac -cp /usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/* -d wordcount_classes WordCount.java Right click on the wordcount_classes folder you made (it will be in the home directory) and select compress. Choose.jar as the file format and wordcount as filename: echo "Hello World Bye World" > file0 echo "Hello Hadoop Goodbye Hadoop" > file1 hadoop fs -mkdir /user/cloudera /user/cloudera/wordcount /user/cloudera/wordcount/input hadoop fs -put file* /user/cloudera/wordcount/input hadoop jar wordcount.jar org.myorg.wordcount /user/cloudera/wordcount/input output According to the Cloudera Tutoria, this should be all you need to do, but I got an error message here, so everything is not quite right yet. When you first log onto the virtual machine, it should begin with a firefox window open to some kind of cloudera page. Go to this and click the Cloudera Manager link. Enter admin and admin as a username and password to access it. Now you can see the health of your setup s various components. mapreduce1 will probably be listed as in poor health. click on it You should see that the jobtracker is the problem. Return to terminal: sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system Then restart jobtracker by clicking instances the instances tab, clicking on jobtracker, clicking to the processes tab, selecting the actions tab in the corner, and selecting restart: hadoop jar wordcount.jar org.myorg.wordcount /user/cloudera/wordcount/input output This time it should work: hadoop fs -cat output/part-00000 This will open up the output folder for you from the hadoop run. It should look like this: 4 Chapter 1. System configuration
cloud-kepler Documentation, Release 1.2 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2 If it looks like that then you are good. It is worth noting that Hadoop won t work unless the directory you set as your output both does not currently exist and is in your hadoop fs home directory. 1.3 Lein setup Note that this assumes a cloudera vm distribution of hadoop. You can find Lein at https://github.com/technomancy/leiningen Download the script from https://raw.github.com/technomancy/leiningen/stable/bin/lein and place it wherever you want: export $HOME=/home cd cd.. cd etc/profile.d sudo vim lein.sh On one line of the file write export PATH=$PATH:{wherever your lein file is located} (in my case /home/cloudera/desktop) Save the file and exit. Exit and reenter terminal to get back to you home directory: chmod 755 {location of lein} Lein should now be functioning, call lein in terminal to test. 1.4 LEMUR setup Note that this assumes a cloudera vm distribution of hadoop. Lemur can be downloaded from http://download.climate.com/lemur/releases/lemur-1.3.1.tgz. follow that link and the file should appear in your download folder. Extract it, and then put it wherever you want it to be: export $HOME=/home cd cd.. cd etc/profile.d sudo vim lemur.sh You are now writing a file which will allow your system to recognize lemur. on the first line of the file write export LEMUR_HOME={wherever you saved your lemur file} (in my case /home/cloudera/desktop/lemur). on the second line of the file write export LEMUR_AWS_ACCESS_KEY={your aws access key} 1.3. Lein setup 5
cloud-kepler Documentation, Release 1.2 on the third line of the file write export LEMUR_AWS_SECRET_KEY={your aws secret key} on the fourth line of the file write export PATH=$PATH:$LEMUR_HOME/bin save the file and exit. Lemur should now work, call lemur in terminal to test. 1.5 References Koch, D.G., Borucki, W.J., Basri, G., et al. 2010, The Astrophysical Journal Letters, 713, L79 10.1088/2041-8205/713/2/L79 Kovacs, G., Zucker, S., & Mazeh, T. 2002, Astronomy & Astrophysics, 391, 369 10.1051/0004-6361:20020802 Still, M., & Barclay, T. 2012, Astrophysics Source Code Library, 8004 LEMUR launcher, Limote M. et al. 2012 The Climate Corporation 6 Chapter 1. System configuration
CHAPTER 2 Quickstart Guide A normal run of cloud-kepler can be started by: more input.txt python get_data.py mast python join_quarters.py python drive_bls_pulse.py -c con This sequence downloads all data from MAST and runs it through the algorithm with the parameters in a configuration file. 2.1 Specifying the data to download The input file (or lines typed directly to stdin) should include the KIC ID, quarter number, and cadence identifier on each line, such as: 011013072 1 llc 011013072 2 slc 011600006 * llc The special quarter identifier * will download all available quarters for the given KIC ID. slc indicates short-cadence data and llc indicates long-cadence data. The Python script get_data.py also accepts the keyword data followed by an absolute or relative filepath of a top-level data directory, with the same structure as the Kepler archive on MAST; use this option instead of mast if your data is stored locally. 2.2 Configuration file options There are several options that can be specified in a configuration file; the same options can be specified via command line options, but they will be overriden by the file if it is provided (with the -c flag). A standard configuration file looks like: [DEFAULT] segment = 2 min_duration = 0.01 max_duration = 0.5 n_bins = 1000 direction = 0 mode = cython print_format = encode verbose = no profiling = off 7
cloud-kepler Documentation, Release 1.2 Additional options will be added as needed, such as for detrending flags. 8 Chapter 2. Quickstart Guide
CHAPTER 3 Retrieving and downloading data 3.1 get_data Get data from MAST or hard disk 3.2 join_quarters Stitch multiple quarters of data together 9
cloud-kepler Documentation, Release 1.2 10 Chapter 3. Retrieving and downloading data
CHAPTER 4 BLS pulse algorithm 4.1 drive_bls_pulse Driver interface to BLS pulse 4.2 bls_pulse_python Naive pure Python implementation 4.3 bls_pulse_vec Vectorized Python implementation 4.4 bls_pulse_cython Optimized Cython implementation 11
cloud-kepler Documentation, Release 1.2 12 Chapter 4. BLS pulse algorithm
CHAPTER 5 detrend Detrend lightcurve data 13
cloud-kepler Documentation, Release 1.2 14 Chapter 5. detrend Detrend lightcurve data
CHAPTER 6 clean_signal Signal cleaning (removal of strong periodic signals) 15
cloud-kepler Documentation, Release 1.2 16 Chapter 6. clean_signal Signal cleaning (removal of strong periodic signals)
CHAPTER 7 postprocessing Analyze output from BLS pulse 17
cloud-kepler Documentation, Release 1.2 18 Chapter 7. postprocessing Analyze output from BLS pulse
CHAPTER 8 utils Utility functions 19
cloud-kepler Documentation, Release 1.2 20 Chapter 8. utils Utility functions
Python Module Index p postprocessing, 17 21