Genomatic: an R package for DNA fragment analysis project management

Similar documents

LRmix tutorial, version 4.1

How to Import Data into Microsoft Access

Note: With v3.2, the DocuSign Fetch application was renamed DocuSign Retrieve.

Data Analysis Tools. Tools for Summarizing Data

Lesson 07: MS ACCESS - Handout. Introduction to database (30 mins)

R Commander Tutorial

ACCESS Importing and Exporting Data Files. Information Technology. MS Access 2007 Users Guide. IT Training & Development (818)

SPSS: Getting Started. For Windows

Create a Pipe Network with Didger & Voxler

Using the Bulk Export/Import Feature

Tutorial 2: Reading and Manipulating Files Jason Pienaar and Tom Miller

Data exploration with Microsoft Excel: univariate analysis

User Manual - Sales Lead Tracking Software

CHAPTER 8: MANAGING LEADS

AES Crypt User Guide

Query 4. Lesson Objectives 4. Review 5. Smart Query 5. Create a Smart Query 6. Create a Smart Query Definition from an Ad-hoc Query 9

MicroStrategy Desktop

SPSS for Windows importing and exporting data

IRA Pivot Table Review and Using Analyze to Modify Reports. For help,

COGNOS Query Studio Ad Hoc Reporting

Calc Guide Chapter 9 Data Analysis

DiskPulse DISK CHANGE MONITOR

Step by Step Guide to Importing Genetic Data into JMP Genomics

NDSR Utilities. Creating Backup Files. Chapter 9

FileMaker Pro and Microsoft Office Integration

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

Vendor: Crystal Decisions Product: Crystal Reports and Crystal Enterprise

SnapLogic Tutorials Document Release: October 2013 SnapLogic, Inc. 2 West 5th Ave, Fourth Floor San Mateo, California U.S.A.

Rational Rational ClearQuest

Using R for Windows and Macintosh

Microsoft Excel Tips & Tricks

Pharmacy Affairs Branch. Website Database Downloads PUBLIC ACCESS GUIDE

How to Download Census Data from American Factfinder and Display it in ArcMap

Importing Data into R

Data Export User Guide

Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Version 6.2 Graphical User Interface (GUI) Manual

Radius Maps and Notification Mailing Lists

Cal Answers Analysis Training Part I. Creating Analyses in OBIEE

Tivoli Access Manager Agent for Windows Installation Guide

Data exploration with Microsoft Excel: analysing more than one variable

Appendix 2.1 Tabular and Graphical Methods Using Excel

STATGRAPHICS Online. Statistical Analysis and Data Visualization System. Revised 6/21/2012. Copyright 2012 by StatPoint Technologies, Inc.

Installation of ADS SiMKit startup script and designkit on Windows for SiMKit version 4.4

Practice Fusion API Client Installation Guide for Windows

SAP Predictive Analysis Installation

Novell ZENworks Asset Management 7.5

Introduction to the data.table package in R

Librarian. Integrating Secure Workflow and Revision Control into Your Production Environment WHITE PAPER

Creating and Using Databases with Microsoft Access

Access Queries (Office 2003)

UTILITIES BACKUP. Figure 25-1 Backup & Reindex utilities on the Main Menu

Eventia Log Parsing Editor 1.0 Administration Guide

Spreadsheet - Introduction

Results CRM 2012 User Manual

Basics Series-4004 Database Manager and Import Version 9.0

standarich_v1.00: an R package to estimate population allelic richness using standardized sample size

An Introduction to Using the Command Line Interface (CLI) to Work with Files and Directories

Importing and Exporting With SPSS for Windows 17 TUT 117

Dataframes. Lecture 8. Nicholas Christian BIOST 2094 Spring 2011

The Power Loader GUI

Creating Codes with Spreadsheet Upload

User Guide. Analytics Desktop Document Number:

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

Requirement for Data Migration: Access to both the SMS and SAM servers; A separate application called the SMS Data Preparation Tool (page 9).

SPSS 12 Data Analysis Basics Linda E. Lucek, Ed.D

SAS Analyst for Windows Tutorial

EXCEL IMPORT user guide

Introduction to Microsoft Access 2003

System Administration and Log Management

Help File. Version February, MetaDigger for PC

Microsoft Office Access 2007 Basics

An Introduction to Using the Command Line Interface (CLI) to Work with Files and Directories

SAS 9.4 PC Files Server

Time Clock Import Setup & Use

Market Pricing Override

Decision Support AITS University Administration. Web Intelligence Rich Client 4.1 User Guide

Advanced Excel 10/20/2011 1

Microsoft Access Basics

Groundwater Chemistry

Polynomial Neural Network Discovery Client User Guide

MAS 500 Intelligence Tips and Tricks Booklet Vol. 1

EndNote Beyond the Basics

Microsoft Dynamics GP. Extender User s Guide

CLC Server Command Line Tools USER MANUAL

Vendor: Brio Software Product: Brio Performance Suite

Integration Guide Version 5.3 6/25/2015

Call Recorder User Guide

IBM WebSphere Application Server Version 7.0

MS Access Lab 2. Topic: Tables

Jet Data Manager 2012 User Guide

****Also, if you have done previous promotions and have multiple mailing lists, make sure you select the most recent one.

Exploring Microsoft Office Access Chapter 2: Relational Databases and Multi-Table Queries

Generating a Custom Bill of Materials

Transcription:

Genomatic: an R package for DNA fragment analysis project management Brian J. Knaus Department of Botany and Plant Pathology Oregon State University http://oregonstate.edu/~knausb knausb at science dot oregonstate dot edu genomatic 0.2-0 R 2.7.X Linux and Windows November 2008 1

Acknowledgments: This package would not have been possible without the efforts of the team involved in the Purshia tridentata project. In particular, acknowledgments need to be made to Dr. Rich Cronn (USDA PNW), Dr. Matt Horning (USDA PNW), Angie Rodriguez (Oregon State University) and Jane Marler (Oregon State University). This project would not have been possible without their help! I'd also like to thank Dr. Peter Dolan (Oregon State University) for helpful conversation on the R language, computer science, and mathematics in general. 2

Table of Contents 1 Philosophy, input/output files 4 2 Installation 5 Pre-genotyping tools 3 Creating 96-well plate maps. 7 4 Creating ABI submission forms from 96-well plate maps. 9 Genotype management tools 5 Scoring Genotyper data for input into the Genomatic. 10 6 Formatting Genotyper tables for the Genomatic. 11 7 Managing reruns filling holes. 13 8 Combining multiplexes to a single file. 14 Allele characterization and binning tools 9 Allele characterization. 15 10 Scoring your data into bins. 17 Exporting to statistical software 11 Exporting your data to other software formats. 18 GenAlEx NTSYSpc Create Graphical User Interface 12 tcl/tk Graphical User Interface 19 Appendices A Genomatic functions 20 B User supplied files 21 C Genomatic flowcharts 22 3

1 Philosophy, input/output files A general philosophy of this software is to provide modularity and copious output. Modularity allows the user to utilize either the entire package, or more importantly, only certain elements which the user may find useful. The provision of copious output also allows the user to use various steps in this software as inputs or outputs to their preferred software. Throughout this set of programs commadelimited files (.csv) are used as input and output as much as possible. This file format is considered convenient as it is opened by default by various spreadsheets. It is the author's experience that text files are the common currency among many softwares. Through the provision of comma-delimited text files at many steps throughout the Genomatic process it is believed that the user can use these files for easy evaluation or importing/exporting into other softwares. In order to manage names of individual samples and populations a convention has been established in this software. Names should be similar to: PUTR_077_01. The underscore ('_') is used to delimit fields within the name. The first field ('PUTR') is a taxon field and is relaively arbitrary but must be present. The second field ('077') is a population code, while the third field ('01') is an individual code. Input to the genomatic frequently only includes the full name and the genomatic parses these names into populations for subsequent sorting. The Genomatic was created during the summer of 2007 as part of a team involved with a DNA microsatellite project on Purshia tridentada (Pursh) DC., a diploid member of the Rose family. The sample involved over 600 individuals which we genotyped at eight loci. Immediately it became obvious that organization would be critical to implementing this project. Furthermore, many of the steps we had been using in the past involved manual 'copying' and 'pasting' which was monotonous and consequently error prone. The Genomatic was created with the specific need of processing a large quantity of diploid microsatellite data, but also with the goal of creating reusable code that could be implemented in other studies. For the most part the directions provided in this document are intended to be executed from the R console. Occasionally (particularly for installation processes) commands need to be executed from the shell (i.e., Windows console, bash, etc.). In order to differentiate these commands the following syntax has been adhered to: R prompt: > command() Shell prompt: $ command 4

2 Installation Obtaining R and its packages The scripting language 'R' is a free interpreted language covered by the GNU public license (http://www.gnu.org/copyleft/gpl.html). R is available at http://www.r-project.org/. Introductory materials can be found at: http://www.r-project.org/ and http://oregonstate.edu/~knausb/r_group/r_user.html. Because R is an interpreted language it is easier to learn than lower level languages such as 'C' or 'fortran'. However, the price to pay for more easily interpreted code is longer execution time. Large datasets, or other computationally intensive operations, may require significant execution time on the order of minutes or longer. It is the experience of the author that the functions in this package can be executed on the order of minutes (at most, usually much faster) and are thus reasonable. Exceptionally large datasets, in samples or markers, may require substantial execution times. Installing the Genomatic It is planned that the Genomatic will be made available through CRAN when it reaches its first release. The Genomatic is currently in 'beta' status. The package is currently available at the author's website at: http://oregonstate.edu/~knausb/. Download the archive appropriate for your platform. It is assumed that the user has a basic understanding of the R language. For more background see the documentation provided at the R website (http://www.r-project.org/) or at the author's website (http://oregonstate.edu/~knausb/r_group/r_user.html). Under Linux (I currently use Ubuntu 8 the Hardy Heron) download the tarball to a local directory. Enter R using superuser privileges (e.g, $ sudo R). Once in R navigate to the directory containing the tarball and install: > setwd( /media/hdb2 ) # The directory containing the tarball. > install.packages("genomatic_0.2-0.tar.gz", repos=null) Under Windows download the zip file to a local directory. Open R, navigate to the directory containing the zip archive and install (in Windows this can also be accomplished through the dropdown menu): > setwd( c:/ ) > install.packages("genomatic_0.2-0.zip", repos=null) After installation it may be important to know where the Genomatic has been installed. The following R command will tell you where to look: 5

>.libpaths() I've included an R script to install an example to a directory of your choice (where you assumedly have read and write permissions). This is the file 'purshia_example_setup.r' in the directory 'genomatic/doc/'. Navigate to this directory, open the script and execute the appropriate parts (some are platform specific). You will neeed to specify a destination directory for the exmaple and make sure this directory exists. This should create a version of the example in your desired directory. This also leaves a copy in the 'genomatic' directory so that you can replace your example directory if you feel it becomes necessary. In order to use the genomatic open an instance of R and use the command library('genomatic'). This loads the Genomatic and makes its functions available for use. Removing the Genomatic As with all R packages, the Genomatic is installed within the R environment and is installed to a subdirectory of the R installation. In order to remove the package enter the following command: > remove.packages( genomatic ) This should remove the Genomatic from your system. 6

3 Creating 96-well plate maps. Functions: pop2indiv.r plater.r plate_write.r A first step in any molecular genetic study involves specifying a sample which includes individuals nested within populations. These individuals need to be plated out into 96-well plates for genotyping. This chapter describes how to convert a comma delimited file containing population names and sample sizes into maps of 96-well plates for genotyping. First we'll need to open R and set the working directory. The working directory is where R will look for files when we tell it to, and will be where files are written to when we write them. The working directory can be set with the following command (an example, note that these directories must exist): > setwd( C:/purshia_example/ ) # Windows > setwd( /home/knausb/purshia_example ) # Linux In the Linux environment the 'working directory' is usually handled by the directory in which files are opened. In Windows the 'working directory' needs to be specified. Recall that Windows uses backwards slashes (\) but because R has its roots in Unix, it therefore uses forward slashes (/). If you 'copy' and 'paste' your directory location out of the Windows Explorer you'll have to change the slashes. Next we read in a file containing our population data. Examples of all the files you need are included in the./genomatic/doc directory which should have been written to a user specified directory when you used the script 'purshia_example_setup.r'. Remember that R writes over files without asking, which depending on the context may be good or bad. First we read in the file and save it as the object 'purshia'. Note that we've included the information that the first line is a header and that it is a comma delimited file. Open this file in a spreadsheet and you will see two columns, one for the population name and a second for the sample size of each population. The second command here uses the sample size to create n individuals for each population where n is the sample size. > purshia <- read.table(file= purshia_pops.csv, header=t, sep=, ) > p.indiv <- pop2indiv(purshia) Now we organize these samples into an 8 X 12 map (actually a 'matrix'), and finally write these maps to comma delimited files. The second argument in the call to 'plate_write' ('putr_plate') is a prefix that will be used in the naming of your files (e.g., putr_plate_1.csv, putr_plate_2.csv). These files are used later in the Genomatic process but may also be useful in creating worksheets to aid the 7

initial plating of samples. > p.plates <- plater(p.indiv) > plate_write(p.plates, putr_plate ) Note that the function 'plate_write' parses your data into units of 96-well plates. Therefore you can give it any number of individuals (more or less than 96) and it will give you the correct number of plates. Empty wells are specified by 'NA's. 8

4 Creating ABI submission forms from 96-well plate maps. Functions: abi_sub.r In preparation for genotyping an ABI genotyping data file needs to be generated. This includes the sample name information as well as the filterset and loci. First read in the filterset file and store it as the object 'fset'. Examples for ABI filterset 'D' and 'G5' are included in the /Genomatic/doc directory. Next read in the plate map, note that the first column contains row names (A, B,..., H). Lastly, we send the plate, filterset, and a prefix for the outfile (in this case 'purshia_1'). The function abi_sub creates the appropriate information and prints it to a file in the working directory. > fset <- read.table(file="purshia_loci_fsetd.csv", header=t, + sep=",") > p.plate <- read.table(file="putr_plate_1.csv", header=t, sep=",", + row.names=1) > > abi_sub(p.plate, fset, "purshia_1") The Genomatic currently only supports comma delimited files. The ABI genotyping software may require an MS Excel file. This can easily be accomplished by opening the comma delimited file in MS Excel as saving it as a file of type '.xls.' 9

5 Scoring Genotyper data for input into the Genomatic. Functions: Non-Genomatic process We score our data using ABI's Genotyper, but you can probably use other software as long as you can get it into our format. Data is scored such that each locus is a separate 'category,' which results in a table where each 'category' is recorded on a separate row. This results in an individual being spread out over several rows, a problem the Genomatic will fix prior to data analysis. The first column of this file should be the file name of the trace (e.g., A10_PUTR_068_01_A10_02.fsa), second the lane number, third the dye, fourth the sample info, fifth the sample comment, sixth the category, and columns seven and eight contain the size in base pairs of the two peaks (this software assumes a diploid organism). ABI's genotyper software includes the option to score homozygotes twice, resulting in two identical calls for each homozygote appearing in two columns. This is expected by the Genomatic. Example files can be found in the /Genomatic/doc/ directory and are named: plate_1_i.txt plate_1_ii.txt plate_5_i.txt plate_5_ii.txt 10

6 Formatting Genotyper tables for the Genomatic. Functions: genotyper2genomatic.r cat_sorter.r File.Name_2_sample.r pop_get.r order_sample.r After your data has been scored in ABI's Genotyper we need to get it into a format that the Genomatic can easily use. First we initailize a list to hold our data in, in this instance we call this object 'purshia.mpi'. Next we read in our data and store it as elements of the list. Each file contains data for a 96-well plate in a tab delimited file where the first row is a header. > purshia.mpi <- list() > purshia.mpi[[1]] <- read.table(file="../data/plate_1_i.txt", + header=t, sep="\t") > purshia.mpi[[2]] <- read.table(file="../data/plate_5_i.txt", + header=t, sep="\t") Now we create an object called fnames to store the file names for our samples. This information will be associated with each sample so that after we organize the data by sample name so we can still track each sample back to its plate. This information is intended to help with rerun management where an individual may have been run for the same locus multiple times due to failed reactions or other unexpected issues. This could also be some other sort of plate or run identifier (e.g., plate1_take2). After we've created fnames we can send the list of plates and fnames to the function genotyper2genomatic and the result should be a sorted data.frame. We'll save this as a comma delimited file saved to the working directory. Note that because we have sample names as the first column we'll suppress the row names (which would just be a numeric index). > fnames <- c("plate_1_i.txt", "plate_5_i.txt") > putr_data <- genotyper2genomatic(purshia.mpi, fnames) > write.table(putr_data, file="putr_data.csv", sep=",", + row.names=false) The function genotyper2genomatic calls several functions to help it with its task of converting the GenoTyper tables to a format we can deal with. You may never notice them, but I've included some information on them here in case you find an alternate use for them. The function cat_sorter takes Genotyper tables and sorts them so that each row contains a single individual with several loci. Called by the function genotyper2genomatic. This function assumes that there is a column called 'Category' which contains a locus identifier. 11

The function filename2sample takes a column of file names from the fsa files and extracts the sample name. The expected format of the file name looks like this: 'A10_PUTR_068_01.fsa' where 'A10' is the well, and 'PUTR_068_01' is the sample name (here its population 068 and individual 01 from that population). This function uses the underscore (_) to parse this string and keeps the second, third and fourth elements (here they are 'PUTR_068_01' as the sample name). The function pop_get works in a similar way as filename2sample does. This function takes as input the sample names (e.g., PUTR_068_01) and,using the underscores as delimiters, extracts the population identifier (e.g., PUTR_068). The function order_sample takes the data.frame and sorts it based on the first column. In the Genomatic process this column is the sample name. 12

7 Managing reruns filling holes. Functions: dup_checker dup_replace order_sample Throughout the course of a project certain samples may be rerun due to failed reactions or other issues. In a large project it can be somewhat of a task to sort through a large matrix of samples to find these duplicated samples. Here I've created a couple of functions to help automate this task. The function dup_checker searches your dataset for duplicates. The first column of your data.frame should be the file's name. The function takes the first sample name and checks for duplicate names in the entire dataset. If a duplicate is found then they are saved in a new data.frame which is sorted and returned for you to scrutinize. At the end of execution the function it prints a quick message telling you whether duplicates were found. If duplicates are found you should save the data.frame as a comma delimited file so you can manually sort through the file and subsequently create a file with no duplicates. > putr_data <- read.table(file="../data/putr_data.csv", header=t, + sep=",") > putr_dups <- dup_checker(putr_data) > write.table(putr_data, file="../data/putr_dups.csv", sep=",", + row.names=false) Once we have our file of duplicates we can open it in our favorite spreadsheet and edit it so that the file contains a single instance of each sample. We can then read this file back into the Genomatic and use the function dup_replace to remove the duplicates from our data file and add our edited samples. Lastly we order the samples and save them as a comma delimited file. > putr_dups <- read.table(file="../data/putr_dups.csv", header=t, + sep=",") > putr_data <- dup_replace(putr_data, putr_dups) > putr_data <- order_sample(putr_data) > write.table(putr_data, file="../data/putr_data.csv", sep=",", + row.names=false) The file you have edited to remove duplicates can be reused or added to through multiple instances of this process, for example, if you have several rounds of reruns. It is however important to rename this file if you want to keep it because R will write over files without asking. 13

8 Combining multiplexes to a single file. Functions: multiplex_cbind pop_get Now that we have all the duplicates removed we can combine the multiplexes into one file. First we initialize a list and read each multiplex in as an element. The function multiplex_cbind needs an index of all the sample names because it matches the sample names in each multiplex before combining them. Here we've used the first column of the first multiplex. We send the function this index and our list of data and it returns our data sorted. This step is somewhat computationally intense, so it may take a bit to execute. > purshia <- list() > purshia[[1]] <- read.table(file="../data/putr_data_mpi.csv", + header=t, sep=",") > purshia[[2]] <- read.table(file="../data/putr_data_mpii.csv", + header=t, sep=",") > putr_data <- multiplex_cbind(purshia[[1]][,1], purshia) > write.table(putr_data, file="../data/putr_data.csv", sep=",", + row.names=false) The function multiplex_cbind calls the function pop_get, which is described in chapter 6 (Formatting Genotyper tables for the Genomatic). 14

9 Allele characterization. Functions: allele_process bin_flagger bin_diag allele_char bin_by_num The function allele_process calls several other functions to generate graphics of the data and stores them in pdf files as well as a comma delimited file for each of the loci containing summary statistics which will later be used to bin the data. The function requires a data.frame of data, a data.frame (of one column) containing locus names, and a specification of the minimum gap size between bins (in units of base pairs). > putr_data <- read.table(file="../data/putr_data.csv", header=t, + sep=",") > loci <- read.table(file="../data/purshia_loci.csv", header=t, + sep=",") > gap <- c(0.3) > allele_process(putr_data, loci, gap) The file 'allele_characterizations.pdf' includes a plot for each locus where the alleles are sorted by size and the bins are coded by alternating colors. The file 'locus_characterizations.pdf' inclueds a plot of each locus where the range of each called bin as a function of the number of peaks in each bin. A regression line is also included. A series of comma delimited files names for the locus and 'alleles.csv' include summary statistics to be examined by the user for suitability and will be used for scoring the dataset. The user may never need to interact with the following functions which are called by the function 'allele_process'. I have provided some information here on these functions for the user who may be interested in modifying the genomatic. The function bin_flagger flags boundaries for bins. Bins are defined by sorting the alleles from smallest to largest and searching for gaps of size gaps. Whenever a gap is encountered a new bin is called. Bins are first identified by a binary code which marks the beginning of each bin. These bins are subsequently numbered and returned by the function. The function bin_diag plots bin diagnostics. The sample data is plotted in a index of alleles (abscissa) by base pairs (ordinate) plot. This function can be used to plot to a graphics device or to a graphics file (e.g., pdf files). The function allele_char calculates summary statistics for each allele in a locus. This 15

information can subsequently be used to bin the data. Statistics computed (in base pairs) are the mean size, standard deviation, minimum size included in the bin, maximum size, and the count of samples from the dataset that are in the bin. The function bin_by_num plots bins by the number of samples represented in each bin. The data are plotted by count of samples in the allele (abscissa) by the range (in base pairs) of the bin. A heavy line is plotted at one base pair and thinner lines are plotted at 1.1, 1.2, 1.3, and 1.4 base pairs. It seems intuitive that a bin should not range more than a base pair, however experience suggests that bins that range more than a single base pair are frequently acceptable. This could be due to error in the size calling (how close the peak is to a size standard), among capillary error (capillary quality may induce error), or other unforeseen issues. These plots are intended to help the user assess bin quality and to decide what may be appropriate. 16

10 Scoring your data into bins. Functions: bin_score bin_caller pop_get sample_sorter The function bin_score controls the bin scoring process, however it needs some information to do its job. First, it needs data, which we've produced and formatted in previous chapters (chapters 8 & 9). Next we need a table of locus names. The allele characterization process (chapter 9) created a file for each locus containing a characterization for each allele. The function bin_score automates the input of these files but needs the allele names in order to do this. Use the same file you used in chapter 8 (Combining multiplexes into a single file). We also need to specify the width of each bin in standard deviations. And lastly we need to name an outfile. > putr_data <- read.table(file="../data/putr_data.csv", header=t, + sep=",") > loci <- read.table(file="../data/purshia_loci.csv", header=t, + sep=",") > bin_w <- c(2) > outfile <- c("purshia_v1.csv") We give this information to bin_score and it scores our data and creates a comma delimited file. This is an iterative process which will take a bit to execute, in my experience this may be on the order of minutes. > bin_score(putr_data, loci, bin_w, outfile) As with any automated scoring process there may be some calls which are unsatisfactory. It is recommended that this is used as a first draft of your scoring process. There will undoubtedly be some 'difficult alleles' to call in your dataset which will require human intervention to score appropriately. 17

11 Exporting your data to other software formats. Functions: genobins genomatic2genalex genomatic2ntsys The Genomatic was not created to analyse your data, there already exist numerous softwares that accomplish this. However, we do need to get the data in a format that other softwares accept. GenAlEx v6 (Peakall and Smouse 2006) is not only a very nice analysis package but it also provides export of data into numerous formats for other softwares. Therefore it was decided this would be a good format for the Genomatic to export to as it could not only provide analysis but also serve as an intermediary to other packages. First we input our data and create an object to hold the prefix to our outfile. Then we need to remove the allele calls (in decimal base pairs) and leave just the bin calls (integers), which is accomplished with the function genobins. > putr_data <- read.table(file="purshia_v1.csv", header=t, sep=",") > outfile <- c("purshia_v1") > putr_data <- genobins(putr_data) Now we convert the file to GenAlEx format. > genomatic2genalex(putr_data, outfile) Or to NTSYS format. > genomatic2ntsys(putr_data, outfile) Peakall, R., Smouse, P.E., 2006. GENALEX 6: genetic analysis in Excel. Population genetic software for teaching and research. Molecular Ecology Notes 6: 288-295. 18

12 Graphical User Interface. Functions: genogui In order to increase usability of the Genomatic a graphical user interface (GUI) was created. This GUI was written using the R extension of the tcl/tk language and therefore should be relatively platform independent. It is currently know to work properly on WindowsXP and Ubuntu Linux platforms. The GUI is essentially a large function which calls the functions described in the previous chapters. In order to use the GUI you will need to start an instance of R, load the genomatic library and start the GUI. > library(genomatic) > genogui() An instance of the GUI should start on your desktop. A drop-down menu titled 'Genomatic' contains an option for each of the previous chapters. Selection of one of these options starts a new window where the user is prompted for necessary information (e.g., input files, output filenames/prefixes) and a final button to exectute the task. An essential first task every time you use the genomatic is to designate an input directory (where user created data is kept) and an output directory (where Genomatic output is written to). This is the first option in the Genomatic dropdown menu. After designating these directories you should be able to enter any part of the Genomatic process where you have the necessary data. You can also work through all the examples outlined above through use of the GUI instead of the command line. The Purshia example discussed in the previous chapters can be repeated. If you adhered to the directions you can simply delete all the output from the directory./purshia_example/output/ and all necessary files should remain in the./purshia_example/ directory. Alternatively the example can be re-installed (see Chapter 2: Installation). First load the library and start the GUI by using the above commands. In the 'Genomatic' dropdown menu select 'Input/Output.' Set the input directory to./purshia_example/ and the output to./purshia_example/output/. Now push the 'Set Directory' button and select 'OK' on the confirmation window. To make 96-well plate maps select that option from the 'Genomatic' drop-down menu. For the 'population file' option select the file purshia_pops.csv. You can verify the file was read in properly by pushing the 'Check File' button, an option available throughout the GUI. For a prefix enter 'putr_plate.' Now press the 'Create 96-well Maps' button to create your maps, press 'OK' on the confirmation window. 19

We can now use these plate maps to create ABI submission files. Select 'ABI Submission' from the 'Genomatic' drop-down menu. Select the file 'purshia_loci_fsetd.csv' file for the filterset file. Note that the 'loci' column of this file is essentially a user comment value, here I've used it to designate the loci in each color. Select 'putr_plate_1.csv' for the sample plate file and enter 'purshia_abi1' as the output prefix. The script adds '_D_abi_cub.csv' to the output (the 'D' comes from the second column in the filterset file). When you score your data each diploid locus should appear on a seperate row of a tab-delimited file such that a sample will be spread across several rows (unless only a single locus was run). See the example file 'plate_1_i.txt' for an example of input file structure. Before we can score data we need to convert it to a format which the Genomatic can use. Select Input from Genotyper from the drop-down menu. Here we can input multiple files. First we choose a file to initialize the list with (removing all others) and then add as many files as we want (which contain the same filterset). We select an output prefix: putr_data_mpi, push the convert button and the ok button on the acknowledgement window. Now that we have the data in a format we can use we need to manage reruns. 20

Appendix A Genomatic Functions This appendix lists the functions included in the genomatic package. This list is also present in the 'INDEX' file. abi_sub allele_char allele_process bin_by_num bin_caller bin_diag bin_flagger bin_score cat_sorter condense dup_checker dup_replace filename2sample genobins genogui Genomatic2genalex genomatic2ntsys genotyper2genomatic init_gmtc order_sample plater plate_write pop2indiv pop_get sample_sorter Creates ABI submission forms Characterizes bins Processes alleles into bins Graphical output of bin quality Bins peaks Diagnostic plots for bins Flags alleles Automates bin scoring Sorts categories to their sample Merges samples with non-redundant data. Checks for duplicate samples Replaces duplicate samples with a single sample Extracts sample names from file names Remove peaks from files and leave bins Graphical user interface. Converts data to GenAlEx format Converts data to NTSYSpc format Import genotyper tables Initialize genomatic data file. Order samples alphanumerically Organizes individual names into 96-well plates Write 96-well maps Population Name to Individual Name Conversion Extracts the population name from a sample's name Sorts data from separate lists into a data.frame 25 total functions 21

Appendix B User supplied files In order to complete the entire Genomatic process several files need to be provided by the user. Examples of these files are included in the /genomatic/doc/ directory. Here is a list of these files. purshia_pops.csv Site name and sample size. purshia_loci_fsetd.csv Filterset and locus information. purshia_loci_fsetg5.csv plate_1_i.txt Exported Genotyper tables. plate_1_ii.txt These four files include two plates (1 & 5) plate_5_i.txt with two different multiplexes (I & II). plate_5_ii.txt purshia_loci.csv Lists each locus. An example script containing the commands in the manual can be found in the /genomatic/doc/ directory called purshia_example.r. When you run this script it will create numerous output files. In the example I have chosen to place these files in the 'output' directory, this directory must exist or an error will be generated (the script purshia_example_setup.r will create an output directory). Keep in mind that R will overwrite files without asking, this can be both good and bad depending on the circumstances. 22

Appendix C Genomatic flowcharts 3 Creating 96-well maps Population names & samples sizes pop2indiv plater plate_write 96-well plates 4 ABI submission 96-well plate Filterset abi_sub ABI submission form 23

6 Genotyper Reformatting Plate data #1 Plate data #n cat_sorter File.Name_2_sample genotyper2genomatic pop_get order_sample Formatted output: putr_data_mpi.csv 24

7 Managing Reruns Formatted output: putr_data_mpi.csv dup_checker Print duplicates: putr_dups_mpi.csv Remove duplicates Are duplicates present? dup_replace order_sample Formatted output: dups_removed_mi.csv 25

8 Multiplex Combination Formatted data: putr_data_mpi.csv Formatted data: putr_data_mpii.csv multiplex_cbind pop_get Formatted output: purshia_ssr_8loci.csv 26

9 Allele Characterization Formatted Data: purshia_data.csv File of Loci: purshia_loci.csv Designate gap size Default = 0.3 bp allele_process bin_flagger bin_diag allele_char Choose a working directory bin_by_num Write each locus characterization to a separate file: FAM_CT2_18_alleles.csv 27

10 Bin Scoring Data file putr_data.csv Locus names purshia_loci.csv Bin width s.d. = 2 Designate Outfile purshia_v1.csv bin_caller bin_score sample_sorter pop_get Scored data purshia_v1.csv 28

11 Export Scored data Outfile purshia_v1 genobins genomatic2genalex genomatic2ntsys GenAlEx formatted data NTSYS formatted data 29