standarich_v1.00: an R package to estimate population allelic richness using standardized sample size

standarich_v1.00: an R package to estimate population allelic richness using standardized sample size Author: Filipe Alberto, CCMAR, University of the Algarve, Portugal Date: 3-2006 What it does The package purpose is to standardize population sample size before comparing allelic richness (Â) estimates among different populations. The problem of unequal sample size is typical in clonal species where G, the number of different genotypes or genetic individuals, is the relevant statistic to standardize. Independently of the sample design used for clonal organisms G is unpredictable at that stage, even when N, the number of sample units, is kept constant across populations. Therefore for clonal species the data file should contain a single copy of each multilocus genotype (MLG) present for each population. Usually the G counts among populations for clonal species data sets vary much, thus a standardization of G is needed to compare meaningful estimates of Â. For non-clonal species all MLG should be used, as the standardization is for N, and is normally necessary when the sample size used varied among populations. Graphical tools are also available to; 1) plot allele distribution and frequency across populations for each locus (Fig 1A and B), and 2) plot the relationship between allelic richness and genet/individual number in a population (Fig 2A and B). Using an R package R is a free software for statistical analysis and complex computations with numerous graphical applications and programming language (R Development Core team 2004). It is available for a series of platforms (Unix, Windows and Macintosh). The use of R is typically a slow learning process, but in the end it compensates by offering a large body of functionalities, graphics and data handling procedures. To use this package you need to know little about R and follow the instructions in this document. Start by installing R in your system, it is available from http://www.r-project.org/, follow the instructions there in. Once you installed the program and set a shortcut of Rgui to a specific working directory (see the pdf help manuals An introduction to R/introduction and preliminaries ), you can open the R console and select install packages from local Zipp files from the packages menu on the top bar and select the standarich_1.0.zip file. You now only need to load the package in the R environment, by selecting it from the package window in load package. Data file The text tab delimited file to use as input to standarich is very simple: Lines correspond to individual genotypes. The 1 st column has the population codes, the 2 nd the individual codes. The following columns have the allele codes (i.e. two columns per locus), in integers, using three or two digits (but be consistent!). Missing data is represented by the value 999 (don t use 99 even if you use 2 digits to code alleles!). There is no header in this file. 1

Example (a data file with 2 populations and 3 loci): a 1 181 181 222 222 175 177 a 2 181 181 222 222 177 177 a 3 181 181 222 222 177 177 1 181 181 224 230 175 175 2 181 181 230 230 175 175 3 181 203 224 224 175 175 4 181 203 224 230 175 175 5 181 203 224 230 175 175 To load such a file into R environment you can use the read.table( ) function; in the R console type: test<-read.table( myfile.txt, header=false) The object test is now loaded with your data(note that R is case sensitive Test is different to test!) standarich functions standarich contains, two function for data manipulation and two functions for plotting: rgenotypes.arich( test, n_of_replicates) This function should be used initially to perform a multiple random reduction (Leberg et al. 2002) of the number of individuals/genets in each population, by generating random subsamples for each population of varying size from g = 1 to g = G, G being the total number of different genotypes observed in each population (G = N in non-clonal species). Each subsample of a given g size is randomly replicated n_of_replicates times. You will need two arguments to this function: The first is the R object with your data (e.g. test) and the second the number_of_replicates times you want for each subsample of size g to be replicated. The function returns a table (actually an R object of type data.frame) with the multiple random reduction results: # to send the results of the function to the object redtest redtest<-rgenotypes.arich( test, 5 ) redtest 2

The 1 st column refers to population, the 2 nd to the subsample size in that randomization, the third to the Â value for that subsample, the 4 th the standard deviation across loci, and in the remaining columns the number of alleles found per each locus. Note how each subsample of size Ind = g is repeated a number_of_replicates times. stand.arich( redtest, g) The second function uses as argument the data.frame produced above (e.g. redtest), and returns a summarized data.frame with the results of Â after standardization of all population to a given g, the second argument to this function. You can run the function with different g values using the same redtest data.frame performing standardization for different g values. The results from this method are similar to the rarefraction method (Petit et al. 1998). stand.arich( redtest, 15 ) 3

stand.arich( redtest, 25 ) The 1 st column refers to population, the 2 nd to Â and the last to the standard deviation across replicated random subsamples of size g. Note how in the second example (g = 25) and populations have no values, this is because those populations have a number of individuals which is lower than the value used to perform the standardization (G < g). You should try to compromise between a g value that is not too high to result in losing many populations but is not to low to render Â differences intangible. Both the above described functions print their results tables to files written in the working directory named results random reduction.txt and standardized allelic richness.txt respectively. Note that if you need to use stand.arich( ) with different g arguments you need to rename the standardized allelic richness.txt file each time. Plotting functions allele.freq.plot( test, tpop=1, tpoint=2, tall=1) The function returns as many plot windows as there are loci in your data file. Each plot represents a table of allele frequencies for a given locus where the actual values are represented by dots of varying diameter. Allele codes are indicated on the x axis and population names on the y axis. The first argument of this function is the object with your data file (e.g. test), the latter arguments are used to control allele (tall) and population (tpop) text size and relative allele frequency dot size (tpoint), all have default values, although you may have to change them according to the number of populations in your data file. 4

allele.freq.plot( test, tpoint=4 ) 108 110 112 114 116 L 5 Alleles Fig 1A The plot for locus 5 for a data set with 3 populations allele.freq.plot ( rteste, tpop=0.65, tall=0.8 ) 100 102 106 108 110 112 114 116 L 5 Alleles 5 Fig 1B The plot for locus 5 for a data set with 37 populations

allele.genotype.plot( redtest, g=0, xmin=0, xmax=50, xmark=10, ymax=5 ) This function uses the data.frame produced by rgenotypes.arich( ) ( e.g. redtest ) to plot the relationship between allelic richness and sample size. The 2 nd argument to the function is a value of g to print as a vertical line in its intersection on the x axis, e.g. to represent the value used in the standardization, cutting the population lines at the standardized Â values (intersection in the y axis). The arguments xmax, xmin define the limits of the x axis. The space between consecutive tick marks of the x axis is set by xmark, and ymax defines the size of the y axis (all arguments except the data.frame have default values). #be test37pop a input table with data from 37 populations. red37pop_5rep <- rgenotypes.arich( teste37pop, 5 ) allele.genotype.plot( red37pop_5rep, g=10, xmax=40 ) Allelic richness 5 4 3 Fig 2 A Relationship between allelic richness (Â) and nº of genotypes in a sample. Each line represents a different population. Each point in the line is the mean of all replicates for that subsample g size, here 5 replicates were used. 2 1 0 10 20 30 40 Nº of genotypes #Now using 100 replicates red37pop_100rep <- rgenotypes.arich( teste37pop, 100 ) allele.genotype.plot( red37pop_100rep, g=10, xmax=40 ) Allelic richness 5 4 3 2 Fig 2 B Relationship between allelic richness (Â) and nº of genotypes in a sample. Each line represents a different population. Each point in the line is the mean of all replicates for that subsample g size, here 100 replicates were used to smooth the line. 1 0 10 20 30 40 Nº of genotypes 6

Notice how the lines are smoother with higher number of replicates although the processing time increases much. Additional help In the Rgui console you may find additional help documentation to this manual. After you load the standarich package you can use the function help to see a window with the documentation for every R function: help( rgenotypes.arich ) # or simply? rgenotypes.arich You can access the same information in html format, select from the help menu, HTML help, than select Packages and search for the standarich link. Example data files For testing the functions in package standarich two data sets are available: A data set Exdata is available to test rgenotypes.arich( ) #do data(exdata) a<-rgenotypes.arich(exdata, 50) To test stand.arich( ) a data.frame Exresults, with the results from rgenotypes.arich( ) is available. data(exresults) b<-stand.arich(exresults, 10) References Alberto F, Arnaud-Haond S, Duarte CM, Serrao EA (2006) Genetic diversity of a clonal angiosperm near its range limit: the case of Cymodocea nodosa in the Canary Islands. Marine Ecology Progress Series 309: 117-29. Leberg PL (2002) Estimating allelic richness: Effects of sample size and bottlenecks. Molecular Ecology 11: 2445-2449. Petit RJ, El Mousadik A, Pons O (1998) Identifying populations for conservation on the basis of genetic markers. Conservation Biology, 12: 844-855. 7

R Development Core Team (2004) R: A language and environment for statistical computing. R foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-00-3, URL http://www.rproject.org. 8