STATISTICAL LABORATORY, USING R FOR BASIC STATISTICAL ANALYSIS

STATISTICAL LABORATORY, USING R FOR BASIC STATISTICAL ANALYSIS Manuela Cattelan 1 ABC OF R 1.1 ARITHMETIC AND LOGICAL OPERATORS. VARIABLES AND AS- SIGNMENT OPERATOR R works just as a pocket calculator, when performing elementary computations. The arithmetic operators of addition, subtraction, multiplication, division and power are ADDITION SUBTRACTION MULTIPLICATION DIVISION POWER + - * / ^ When necessary, (round) parentheses can be used to force a given order of the computations. > 1 + 2 + 3/3 [1] 4 > (1 + 2 + 3)/3 [1] 2 > 2 * (10-20)^3 [1] -2000 > (0.2 * 10 + 0.3 * 12 + 0.5 * 7)/2 [1] 4.55 In most cases it is convenient to store constants or intermediate results into the computer memory for later use, in the same session. The corresponding areas of the computer memory are identified by names chosen by the user. The assignment operator <- is obtained by entering the < and - keys on the computer keyboard. > a <- 100 > b <- a * (5-2)/5 > b [1] 60 The operators for binary comparisons are The logical operators! (NOT ), & (AND) and (OR) are used to form complex logical expressions. Within R, the logical constants TRUE and FALSE are used to show that a given logical expression is true or false. Based on material prepared by Prof. Mario Romanazzi 1

1 ABC OF R 2 LOWER LOWER OR EQUAL GREATER GREATER OR EQUAL EQUAL NOT EQUAL < <= > >= ==! = > neg <- -1 > pos <- 10 > neg <= 0 [1] TRUE > neg * pos < 0 [1] TRUE > neg < 0 & pos > 0 [1] TRUE > pos >= 0 pos < 5 [1] TRUE 1.2 MATHEMATICAL FUNCTIONS The usual mathematical functions are available. Some common functions are listed below. FUNCTION Absolute value Square root Logarithm (natural, base e) Logarithm (base 10) Exponential Trigonometric: sine, cosine, tangent Factorial Binomial coefficient R NAME abs sqrt log log10 exp sin, cos, tan factorial choose The following results reflect the very definitions of the functions. > sqrt(100) [1] 10 > log10(10000) [1] 4 > 10^(log10(10000)) [1] 10000 > sin(pi/2) [1] 1 > cos(pi/2) [1] 6.123032e-17 > factorial(5)

1 ABC OF R 3 [1] 120 > factorial(10) [1] 3628800 > choose(10, 2) [1] 45 1.3 USER DEFINED FUNCTIONS The user can define specific functions. For example, the following code defines two functions to compute the length of the circumference and the area of a circle from the length of the radius. > circ_l <- function(x) 2 * pi * x > circ_a <- function(x) pi * x^2 > circ_l(1) [1] 6.283185 > circ_a(0.5) [1] 0.7853982 1.4 DATA STRUCTURES: VECTORS The most important data structure is the vector, an ordered collection of n 1 items of the same type (numerical, alphanumeric, logical). Note that a scalar is a numerical vector with just one element. There are several ways to define vectors. The most general one is through the c function (c means concatenate). The length function gives the size (number of components) of a vector. > itp_age <- c(71, 74, 63, 71, 66, 63, 82, 58, 74, 79, 81) > itp_name <- c("e. De Nicola", "L. Einaudi", "G. Gronchi", "A. Segni", + "G. Saragat", "G. Leone", "S. Pertini", "F. Cossiga", "O. L. Scalfaro", + "C. A. Ciampi", "G. Napolitano") > usp_age <- c(61, 63, 44, 55, 56, 61, 53, 70, 65, 47, 55, 47) > usp_name <- c("h. S. Truman", "D. D. Eisenhower", "J. F. Kennedy", + "L. B. Johnson", "R. Nixon", "G. Ford", "J. Carter", "R. Reagan", + "G. Bush", "B. Clinton", "G. W. Bush", "B. Obama") > length(itp_age) [1] 11 > length(itp_age) == length(itp_name) [1] TRUE Let x denote the vector (x 1, x 2,..., x n ). Special vectors can be defined by using the (colon) : operator (x i+1 x i = 1, i = 1,..., n 1) or the seq (x i+1 x i = c, i = 1,..., n 1, where c is a constant) and the rep (x i = c, i = 1,..., n, where c is a constant) functions (seq and rep are abbreviations of sequence and repeat, respectively). > -2:10 [1] -2-1 0 1 2 3 4 5 6 7 8 9 10 > -2.5:10

1 ABC OF R 4 [1] -2.5-1.5-0.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 > seq(1, 20, 2) [1] 1 3 5 7 9 11 13 15 17 19 > seq(-5, 5, 0.5) [1] -5.0-4.5-4.0-3.5-3.0-2.5-2.0-1.5-1.0-0.5 0.0 0.5 1.0 1.5 2.0 [16] 2.5 3.0 3.5 4.0 4.5 5.0 > rep(0, 5) [1] 0 0 0 0 0 > c(-0.5, 1:4, rep(5, 3)) [1] -0.5 1.0 2.0 3.0 4.0 5.0 5.0 5.0 1.5 VECTOR OPERATIONS Vector components or subvectors are obtained by giving either their positions within the vector or a property they satisfy. In both cases, the subsetting operator [] is used. > itp_name[1] [1] "E. De Nicola" > itp_name[1:3] [1] "E. De Nicola" "L. Einaudi" "G. Gronchi" > itp_name[c(1, length(itp_name))] [1] "E. De Nicola" "G. Napolitano" > itp_name[itp_age < 60] [1] "F. Cossiga" > usp_name[usp_age > 40 & usp_age < 50] [1] "J. F. Kennedy" "B. Clinton" "B. Obama" In general, transformation of vectors by mathematical functions is performed component wise. Moreover, arithmetic operations involving several vectors require the vectors to have the same length. > v1 <- -2:5 > v2 <- 6:13 > abs(v1) [1] 2 1 0 1 2 3 4 5 > v1 + v2 [1] 4 6 8 10 12 14 16 18 > 2 * v1 - v2 [1] -10-9 -8-7 -6-5 -4-3

2 BASIC STATISTICS 5 > 2 * v1-1 [1] -5-3 -1 1 3 5 7 9 To arrange the vector components from minimum to maximum, the sort function can be used. The functions min, max, which.min, which.max produce the minimum and maximum entries and the corresponding positions within the vector. > sort(itp_age) [1] 58 63 63 66 71 71 74 74 79 81 82 > c(min(itp_age), max(itp_age)) [1] 58 82 > c(which.min(itp_age), which.max(itp_age)) [1] 8 7 > sort(c("carla", "Francesco", "Paola", "Matteo", "Maria")) [1] "Carla" "Francesco" "Maria" "Matteo" "Paola" 2 BASIC STATISTICS We use the Presidents age data to illustrate how to produce a statistical report with R. We list in the following table some basic functions (both analytical and graphical). Note that they all have as basic argument the data vector. STATISTICAL FUNCTION Sample size Frequency table Stem-and-leaf Order statistic Basic location statistics Quantile Median Mean Variance (unbiased version) Standard deviation (unbiased version) Box-plot Histogram R NAME length table stem sort summary quantile median mean var sd boxplot hist The stem-and-leaf display shows the general features of the distribution: range, location, dispersion, shape, possible outliers. It is mainly useful with small sample sizes, as in the present case. > stem(itp_age, scale = 0.5) The decimal point is 1 digit(s) to the right of the 5 8 6 336 7 11449 8 12 > stem(usp_age)

2 BASIC STATISTICS 6 The decimal point is 1 digit(s) to the right of the 4 477 5 3556 6 1135 7 0 The summary function, when applied to a numerical vector, gives basic statistics to evaluate location: minimum and maximum values, quartiles and mean. In computing empirical (sample) quantiles, R employs a more refined interpolation algorithm that that described in the textbook (to be used with hand computations). With small samples, discrepancies are observed. The range and the interquartile range are easily evaluated from these results. Another dispersion statistic is the standard deviation, to be used together with the sample mean. > summary(itp_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 58.00 64.50 71.00 71.09 76.50 82.00 > summary(usp_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 44.00 51.50 55.50 56.42 61.50 70.00 > sd(itp_age) [1] 7.905119 > sd(usp_age) [1] 7.925314 A very useful graphical comparison of the two samples is given by the paired box-plot. Whereas the stem-and-leaf displays the full order statistic, the box-plot displays only the quartiles and the extreme statistics. > boxplot(itp_age, usp_age, horizontal = TRUE, xlab = "Presidents' age (years)", + names = c("italy", "US"), col = "lavender", main = "Italy vs US, 1945-2010")

3 EXPLORATION OF DATA WITH FREQUENCY TABLES AND DISTRIBUTIONAL PLOTS 7 Italy vs US, 1945 2010 Italy US 50 60 70 80 Presidents' age (years) Here, the most important feature is a location shift on the right of Italy with respect to US. The statistical tendency is that the italian Presidents are older than US Presidents: the difference between the medians is about 15.5 years. The dispersion does not seem very different (e.g, compare the standard deviations and the IQRs). 3 EXPLORATION OF DATA WITH FREQUENCY TABLES AND DISTRIBUTIONAL PLOTS The file regioni.txt contains the required information and can be used to simulate the collection of the data. First of all we input the file into the R system. > regio <- read.table("http://venus.unive.it/romanaz/statistics/data/regioni.txt", + header = TRUE, na.strings = "NA") > str(regio) 'data.frame': 20 obs. of 7 variables: $ name : Factor w/ 20 levels "Abruzzo","Basilicata",..: 12 19 8 9 17 20 6 5 16 18... $ area : Factor w/ 3 levels "C","N","S": 2 2 2 2 2 2 2 2 1 1... $ capo : Factor w/ 20 levels "Ancona","Aosta",..: 17 2 8 10 18 20 19 4 7 13... $ totsurf: int 25399 3263 5421 23861 13607 18379 7844 22123 22997 8456... $ coast : int NA NA 346 NA NA 156 110 130 573 NA... $ res09 : int 4432571 127065 1615064 9742676 1018657 4885548 1230936 4337979 3707818 894222... $ car : num 623 2104 467 558 555...

3 EXPLORATION OF DATA WITH FREQUENCY TABLES AND DISTRIBUTIONAL PLOTS 8 > regio name area capo totsurf coast res09 car 1 Piemonte N Torino 25399 NA 4432571 623.3 2 Valle d'aosta N Aosta 3263 NA 127065 2104.3 3 Liguria N Genova 5421 346 1615064 467.3 4 Lombardia N Milano 23861 NA 9742676 558.5 5 Trentino-Alto Adige N Trento 13607 NA 1018657 555.0 6 Veneto N Venezia 18379 156 4885548 422.6 7 Friuli-Venezia Giulia N Trieste 7844 110 1230936 525.9 8 Emilia-Romagna N Bologna 22123 130 4337979 534.7 9 Toscana C Firenze 22997 573 3707818 541.9 10 Umbria C Perugia 8456 NA 894222 689.6 11 Marche C Ancona 9694 172 1569578 616.2 12 Lazio C Roma 17208 357 5626710 699.7 13 Abruzzo S L'Aquila 10799 124 1334675 696.2 14 Molise S Campobasso 4438 34 320795 658.5 15 Campania S Napoli 13595 461 5812962 568.1 16 Puglia S Bari 19363 830 4079702 560.7 17 Basilicata S Potenza 9992 59 590601 702.7 18 Calabria S Reggio Calabria 15080 710 2008709 609.5 19 Sicilia S Palermo 25707 1425 5037799 594.4 20 Sardegna S Cagliari 24089 1636 1671001 657.2 The structure is a data table with N = 20 rows (the regions, statistical units) and 7 columns (the variables) 1. name, name of region (identifier) 2. area, geographical area (stratification v.) 3. maint, region capital (categorical v.) 4. totsurf, region total surface area (numerical v., km 2 ) 5. coast, region total coast length (numerical v., km) 6. res09, total number of residents, as of 1/1/2009 (numerical v., count) 7. car, number of cars for 1000 residents, (numerical v.) We start data exploration with the study of the stratification variable, geographical area. The frequency distribution is obtained by the R function table and the corresponding graphical display with barplot. > N <- dim(regio)[1] > abs_f <- table(regio$area) > abs_f C N S 4 8 8 > rel_f <- 100 * (table(regio$area)/n) > rel_f C N S 20 40 40

3 EXPLORATION OF DATA WITH FREQUENCY TABLES AND DISTRIBUTIONAL PLOTS 9 > barplot(rel_f, xlab = "Geographical Area", ylab = "Frequency (%)", + main = "Distribution of Regions According to Geographical Area", + col = "lavender") Distribution of Regions According to Geographical Area Frequency (%) 0 10 20 30 40 C N S Geographical Area The density of the residents is the ratio of the total number of residents in a geographical area and the corresponding surface measure. In the present situation we can obtain the population density of each region dividing res09 by totsurf. > dens <- regio$res09/regio$totsurf > dens [1] 174.51754 38.94116 297.92732 408.30963 74.86272 265.82230 156.92708 [8] 196.08457 161.23051 105.75000 161.91232 326.98222 123.59246 72.28369 [15] 427.58088 210.69576 59.10739 133.20351 195.96993 69.36780 > stem(dens) The decimal point is 2 digit(s) to the right of the 0 46777 1 1236667 2 0017 3 03 4 13

4 THE UNIFORM DISTRIBUTION 10 Note that the decimal point does not appear in the stem-and-leaf, but the legend supplies the necessary information. In the present case, only the hundred and ten digits are retained and the data are rounded accordingly. The stems are the classes (0, 100), [100, 200), etc. The distribution is skewed on the left and unimodal with the highest frequency in the second class. To obtain the joint distribution of geographical area and density, we use the table function again. Note that variable dens is previously divided into classes with the function cut to avoid uninformative profileration of entries in the frequency table. > table(regio$area, cut(dens, breaks = c(0, 150, 300, 450), include.lowest = TRUE)) [0,150] (150,300] (300,450] C 1 2 1 N 2 5 1 S 5 2 1 The result suggests that northern and central regions concentrate in the middle density class, southern regions in the lowest class. 4 THE UNIFORM DISTRIBUTION Is there any pattern in the statististical distribution of decimal digits of real numbers? As an example we use the first 49 decimal digits of and π 3.1415926535897932384626433832795028841971693993751 e 2.7182818284590452353602874713526624977572470937000 The appropriate tools are the frequency distribution or the stem and leaf plot of the data. > dig_pi <- "3.1415926535897932384626433832795028841971693993751" > dig_pi <- unlist(strsplit(dig_pi, split = ""))[-c(1, 2)] > dig_pi [1] "1" "4" "1" "5" "9" "2" "6" "5" "3" "5" "8" "9" "7" "9" "3" "2" "3" "8" "4" [20] "6" "2" "6" "4" "3" "3" "8" "3" "2" "7" "9" "5" "0" "2" "8" "8" "4" "1" "9" [39] "7" "1" "6" "9" "3" "9" "9" "3" "7" "5" "1" > table(as.numeric(dig_pi)) 0 1 2 3 4 5 6 7 8 9 1 5 5 8 4 5 4 4 5 8 > stem(as.numeric(dig_pi)) The decimal point is at the 0 0 1 00000 2 00000 3 00000000 4 0000 5 00000 6 0000 7 0000 8 00000 9 00000000

5 ITALIAN VS NON ITALIAN AGE DISTRIBUTION 11 > dig_e <- "2.7182818284590452353602874713526624977572470937000" > dig_e <- unlist(strsplit(dig_e, split = ""))[-c(1, 2)] > dig_e [1] "7" "1" "8" "2" "8" "1" "8" "2" "8" "4" "5" "9" "0" "4" "5" "2" "3" "5" "3" [20] "6" "0" "2" "8" "7" "4" "7" "1" "3" "5" "2" "6" "6" "2" "4" "9" "7" "7" "5" [39] "7" "2" "4" "7" "0" "9" "3" "7" "0" "0" "0" > table(as.numeric(dig_e)) 0 1 2 3 4 5 6 7 8 9 6 3 7 4 5 5 3 8 5 3 > stem(as.numeric(dig_e)) The decimal point is at the 0 000000 1 000 2 0000000 3 0000 4 00000 5 00000 6 000 7 00000000 8 00000 9 000 In both cases the distribution does not exhibit a clear mode, as in the unimodal situation, nor a clear increasing or decreasing pattern. A theoretical model could be a uniform distribution on the integers 0, 1, 2,..., 9 giving equal weight 1/10 to all digits. The discrepancies of the observed distribution from this model should depend on the low sample size. 5 ITALIAN VS NON ITALIAN AGE DISTRIBUTION The age classes have different widths, hence density must be used, not frequency. We show the results in the table below. Italian Residents Non Italian Residents Veneto Italy Veneto Italy Age Class Width Freq. % Dens. % Freq. % Dens. % Freq. % Dens. % Freq. % Dens. % (0, 15) 15 13.9 0.93 14.1 0.94 21.0 1.40 19.1 1.27 [15, 30) 15 15.7 1.05 16.8 1.12 26.6 1.77 25.2 1.68 [30, 45) 15 25.5 1.70 24.0 1.60 38.4 2.56 38.7 2.58 [45, 65) 20 25.8 1.29 25.3 1.27 12.7 0.64 15.0 0.75 [65, 80) 15 14.1 0.94 14.6 0.97 1.2 0.08 1.8 0.12 [80, 110) 30 5.1 0.17 5.1 0.17 0.2 0 0.3 0 As expected, there are no very important differences between Veneto and Italy, both for italian and non italian residents. The age distribution of italian residents is unimodal, with the density peak in the class [30, 45). In contrast, the highest frequency is in the class [45, 65).

5 ITALIAN VS NON ITALIAN AGE DISTRIBUTION 12 The age distribution of non italian residents is still unimodal, with the density peak in the class [30, 45), but here the concentration of the data in the modal class is much higher (the density is 2.58 against 1.60). The statistical tendency is clear: non italian residents are younger. With reference to Italy, 19.7% of italian residents have 65 years or more, against 2.1% of non italian residents. What is the explanation of this finding? Do you expect to be a permanent character or to change in the future? The four histograms are shown in the figure. > layout(matrix(1:4, 2, 2)) > plot(0, 0, type = "p", xlab = "Age (Years)", ylab = "Density (%)", + main = "Veneto (Italian Res.)", xlim = c(0, 115), ylim = c(0, + 2.6)) > rect(c(0, 15, 30, 45, 65, 80), c(0, 0, 0, 0, 0, 0), c(15, 30, + 45, 65, 80, 110), c(0.93, 1.05, 1.7, 1.29, 0.94, 0.17), col = "lavender") > plot(0, 0, type = "p", xlab = "Age (Years)", ylab = "Density (%)", + main = "Italy (Italian Res.)", xlim = c(0, 115), ylim = c(0, + 2.6)) > rect(c(0, 15, 30, 45, 65, 80), c(0, 0, 0, 0, 0, 0), c(15, 30, + 45, 65, 80, 110), c(0.94, 1.12, 1.6, 1.27, 0.97, 0.17), col = "lavender") > plot(0, 0, type = "p", xlab = "Age (Years)", ylab = "Density (%)", + main = "Veneto (Non Italian Res.)", xlim = c(0, 115), ylim = c(0, + 2.6)) > rect(c(0, 15, 30, 45, 65, 80), c(0, 0, 0, 0, 0, 0), c(15, 30, + 45, 65, 80, 110), c(1.4, 1.77, 2.56, 0.64, 0.08, 0), col = "lavender") > plot(0, 0, type = "p", xlab = "Age (Years)", ylab = "Density (%)", + main = "Italy (Non Italian Res.)", xlim = c(0, 115), ylim = c(0, + 2.6)) > rect(c(0, 15, 30, 45, 65, 80), c(0, 0, 0, 0, 0, 0), c(15, 30, + 45, 65, 80, 110), c(1.27, 1.68, 2.58, 0.75, 0.12, 0), col = "lavender")

5 ITALIAN VS NON ITALIAN AGE DISTRIBUTION 13 Veneto (Italian Res.) Veneto (Non Italian Res.) Density (%) 0.0 1.0 2.0 Density (%) 0.0 1.0 2.0 0 20 40 60 80 100 0 20 40 60 80 100 Age (Years) Age (Years) Italy (Italian Res.) Italy (Non Italian Res.) Density (%) 0.0 1.0 2.0 Density (%) 0.0 1.0 2.0 0 20 40 60 80 100 0 20 40 60 80 100 Age (Years) Age (Years)