The R Environment. A high-level overview. Deepayan Sarkar. 22 July Indian Statistical Institute, Delhi
|
|
- Shanon Dickerson
- 7 years ago
- Views:
Transcription
1 The R Environment A high-level overview Deepayan Sarkar Indian Statistical Institute, Delhi 22 July 2013
2 The R language ˆ Tutorials in the Astrostatistics workshop will use R ˆ...in the form of silent hands-on tutorials ˆ This talk gives a very high-level introduction to R
3 What exactly is R? ˆ R is a language and environment for statistical computing and graphics. ˆ It is a Free Software project which is similar to the S language and environment which was developed at Bell Laboratories by John Chambers and colleagues. ˆ R can be considered as a different implementation of S.
4 What exactly is R? ˆ R is a language and environment for statistical computing and graphics. ˆ It is a Free Software project which is similar to the S language and environment which was developed at Bell Laboratories by John Chambers and colleagues. ˆ R can be considered as a different implementation of S.
5 The origins of S ˆ Developed at Bell Labs (statistics research department, 1970s onward) ˆ Primary goals ˆ Interactivity: Exploratory Data Analysis vs batch mode ˆ Flexibility: Novel vs routine methodology ˆ Practical: For actual use, not (just) academic research
6 R 1993 Started as teaching tool by Robert Gentleman & Ross Ihaka at the Univesity of Auckland... because S didn t run on the Apple computers they had 1995 Convinced by Martin Maechler to release as Free Software 2000 Version 1.0 released
7 R 1993 Started as teaching tool by Robert Gentleman & Ross Ihaka at the Univesity of Auckland... because S didn t run on the Apple computers they had 1995 Convinced by Martin Maechler to release as Free Software 2000 Version 1.0 released
8 R 1993 Started as teaching tool by Robert Gentleman & Ross Ihaka at the Univesity of Auckland... because S didn t run on the Apple computers they had 1995 Convinced by Martin Maechler to release as Free Software 2000 Version 1.0 released
9 ˆ Not really that different from S, but the Free Software/Open Source development model has made it a larger success, particularly in academia R
10 Why the success? ˆ Rapid prototyping ˆ Interfaces to external software ˆ Easy dissemination of research (through packages)
11 Why the success? ˆ Rapid prototyping ˆ Interfaces to external software ˆ Easy dissemination of research (through packages)
12 Why the success? ˆ Rapid prototyping ˆ Interfaces to external software ˆ Easy dissemination of research (through packages)
13 Rapid prototyping S is a programming language and environment for all kinds of computing involving data. It has a simple goal: To turn ideas into software, quickly and faithfully John Chambers Programming with Data
14 Example: Fibonacci sequence Sequence {F n } defined by the recurrence relation F n = F n 1 + F n 2 with seed values F 0 = 0, F 1 = 1
15 S is a programming language > fibonacci <- function(n) { x <- c(0, 1) while (length(x) < n) { x <- c(x, sum(tail(x, 2))) } return (x) } > fib10 <- fibonacci(10) > fib10 [1]
16 Basic usage of R ˆ Think of R as a (sophisticated) calculator ˆ Expressions are typed at the command line ˆ R evaluates it and prints result ˆ...unless result is stored in a variable (using <- or =)
17 Basic usage of R ˆ Think of R as a (sophisticated) calculator ˆ Expressions are typed at the command line ˆ R evaluates it and prints result ˆ...unless result is stored in a variable (using <- or =)
18 S is a programming language > fibonacci(20) [1] [13] > fib20 <- fibonacci(20) > fib20 [1] [13]
19 File fib.c: #include <Rdefines.h> Easy to call C for efficiency void cfib(int *x, int n) { int i; x[0] = 0; x[1] = 1; for (i = 2; i < n; i++) x[i] = x[i-1] + x[i-2]; } SEXP do_fibonacci(sexp nr) { SEXP ans = PROTECT(NEW_INTEGER(INTEGER_VALUE(nr))); cfib(integer_pointer(ans), length(ans)); UNPROTECT(1); return ans; }
20 Easy to call C for efficiency $ R CMD SHLIB fib.c gcc -std=gnu99 -I/home/deepayan/Rinstall/R-devel/lib/R/incl gcc -std=gnu99 -shared -L/usr/local/lib -o fib.so fib.o -L/ > dyn.load("fib.so") > cfib10 =.Call("do_fibonacci", as.integer(10)) > cfib10 [1]
21 Vectorized computation The Fibonacci series has a closed-form expression as well. F (n) = φn (1 φ) n 5, where φ = > phi <- (1 + sqrt(5)) / 2 > n <- 0:9 > n [1] > (phi^n - (1 - phi)^n) / sqrt(5) [1]
22 S is a programming language ˆ...designed for interactive use ˆ...with a focus on data analysis ˆ Basic data structures are vectors ˆ Large collection of statistical functions ˆ Advanced statistical graphics capabilities
23 S is a programming language ˆ...designed for interactive use ˆ...with a focus on data analysis ˆ Basic data structures are vectors ˆ Large collection of statistical functions ˆ Advanced statistical graphics capabilities
24 Ideal for simulation studies Example: How many heads when we toss a fair coin 500 times? > rbinom(1, size = 500, prob = 0.5) [1] 234 > rbinom(10, size = 500, prob = 0.5) [1]
25 Ideal for simulation studies > x <- rbinom(1000, size = 500, prob = 0.5) > histogram(x) 20 Percent of Total x
26 What if the coin is not fair? > x <- rbinom(1000, size = 500, prob = 0.1) > histogram(x) Percent of Total x
27 Statistical models ˆ S can fit all kinds of statistical models ˆ A brief example: ˆ Michelson s 1879 experiment on measuring the speed of light ˆ Measurements of the speed of light in air ˆ Made between 5th June and 2nd July, 1879 ˆ Five experiments, each consisting of 20 consecutive runs ˆ Speed recorded in km/s ˆ The currently accepted value on this scale is 734.5
28 Statistical models ˆ S can fit all kinds of statistical models ˆ A brief example: ˆ Michelson s 1879 experiment on measuring the speed of light ˆ Measurements of the speed of light in air ˆ Made between 5th June and 2nd July, 1879 ˆ Five experiments, each consisting of 20 consecutive runs ˆ Speed recorded in km/s ˆ The currently accepted value on this scale is 734.5
29 Speed Run Expt The Michelson data
30 The R approach : A typical analysis > data(michelson, package = "MASS") > michelson Speed Run Expt
31 > str(michelson) 'data.frame': 100 obs. of 3 variables: $ Speed: int $ Run : Factor w/ 20 levels "1","2","3","4",..: $ Expt : Factor w/ 5 levels "1","2","3","4",..:
32 > bwplot(speed ~ Expt, data = michelson, aspect = 1) Speed
33 Fitting models Model: y ij = µ + α i + ε ij where y ij = measured speed of light µ = overall mean (actual speed) α i = additive effect of i-th experiment ε ij N (0, σ 2 ) = error To fit this model in R, use > fm <- lm(speed ~ Expt, data = michelson)
34 Hypothesis testing (ANOVA) Test whether experiments are different (should not be) > anova(fm) Analysis of Variance Table Response: Speed Df Sum Sq Mean Sq F value Pr(>F) Expt Residuals
35 > xyplot(speed ~ Run Expt, data = michelson, layout = c(2, 3)) Speed Run
36 Summary The S approach is to work with objects. ˆ Model fits produce objects, usually stored as variables ˆ Queried interactively for further analysis > anova(fm) > summary(fm) > residuals(fm) ˆ Use graphics as a natural component of workflow
37 Even graphics is programmable Globular cluster data: data/globclus_prop.dat > GlobClust <- read.table("data/globclus_prop.dat", header = TRUE) > xyplot(mv ~ r.tidal, GlobClust, grid = TRUE) 2 4 Mv r.tidal
38 Even graphics is programmable > xyplot(mv ~ r.tidal, GlobClust, grid = TRUE, scales = list(x = list(log = 2))) 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal
39 Adding a regression line > xyplot(mv ~ r.tidal, GlobClust, scales = list(x = list(log = 2) panel = function(x, y,...) { panel.grid(h = -1, v = -1) panel.points(x, y,...) panel.abline(lm(y ~ x), col = "black") }) 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal
40 ˆ Doesn t seem to be a good fit ˆ One solution: use locally linear fits ˆ At each location, fit regression line based on nearby points ˆ A generalization of this idea is known as LOESS 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal
41 ˆ Doesn t seem to be a good fit ˆ One solution: use locally linear fits ˆ At each location, fit regression line based on nearby points ˆ A generalization of this idea is known as LOESS 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal
42 ˆ Doesn t seem to be a good fit ˆ One solution: use locally linear fits ˆ At each location, fit regression line based on nearby points ˆ A generalization of this idea is known as LOESS 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal
43 > xyplot(mv ~ r.tidal, GlobClust, scales = list(x = list(log = 2)), panel = function(x, y,...) { panel.grid(h = -1, v = -1) panel.points(x, y,...) panel.loess(x, y, col = "black") }) 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal
44 ˆ Smooth curve denotes fitted LOESS model ˆ Estimates average acceleration as function of distance ˆ But how confident are we about the fitted estimate? 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal
45 One approach: Bootstrap ˆ Resample with replacement to mimic underlying randomness ˆ Use same modeling process to obtain estimate ˆ Repeat many times to assess variability
46 > xyplot(mv ~ r.tidal, GlobClust, scales = list(x = list(log = 2)), panel = function(x, y,...) { panel.grid(h = -1, v = -1) n <- length(x) for (i in 1:1000) { bs.id <- sample(1:n, replace = TRUE) ## SRSWR panel.loess(x[bs.id], y[bs.id], col = "red", alpha = 0.02) } panel.points(x, y,...) panel.loess(x, y, col = "black") })
47 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal
48 Powerful built-in tools + Programming language Flexibility
49 Dissemination of research ˆ Rapid prototyping quick implementation of research ideas ˆ Well-structured packaging system allows dissemination ˆ CRAN: Comprehensive R Archive Network: > 3000 packages ˆ Other specialized collections (Bioconductor, Omegahat)
50 Sweave: reproducible documents ˆ Inspired by literate documents (Knuth) ˆ Enables mixing of R code and L A TEX/ HTML (etc.) ˆ Source file reproduces both analysis and report ˆ Reproducible research + convenience ˆ rnw/bootstrap.rnw
51 Sweave: reproducible documents ˆ Inspired by literate documents (Knuth) ˆ Enables mixing of R code and L A TEX/ HTML (etc.) ˆ Source file reproduces both analysis and report ˆ Reproducible research + convenience ˆ rnw/bootstrap.rnw
52 Summary ˆ R is a feature-rich interactive language + environment ideally suited to data analysis as well as other kinds of numerical computations ˆ Some learning required before it can be used effectively ˆ Typical mind-blocks for newcomers: ˆ R is not C! More like LISP / Scheme ˆ Vectorization (easy to get past with a little experience) ˆ Functional approach to programming
53 Summary ˆ R is a feature-rich interactive language + environment ideally suited to data analysis as well as other kinds of numerical computations ˆ Some learning required before it can be used effectively ˆ Typical mind-blocks for newcomers: ˆ R is not C! More like LISP / Scheme ˆ Vectorization (easy to get past with a little experience) ˆ Functional approach to programming
54 Questions?
R: A Free Software Project in Statistical Computing
R: A Free Software Project in Statistical Computing Achim Zeileis Institut für Statistik & Wahrscheinlichkeitstheorie http://www.ci.tuwien.ac.at/~zeileis/ Acknowledgments Thanks: Alex Smola & Machine Learning
More informationR Language Fundamentals
R Language Fundamentals Data Types and Basic Maniuplation Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Where did R come from? Overview Atomic Vectors Subsetting
More informationSilvia Liverani. Department of Statistics University of Warwick. CSC, 24th April 2008. R: A programming environment for Data. Analysis and Graphics
: A Department of Statistics University of Warwick CSC, 24th April 2008 Outline 1 2 3 4 5 6 What do you need? Performance Functionality Extensibility Simplicity Compatability Interface Low-cost Project
More informationSimple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
More informationTesting for Lack of Fit
Chapter 6 Testing for Lack of Fit How can we tell if a model fits the data? If the model is correct then ˆσ 2 should be an unbiased estimate of σ 2. If we have a model which is not complex enough to fit
More informationLecture 8. Confidence intervals and the central limit theorem
Lecture 8. Confidence intervals and the central limit theorem Mathematical Statistics and Discrete Mathematics November 25th, 2015 1 / 15 Central limit theorem Let X 1, X 2,... X n be a random sample of
More informationUnit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a
More informationE(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F
Random and Mixed Effects Models (Ch. 10) Random effects models are very useful when the observations are sampled in a highly structured way. The basic idea is that the error associated with any linear,
More informationStatistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Linear Models in R Regression Regression analysis is the appropriate
More informationAn Introduction to the Use of R for Clinical Research
An Introduction to the Use of R for Clinical Research Dimitris Rizopoulos Department of Biostatistics, Erasmus Medical Center d.rizopoulos@erasmusmc.nl PSDM Event: Open Source Software in Clinical Research
More informationUnivariate Regression
Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is
More informationFitting Subject-specific Curves to Grouped Longitudinal Data
Fitting Subject-specific Curves to Grouped Longitudinal Data Djeundje, Viani Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK E-mail: vad5@hw.ac.uk Currie,
More informationHow to calculate an ANOVA table
How to calculate an ANOVA table Calculations by Hand We look at the following example: Let us say we measure the height of some plants under the effect of different fertilizers. Treatment Measures Mean
More informationIntroduction to Hierarchical Linear Modeling with R
Introduction to Hierarchical Linear Modeling with R 5 10 15 20 25 5 10 15 20 25 13 14 15 16 40 30 20 10 0 40 30 20 10 9 10 11 12-10 SCIENCE 0-10 5 6 7 8 40 30 20 10 0-10 40 1 2 3 4 30 20 10 0-10 5 10 15
More informationGETTING STARTED WITH R AND DATA ANALYSIS
GETTING STARTED WITH R AND DATA ANALYSIS [Learn R for effective data analysis] LEARN PRACTICAL SKILLS REQUIRED FOR VISUALIZING, TRANSFORMING, AND ANALYZING DATA IN R One day course for people who are just
More informationOutline. Dispersion Bush lupine survival Quasi-Binomial family
Outline 1 Three-way interactions 2 Overdispersion in logistic regression Dispersion Bush lupine survival Quasi-Binomial family 3 Simulation for inference Why simulations Testing model fit: simulating the
More informationBasics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar
More information" Y. Notation and Equations for Regression Lecture 11/4. Notation:
Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through
More informationIntroduction to Analysis of Variance (ANOVA) Limitations of the t-test
Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA Limitations of the t-test Although the t-test is commonly used, it has limitations Can only
More informationGeostatistics Exploratory Analysis
Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras cfelgueiras@isegi.unl.pt
More informationAdditional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
More informationUsing Open Source Software to Teach Mathematical Statistics p.1/29
Using Open Source Software to Teach Mathematical Statistics Douglas M. Bates bates@r-project.org University of Wisconsin Madison Using Open Source Software to Teach Mathematical Statistics p.1/29 Outline
More informationHow do most businesses analyze data?
Marilyn Monda, MA, MBB Say hello to R! And say good bye to expensive stats software R Course 201311 1 How do most businesses analyze data? Excel??? Calculator?? Homegrown analysis packages?? Statistical
More informationMultiple Linear Regression
Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is
More informationThe Normal distribution
The Normal distribution The normal probability distribution is the most common model for relative frequencies of a quantitative variable. Bell-shaped and described by the function f(y) = 1 2σ π e{ 1 2σ
More informationPsychology 205: Research Methods in Psychology
Psychology 205: Research Methods in Psychology Using R to analyze the data for study 2 Department of Psychology Northwestern University Evanston, Illinois USA November, 2012 1 / 38 Outline 1 Getting ready
More informationLeast Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
More informationResearch Methods & Experimental Design
Research Methods & Experimental Design 16.422 Human Supervisory Control April 2004 Research Methods Qualitative vs. quantitative Understanding the relationship between objectives (research question) and
More informationNormal Distribution. Definition A continuous random variable has a normal distribution if its probability density. f ( y ) = 1.
Normal Distribution Definition A continuous random variable has a normal distribution if its probability density e -(y -µ Y ) 2 2 / 2 σ function can be written as for < y < as Y f ( y ) = 1 σ Y 2 π Notation:
More informationOutline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares
Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation
More informationSection 13, Part 1 ANOVA. Analysis Of Variance
Section 13, Part 1 ANOVA Analysis Of Variance Course Overview So far in this course we ve covered: Descriptive statistics Summary statistics Tables and Graphs Probability Probability Rules Probability
More informationFinal Exam Practice Problem Answers
Final Exam Practice Problem Answers The following data set consists of data gathered from 77 popular breakfast cereals. The variables in the data set are as follows: Brand: The brand name of the cereal
More informationDifference of Means and ANOVA Problems
Difference of Means and Problems Dr. Tom Ilvento FREC 408 Accounting Firm Study An accounting firm specializes in auditing the financial records of large firm It is interested in evaluating its fee structure,particularly
More informationSample Size Calculation for Longitudinal Studies
Sample Size Calculation for Longitudinal Studies Phil Schumm Department of Health Studies University of Chicago August 23, 2004 (Supported by National Institute on Aging grant P01 AG18911-01A1) Introduction
More informationWeek 5: Multiple Linear Regression
BUS41100 Applied Regression Analysis Week 5: Multiple Linear Regression Parameter estimation and inference, forecasting, diagnostics, dummy variables Robert B. Gramacy The University of Chicago Booth School
More informationSimulation Exercises to Reinforce the Foundations of Statistical Thinking in Online Classes
Simulation Exercises to Reinforce the Foundations of Statistical Thinking in Online Classes Simcha Pollack, Ph.D. St. John s University Tobin College of Business Queens, NY, 11439 pollacks@stjohns.edu
More informationPlease follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software
STATA Tutorial Professor Erdinç Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software 1.Wald Test Wald Test is used
More informationConsider a study in which. How many subjects? The importance of sample size calculations. An insignificant effect: two possibilities.
Consider a study in which How many subjects? The importance of sample size calculations Office of Research Protections Brown Bag Series KB Boomer, Ph.D. Director, boomer@stat.psu.edu A researcher conducts
More informationSpatial Statistics Chapter 3 Basics of areal data and areal data modeling
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data
More informationWeek TSX Index 1 8480 2 8470 3 8475 4 8510 5 8500 6 8480
1) The S & P/TSX Composite Index is based on common stock prices of a group of Canadian stocks. The weekly close level of the TSX for 6 weeks are shown: Week TSX Index 1 8480 2 8470 3 8475 4 8510 5 8500
More informationNonlinear Regression:
Zurich University of Applied Sciences School of Engineering IDP Institute of Data Analysis and Process Design Nonlinear Regression: A Powerful Tool With Considerable Complexity Half-Day : Improved Inference
More informationMATH 4470/5470 EXPLORATORY DATA ANALYSIS ONLINE COURSE SYLLABUS
MATH 4470/5470 EXPLORATORY DATA ANALYSIS ONLINE COURSE SYLLABUS COURSE DESCRIPTION Introduction to modern techniques in data analysis, including stem-and-leafs, box plots, resistant lines, smoothing and
More informationChapter 7 Notes - Inference for Single Samples. You know already for a large sample, you can invoke the CLT so:
Chapter 7 Notes - Inference for Single Samples You know already for a large sample, you can invoke the CLT so: X N(µ, ). Also for a large sample, you can replace an unknown σ by s. You know how to do a
More informationPart 2: Analysis of Relationship Between Two Variables
Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable
More informationIntroduction to the R Language
Introduction to the R Language Functions Biostatistics 140.776 Functions Functions are created using the function() directive and are stored as R objects just like anything else. In particular, they are
More informationIAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results
IAPRI Quantitative Analysis Capacity Building Series Multiple regression analysis & interpreting results How important is R-squared? R-squared Published in Agricultural Economics 0.45 Best article of the
More informationFairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
More informationFinancial Risk Models in R: Factor Models for Asset Returns. Workshop Overview
Financial Risk Models in R: Factor Models for Asset Returns and Interest Rate Models Scottish Financial Risk Academy, March 15, 2011 Eric Zivot Robert Richards Chaired Professor of Economics Adjunct Professor,
More informationQuality Assurance for Graphics in R
New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,
More information5 Correlation and Data Exploration
5 Correlation and Data Exploration Correlation In Unit 3, we did some correlation analyses of data from studies related to the acquisition order and acquisition difficulty of English morphemes by both
More information5. Linear Regression
5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4
More informationPackage pesticides. February 20, 2015
Type Package Package pesticides February 20, 2015 Title Analysis of single serving and composite pesticide residue measurements Version 0.1 Date 2010-11-17 Author David M Diez Maintainer David M Diez
More informationCross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More informationIntroduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
More informationIntroduction to Hypothesis Testing. Hypothesis Testing. Step 1: State the Hypotheses
Introduction to Hypothesis Testing 1 Hypothesis Testing A hypothesis test is a statistical procedure that uses sample data to evaluate a hypothesis about a population Hypothesis is stated in terms of the
More informationPackage cpm. July 28, 2015
Package cpm July 28, 2015 Title Sequential and Batch Change Detection Using Parametric and Nonparametric Methods Version 2.2 Date 2015-07-09 Depends R (>= 2.15.0), methods Author Gordon J. Ross Maintainer
More informationRecursive Algorithms. Recursion. Motivating Example Factorial Recall the factorial function. { 1 if n = 1 n! = n (n 1)! if n > 1
Recursion Slides by Christopher M Bourke Instructor: Berthe Y Choueiry Fall 007 Computer Science & Engineering 35 Introduction to Discrete Mathematics Sections 71-7 of Rosen cse35@cseunledu Recursive Algorithms
More informationOrdinal Regression. Chapter
Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe
More informationHypothesis Testing for Beginners
Hypothesis Testing for Beginners Michele Piffer LSE August, 2011 Michele Piffer (LSE) Hypothesis Testing for Beginners August, 2011 1 / 53 One year ago a friend asked me to put down some easy-to-read notes
More informationData Analysis Tools. Tools for Summarizing Data
Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool
More informationChicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011
Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Name: Section: I pledge my honor that I have not violated the Honor Code Signature: This exam has 34 pages. You have 3 hours to complete this
More information17. SIMPLE LINEAR REGRESSION II
17. SIMPLE LINEAR REGRESSION II The Model In linear regression analysis, we assume that the relationship between X and Y is linear. This does not mean, however, that Y can be perfectly predicted from X.
More informationMath 425 (Fall 08) Solutions Midterm 2 November 6, 2008
Math 425 (Fall 8) Solutions Midterm 2 November 6, 28 (5 pts) Compute E[X] and Var[X] for i) X a random variable that takes the values, 2, 3 with probabilities.2,.5,.3; ii) X a random variable with the
More information1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
More informationChapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )
Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples
More informationPOLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.
Polynomial Regression POLYNOMIAL AND MULTIPLE REGRESSION Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model. It is a form of linear regression
More informationSoftware and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University
Software and Methods for the Analysis of Affymetrix GeneChip Data Rafael A Irizarry Department of Biostatistics Johns Hopkins University Outline Overview Bioconductor Project Examples 1: Gene Annotation
More informationMIXED MODEL ANALYSIS USING R
Research Methods Group MIXED MODEL ANALYSIS USING R Using Case Study 4 from the BIOMETRICS & RESEARCH METHODS TEACHING RESOURCE BY Stephen Mbunzi & Sonal Nagda www.ilri.org/rmg www.worldagroforestrycentre.org/rmg
More informationLecture 10: Depicting Sampling Distributions of a Sample Proportion
Lecture 10: Depicting Sampling Distributions of a Sample Proportion Chapter 5: Probability and Sampling Distributions 2/10/12 Lecture 10 1 Sample Proportion 1 is assigned to population members having a
More informationCorrelations. MSc Module 6: Introduction to Quantitative Research Methods Kenneth Benoit. March 18, 2010
Correlations MSc Module 6: Introduction to Quantitative Research Methods Kenneth Benoit March 18, 2010 Relationships between variables In previous weeks, we have been concerned with describing variables
More informationWHERE DOES THE 10% CONDITION COME FROM?
1 WHERE DOES THE 10% CONDITION COME FROM? The text has mentioned The 10% Condition (at least) twice so far: p. 407 Bernoulli trials must be independent. If that assumption is violated, it is still okay
More informationTime Series Analysis AMS 316
Time Series Analysis AMS 316 Programming language and software environment for data manipulation, calculation and graphical display. Originally created by Ross Ihaka and Robert Gentleman at University
More informationBowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition
Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Online Learning Centre Technology Step-by-Step - Excel Microsoft Excel is a spreadsheet software application
More informationConfidence Intervals for the Difference Between Two Means
Chapter 47 Confidence Intervals for the Difference Between Two Means Introduction This procedure calculates the sample size necessary to achieve a specified distance from the difference in sample means
More informationAn Interactive Tool for Residual Diagnostics for Fitting Spatial Dependencies (with Implementation in R)
DSC 2003 Working Papers (Draft Versions) http://www.ci.tuwien.ac.at/conferences/dsc-2003/ An Interactive Tool for Residual Diagnostics for Fitting Spatial Dependencies (with Implementation in R) Ernst
More informationPEST - Beyond Basic Model Calibration. Presented by Jon Traum
PEST - Beyond Basic Model Calibration Presented by Jon Traum Purpose of Presentation Present advance techniques available in PEST for model calibration High level overview Inspire more people to use PEST!
More informationM1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 1 - Bootstrap
Nathalie Villa-Vialanei Année 2015/2016 M1 in Economics and Economics and Statistics Applied multivariate Analsis - Big data analtics Worksheet 1 - Bootstrap This worksheet illustrates the use of nonparametric
More informationSection 1: Simple Linear Regression
Section 1: Simple Linear Regression Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Regression: General Introduction
More informationIllustration (and the use of HLM)
Illustration (and the use of HLM) Chapter 4 1 Measurement Incorporated HLM Workshop The Illustration Data Now we cover the example. In doing so we does the use of the software HLM. In addition, we will
More informationData Mining Introduction
Data Mining Introduction Bob Stine Dept of Statistics, School University of Pennsylvania www-stat.wharton.upenn.edu/~stine What is data mining? An insult? Predictive modeling Large, wide data sets, often
More informationModern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
More informationDoing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:
Doing Multiple Regression with SPSS Multiple Regression for Data Already in Data Editor Next we want to specify a multiple regression analysis for these data. The menu bar for SPSS offers several options:
More informationNormal distribution. ) 2 /2σ. 2π σ
Normal distribution The normal distribution is the most widely known and used of all distributions. Because the normal distribution approximates many natural phenomena so well, it has developed into a
More information9.07 Introduction to Statistical Methods Homework 4. Name:
1. Estimating the population standard deviation and variance. Homework #2 contained a problem (#4) on estimating the population standard deviation. In that problem, you showed that the method of estimating
More informationTheory at a Glance (For IES, GATE, PSU)
1. Forecasting Theory at a Glance (For IES, GATE, PSU) Forecasting means estimation of type, quantity and quality of future works e.g. sales etc. It is a calculated economic analysis. 1. Basic elements
More informationA LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA
REVSTAT Statistical Journal Volume 4, Number 2, June 2006, 131 142 A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA Authors: Daiane Aparecida Zuanetti Departamento de Estatística, Universidade Federal de São
More informationt Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon
t-tests in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com www.excelmasterseries.com
More informationChapter 4 and 5 solutions
Chapter 4 and 5 solutions 4.4. Three different washing solutions are being compared to study their effectiveness in retarding bacteria growth in five gallon milk containers. The analysis is done in a laboratory,
More informationNormality Testing in Excel
Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com
More informationstatistics Chi-square tests and nonparametric Summary sheet from last time: Hypothesis testing Summary sheet from last time: Confidence intervals
Summary sheet from last time: Confidence intervals Confidence intervals take on the usual form: parameter = statistic ± t crit SE(statistic) parameter SE a s e sqrt(1/n + m x 2 /ss xx ) b s e /sqrt(ss
More informationPackage empiricalfdr.deseq2
Type Package Package empiricalfdr.deseq2 May 27, 2015 Title Simulation-Based False Discovery Rate in RNA-Seq Version 1.0.3 Date 2015-05-26 Author Mikhail V. Matz Maintainer Mikhail V. Matz
More informationEconometrics Simple Linear Regression
Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight
More informationSIMULATION STUDIES IN STATISTICS WHAT IS A SIMULATION STUDY, AND WHY DO ONE? What is a (Monte Carlo) simulation study, and why do one?
SIMULATION STUDIES IN STATISTICS WHAT IS A SIMULATION STUDY, AND WHY DO ONE? What is a (Monte Carlo) simulation study, and why do one? Simulations for properties of estimators Simulations for properties
More informationHYPOTHESIS TESTING: POWER OF THE TEST
HYPOTHESIS TESTING: POWER OF THE TEST The first 6 steps of the 9-step test of hypothesis are called "the test". These steps are not dependent on the observed data values. When planning a research project,
More informationChapter 3 Quantitative Demand Analysis
Managerial Economics & Business Strategy Chapter 3 uantitative Demand Analysis McGraw-Hill/Irwin Copyright 2010 by the McGraw-Hill Companies, Inc. All rights reserved. Overview I. The Elasticity Concept
More information3. Regression & Exponential Smoothing
3. Regression & Exponential Smoothing 3.1 Forecasting a Single Time Series Two main approaches are traditionally used to model a single time series z 1, z 2,..., z n 1. Models the observation z t as a
More informationLean Certification Program Blended Learning Program Cost: $5500. Course Description
Lean Certification Program Blended Learning Program Cost: $5500 Course Description Lean Certification Program is a disciplined process improvement approach focused on reducing waste, increasing customer
More informationLogit Models for Binary Data
Chapter 3 Logit Models for Binary Data We now turn our attention to regression models for dichotomous data, including logistic regression and probit analysis. These models are appropriate when the response
More information