STATISTICAL DATA ANALYSIS



Similar documents
Appendix B Checklist for the Empirical Cycle

STATISTICS and PROBABILITY Definition Statistics is the science and practice of developing human knowledge through the use of empirical data

Organizing Your Approach to a Data Analysis

Understanding Confidence Intervals and Hypothesis Testing Using Excel Data Table Simulation

How To Check For Differences In The One Way Anova

2. Simple Linear Regression

Intro to Data Analysis, Economic Statistics and Econometrics

IBM SPSS Direct Marketing 23

Fairfield Public Schools

Foundation of Quantitative Data Analysis

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

IBM SPSS Direct Marketing 22

This book serves as a guide for those interested in using IBM SPSS

3. Data Analysis, Statistics, and Probability

What is Data Analysis. Kerala School of MathematicsCourse in Statistics for Scientis. Introduction to Data Analysis. Steps in a Statistical Study

Randomization in Clinical Trials

Regression Clustering

4. Are you satisfied with the outcome? Why or why not? Offer a solution and make a new graph (Figure 2).

Excel Charts & Graphs

Appendix G STATISTICAL METHODS INFECTIOUS METHODS STATISTICAL ROADMAP. Prepared in Support of: CDC/NCEH Cross Sectional Assessment Study.

Data Mining: Overview. What is Data Mining?

Tutorial for proteome data analysis using the Perseus software platform

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

MAT 155. Chapter 1 Introduction to Statistics. Key Concept. Basics of Collecting Data. 155S1.5_3 Collecting Sample Data.

Chapter 2 Introduction to SPSS

STAT355 - Probability & Statistics

Statistical & Technical Team

Department/Academic Unit: Public Health Sciences Degree Program: Biostatistics Collaborative Program

Appendix 2.1 Tabular and Graphical Methods Using Excel

Vertical Alignment Colorado Academic Standards 6 th - 7 th - 8 th

How To Run Statistical Tests in Excel

R Commander Tutorial

Introduction to Longitudinal Data Analysis

Clinical Study Design and Methods Terminology

Didacticiel Études de cas

SAS Enterprise Guide in Pharmaceutical Applications: Automated Analysis and Reporting Alex Dmitrienko, Ph.D., Eli Lilly and Company, Indianapolis, IN

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

Cluster analysis with SPSS: K-Means Cluster Analysis

Summarizing and Displaying Categorical Data

Basic Concepts in Research and Data Analysis

SPSS Workbook 1 Data Entry : Questionnaire Data

Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA

Power Calculation Using the Online Variance Almanac (Web VA): A User s Guide

Scatter Plots with Error Bars

Importing Data into R

Data Analysis and Statistical Software Workshop. Ted Kasha, B.S. Kimberly Galt, Pharm.D., Ph.D.(c) May 14, 2009

SAS Software to Fit the Generalized Linear Model

The Research Proposal

A Guide to Stat/Transfer File Transfer Utility, Version 10

SPSS and AM statistical software example.

INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER

Creating a Database. Frank Friedenberg, MD

Now we begin our discussion of exploratory data analysis.

WHO STEPS Surveillance Support Materials. STEPS Epi Info Training Guide

In-Depth Guide Advanced Spreadsheet Techniques

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Recommend Continued CPS Monitoring. 63 (a) 17 (b) 10 (c) (d) 20 (e) 25 (f) 80. Totals/Marginal

Better decision making under uncertain conditions using Monte Carlo Simulation

Chapter 4 Displaying and Describing Categorical Data

How to Get More Value from Your Survey Data

Descriptive Methods Ch. 6 and 7

CHANCE ENCOUNTERS. Making Sense of Hypothesis Tests. Howard Fincher. Learning Development Tutor. Upgrade Study Advice Service

COMMON CORE STATE STANDARDS FOR

HLM software has been one of the leading statistical packages for hierarchical

Two-Group Hypothesis Tests: Excel 2013 T-TEST Command

Motor and Household Insurance: Pricing to Maximise Profit in a Competitive Market

Chapter 23. Inferences for Regression

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Using Excel for Analyzing Survey Questionnaires Jennifer Leahy

Selecting Research Participants

Inclusion and Exclusion Criteria

Chapter 5. Microsoft Access

Dynamic Process Modeling. Process Dynamics and Control

Introduction to SPSS 16.0

NON-PROBABILITY SAMPLING TECHNIQUES

MARS STUDENT IMAGING PROJECT

Using Excel for descriptive statistics

Chi Squared and Fisher's Exact Tests. Observed vs Expected Distributions

NCSS Statistical Software

6 3 The Standard Normal Distribution

Data Quality Assessment: A Reviewer s Guide EPA QA/G-9R

InfiniteInsight 6.5 sp4

Formulas & Functions in Microsoft Excel

SA 530 AUDIT SAMPLING. Contents. (Effective for audits of financial statements for periods beginning on or after April 1, 2009) Paragraph(s)

Survey Data Analysis in Stata

Chapter 7. One-way ANOVA

IT462 Lab 5: Clustering with MS SQL Server

Analyzing and interpreting data Evaluation resources from Wilder Research

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Study Design and Statistical Analysis

Virtual Site Event. Predictive Analytics: What Managers Need to Know. Presented by: Paul Arnest, MS, MBA, PMP February 11, 2015

Advanced Microsoft Excel 2010

Module 223 Major A: Concepts, methods and design in Epidemiology

Transcription:

STATISTICAL DATA ANALYSIS INTRODUCTION Fethullah Karabiber YTU, Fall of 2012

The role of statistical analysis in science This course discusses some statistical methods, which involve applying statistical methods to biological problems. We use empirical evidence to study populations and make informed decisions. To study a population, we measure a set of characteristics, which we refer to as variables. The objective of many scientific studies is to learn about the variation of a specific characteristic (e.g., BMI, disease status) in the population of interest.

The role of statistical analysis in science In many studies, we are interested in possible relationships among different variables. We refer to the variables that are the main focus of our study as the response (or target) variables. In contrast, we call variables that explain or predict the variation in the response variable as explanatory variables or predictors depending on the role of these variables. Statistical analysis begins with a scientific problem usually presented in the form of a hypothesis testing or a prediction problem.

Sampling The samples are selected randomly (i.e., with some probability) from the population. Unless stated otherwise, these randomly selected members of populations are assumed to be independent. The selected members (e.g., people, households, cells) are called sampling units. The individual entities from which we collect information are called observation units, or simply observations. Our sample must be representative of the population, and their environments should be comparable to that of the whole population.

Sampling Some of the most widely used sampling designs Simple Random Sampling: the chance of being selected is the same for any group of n members in the popilation Stratified Sampling: The population is first partitioned in to subpopulation,a.k.a. strata, and sampling is performed seperately within each subpopulation. Cluster Sampling: Group observations units into clusters and then sample from these clusters.

Observational studies and experiments After obtaining the sample, the next step is gathering the relevant information from the selected members. In observational studies, researchers are passive examiners, trying to have the least impact on the data collection process. Observational studies are quite helpful in detecting relationships among characteristics. When studying the relationships between characteristics, it is important to distinguish between association and causality. The realationship is casual if one characteristic influences the other one. It is usually easier to establish causality by using experiments. In experiments, researchers attempt to control the process as much as possible.

Observational studies and experiments Observational Studies: Retrospective: look into history Prospective: observation over time Randomization and Replication Collecting Data Cross-Sectional: at some fixed time Longitudinal: follow the samples over time and repeatedly collect information and take measurements Time Series: over a period of time. It is collected more frequently, but on smaller samples

Data exploration After collecting data, the next step towards statistical inference and decision making is to perform data exploration, which involves visualizing and summarizing the data. The objective of data visualization is to obtain a high level understanding of the sample and their observed (measured) characteristics. To make the data more manageable, we need to further reduce the amount of information in some meaningful ways so that we can focus on the key aspects of the data. Summary statistics are used for this purpose.

Data exploration Using data exploration techniques, we can learn about the distribution of a variable. Informally, the distribution of a variable tells us the possible values it can take, the chance of observing those values, and how often we expect to see them in a random sample from the population. Through data exploration, we might detect previously unknown patterns and relationships that are worth further investigation. We can also identify possible data issues, such as unexpected or unusual measurements, known as outliers.

Statistical inference We collect data on a sample from the population in order to learn about the whole population. For example, Mackowiak, et al. (1992) measure the normal body temperature for 148 people to learn about the normal body temperature for the entire population. In this case, we say we are estimating the unknown population average. However, the characteristics and relationships in the whole population remain unknown. Therefore, there is always some uncertainty associated with our estimations.

Statistical inference In Statistics, the mathematical tool to address uncertainty is probability. The process of using the data to draw conclusions about the whole population, while acknowledging the extent of our uncertainty about our findings, is called statistical inference. The knowledge we acquire from data through statistical inference allows us to make decisions with respect to the scientific problem that motivated our study and our data analysis.

Computation We usually use computer programs to perform most of our statistical analysis and inference. The computer programs commonly used for this purpose are SAS, STATA, SPSS, MINITAB, MATLAB, and R. R is free and arguably the most common software among statisticians. For the purpose of this course, we use R-Commander, which allows us to do basic statistical analysis without necessarily learning the programing language of R. You are however encouraged to learn R for additional flexibility in your data analysis.

Using R-Commander The R-Commander window. Notice the menu bar, Data set box, Edit data set button, View data set button, Script window, Submit button, and Output window

Using R-Commander Importing thepima.tr data from the MASS package. Select the appropriate package and data set Viewing the Pima.tr data in R-Commander. Each row corresponds to a Pima Indian woman in our sample, and each column corresponds to a Variable

Using R-Commander Importing the BodyTemperature data into R-Commander. Enter BodyTemperature as the name of the data set. Accept all other defaults Viewing the BodyTemperature data in R-Commander