1. Data and information



Similar documents
Additional sources Compilation of sources:

Geostatistics Exploratory Analysis

Simple Predictive Analytics Curtis Seare

430 Statistics and Financial Mathematics for Business

Data Analysis Tools. Tools for Summarizing Data

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Least Squares Estimation

Fairfield Public Schools

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

SPSS Tests for Versions 9 to 13

Northumberland Knowledge

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Algebra 1 Course Information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Chapter 11 Introduction to Survey Sampling and Analysis Procedures

STATISTICAL ANALYSIS WITH EXCEL COURSE OUTLINE

How To Write A Data Analysis

ADD-INS: ENHANCING EXCEL

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

03 The full syllabus. 03 The full syllabus continued. For more information visit PAPER C03 FUNDAMENTALS OF BUSINESS MATHEMATICS

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

II. DISTRIBUTIONS distribution normal distribution. standard scores

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Airport Planning and Design. Excel Solver

Introduction to Sampling. Dr. Safaa R. Amer. Overview. for Non-Statisticians. Part II. Part I. Sample Size. Introduction.

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 12: June 22, Abstract. Review session.

Confidence Intervals for the Difference Between Two Means

An introduction to using Microsoft Excel for quantitative data analysis

Overview of Non-Parametric Statistics PRESENTER: ELAINE EISENBEISZ OWNER AND PRINCIPAL, OMEGA STATISTICS

240ST014 - Data Analysis of Transport and Logistics

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Quantitative Methods for Finance

Week 1. Exploratory Data Analysis

Statistics Review PSY379

Statistics Graduate Courses

Using MS Excel to Analyze Data: A Tutorial

Annex 6 BEST PRACTICE EXAMPLES FOCUSING ON SAMPLE SIZE AND RELIABILITY CALCULATIONS AND SAMPLING FOR VALIDATION/VERIFICATION. (Version 01.

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Introduction to Statistics and Quantitative Research Methods

E x c e l : Data Analysis Tools Student Manual

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Teaching Multivariate Analysis to Business-Major Students

Exploratory Data Analysis

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Study Guide for the Final Exam

January 26, 2009 The Faculty Center for Teaching and Learning

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Exploratory data analysis (Chapter 2) Fall 2011

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

Descriptive Analysis

Simple Linear Regression Inference

Elements of statistics (MATH0487-1)

Elementary Statistics

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Data Preparation and Statistical Displays

Linear Threshold Units

How To Understand The Theory Of Probability

Biostatistics: Types of Data Analysis

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Final Exam Practice Problem Answers

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

DISCRIMINANT FUNCTION ANALYSIS (DA)

Chapter 7. One-way ANOVA

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

New SAS Procedures for Analysis of Sample Survey Data

Introduction to Quantitative Methods

Part 2: Analysis of Relationship Between Two Variables

Tutorial 5: Hypothesis Testing

Descriptive Statistics

MBA 611 STATISTICS AND QUANTITATIVE METHODS

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

International College of Economics and Finance Syllabus Probability Theory and Introductory Statistics

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.


Introduction to Statistical Computing in Microsoft Excel By Hector D. Flores; and Dr. J.A. Dobelman

Teaching model: C1 a. General background: 50% b. Theory-into-practice/developmental 50% knowledge-building: c. Guided academic activities:

UNDERSTANDING THE INDEPENDENT-SAMPLES t TEST

Organizing Your Approach to a Data Analysis

Lecture 2. Summarizing the Sample

5. Linear Regression

THE KRUSKAL WALLLIS TEST

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

The primary goal of this thesis was to understand how the spatial dependence of

Statistics 104: Section 6!

Common Core Unit Summary Grades 6 to 8

RUTHERFORD HIGH SCHOOL Rutherford, New Jersey COURSE OUTLINE STATISTICS AND PROBABILITY

A Basic Introduction to Missing Data

Research Methods & Experimental Design

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

Statistical & Technical Team

Application of discriminant analysis to predict the class of degree for graduating students in a university system

Transcription:

1. Data and information -Random sampling application -Information base and data collection 1 References Data sampling: Cochran, W.G. (1977) Sampling Techniques. John Wiley&Sons, Inc. New York and others Lohr, S.L. (1999) Sampling: Design and Analysis. Brooks/Cole Publishing Company

Random Sampling of Farming Systems Sampling has to cope with reality rather than with controlled conditions (= trial, laboratory). Certain violations of the rules for random sampling may be unavoidable. Lists about all units (farms, households) in a total population do rarely exist (in particular in developing countries) The design of random sampling plans may depend more on ensuring equal probabilities for survey units to be chosen than on decreasing variances in strata. The application of justified sampling for gaining representative knowledge is usually affected by the low pre-knowledge on existing types of farming systems. Total populations are far from being infinite. 3 Basics in data sampling on farming systems Select a sample survey plan and identify the respective formulae for estimators before you start the survey. Calculate the required or possible size of your sample and the optimal distribution between the strata under consideration of your resources. Plan the steps of your survey as precise as possible in advance. This includes the methodological parts (intended analyses, questionnaire, data base structure, sampling plan) as well as the logistic issues (budget, time, staff, transport, required software, data input). Plan and conduct the survey as closely as possible to the requirements of the theory of your chosen type of sample survey, but stay pragmatic. Try to adapt the theory to reality and document unavoidable divergence (= limitations of inferences from the sample on the total population). Low resources are no excuse for a bad planning of a survey! 4

Sampling Plan The choice of the sampling plan (simple sampling, stratified sampling, more complex sampling schemes) depends on the specific situation. Criteria are: available pre-knowledge possibilities to assure controlled probabilities of selection possibilities to minimize variance between surveyed units (e.g. farming systems) capacity limits (funding, time, transport, staff etc.) The chosen sampling plan decides on the required estimators (= formulae for calculating estimates from the surveyed units on the total population 5 Estimators 1 An estimator is a function (a formula) that allows the estimation of the situation in the total population by using information from a sample. Sample Total Population information Estimator estimation 6

Estimators The correct estimators (=formulae) depends on the applied sampling plan Estimators are required for: parameters of location: - "averages" (e.g. mean values) - totals - frequencies and proportions parameters of spread (e.g. variance, standard deviation) 7 Estimators 3: Example for correct estimators, quantitative data Sampling Plan: simple random sampling (SRS) stratified sampling (STS) estimate: estimators mean value yi, s i yi = n yi y vp Ny z = Nvp z variance yz, s z ( yi ) = yi si n 1 s N z sz vp = N z whereby: y i = Information from farm i ys i, = mean and variance (SRS) i yz, s = Mean and variance in stratum z z yvp, s = Mean and variance (STS) vp n = N z = N vp = Number of interviewed farmers Total number of farmers in stratum z Total number of farmers in study region 8

Sample size (1) Criteria for sample size: 1. Variance of the desired information within the total population. Acceptable non-systematic sampling error 3. Available resources for the survey (logistics, time, etc.) Two basic approaches: 1. Calculating the sample size based on the accepted error. Determining the sample size by available resources and calculating the obtained error in retrospect 9 Sample size () Calculating the required sample size: The required sample size has to be calculated for every single criterion under research. Attention: Calculating the required sample size by the following formulae presupposes that the concerned criterion is normal distributed in the total population. The variance of concerned criteria in the total population has to be estimated in advance. This may happen by applying general characteristics of a normal distribution Example for quantitative criteria: standard deviation = 1/6 of range [= maximal value - minimal value], variance = (range/6)² 10

Formulae for sample size : Quantitative criteria small ratio sample/ total population: large ratio sample/ total population (n 0 > 5% of N): n 0 t = n = n 1 α 0 e s N n0 N 1 whereby: n, n 0 = sample size, t = quantile of t-distribution, s² = variance according to chosen sampling plan, N = total population, e = accepted error in units of the criterion (±) 11 Formulae for sample size: Qualitative criteria (proportions, frequencies) small ratio sample/total population: large ratio sample/total population (n 0 > 5% of N): n 0 t = n = n pq e N n0 N 1 1 α / whereby: n, n 0 = sample size, t = quantile of t-distribution, p = proportion of total population with criterion, q = proportion of total population without criterion, pq = variance according to chosen sampling plan, N = total population, e = accepted error in % (±) 0 N.B.: These formulae do not consider budget restrictions. Respective considerations (example: Neyman-Tschupow formula) deal with the minimization of error under a given budget in stratified sampling. 1

PC-Exercise: file C31sampDB.xls Introduction to PC in general and Excel (if required) Create a copy of the exercise file Open Windows explorer switch to partition "user of <name of PC-room>", usually partition U: choose subdirectory: kurs\m319 Copy file sampling1.xls from this subdirectory to your partition (usually H:) Start Excel Open file sampling1.xls (resp. H:\[your subdirectory]\sampling1.xls) explain structure of excel-files (multiple worksheets) 1. Exercises with Spreadsheet "sample size" explain calculation and interpret results parametrize data in the example and allowed errors and discuss the resulting impacts on the sample size 13 Determination of obtained precision Confidence limits of estimators (see 1.1b) express the obtained precision. The calculation of confidence limit corresponds to the calculation of the sample size with the difference, that n is known and e is required. Quantitative criteria: small ratio sample/total population: t α ± 1 s n large ratio sample/ total population: t1 ± α s N n n N 1 whereby: n = sample size, t = quantile of t-distribution, s = standard deviation according to chosen sampling plan, N = total population 14

Determination of obtained precision Qualitative criteria (proportions, frequencies): large total populations: ± t1 α / n pq small total populations: ± t1 α whereby: n= sample size, t = quantile of t-distribution, p = proportion of total population with criterion, q = proportion of total population without criterion, pq = variance according to chosen sampling plan pq n N n N 1 15 Data entry and editing 1. Transfer of data from questionnaires and data sheets to files. Coding of alphanumeric information 3. Check for data entry errors and correction 4. Identify data problems (extreme values, missing data) 5. Solve data problems and provide operational data base 16

Identification of outliers - exploratory data analysis 1400000 100000 8 Upper hinge Median Lower hinge 1000000 800000 600000 400000 00000 0 outlier } 1,5 x h-spread } h-spread -00000 N = 0 REVPV Rules for identification of outliers in box-and-whisker plots: all values above upper hinge + 1,5 h-spread and below lower hinge - 1,5 h-spread are extreme observations 17 Missing values The mechanisms, that lead to missing data decide on the possible solutions for further analyses of concerned data sets The use of imputation values requires that these mechanisms are ignorable, i.e. not linked to the information content Imputation procedures: Advantage: use of standard methods for calculations possible Disadvantage: no consideration of added uncertainty Method: replacement of individual missing values by values that are derived from complete sets of the sample 18

Generation of Imputation Values use of mean values, medians or modes variance-neutral imputation regression values (if covariate data available) hot deck imputation (random choice of values from comparable cases in the same survey) cold deck imputation (random choice of values from other sources) nearest-neighbour imputation (value from next record) 19. Classifications ( Cluster analyses ) 0

References Bi- and Multivariate Analysis: Aldendorfer, M.S., Blackfield R.K (1984) Cluster Analysis, West Hilcrest Backhaus, K.; Erichson, B.; Plinke, W.; Weiber, R. (1994) Multivariate Analysemethoden. Springer Verlag, Berlin u.a. (German language) Henze, A. (1994) Marktforschung. UTB 179, Ulmer Verlag Stuttgart (German language) Tukey, J.W. (1977) Exploratory Data Analysis. Addison-Wesley Publishing Company, Inc., Reading SPSS online help SPSS vers. 10.0 upward. Classification ( Cluster analyses ) 1 Steps in the Computer Application The computer exercises follow the most common sequence of application in reality rather than the steps in learning econometrics univariate classification ordering and identifying the best class boundaries multivariate classification checking for too high relationships between selected classification criteria and application of cluster procedures (cluster algorithm + distance measure) testing for differences statistical tests for checking significant differences between classes (χ²-test, nonparametric tests) and between observations in time (z-test, t-test) models of linear dependencies linear regression, multiple regression, probit- and logit models. Classification ( Cluster analyses )

Univariate classification PC-Exercise: file C3class.xls, sheet: univ, Software: Excel order the selected (quantitative) classification criterion in ascending order determine differences ( ) from one value to the next within this order check for over proportionally large steps (graphically and/or numerically) -> preliminary class borders determine homogeneity (coefficient of variation within classes = standard deviation / mean value) and heterogeneity (distance between class means) of the preliminary classes check if moving border cases improves the measures of heterogeneity and homogeneity. Classification ( Cluster analyses ) 3 Multivariate classification PC-Exercise: file C3class.xls, sheet: multi (+ derivates), Software: Excel, SPSS import "multi" in SPSS, check for linear correlations and exclude one of each two too highly correlated variables from the further process set up cluster procedure (selection of distance measure, cluster algorithm and standardization of classification variables) The exploratory approach - interpret results (dendrogrammes) from different sets of procedures (seize and number of clusters, development of homogeneity within clusters. Classification ( Cluster analyses ) 4

Testing for differences PC-Exercise: file C3class.xls, sheet: multi (+ derivates), Software: Excel, SPSS Checking significant differences between clusters of selected results - nonparametric tests for quantitative variables, χ²-test for qualitative variables in SPSS (tables for statistical tests inbuilt) Interpretation of test results towards a description of the assumed clusters Checking significant differences between the current sample and information from the past (z-test, t-test) in Excel (use of printed test-value table on the standard normal distribution) probability test value probability test value 95% 1,6449 97,5% 1,9600 90% 1,816 9,5% 1,4395 5 -Design of family models -Application of family models -Uncertainty and risk -Gap analysis and interpretation 6

References Modelling in General: Dantzig, G. B. (1963) "Linear Programming and Extensions", Princeton University Press, Princeton, N.J France, J.; Thornley, J.H.M. (1984) Mathematical Models in Agriculture. Butterworth, London http://www.solver.com/tutorial.htm MOTAD: Hazell, P.B.R. (1971) A linear alternative to quadratic and semivariance programming for farm planning under uncertainty. American Journal of Agricultural Economics, 53, pp.53-6 Doppler, W.; Salman, A. Z. and Al-Karablieh, E. K., Wolff, H.-P.: The impact of water price strategies on the allocation of irrigation water - the case of the Jordan Valley. Agricultural Water Management, 55 (00), Elsevier Science Ltd., pp.171-18 7 LP-Models in EXCEL A LP-Matrix can be set up in several ways in EXCEL. The one used within M319 is just one alternative. The required elements are, however, always the same. SOLVER is provided as a standard add-in to EXCEL, which is why it is used in the module. Other software (e.g. XA or GAMS) is suited as well - in some regards even better but requires the purchase and the learning of how this software works. 8

Set-Up of the LP Matrix PC-Exercise: file C33mod1.xls, sheet: LP1, Software: Excel Exercises with Spreadsheet "LP (1)" set-up of a planning matrix transform planning matrix for the use with EXCEL solver explain EXCEL-Solver settings: cells, constraints etc. explain reports on solution, sensitivity and limits explain the use of ranges and the SUMPRODUCT command run basic model and discuss resulting reports on the (1) the optimal solution and () the sensitivity analysis 9 Parameterization PC-Exercise: file C33mod1.xls, sheet: LP, Software: Excel Exercises with Spreadsheet "LP ()" explain and demonstrate the approach of parametrizing explain and apply integer constraints add additional activities and constraints run model with changed parameters and discuss resulting reports on the (1) the optimal solution and () the sensitivity analysis 30

A brief introduction to MOTAD models PC-Exercise: file C33mod.xls, Software: Excel, Solver The MOTAD approach is a linear approximation of the (µ,σ-)- criterion (which is refered to in literature also as E-V model, cf. also lecture chapter 6) MOTAD = Minimization Of Total Absolute Deviation (i.e. uses deviation measure rather than variance to measure variability of return) Advantage over quadratic programming (E-V-models): solution of the model requires linear algorithm only. 31 Required data for a MOTAD models Available capacities, required capacities per realized unit of activities, contribution of alternative activities to the objective function (= data requirements of a E-model, i.e. a model based on an expected value/avtivity only) The distribution (=variation) of the altenatives' contributions to the objective function A "realistic" idea about the desired total expected value (e.g. total gross margin). "Realistic" means = or < than the maximum return from an LP model that is based on expected values. 3

Applied Method Set up your basic LP model, but formulate the contribution of your activities to the objective function as a constraint that is forced to yield the expected total gross margin (respective cells in the objective function stay 0) Add constraints for the absolute deviation of the values in your time series (=mean value of total time series observed value in t n ). These constraints must be larger than their RHS value of 0 in the optimal solution Add adjustment activities (columns) that allow for a stepwise (1 step = 1) reduction of the absolute deviation and deliver 1 "unit of variation" to the objective function. 33 Results MOTAD allows for the calculation of a series of optimal combinations of activities for different attitudes towards risk The final selection among the optimal combinations depends on the individual farmer's choice and refers to his specific utility function (I.e. his preference with regard to the combination of expected income and related uncertainty) For the mathematical background and justification refer to Hazel /(1971) 34