Business Statistics: Intorduction Donglei Du (ddu@unb.edu) Faculty of Business Administration, University of New Brunswick, NB Canada Fredericton E3B 9Y2 September 23, 2015 Donglei Du (UNB) AlgoTrading September 23, 2015 1 / 45
Table of contents 1 Why Statistics? 2 What is Statistics? 3 Two methodologies in Science 4 Variables and types 5 Sources of Statistical Data 6 Software for Statistical Analysis 7 Materials to learn R 8 A brief tutorial of R with a case study Donglei Du (UNB) AlgoTrading September 23, 2015 2 / 45
Section 1 Why Statistics? Donglei Du (UNB) AlgoTrading September 23, 2015 3 / 45
Figure: Source: http://mechanicalforex.com/2015/05/ building-algorithmic-trading-systems-for-the-forex-market-par html Donglei Du (UNB) AlgoTrading September 23, 2015 4 / 45
Why Statistics? This is a required course for your degree. It is a prerequisite for many other topics. Data everywhere, particularly in this big data era. Sampling vs censusing. Decision Making: Statistics will help you make important decision. Donglei Du (UNB) AlgoTrading September 23, 2015 5 / 45
Data everywhere: Example 1: productivity and standard of living of a nation 80 8 0.8 wages Bulgaria Egypt India Pakistan Indonesia Italy France Germany Canada Japan 2 20 200 Labor productivity Figure: High productivity is a key to high standard of living U.K. US Two important statistics that a nation is most concerned are the productivity and standard of living. Productivity is usually measured in terms of output per worker, and standard of living is measured in terms of wages per worker. They are usually strongly related. Countries with high productivity in general are seen with high standard of living (Left figure) Donglei Du (UNB) AlgoTrading September 23, 2015 6 / 45
Data everywhere: Example 2: Consumer Price Index (CPI) in Canada The Consumer Price Index measures the rate of price change for goods and services bought by consumers in a country. It is a statistic constructed using the prices of a sample of representative items whose prices are collected periodically. For instance, the CPI All-items for Canada for the month of July 2013 was 123.1 (2000=100), meaning that consumer prices were 23.1% higher in July 2013 than in 2000. [hyphens]http://www.statcan.gc.ca/tables-tableaux/sumsom/l01/cst01/cpis01a-eng.htm Donglei Du (UNB) AlgoTrading September 23, 2015 7 / 45
Data everywhere: Example 2: Consumer Price Index (CPI) in Canada The CPI directly or indirectly affects nearly all Canadians. Old Age Security pensions, Canada Pension Plan payments, and other forms of social and welfare payments are adjusted periodically to take account of changes in the CPI. Rental agreements, spousal and child support payments and other forms of contractual and price-setting arrangements are frequently tied in some manner to movements in the CPI. Cost-of-living adjustment (COLA) clauses link wage increases to movements in the CPI. Labour contracts governing the wages of many Canadian workers include COLA clauses. Donglei Du (UNB) AlgoTrading September 23, 2015 8 / 45
Sampling vs censusing? Costs of surveying the entire population may be too large or prohibitive e.g., Television networks monitor the popularity of their programs; Destruction of elements during investigation e.g., Manufacturers estimate the average lifetime of light bulbs; doctors take a blood sample to check for disease. Unknown future e.g., stock index; temperature tomorrow Donglei Du (UNB) AlgoTrading September 23, 2015 9 / 45
Decision Making How do large retailers (like COSTO, Walmart) fill their store shelves so as to meet the customer demand while minimizing their operating cost? How do doctors in the hospital make diagnose? How do political leaders run their campaign? How do Investment Banks in Wall Street (or Bay Street) decide which stock (or stocks) to invest? How do insurance companies decide the premium for a particular client? How do car dealers decide how many car models of each brand to be kept in their locations? Donglei Du (UNB) AlgoTrading September 23, 2015 10 / 45
Section 2 What is Statistics? Donglei Du (UNB) AlgoTrading September 23, 2015 11 / 45
What is Statistics? It is the science and art of collecting, organizing, and representing data in such a way that the characteristics and patterns of the data can be easily captured (Descriptive Statistics); also, of estimating attributes and drawing inference from a sample about the entire population (Inferential Statistics). Donglei Du (UNB) AlgoTrading September 23, 2015 12 / 45
Examples: descriptive or inferential? In 1995, 45% of Canadian households owned a computer and 25% were connected to internet; On average, Canadians spend 1.3 hours per day commuting, and 1.5 hours per day with their children. descriptive The accounting department of a firm will select a sample of invoices to check for accuracy of all the invoices of the company. inferential Donglei Du (UNB) AlgoTrading September 23, 2015 13 / 45
Terminologies Population vs sample Population: a set of all elements of interests in an investigation Sample: a subset of all elements of a population Parameter vs Statistic Parameter: a measurable characteristic of a Population (such as average, extreme, variation, proportion) Statistic: a measurable characteristic of a Sample Donglei Du (UNB) AlgoTrading September 23, 2015 14 / 45
Section 3 Two methodologies in Science Donglei Du (UNB) AlgoTrading September 23, 2015 15 / 45
Two methodologies in Science Deductive: general particular Mathematics: Axioms + logic Inductive: particular general Most natural sciences, like physics, chemistry, and Statistics... a.k.a, the method of Experimentation Donglei Du (UNB) AlgoTrading September 23, 2015 16 / 45
The method of Experimentation 1 Define the experimental goal or a working hypothesis 2 Design an experiment 3 Collect and represent data 4 Estimate the values/relations 5 Draw inferences 6 Predict and prepare policy analysis Donglei Du (UNB) AlgoTrading September 23, 2015 17 / 45
Section 4 Variables and types Donglei Du (UNB) AlgoTrading September 23, 2015 18 / 45
Variable 1 A variable is a characteristic/attribute of a population/sample that is interest in a particular investigation and the value of the variable can "vary" from one entity to another. A person s gender is a variable, which could have the value of "Male" for one person and "Female" for another. The rank of faculty members in Business Administration is a variable, which could have the value of "Full Professor" for one person, "Associate Professor" for another, and " Assistant Professor" for yet another. Temperatures in this classroom is a variable which could have the value of "20" or "100". Annual salary of NBA players (or hockey players in Canada), which could have the value of "10M" or "5M". Donglei Du (UNB) AlgoTrading September 23, 2015 19 / 45
Types of Data: Qualitative vs. Quantitative Variables Qualitative (a.k.a. categorical) Qualitative variables take on values that are names or labels. Examples: gender, country names, color Quantitative (a.k.a, numeric) Quantitative variables are numeric. Examples: number of bedrooms in houses, number of minutes to the end of this class, distance between two cities Donglei Du (UNB) AlgoTrading September 23, 2015 20 / 45
Types of Data: Discrete vs. Continuous Variables Quantitative variables can be further classified as discrete or continuous Continuous Variable: a variable can take on any value within a range; Examples: the number of minutes to the end of this class, distance between two cities, pressure in a tire, weight of a pork chop, height of students in a class Discrete variable: a variable can take only certain value (finite or countably infinite) within a range. Examples: number of bedrooms in houses Donglei Du (UNB) AlgoTrading September 23, 2015 21 / 45
Data/variable representation Language such as English, French, Chinese: This is the natural and direct way Such as "All the people in Canada" Numbers: This is the mathematical and indirect way such as, data representation in computer: binary numbers: 0,1 Donglei Du (UNB) AlgoTrading September 23, 2015 22 / 45
Level of measurement Suppose now we represent all data /variable using numbers, then Level of measurement (a.k.a. scales of measure) are the different ways numbers can be used. There are four levels of measurements Nominal Ordinal Interval Ratio However, representing variables as numbers does not give you the license to perform the regular logical/arithmetic operations all the time (such as comparison, addition, subtraction, multiplication, and division etc.); or to infer anything about the magnitude or quantitative difference between the numbers. Donglei Du (UNB) AlgoTrading September 23, 2015 23 / 45
Nominal level A variable is at the nominal level if none of the five operations (namely comparison, addition, subtraction, multiplication, and division) is meaningful. At the nominal level of measurement, numbers are assigned to a set of mutually exclusive and exhaustive categories for the purpose of naming, labeling, or classifying the observations, but no arithmetic operation is meaningful; where Mutually exclusive: any individual object is included in ONLY ONE CATEGORY Exhaustive: any individual object MUST APPEAR in one of the categories Donglei Du (UNB) AlgoTrading September 23, 2015 24 / 45
Examples Examples Barcodes Social insurance numbers (SIN) Student IDs Phones numbers The fact that the barcode for one product is higher than that for another, or that your SIN is higher than mine tells us nothing. In surveys we often use arbitrary numbers to code variables such as religion, ethnicity, major in college or gender. Donglei Du (UNB) AlgoTrading September 23, 2015 25 / 45
Ordinal level A variable is at the ordinal level if only comparison is meaningful. Donglei Du (UNB) AlgoTrading September 23, 2015 26 / 45
Examples Examples Rank of faculty members Maclean ranking of Canadian colleges: [hyphens]http://oncampus.macleans.ca/education/2012/11/01/20 comprehensive/ US News Ranking of US/world colleges http://www.usnews.com/rankings The differences between data values cannot be determined or are meaningless. For instance, first class is better than economy and that business is in between. Just how much better first class is compared to business, and business compared to economy varies from airline to airline, and even from flight to flight. Donglei Du (UNB) AlgoTrading September 23, 2015 27 / 45
Interval A variable is at the interval level if only division is meaningless. Reason: Interval data have meaningful intervals between measurements, but there is no true starting point (zero). Donglei Du (UNB) AlgoTrading September 23, 2015 28 / 45
Examples Examples Temperatures in Celsius and Fahrenheit are interval data Certainly order is important and intervals are meaningful. However, a 20 C dashboard is not twice as hot as the 10 C outside. A conversion formula between Celsius and Fahrenheit F = 9 5 C + 32 So 10 C = 50 F and 20 C = 68 F. Obviously 20 C 10 C = 68 F 50 F Calender year: arbitrary 0 year: the birth of Jesus IQ score: non-arbitrary 0 score: Donglei Du (UNB) AlgoTrading September 23, 2015 29 / 45
Ratio A variable is at the ratio level if all logical/arithmetic operations are meaningful. Reason: Ratios between measurements as well as intervals are meaningful because there is a non-arbitrary zero point. Donglei Du (UNB) AlgoTrading September 23, 2015 30 / 45
Examples Examples Income Distance Height Donglei Du (UNB) AlgoTrading September 23, 2015 31 / 45
Summary of level of measurement comparison addition subtraction multiplication division nominal x x x x x ordinal x x x x interval x x ratio Donglei Du (UNB) AlgoTrading September 23, 2015 32 / 45
A general method for identifying the level of measurement Ask yourself the following three questions: Is order meaningful? No! then the data is nominal Is difference meaningful? No! then the data is ordinal Is zero meaningful No! then the data is interval Yes! then the data is ratio Donglei Du (UNB) AlgoTrading September 23, 2015 33 / 45
Section 5 Sources of Statistical Data Donglei Du (UNB) AlgoTrading September 23, 2015 34 / 45
Sources of Statistical Data Statistics Canada: http://www.statcan.ca Twitter: www.twitter.com Facebook: www.facebook.com.... Donglei Du (UNB) AlgoTrading September 23, 2015 35 / 45
Section 6 Software for Statistical Analysis Donglei Du (UNB) AlgoTrading September 23, 2015 36 / 45
Software for Statistical Analysis Many, but the most popular ones are EXCEL (proprietary) SAS (proprietary) R (open-source).... Donglei Du (UNB) AlgoTrading September 23, 2015 37 / 45
Section 7 Materials to learn R Donglei Du (UNB) AlgoTrading September 23, 2015 38 / 45
Some online resources to learn R I R in a nutshell An introduction to R R for Beginners Try R Quick-R The art of R Programming www.coursera.org offers some nice courses on R Such as this one Donglei Du (UNB) AlgoTrading September 23, 2015 39 / 45
Section 8 A brief tutorial of R with a case study Donglei Du (UNB) AlgoTrading September 23, 2015 40 / 45
R for Statistical Analysis We will retrieve some stock data via package quantmod from Yahoo Finance Donglei Du (UNB) AlgoTrading September 23, 2015 41 / 45
Load package rm(list=ls()) require("quantmod") Donglei Du (UNB) AlgoTrading September 23, 2015 42 / 45
Parameters startdate <- '1957-03-04' # start of data enddate <- '2014-01-06' # end of data Sys.setenv(TZ = "UTC") Donglei Du (UNB) AlgoTrading September 23, 2015 43 / 45
Retrieve data symbols<-c("^gspc") #symbols<-c("aapl","fb", "YHOO") if(file.exists("data/gspc.rdata")) { load("data/gspc.rdata") #GSPC_csv <- read.csv("data/gspc.csv") } else { getsymbols(symbols, src = "yahoo", # OHLC format from = startdate, #to = enddate, index.class=c("posixt","posixct"), # Recommended warnings = FALSE, adjust=true) dir.create(file.path('data'), showwarnings = TRUE) save(list="gspc", file="data/gspc.rdata") Donglei Du (UNB) AlgoTrading September 23, 2015 44 / 45
Plot options(width=60) chartseries(gspc["2004::2014"]) addmacd() addbbands() Donglei Du (UNB) AlgoTrading September 23, 2015 45 / 45