Statistical Analysis of Gene Expression Data With Oracle & R (- data mining)

Size: px
Start display at page:

Download "Statistical Analysis of Gene Expression Data With Oracle & R (- data mining)"

Transcription

1 Statistical Analysis of Gene Expression Data With Oracle & R (- data mining) Patrick E. Hoffman Sc.D. Senior Principal Analytical Consultant pat.hoffman@oracle.com

2 Agenda (Oracle & R Analysis) Tools Loading Data Statistical Analysis Most Important Genes MDL Correlations t statistics, etc R Visualizations Predictive models? Other?

3 Analysis Tools Tools For DB * PLSQL development ODM (Oracle Data Miner) TOAD (From Quest) Jdeveloper (Free from Oracle) does Java & plsql development Enterprise Manager Console ( Oracle client) Managing the whole DB SQLPlus (command line sql or plsql) R Project Open source clone of S-Plus TextPad (editor)

4 The data Affy gene expression (old AML/ALL) 7129 genes 72 patients (combined train/test) Can be expanded to the new chips (50,000 genes)

5

6 Loading Data into Oracle DB? From flat file? SQL*Loader Oracle Warehouse Builder Production (auto sql*ldr) All types of data Oracle Data Miner (ODM) Quick & Easy From R-Project tables

7 Screen Shots ODM r to load csv file

8

9

10

11

12

13

14 Load Data From R to DB Set up ODBC driver (in Windows) Download RODBC pkg (from cran) Load CSV file Sqlsave routine to store in DB

15 R code send csv to db # you must install(download) the RODBC first # r script to load a csv file to Oracle db # install packages(rodbc to connect to a database) from CRAN (menu option) library(rodbc) # use standard microsoft odbc connection # set up a dsn name to connect to correct service # setup the channel to database, ODBC must be set up to connect to Oracle DB chan1 <- odbcconnect(dsn="") odbcgetinfo(chan1) #make sure channel is ok # load in a csv file filename <- "D:\\clients\\affy\\gs_EXPRESSION.csv" csv <- read.csv(filename) sqldrop(chan1, "GS_EXPRESSION", errors = TRUE) #drop the old table # load the csv file to a table in the Oracle db sqlsave(chan1, csv, tablename = "GS_EXPRESSION", rownames = FALSE,colnames = T, fast = F) # fast did not work on 10g

16 7000 genes? 1000 column DB Limit? Gone! Transactional Format or Nested Columns Convert flat file table to one of these formats

17 Analysis Pl/Sql Package affy affy_to_trans Flat to Transactional affy_ai Calculate MDL Attrib. Import. trans_to_affy Convert top gene to flat form. corr_genes Correlate genes to genes corr_cases Correlate cases to cases corr_target Correlate genes to target T statistic Calculate t-statistics anova Stats for Multiple Classes

18 Convert gene expression table to transactional format Load data into table GS_EXPRESSION Use Affy package exec affy.affy_to_trans('gs_expression'); OUTPUT - will be in the table AFFY_TRANS

19 Normal format

20 Transactional Format (72*700)

21 PLSQL Affy_to_Trans

22 ODM will give stats

23 And Histograms

24 For Transactional & Flat Tables

25 Histogram for One Gene

26 Attribute Importance Target is ALL or AML This is Minimum Distance Length (MDL) algorithm

27 ODM GUI or plsql API

28 Top Genes by MDL (no normalization)

29 Classification & Clustering (ODM) Transactional or Flat tables SVM, Naïve Bayes, Adaptive Bayes Java, Plsql, API Java based GUI Advanced K-means, Orthogonal Clustering Lift and ROC curves

30 10g Statistics & SQL Analytics Ranking functions rank, dense_rank, cume_dist, percent_rank, ntile Window Aggregate functions (moving and cumulative) Avg, sum, min, max, count, variance, stddev, first_value, last_value LAG/LEAD functions Direct inter-row reference using offsets Reporting Aggregate functions Sum, avg, min, max, variance, stddev, count, ratio_to_report Statistical Aggregates Correlation, linear regression family, covariance Linear regression Fitting of an ordinary-least-squares regression line to a set of number pairs. Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions. Descriptive Statistics average, standard deviation, variance, min, max, median (via percentile_count), mode, group-by & roll-up DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- 3 sigma values, top/bottom 5 values Correlations Pearson s correlation coefficients, Spearman's and Kendall's (both nonparametric). Cross Tabs Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa Hypothesis Testing Student t-test, F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov- Smirnov test, One-way ANOVA Distribution Fitting Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi- Squared Test, Normal, Uniform, Weibull, Exponential Pareto Analysis (documented) 80:20 rule, cumulative results table

31 Analytical Functions with TX format Correlating cases with cases Correlating genes with genes Correlating genes with TARGET (all/aml) T- statistics ANOVA

32 Correlating Cases -- put caseid s to correlate in table aftran2 create view aftran2 as select distinct(caseid) from affy_trans; create table pcor1 as select a.caseid p1, b.caseid p2, corr(a.attr_value, b.attr_value) corr from affy_trans a, affy_trans b where a.caseid < b.caseid and a.attrib = b.attrib and a.caseid in (select * from aftran2) group by a.caseid, b.caseid having corr(a.attr_value, b.attr_value) >.93 or corr(a.attr_value, b.attr_value) < -.93;

33 199 cases correlated >.93

34 Correlating Genes select a.attrib p1, b.attrib g2, corr(a.attr_value, b.attr_value) corr from affy_trans a, affy_trans b where a.attrib < b.attrib and a.seqnum = b.seqnum and a.attrib = 'X95735_at group by a.attrib, b.attrib --- zyxin having corr(a.attr_value, b.attr_value) >.5 or corr(a.attr_value, b.attr_value) < -.5;

35 208 genes correlate with zyxin across 72 patients

36 Correlating Genes with TARGET select a.attrib p1, b.attrib g2, corr(a.attr_value, b.attr_value) corr from affy_trans a, affy_trans b where a.attrib < b.attrib and a.seqnum = b.seqnum and a.attrib = TARGET group by a.attrib, b.attrib --- AML/ALL having corr(a.attr_value, b.attr_value) >.5 or corr(a.attr_value, b.attr_value) < -.5;

37 57 genes correlate with TARGET

38 R Correlations library(rodbc) chan1 <- odbcconnect(dsn="") odbcgetinfo(chan1) #make sure channel is ok # get gene expression data from db, but drop gene name gs1 <- sqlquery(chan1, query = "create table gs1 as SELECT * FROM GS_EXPRESSION") gs1 <- sqlquery(chan1, query = "alter table gs1 drop column gene") gs <- sqlquery(chan1, query = "SELECT * FROM GS1") # get correlation matrix cm <- cor(gs,use="pairwise.complete.obs") # Write the new table to a file. write.table( cm, file="d:\\clients\\affy\\gs_cm.csv", append = FALSE, quote = FALSE, sep = ",", eol = "\n", na = "", dec = ".", row.names = T, col.names = T ) heatmap(cm, Rowv=NA, Colv=NA,symm=TRUE)

39 Case Correlation Matrix

40 eatmap from R Correlation Matrix

41 Other Statistics Corr_s - Spearman s rho correlation coef. Corr_k - Kendall's tau-b correlation coef. stats_t_test_indep - equal variance stats_t_test_indepu - unequal variance stats_one_way_anova

42 statistics for each gene (tx format) Create table t_stats as SELECT a.attrib, count(*) cnt,avg(a.attr_value) avg_atr, avg(b.attr_value) avg_trg, STATS_T_TEST_INDEP(b.attr_value, a.attr_value, 'STATISTIC') t_observed, STATS_T_TEST_INDEP(b.attr_value, a.attr_value)*7130 two_sided_p_value FROM affy_trans A, affy_trans b WHERE a.attrib in (select * from aftran1) and b.attrib = 'TARGET' and a.caseid = b.caseid group by a.attrib;

43 T-statistic output table (Bonferroni correction)

44 F-distribution one way ANOVA drop table anova; create table anova as SELECT a.attrib, count(*) cnt,avg(a.attr_value) avg_atr, avg(b.attr_value) avg_trg, STATS_ONE_WAY_ANOVA(b.attr_value, a.attr_value, 'F_RATIO') f_ratio, STATS_ONE_WAY_ANOVA(b.attr_value, a.attr_value, 'SIG')*7130 p_value FROM affy_trans A, affy_trans b WHERE a.attrib in (select * from aftran1) and b.attrib = 'TARGET' and a.caseid = b.caseid group by a.attrib;

45 Top genes by ANOVA

46 R for Plotting and Visualization #### get data and plot all variables a1 <- sqlquery(chan1, query = "select * from anova") plot(a1)

47 Scatterplot Matrix in R

48 Histograms of Expression Distribution #Generate Histograms of gene expression cases, with labels and cut offs pam1 = 40 #Bars pam2 = 5000 #Ceiling pam3 = #floor wx = 1000 #width hx = 714 #floor N <- ncol(csv) # gene expression data is in csv N = 3 # do only 3 histograms R <- nrow(csv) nom<- attr(csv, "names") par(mfrow = c(n-1,1)) b <- 0:pam1 c = pam1/(pam2 - pam3) b <- b/c b <- b + pam3 for( num in 2:N) { h1 <- csv[,num] h1[ h1>pam2] <- pam2 h1[ h1<pam3] <- pam3 h<- hist(h1,nclass=pam1,breaks=b, main = paste("histogram of", nom[num],"clamp at",pam2,pam3), xlab = nom[num], col=5) }

49 3 columns of expression dist.

50 Histogram no labels N <- ncol(csv) # number of columns pam1 = 40 #Bars pam2 = 5000 #Ceiling pam3 = #floor R <- nrow(csv) #make the break points b <- 0:pam1 c = pam1/(pam2 - pam3) b <- b/c b <- b + pam3 par(mfrow = c(n,1),mar=c(0,0,0,0)) # space on graph for( num in 2:N) { h1 <- csv[,num] h1[ h1>pam2] <- pam2 h1[ h1<pam3] <- pam3 h<- hist(h1,breaks=b, main = "", xlab = "",axes=f, col=5) }

51 ll 72 cases of expression dist.

52

53 More? Many other applications of R Bioconductor Many other applications of Oracle Other code is available Oracle Informatics Consulting

54 Life Sciences DM Workshop A one day onsite technical session educating organizations on how to leverage one of their most valuable assets to provide insight in the operations of their business, the behavioral patterns of their customers and hidden relationships found deep within corporate data that can have direct impact to the bottom line. Life Sciences DM Blueprint Life Sciences DM Insight Life Sciences DMQuickstart A documented technical roadmap providing the organization with the strategy to integrate and deploy Life Sciences technology. This includes recommendations based on feedback from the Life Sciences workshop focusing on source data preparation, mining methodologies and supporting architecture. A five day onsite engagement focused on providing a detailed analysis of the business problem, data preparation, model build and analysis and knowledge deployment extending the analysis of the Life Sciences workshop culminating with a technical roadmap with a strategy to integrate and deploy Life Sciences technology. A thirty day engagement focused on taking a business problem and transforming into a Life Sciences solution. This includes transforming the business problem, preparing e data, creation of the mining model and knowledge deployment. Upon completion, results will be delivered mapped to the initial business problem. Life Sciences DM Services A series of custom services focused on delivering Life Sciences methodologies and solutions to provide insight in the operations of their business, the behavioral patterns of their customers and hidden relationships found deep within corporate data that can have direct impact to the bottom line.

55 Life Science Informatics experience Gene expression analysis Sequence Analysis (blast exon/intron prediction) Clinical/Medical data analysis QSAR/Cheminformatics Isis,Molconz, Predictive Tox Animal Studies Protein analysis (arrays, Mass spec) Ontology's and Text Mining

56 Data Mining & Informatics Services Contact Richard Solari Contact Patrick Hoffman

OLSUG Workshop Oracle Data Mining

OLSUG Workshop Oracle Data Mining OLSUG Workshop Oracle Data Mining Charlie Berger Sr. Director of Product Mgmt, Life Sciences and Data Mining Oracle Corporation charlie.berger@oracle.com Dr. Lutz Hamel Asst. Professor, Computer Science

More information

Exadata V2 + Oracle Data Mining 11g Release 2 Importing 3 rd Party (SAS) dm models

Exadata V2 + Oracle Data Mining 11g Release 2 Importing 3 rd Party (SAS) dm models Exadata V2 + Oracle Data Mining 11g Release 2 Importing 3 rd Party (SAS) dm models Charlie Berger Sr. Director Product Management, Data Mining Technologies Oracle Corporation charlie.berger@oracle.com

More information

The Oracle Data Mining Machine Bundle: Zero to Predictive Analytics in Two Weeks Collaborate 15 IOUG

The Oracle Data Mining Machine Bundle: Zero to Predictive Analytics in Two Weeks Collaborate 15 IOUG The Oracle Data Mining Machine Bundle: Zero to Predictive Analytics in Two Weeks Collaborate 15 IOUG Presentation #730 Tim Vlamis and Dan Vlamis Vlamis Software Solutions 816-781-2880 www.vlamis.com Presentation

More information

SQL - the best analysis language for Big Data!

SQL - the best analysis language for Big Data! SQL - the best analysis language for Big Data! NoCOUG Winter Conference 2014 Hermann Bär, hermann.baer@oracle.com Data Warehousing Product Management, Oracle 1 The On-Going Evolution of SQL Introduction

More information

Oracle Data Mining In-Database Data Mining Made Easy!

Oracle Data Mining In-Database Data Mining Made Easy! Oracle Data Mining In-Database Data Mining Made Easy! Charlie Berger Sr. Director Product Management, Data Mining and Advanced Analytics Oracle Corporation charlie.berger@oracle.com www.twitter.com/charliedatamine

More information

Oracle's In-Database Statistical Functions

Oracle's In-Database Statistical Functions Oracle 11g DB Data Warehousing Oracle's In-Database Statistical Functions OLAP Statistics Data Mining Charlie Berger Sr. Director Product Management, Data Mining Technologies

More information

Seamless Access from Oracle Database to Your Big Data

Seamless Access from Oracle Database to Your Big Data Seamless Access from Oracle Database to Your Big Data Brian Macdonald Big Data and Analytics Specialist Oracle Enterprise Architect September 24, 2015 Agenda Hadoop and SQL access methods What is Oracle

More information

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. 1 Copyright 2011, Oracle and/or its affiliates. FPO In-Database Analytics: Predictive Analytics, Data Mining, Exadata & Business Intelligence Charlie Berger Sr. Director Product Management, Data Mining

More information

Predictive Analytics for Better Business Intelligence

Predictive Analytics for Better Business Intelligence Oracle 11g DB Data Warehousing ETL OLAP Statistics Predictive Analytics for Better Business Intelligence Data Mining Charlie Berger Sr. Director Product Management, Data Mining Technologies

More information

Analyzing Big Data. Heartland OUG Spring Conference 2014

Analyzing Big Data. Heartland OUG Spring Conference 2014 Analyzing Big Data Heartland OUG Spring Conference 2014 Dan Vlamis Vlamis Software Solutions 816-781-2880 http://www.vlamis.com Copyright 2014, Vlamis Software Solutions, Inc. Copyright 2014, Vlamis Software

More information

Blazing BI: the Analytic Options to the Oracle Database. ODTUG Kscope 2013

Blazing BI: the Analytic Options to the Oracle Database. ODTUG Kscope 2013 Blazing BI: the Analytic Options to the Oracle Database ODTUG Kscope 2013 Dan Vlamis Tim Vlamis Vlamis Software Solutions 816-781-2880 http://www.vlamis.com Copyright 2013, Vlamis Software Solutions, Inc.

More information

Semantic and Data Mining Technologies. Simon See, Ph.D.,

Semantic and Data Mining Technologies. Simon See, Ph.D., Semantic and Data Mining Technologies Simon See, Ph.D., Introduction to Semantic Web and Business Use Cases 2 Lots of Scientific Resources NAR 2009 over 1170 databases Reuse, Recycling, Repurposing Paul

More information

Sun / Oracle Life Science Platform From Deluge to Discovery. 2011 Oracle Corporation

Sun / Oracle Life Science Platform From Deluge to Discovery. 2011 Oracle Corporation Sun / Oracle Life Science Platform From Deluge to Discovery SGI and Sun 1996 2011 Graph Algorithims Social Media We re a very tiny circle in the middle of this big universe. So it s more likely interesting

More information

Big Data Analytics with Oracle Advanced Analytics In-Database Option

Big Data Analytics with Oracle Advanced Analytics In-Database Option Big Data Analytics with Oracle Advanced Analytics In-Database Option Charlie Berger Sr. Director Product Management, Data Mining and Advanced Analytics charlie.berger@oracle.com www.twitter.com/charliedatamine

More information

Oracle Big Data SQL Architectural Deep Dive

Oracle Big Data SQL Architectural Deep Dive Oracle Big Data SQL Architectural Deep Dive Dan McClary, Ph.D. Big Data Product Management Oracle Safe Harbor Statement The following is intended to outline our general product direction. It is intended

More information

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

More information

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Online Learning Centre Technology Step-by-Step - Excel Microsoft Excel is a spreadsheet software application

More information

Big Data: Are you ready?

Big Data: Are you ready? Big Data: Are you ready? Oracle Big Data SQL George Bourmas Enterprise Architect EMEA XLOB Enterprise Architects September 13, 2014 Oracle Confidential Internal/Restricted/Highly Restricted Thoughts Things

More information

Big Data Management System Solution Overview

Big Data Management System Solution Overview Big Data Management System Solution Overview Pascal GUY Pre Sales Architect Business Unit Systems Oracle France Copyright 2014 Oracle and/or its affiliates. All rights reserved. Safe Harbor Statement The

More information

Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1 Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1 Calculate counts, means, and standard deviations Produce

More information

Projects Involving Statistics (& SPSS)

Projects Involving Statistics (& SPSS) Projects Involving Statistics (& SPSS) Academic Skills Advice Starting a project which involves using statistics can feel confusing as there seems to be many different things you can do (charts, graphs,

More information

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Structure As a starting point it is useful to consider a basic questionnaire as containing three main sections:

More information

Directions for using SPSS

Directions for using SPSS Directions for using SPSS Table of Contents Connecting and Working with Files 1. Accessing SPSS... 2 2. Transferring Files to N:\drive or your computer... 3 3. Importing Data from Another File Format...

More information

STATISTICAL ANALYSIS WITH EXCEL COURSE OUTLINE

STATISTICAL ANALYSIS WITH EXCEL COURSE OUTLINE STATISTICAL ANALYSIS WITH EXCEL COURSE OUTLINE Perhaps Microsoft has taken pains to hide some of the most powerful tools in Excel. These add-ins tools work on top of Excel, extending its power and abilities

More information

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel

More information

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc. Oracle9i Data Warehouse Review Robert F. Edwards Dulcian, Inc. Agenda Oracle9i Server OLAP Server Analytical SQL Data Mining ETL Warehouse Builder 3i Oracle 9i Server Overview 9i Server = Data Warehouse

More information

extreme Datamining mit Oracle R Enterprise

extreme Datamining mit Oracle R Enterprise extreme Datamining mit Oracle R Enterprise Oliver Bracht Managing Director eoda Matthias Fuchs Senior Consultant ISE Information Systems Engineering GmbH extreme Datamining with Oracle R Enterprise About

More information

SPSS Tests for Versions 9 to 13

SPSS Tests for Versions 9 to 13 SPSS Tests for Versions 9 to 13 Chapter 2 Descriptive Statistic (including median) Choose Analyze Descriptive statistics Frequencies... Click on variable(s) then press to move to into Variable(s): list

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

SPSS TUTORIAL & EXERCISE BOOK

SPSS TUTORIAL & EXERCISE BOOK UNIVERSITY OF MISKOLC Faculty of Economics Institute of Business Information and Methods Department of Business Statistics and Economic Forecasting PETRA PETROVICS SPSS TUTORIAL & EXERCISE BOOK FOR BUSINESS

More information

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management KSTAT MINI-MANUAL Decision Sciences 434 Kellogg Graduate School of Management Kstat is a set of macros added to Excel and it will enable you to do the statistics required for this course very easily. To

More information

Data Analysis Tools. Tools for Summarizing Data

Data Analysis Tools. Tools for Summarizing Data Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool

More information

Oracle Advanced Analytics - Option to Oracle Database: Oracle R Enterprise and Oracle Data Mining. Data Warehouse Global Leaders Winter 2013

Oracle Advanced Analytics - Option to Oracle Database: Oracle R Enterprise and Oracle Data Mining. Data Warehouse Global Leaders Winter 2013 Oracle Advanced Analytics - Option to Oracle Database: Oracle R Enterprise and Oracle Data Mining Data Warehouse Global Leaders Winter 2013 Dan Vlamis, Vlamis Software Solutions Tim Vlamis, Vlamis Software

More information

An introduction to using Microsoft Excel for quantitative data analysis

An introduction to using Microsoft Excel for quantitative data analysis Contents An introduction to using Microsoft Excel for quantitative data analysis 1 Introduction... 1 2 Why use Excel?... 2 3 Quantitative data analysis tools in Excel... 3 4 Entering your data... 6 5 Preparing

More information

The Dummy s Guide to Data Analysis Using SPSS

The Dummy s Guide to Data Analysis Using SPSS The Dummy s Guide to Data Analysis Using SPSS Mathematics 57 Scripps College Amy Gamble April, 2001 Amy Gamble 4/30/01 All Rights Rerserved TABLE OF CONTENTS PAGE Helpful Hints for All Tests...1 Tests

More information

An introduction to IBM SPSS Statistics

An introduction to IBM SPSS Statistics An introduction to IBM SPSS Statistics Contents 1 Introduction... 1 2 Entering your data... 2 3 Preparing your data for analysis... 10 4 Exploring your data: univariate analysis... 14 5 Generating descriptive

More information

Data exploration with Microsoft Excel: analysing more than one variable

Data exploration with Microsoft Excel: analysing more than one variable Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical

More information

Univariate Regression

Univariate Regression Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

More information

Data analysis process

Data analysis process Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Oracle Data Miner (Extension of SQL Developer 4.0)

Oracle Data Miner (Extension of SQL Developer 4.0) An Oracle White Paper October 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Generate a PL/SQL script for workflow deployment Denny Wong Oracle Data Mining Technologies 10 Van de Graff Drive Burlington,

More information

This presentation is for informational purposes only and may not be incorporated into a contract or agreement.

This presentation is for informational purposes only and may not be incorporated into a contract or agreement. This presentation is for informational purposes only and may not be incorporated into a contract or agreement. The following is intended to outline our general product direction. It is intended for information

More information

DATA ANALYSIS. QEM Network HBCU-UP Fundamentals of Education Research Workshop Gerunda B. Hughes, Ph.D. Howard University

DATA ANALYSIS. QEM Network HBCU-UP Fundamentals of Education Research Workshop Gerunda B. Hughes, Ph.D. Howard University DATA ANALYSIS QEM Network HBCU-UP Fundamentals of Education Research Workshop Gerunda B. Hughes, Ph.D. Howard University Quantitative Research What is Statistics? Statistics (as a subject) is the science

More information

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

More information

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19 PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations

More information

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/ Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/ Email: p.ducange@iet.unipi.it Office: Dipartimento di Ingegneria

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Instructions for SPSS 21

Instructions for SPSS 21 1 Instructions for SPSS 21 1 Introduction... 2 1.1 Opening the SPSS program... 2 1.2 General... 2 2 Data inputting and processing... 2 2.1 Manual input and data processing... 2 2.2 Saving data... 3 2.3

More information

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

More information

Anomaly and Fraud Detection with Oracle Data Mining 11g Release 2

Anomaly and Fraud Detection with Oracle Data Mining 11g Release 2 Oracle 11g DB Data Warehousing ETL OLAP Statistics Anomaly and Fraud Detection with Oracle Data Mining 11g Release 2 Data Mining Charlie Berger Sr. Director Product Management, Data

More information

SPSS Modules Features Statistics Premium

SPSS Modules Features Statistics Premium SPSS Modules Features Statistics Premium Core System Functionality (included in every license) Data access and management Data Prep features: Define Variable properties tool; copy data properties tool,

More information

R with Rcmdr: BASIC INSTRUCTIONS

R with Rcmdr: BASIC INSTRUCTIONS R with Rcmdr: BASIC INSTRUCTIONS Contents 1 RUNNING & INSTALLATION R UNDER WINDOWS 2 1.1 Running R and Rcmdr from CD........................................ 2 1.2 Installing from CD...............................................

More information

January 26, 2009 The Faculty Center for Teaching and Learning

January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS A USER GUIDE January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS Table of Contents Table of Contents... i

More information

The Statistics Tutor s Quick Guide to

The Statistics Tutor s Quick Guide to statstutor community project encouraging academics to share statistics support resources All stcp resources are released under a Creative Commons licence The Statistics Tutor s Quick Guide to Stcp-marshallowen-7

More information

SPSS Explore procedure

SPSS Explore procedure SPSS Explore procedure One useful function in SPSS is the Explore procedure, which will produce histograms, boxplots, stem-and-leaf plots and extensive descriptive statistics. To run the Explore procedure,

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features Charlie Berger, MS Eng, MBA Sr. Director Product Management, Data Mining and Advanced Analytics charlie.berger@oracle.com www.twitter.com/charliedatamine

More information

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics. Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

More information

Predictor Coef StDev T P Constant 970667056 616256122 1.58 0.154 X 0.00293 0.06163 0.05 0.963. S = 0.5597 R-Sq = 0.0% R-Sq(adj) = 0.

Predictor Coef StDev T P Constant 970667056 616256122 1.58 0.154 X 0.00293 0.06163 0.05 0.963. S = 0.5597 R-Sq = 0.0% R-Sq(adj) = 0. Statistical analysis using Microsoft Excel Microsoft Excel spreadsheets have become somewhat of a standard for data storage, at least for smaller data sets. This, along with the program often being packaged

More information

Data Mining - The Next Mining Boom?

Data Mining - The Next Mining Boom? Howard Ong Principal Consultant Aurora Consulting Pty Ltd Abstract This paper introduces Data Mining to its audience by explaining Data Mining in the context of Corporate and Business Intelligence Reporting.

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Linear Models in R Regression Regression analysis is the appropriate

More information

Normality Testing in Excel

Normality Testing in Excel Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com

More information

5 Correlation and Data Exploration

5 Correlation and Data Exploration 5 Correlation and Data Exploration Correlation In Unit 3, we did some correlation analyses of data from studies related to the acquisition order and acquisition difficulty of English morphemes by both

More information

SPSS: AN OVERVIEW. Seema Jaggi and and P.K.Batra I.A.S.R.I., Library Avenue, New Delhi-110 012

SPSS: AN OVERVIEW. Seema Jaggi and and P.K.Batra I.A.S.R.I., Library Avenue, New Delhi-110 012 SPSS: AN OVERVIEW Seema Jaggi and and P.K.Batra I.A.S.R.I., Library Avenue, New Delhi-110 012 The abbreviation SPSS stands for Statistical Package for the Social Sciences and is a comprehensive system

More information

Statistical tests for SPSS

Statistical tests for SPSS Statistical tests for SPSS Paolo Coletti A.Y. 2010/11 Free University of Bolzano Bozen Premise This book is a very quick, rough and fast description of statistical tests and their usage. It is explicitly

More information

Basic Statistical and Modeling Procedures Using SAS

Basic Statistical and Modeling Procedures Using SAS Basic Statistical and Modeling Procedures Using SAS One-Sample Tests The statistical procedures illustrated in this handout use two datasets. The first, Pulse, has information collected in a classroom

More information

containing Kendall correlations; and the OUTH = option will create a data set containing Hoeffding statistics.

containing Kendall correlations; and the OUTH = option will create a data set containing Hoeffding statistics. Getting Correlations Using PROC CORR Correlation analysis provides a method to measure the strength of a linear relationship between two numeric variables. PROC CORR can be used to compute Pearson product-moment

More information

An SPSS companion book. Basic Practice of Statistics

An SPSS companion book. Basic Practice of Statistics An SPSS companion book to Basic Practice of Statistics SPSS is owned by IBM. 6 th Edition. Basic Practice of Statistics 6 th Edition by David S. Moore, William I. Notz, Michael A. Flinger. Published by

More information

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different

More information

Java Modules for Time Series Analysis

Java Modules for Time Series Analysis Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series

More information

Nonparametric Two-Sample Tests. Nonparametric Tests. Sign Test

Nonparametric Two-Sample Tests. Nonparametric Tests. Sign Test Nonparametric Two-Sample Tests Sign test Mann-Whitney U-test (a.k.a. Wilcoxon two-sample test) Kolmogorov-Smirnov Test Wilcoxon Signed-Rank Test Tukey-Duckworth Test 1 Nonparametric Tests Recall, nonparametric

More information

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:

More information

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA Summer 2013, Version 2.0 Table of Contents Introduction...2 Downloading the

More information

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

There are six different windows that can be opened when using SPSS. The following will give a description of each of them. SPSS Basics Tutorial 1: SPSS Windows There are six different windows that can be opened when using SPSS. The following will give a description of each of them. The Data Editor The Data Editor is a spreadsheet

More information

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques

More information

Introduction to Statistical Computing in Microsoft Excel By Hector D. Flores; hflores@rice.edu, and Dr. J.A. Dobelman

Introduction to Statistical Computing in Microsoft Excel By Hector D. Flores; hflores@rice.edu, and Dr. J.A. Dobelman Introduction to Statistical Computing in Microsoft Excel By Hector D. Flores; hflores@rice.edu, and Dr. J.A. Dobelman Statistics lab will be mainly focused on applying what you have learned in class with

More information

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear. Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear. In the main dialog box, input the dependent variable and several predictors.

More information

Data Analysis for Marketing Research - Using SPSS

Data Analysis for Marketing Research - Using SPSS North South University, School of Business MKT 63 Marketing Research Instructor: Mahmood Hussain, PhD Data Analysis for Marketing Research - Using SPSS Introduction In this part of the class, we will learn

More information

Advanced Excel for Institutional Researchers

Advanced Excel for Institutional Researchers Advanced Excel for Institutional Researchers Presented by: Sandra Archer Helen Fu University Analysis and Planning Support University of Central Florida September 22-25, 2012 Agenda Sunday, September 23,

More information

Analysis of Variance. MINITAB User s Guide 2 3-1

Analysis of Variance. MINITAB User s Guide 2 3-1 3 Analysis of Variance Analysis of Variance Overview, 3-2 One-Way Analysis of Variance, 3-5 Two-Way Analysis of Variance, 3-11 Analysis of Means, 3-13 Overview of Balanced ANOVA and GLM, 3-18 Balanced

More information

Description. Textbook. Grading. Objective

Description. Textbook. Grading. Objective EC151.02 Statistics for Business and Economics (MWF 8:00-8:50) Instructor: Chiu Yu Ko Office: 462D, 21 Campenalla Way Phone: 2-6093 Email: kocb@bc.edu Office Hours: by appointment Description This course

More information

THE KRUSKAL WALLLIS TEST

THE KRUSKAL WALLLIS TEST THE KRUSKAL WALLLIS TEST TEODORA H. MEHOTCHEVA Wednesday, 23 rd April 08 THE KRUSKAL-WALLIS TEST: The non-parametric alternative to ANOVA: testing for difference between several independent groups 2 NON

More information

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This

More information

MEU. INSTITUTE OF HEALTH SCIENCES COURSE SYLLABUS. Biostatistics

MEU. INSTITUTE OF HEALTH SCIENCES COURSE SYLLABUS. Biostatistics MEU. INSTITUTE OF HEALTH SCIENCES COURSE SYLLABUS title- course code: Program name: Contingency Tables and Log Linear Models Level Biostatistics Hours/week Ther. Recite. Lab. Others Total Master of Sci.

More information

Data Mining. Dr. Saed Sayad. University of Toronto 2010 saed.sayad@utoronto.ca. http://chem-eng.utoronto.ca/~datamining/

Data Mining. Dr. Saed Sayad. University of Toronto 2010 saed.sayad@utoronto.ca. http://chem-eng.utoronto.ca/~datamining/ Data Mining Dr. Saed Sayad University of Toronto 2010 saed.sayad@utoronto.ca http://chem-eng.utoronto.ca/~datamining/ 1 Data Mining Data mining is about explaining the past and predicting the future by

More information

TI-Inspire manual 1. Instructions. Ti-Inspire for statistics. General Introduction

TI-Inspire manual 1. Instructions. Ti-Inspire for statistics. General Introduction TI-Inspire manual 1 General Introduction Instructions Ti-Inspire for statistics TI-Inspire manual 2 TI-Inspire manual 3 Press the On, Off button to go to Home page TI-Inspire manual 4 Use the to navigate

More information

Oracle Database 10g New Features: Maximizing the Capabilities of Oracle Database 10g

Oracle Database 10g New Features: Maximizing the Capabilities of Oracle Database 10g Oracle Database 10g New Features: Maximizing the Capabilities of Oracle Database 10g Oracle Database 10g Goals Highest Quality of Service Highest Availability, Reliability, Security Highest Performance,

More information

R Tools Evaluation. A review by Analytics @ Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

R Tools Evaluation. A review by Analytics @ Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015 R Tools Evaluation A review by Analytics @ Global BI / Local & Regional Capabilities Telefónica CCDO May 2015 R Features What is? Most widely used data analysis software Used by 2M+ data scientists, statisticians

More information

Advanced analytics at your hands

Advanced analytics at your hands 2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously

More information

Oracle Data Mining Hands On Lab

Oracle Data Mining Hands On Lab Oracle Data Mining Hands On Lab Material provided by Oracle Corporation Vlamis Software Solutions is one of the most respected training organizations in the Oracle Business Intelligence community because

More information

TIPS FOR DOING STATISTICS IN EXCEL

TIPS FOR DOING STATISTICS IN EXCEL TIPS FOR DOING STATISTICS IN EXCEL Before you begin, make sure that you have the DATA ANALYSIS pack running on your machine. It comes with Excel. Here s how to check if you have it, and what to do if you

More information

Figure 1. An embedded chart on a worksheet.

Figure 1. An embedded chart on a worksheet. 8. Excel Charts and Analysis ToolPak Charts, also known as graphs, have been an integral part of spreadsheets since the early days of Lotus 1-2-3. Charting features have improved significantly over the

More information

T O P I C 1 2 Techniques and tools for data analysis Preview Introduction In chapter 3 of Statistics In A Day different combinations of numbers and types of variables are presented. We go through these

More information

A fast, powerful data mining workbench designed for small to midsize organizations

A fast, powerful data mining workbench designed for small to midsize organizations FACT SHEET SAS Desktop Data Mining for Midsize Business A fast, powerful data mining workbench designed for small to midsize organizations What does SAS Desktop Data Mining for Midsize Business do? Business

More information

Point Biserial Correlation Tests

Point Biserial Correlation Tests Chapter 807 Point Biserial Correlation Tests Introduction The point biserial correlation coefficient (ρ in this chapter) is the product-moment correlation calculated between a continuous random variable

More information

Advanced Analytics for Call Center Operations

Advanced Analytics for Call Center Operations Advanced Analytics for Call Center Operations Ali Cabukel, Senior Data Mining Specialist Global Bilgi Kubra Fenerci Canel, Big Data Solutions Lead Oracle Speaker Bio Ali Çabukel Graduated from Hacettepe

More information

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences. 1 Commands in JMP and Statcrunch Below are a set of commands in JMP and Statcrunch which facilitate a basic statistical analysis. The first part concerns commands in JMP, the second part is for analysis

More information

Analysis of Questionnaires and Qualitative Data Non-parametric Tests

Analysis of Questionnaires and Qualitative Data Non-parametric Tests Analysis of Questionnaires and Qualitative Data Non-parametric Tests JERZY STEFANOWSKI Instytut Informatyki Politechnika Poznańska Lecture SE 2013, Poznań Recalling Basics Measurment Scales Four scales

More information