1 Data Science Will computer science and informatics eat our lunch? Thomas Lumley University of Auckland (g)tslumley statschat.org.nz notstat schat.tumblr.com
2 In the 1920s, the computing labs helped establish statistics on the American continent. Without them, even a modest study was beyond the ability of an individual statistician. At the same time, statistics labs often had the most powerful computing machines within their larger institution. They showed how organized computing could benefit science and provided a place for the earliest of computer scientists to test their ideas. -- Grier The origins of statistical computing, Amstat Online
3 Fig. 2. The Hollerith Electric Tabulating System
4 Iowa State Statistical Computing Service
6 Iowa State Statistical Computing Service
7 Iowa State statistics PhD prelim exam Two eight-hour written-on-paper exams covering : Theory of Probability and Statistics I. Theory of Probability and Statistics II. Statistical Methods I. Statistical Methods II. Advanced Statistical Methods. Advanced Probability Theory. Advanced Theory of Statistical Inference. They do require a stat computing course: 1 credit/30
8 What is data science? and where can we get some?
9 Data Science is just a fancy name for statistics. Fitting simple models to messy and sometimes large data sets Combination of standard black-box fitting tools and good graphics. Doesn t require any fundamental knowledge our students don t have. Needs good computing skills, which our students can learn
10 Need to avoid going overboard with computing Data Wrangling isn t statistics Cleaning, tidying, querying, reformatting, transforming, getting in and out of databases,
11 Data Science is just a fancy name for statistics. Data Wrangling isn t statistics If you value self-consistency, you can hold at most one of these opinions. A/Prof Jenny Bryan, UBC (less than one is good)
12 Data science is statistics in the same way that epidemiology is statistics opinion polling is statistics ag. field trials are statistics
13 I did think, however, that many well-known applied statisticians attacked problems without the necessary mathematical knowledge and manipulative skill. Moreover, I believed that a principal cause of failure among medical research scientists was the lack of basic scientific knowledge in their special chosen field. H. O. Lancaster
14 Computing is easier to steal Define and explain the relevance to applied statistics of: Suffix trees Supernodal Cholesky factorization Column-store database Translation look-aside buffer
15 Computing is easier to steal Need to teach our data science students: A bit about databases and SQL A statistical programming language (eg R) Abstractions such as tidy data, sparse, map/reduce Reproducible data analysis (eg rmarkdown)?collaborative version control (eg git/github) Force students to work with a wild-caught data set and I'm still pretty sure some of the data is Permit interested students to learn the high-tech data structures missing, and butalgorithms could still stuff. be here, in this ONE HUNDRED SHEET excel file a PhD student on Twitter
16 But we don t know this stuff! let mego glethat for you Google Search I'm Feeling Lucky The computing folks are way better at dissemination than us Unlike statistics, the computer can tell you if you get it wrong.
17 Free online courses Books Related Courses M Exploratory Data Analysis Reproducible Research Statistical Inference /osljjÿp o D Dynamic Documents with R and knitr Yihui Xie Pract cal Dat Scienc * Nina/ml John Hooni Doing Data Science STRAIGHT TALK FROM THE FRONTLINE Getting and Cleaning Data Regression Models Developing Data Products d«n» «- dcns<ty(dot>i. n - npts) IIMIMINt Cathy O'Neil & Rachel Schutt dy2 <- M» - JfCIO KqtwlM «- rtrfyel.). length(dx)) lf(flu T> confshade(dx2. s«qb«lo». dy? S' I - 5>l The Data Scientist's Toolbox Data Analysis and Statistical Inference People who make their notes available ÿ 5b5 Home FAQ Syllabus Topics People J Data wrangling, exploration, and analysis with R UBC STAT 545A and 547M Software tools Open source environment for deep analysis of largecomplex data The Power of R with Big Data Get Started inminutes Resources to Learn & Join Learn how to explore, groom, visualize, and analyze dab make all of that reproducible, reusable, ar using R software carpentry
18 What do we have to offer? Popularity? Romance? Excitement?
19 Big Complex Messy Badly Sampled Creepy Vital to ask the right questions
20 Big Data Computer folks are better at this than us, but statistical insights important eg: Noel Cressie: fast computation for spatial models Bill Cleveland: optimising the divide/recombine strategy
21 Big Data Computer folks are better at this than us, Big doesn t mean gigabytes.
22 Complex Data Models for complex data Summaries (parameters, estimators) that answer the real questions Robustness of meaning, not just of power and level.
23 Complex Data: networks F(x)µ1- x -a Power laws: come from network, queue, Matthew effect process blog links page views long tail sales data citations to papers word frequencies earthquake sizes
24 Complex Data: networks F(x)µ1- x -a Power laws: come from network, queue, Matthew effect process blog links page views long tail sales data citations to papers word frequencies earthquake sizes All fit lognormal better, some much better Clauset et al, SIAM Rev. 2009
25 Complex Data: networks Random graph models for connections Erdös-Renyi graphs Exponential Random Graph Models (ERGMs) meaningful parameters, nice likelihood ERGMs are not consistent under sampling. [Shalizi et al, Ann Stat]
26 Complex Data Robustness of meaning can be hard: Suppose a Wilcoxon test shows X > Y, Y>Z What does this tell us about Means of X and Z? Medians of X and Z? Wilcoxon test of X and Z?
27 i i Messy Data Good applied statisticians know from messy data. o CM - X O and I'm still pretty sure some of the data is missing, but it could still be here, in this ONE HUNDRED SHEET excel file blooc Diastolic NnT i o r o a PhD student on Twitter 0 ao o c Age (years)
28 Badly Sampled Whom the Gods Would Destroy, They First Give Real-time Analytics [Dan McKinley, Etsy] This line of thinking is a trap. It's important to divorce the concepts of operational metrics and product analytics. Confusing how we do things with how we decide which things to do is a fatal mistake. Because non-representativeness of short time slices
30 Creepy What questions should data answer? income Mount Eden atistics NZ Chris McDowall Based on census meshblock: not actual household data
31 Creepy (and Evil) What questions should data answer? Familiar issues: Bioethics Statistical disclosure/confidentiality New, but statistical issues: algorithm audit/accountability We also talk to social scientists more. (not enough)
32 Creepy (and Evil) How do we learn more? let me LjOOQie that for you Googlo Search I'm Fooling Lucky Cathy O Neil (mathbabe.org) Ed Felten danah boyd
33 Summary The hard problems in data science are hard. Many of the computational ones are solved (ish) Many of the unsolved ones are closer to statistics
34 Data Science Will computer science and informatics eat our lunch? Only if we let them, and it would be bad for data science, too
Writing and using good learning outcomes Written by David Baume 2 www.leedsmet.ac.uk Preface Our Assessment, Learning and Teaching strategy reinforces the University s commitment to put students at the
Version 3.0 The InStat guide to choosing and interpreting statistical tests Harvey Motulsky 1990-2003, GraphPad Software, Inc. All rights reserved. Program design, manual and help screens: Programming:
The Right Stuff: How to Find Good Information David D. Thornburg, PhD Executive Director, Thornburg Center for Space Exploration firstname.lastname@example.org One of the most frustrating tasks you can have as a student
AN INTRODUCTION TO Data Science Jeffrey Stanton, Syracuse University INTRODUCTION TO DATA SCIENCE 2012, Jeffrey Stanton This book is distributed under the Creative Commons Attribution- NonCommercial-ShareAlike
June 2014 Master of Public Administration at Upper Iowa University 1 Academic or Professional Master's Degrees: Does it Matter? Yes! The UIU MPA program combines both an academic and a professional focus.
Practical Predictive Analytics for Healthcare 101 A white paper by Steven S. Eisenberg, MD You cannot scan a healthcare related newspaper, newsfeed, magazine or website these days without seeing a reference
Analyzing Data with GraphPad Prism A companion to GraphPad Prism version 3 Harvey Motulsky President GraphPad Software Inc. Hmotulsky@graphpad.com GraphPad Software, Inc. 1999 GraphPad Software, Inc. All
WHEN YOU CONSULT A STATISTICIAN... WHAT TO EXPECT SECTION ON STATISTICAL CONSULTING AMERICAN STATISTICAL ASSOCIATION 2003 When you consult a statistician, you enlist the help of a professional who is particularly
NOVEMBER 2014 TDWI E-Book Predictive Analytics: Revolutionizing Business Decision Making 1 Q&A: Predictive Analytics 101 3 Who Should Be Building Predictive Models? 5 Exploratory Predictive Analytics 8
23 Part 1 / Philosophy of Science, Empiricism, and the Scientific Method Chapter 3 Elements of Scientific Theories: Relationships In the previous chapter, we covered the process of selecting and defining
Ph.D. Thesis Research: Where do I Start? Notes by Don Davis Columbia University If you are the next Paul Samuelson and will wholly transform the field of economics, pay no heed. If you are the next Ken
Introduction: Principles of CALL Introduction: Principles of CALL 1 chapter 2 Focus In this chapter you will reflect on definitions of CALL learn about conditions for optimal language learning and standards
Online Education White Paper January 25, 2014 Executive Summary A renewed interest in online education has surfaced recently, largely sparked by issues of bottlenecking and course availability on college
Tips for Applying to Graduate School in Clinical Psychology 1 A Student s Perspective on Applying to Graduate School in (Clinical) Psychology: A Step-by-Step Guide Sophie Choukas-Bradley, M.A. Doctoral
ANALYTICS ON BIG FAST DATA USING REAL TIME STREAM DATA PROCESSING ARCHITECTURE Dibyendu Bhattacharya Architect-Big Data Analytics HappiestMinds Manidipa Mitra Principal Software Engineer EMC Table of Contents
The Real Value of Joining a Local Chamber of Commerce A Research Study Study Overview Advocates of chambers of commerce have long believed that when a company is active in its local chamber, it is doing
Predictive analytics and data mining Charles Elkan email@example.com May 28, 2013 1 Contents Contents 2 1 Introduction 7 1.1 Limitations of predictive analytics.................. 8 1.2 Opportunities for
What Professors Expect From You (I.e., Why You Are at College) Often, students who struggle in college do so because they are unclear about what their college professors expect. This confusion might come
Skills to Pay the Bills Problem Solving and Critical Thinking Everyone experiences problems from time to time. Some of our problems are big and complicated, while others may be more easily solved. There
Introduction to Data Mining and Knowledge Discovery Third Edition by Two Crows Corporation RELATED READINGS Data Mining 99: Technology Report, Two Crows Corporation, 1999 M. Berry and G. Linoff, Data Mining
David Chappell Understanding NoSQL on Microsoft Azure Sponsored by Microsoft Corporation Copyright 2014 Chappell & Associates Contents Data on Azure: The Big Picture... 3 Relational Technology: A Quick
Designing and Teaching Online Courses Bainbridge State College April, 2013 David L. Pollock, Ph.D. 1 Contents Designing and Teaching Online Courses... 3 Why careful design is necessary... 3 Designing with
Big Data Analytics ALTERYX SPECIAL EDITION by Michael Wessler, OCP & CISSP Big Data Analytics For Dummies, Alteryx Special Edition Published by John Wiley & Sons, Inc. 111 River St. Hoboken, NJ 07030-5774
An Oracle White Paper March 2013 Big Data Analytics Advanced Analytics in Oracle Database Advanced Analytics in Oracle Database Disclaimer The following is intended to outline our general product direction.
HR STILL GETTING IT WRONG BIG DATA & PREDICTIVE ANALYTICS THE RIGHT WAY OVERVIEW Research cited by Forbes estimates that more than half of companies sampled (over 60%) are investing in big data and predictive
Helping Students Get Into Graduate School The Journal of Undergraduate Neuroscience Education (JUNE), Fall 2004, 3(1):A4-8 Beth A Fischer 1 & Michael J. Zigmond 1,2 1 Survival Skills and Ethics Program,