Data Science Will computer science and informatics eat our lunch?


 Ilene Miller
 1 years ago
 Views:
Transcription
1 Data Science Will computer science and informatics eat our lunch? Thomas Lumley University of Auckland (g)tslumley statschat.org.nz notstat schat.tumblr.com
2 In the 1920s, the computing labs helped establish statistics on the American continent. Without them, even a modest study was beyond the ability of an individual statistician. At the same time, statistics labs often had the most powerful computing machines within their larger institution. They showed how organized computing could benefit science and provided a place for the earliest of computer scientists to test their ideas.  Grier The origins of statistical computing, Amstat Online
3 Fig. 2. The Hollerith Electric Tabulating System
4 Iowa State Statistical Computing Service
5 CSIRAC
6 Iowa State Statistical Computing Service
7 Iowa State statistics PhD prelim exam Two eighthour writtenonpaper exams covering : Theory of Probability and Statistics I. Theory of Probability and Statistics II. Statistical Methods I. Statistical Methods II. Advanced Statistical Methods. Advanced Probability Theory. Advanced Theory of Statistical Inference. They do require a stat computing course: 1 credit/30
8 What is data science? and where can we get some?
9 Data Science is just a fancy name for statistics. Fitting simple models to messy and sometimes large data sets Combination of standard blackbox fitting tools and good graphics. Doesn t require any fundamental knowledge our students don t have. Needs good computing skills, which our students can learn
10 Need to avoid going overboard with computing Data Wrangling isn t statistics Cleaning, tidying, querying, reformatting, transforming, getting in and out of databases,
11 Data Science is just a fancy name for statistics. Data Wrangling isn t statistics If you value selfconsistency, you can hold at most one of these opinions. A/Prof Jenny Bryan, UBC (less than one is good)
12 Data science is statistics in the same way that epidemiology is statistics opinion polling is statistics ag. field trials are statistics
13 I did think, however, that many wellknown applied statisticians attacked problems without the necessary mathematical knowledge and manipulative skill. Moreover, I believed that a principal cause of failure among medical research scientists was the lack of basic scientific knowledge in their special chosen field. H. O. Lancaster
14 Computing is easier to steal Define and explain the relevance to applied statistics of: Suffix trees Supernodal Cholesky factorization Columnstore database Translation lookaside buffer
15 Computing is easier to steal Need to teach our data science students: A bit about databases and SQL A statistical programming language (eg R) Abstractions such as tidy data, sparse, map/reduce Reproducible data analysis (eg rmarkdown)?collaborative version control (eg git/github) Force students to work with a wildcaught data set and I'm still pretty sure some of the data is Permit interested students to learn the hightech data structures missing, and butalgorithms could still stuff. be here, in this ONE HUNDRED SHEET excel file a PhD student on Twitter
16 But we don t know this stuff! let mego glethat for you Google Search I'm Feeling Lucky The computing folks are way better at dissemination than us Unlike statistics, the computer can tell you if you get it wrong.
17 Free online courses Books Related Courses M Exploratory Data Analysis Reproducible Research Statistical Inference /osljjÿp o D Dynamic Documents with R and knitr Yihui Xie Pract cal Dat Scienc * Nina/ml John Hooni Doing Data Science STRAIGHT TALK FROM THE FRONTLINE Getting and Cleaning Data Regression Models Developing Data Products d«n» « dcns<ty(dot>i. n  npts) IIMIMINt Cathy O'Neil & Rachel Schutt dy2 < M»  JfCIO KqtwlM « rtrfyel.). length(dx)) lf(flu T> confshade(dx2. s«qb«lo». dy? S' I  5>l The Data Scientist's Toolbox Data Analysis and Statistical Inference People who make their notes available ÿ 5b5 Home FAQ Syllabus Topics People J Data wrangling, exploration, and analysis with R UBC STAT 545A and 547M Software tools Open source environment for deep analysis of largecomplex data The Power of R with Big Data Get Started inminutes Resources to Learn & Join Learn how to explore, groom, visualize, and analyze dab make all of that reproducible, reusable, ar using R software carpentry
18 What do we have to offer? Popularity? Romance? Excitement?
19 Big Complex Messy Badly Sampled Creepy Vital to ask the right questions
20 Big Data Computer folks are better at this than us, but statistical insights important eg: Noel Cressie: fast computation for spatial models Bill Cleveland: optimising the divide/recombine strategy
21 Big Data Computer folks are better at this than us, Big doesn t mean gigabytes.
22 Complex Data Models for complex data Summaries (parameters, estimators) that answer the real questions Robustness of meaning, not just of power and level.
23 Complex Data: networks F(x)µ1 x a Power laws: come from network, queue, Matthew effect process blog links page views long tail sales data citations to papers word frequencies earthquake sizes
24 Complex Data: networks F(x)µ1 x a Power laws: come from network, queue, Matthew effect process blog links page views long tail sales data citations to papers word frequencies earthquake sizes All fit lognormal better, some much better Clauset et al, SIAM Rev. 2009
25 Complex Data: networks Random graph models for connections ErdösRenyi graphs Exponential Random Graph Models (ERGMs) meaningful parameters, nice likelihood ERGMs are not consistent under sampling. [Shalizi et al, Ann Stat]
26 Complex Data Robustness of meaning can be hard: Suppose a Wilcoxon test shows X > Y, Y>Z What does this tell us about Means of X and Z? Medians of X and Z? Wilcoxon test of X and Z?
27 i i Messy Data Good applied statisticians know from messy data. o CM  X O and I'm still pretty sure some of the data is missing, but it could still be here, in this ONE HUNDRED SHEET excel file blooc Diastolic NnT i o r o a PhD student on Twitter 0 ao o c Age (years)
28 Badly Sampled Whom the Gods Would Destroy, They First Give Realtime Analytics [Dan McKinley, Etsy] This line of thinking is a trap. It's important to divorce the concepts of operational metrics and product analytics. Confusing how we do things with how we decide which things to do is a fatal mistake. Because nonrepresentativeness of short time slices
29 Badly Sampled Statisticians know about sampling design weighting matching selection models
30 Creepy What questions should data answer? income Mount Eden atistics NZ Chris McDowall Based on census meshblock: not actual household data
31 Creepy (and Evil) What questions should data answer? Familiar issues: Bioethics Statistical disclosure/confidentiality New, but statistical issues: algorithm audit/accountability We also talk to social scientists more. (not enough)
32 Creepy (and Evil) How do we learn more? let me LjOOQie that for you Googlo Search I'm Fooling Lucky Cathy O Neil (mathbabe.org) Ed Felten danah boyd
33 Summary The hard problems in data science are hard. Many of the computational ones are solved (ish) Many of the unsolved ones are closer to statistics
34 Data Science Will computer science and informatics eat our lunch? Only if we let them, and it would be bad for data science, too
Writing and using good learning outcomes Written by David Baume
Writing and using good learning outcomes Written by David Baume 2 www.leedsmet.ac.uk Preface Our Assessment, Learning and Teaching strategy reinforces the University s commitment to put students at the
More informationThe InStat guide to choosing and interpreting statistical tests
Version 3.0 The InStat guide to choosing and interpreting statistical tests Harvey Motulsky 19902003, GraphPad Software, Inc. All rights reserved. Program design, manual and help screens: Programming:
More informationThe Right Stuff: How to Find Good Information
The Right Stuff: How to Find Good Information David D. Thornburg, PhD Executive Director, Thornburg Center for Space Exploration dthornburg@aol.com One of the most frustrating tasks you can have as a student
More informationAN INTRODUCTION TO. Data Science. Jeffrey Stanton, Syracuse University
AN INTRODUCTION TO Data Science Jeffrey Stanton, Syracuse University INTRODUCTION TO DATA SCIENCE 2012, Jeffrey Stanton This book is distributed under the Creative Commons Attribution NonCommercialShareAlike
More informationJune 2014. Master of Public Administration at Upper Iowa University
June 2014 Master of Public Administration at Upper Iowa University 1 Academic or Professional Master's Degrees: Does it Matter? Yes! The UIU MPA program combines both an academic and a professional focus.
More informationPractical Predictive Analytics for Healthcare 101. A white paper by Steven S. Eisenberg, MD
Practical Predictive Analytics for Healthcare 101 A white paper by Steven S. Eisenberg, MD You cannot scan a healthcare related newspaper, newsfeed, magazine or website these days without seeing a reference
More informationAnalyzing Data with GraphPad Prism
Analyzing Data with GraphPad Prism A companion to GraphPad Prism version 3 Harvey Motulsky President GraphPad Software Inc. Hmotulsky@graphpad.com GraphPad Software, Inc. 1999 GraphPad Software, Inc. All
More informationWHEN YOU CONSULT A STATISTICIAN... WHAT TO EXPECT
WHEN YOU CONSULT A STATISTICIAN... WHAT TO EXPECT SECTION ON STATISTICAL CONSULTING AMERICAN STATISTICAL ASSOCIATION 2003 When you consult a statistician, you enlist the help of a professional who is particularly
More informationPredictive Analytics: Revolutionizing Business Decision Making
NOVEMBER 2014 TDWI EBook Predictive Analytics: Revolutionizing Business Decision Making 1 Q&A: Predictive Analytics 101 3 Who Should Be Building Predictive Models? 5 Exploratory Predictive Analytics 8
More informationMethods for Understanding Student Learning University of Massachusetts Amherst Contributing Authors:
R Urban Policy Calculus Lively Arts Minute Paper Accounting Syllabus Metaphysics Pre/Post Cervantes Cyberlaw Primary Trait COURSEBased Review and Assessment Methods for Understanding Student Learning
More informationElements of Scientific Theories: Relationships
23 Part 1 / Philosophy of Science, Empiricism, and the Scientific Method Chapter 3 Elements of Scientific Theories: Relationships In the previous chapter, we covered the process of selecting and defining
More informationBase Tutorial: From Newbie to Advocate in a one, two... three!
Base Tutorial: From Newbie to Advocate in a one, two... three! BASE TUTORIAL: From Newbie to Advocate in a one, two... three! Stepbystep guide to producing fairly sophisticated database applications
More informationPh.D. Thesis Research: Where do I Start?
Ph.D. Thesis Research: Where do I Start? Notes by Don Davis Columbia University If you are the next Paul Samuelson and will wholly transform the field of economics, pay no heed. If you are the next Ken
More informationchapter 2 Introduction: Principles of CALL Focus
Introduction: Principles of CALL Introduction: Principles of CALL 1 chapter 2 Focus In this chapter you will reflect on definitions of CALL learn about conditions for optimal language learning and standards
More information!!!! Online Education White Paper. January 25, 2014
Online Education White Paper January 25, 2014 Executive Summary A renewed interest in online education has surfaced recently, largely sparked by issues of bottlenecking and course availability on college
More informationFigure 1: Guidelines for Research Summaries
Case Study: Teaching Research Skills to Computer Science Graduate Students Marie desjardins Department of Computer Science and Electrical Engineering University of Maryland Baltimore County Baltimore,
More informationA Student s Perspective on Applying to Graduate School in (Clinical) Psychology: A StepbyStep Guide
Tips for Applying to Graduate School in Clinical Psychology 1 A Student s Perspective on Applying to Graduate School in (Clinical) Psychology: A StepbyStep Guide Sophie ChoukasBradley, M.A. Doctoral
More informationANALYTICS ON BIG FAST DATA USING REAL TIME STREAM DATA PROCESSING ARCHITECTURE
ANALYTICS ON BIG FAST DATA USING REAL TIME STREAM DATA PROCESSING ARCHITECTURE Dibyendu Bhattacharya ArchitectBig Data Analytics HappiestMinds Manidipa Mitra Principal Software Engineer EMC Table of Contents
More informationThe Real Value of Joining a Local Chamber of Commerce A Research Study
The Real Value of Joining a Local Chamber of Commerce A Research Study Study Overview Advocates of chambers of commerce have long believed that when a company is active in its local chamber, it is doing
More informationPredictive analytics and data mining
Predictive analytics and data mining Charles Elkan elkan@cs.ucsd.edu May 28, 2013 1 Contents Contents 2 1 Introduction 7 1.1 Limitations of predictive analytics.................. 8 1.2 Opportunities for
More informationDebugging the Hype about Big Data and Business Service Metrics
Once you have defined Business Services, successful cost and performance management hinges on tracking the right metrics. While simple unit metrics are a start, the most effective way to gain insights
More informationWhat Professors Expect From You (I.e., Why You Are at College)
What Professors Expect From You (I.e., Why You Are at College) Often, students who struggle in college do so because they are unclear about what their college professors expect. This confusion might come
More informationProblem Solving and Critical Thinking
Skills to Pay the Bills Problem Solving and Critical Thinking Everyone experiences problems from time to time. Some of our problems are big and complicated, while others may be more easily solved. There
More informationIntroduction to Data Mining and Knowledge Discovery
Introduction to Data Mining and Knowledge Discovery Third Edition by Two Crows Corporation RELATED READINGS Data Mining 99: Technology Report, Two Crows Corporation, 1999 M. Berry and G. Linoff, Data Mining
More informationUnderstanding NoSQL on Microsoft Azure
David Chappell Understanding NoSQL on Microsoft Azure Sponsored by Microsoft Corporation Copyright 2014 Chappell & Associates Contents Data on Azure: The Big Picture... 3 Relational Technology: A Quick
More informationDesigning and Teaching Online Courses
Designing and Teaching Online Courses Bainbridge State College April, 2013 David L. Pollock, Ph.D. 1 Contents Designing and Teaching Online Courses... 3 Why careful design is necessary... 3 Designing with
More informationThese materials are the copyright of John Wiley & Sons, Inc. and any dissemination, distribution, or unauthorized use is strictly prohibited.
Big Data Analytics ALTERYX SPECIAL EDITION by Michael Wessler, OCP & CISSP Big Data Analytics For Dummies, Alteryx Special Edition Published by John Wiley & Sons, Inc. 111 River St. Hoboken, NJ 070305774
More informationAn Oracle White Paper March 2013. Big Data Analytics. Advanced Analytics in Oracle Database
An Oracle White Paper March 2013 Big Data Analytics Advanced Analytics in Oracle Database Advanced Analytics in Oracle Database Disclaimer The following is intended to outline our general product direction.
More informationHR STILL GETTING IT WRONG BIG DATA & PREDICTIVE ANALYTICS THE RIGHT WAY
HR STILL GETTING IT WRONG BIG DATA & PREDICTIVE ANALYTICS THE RIGHT WAY OVERVIEW Research cited by Forbes estimates that more than half of companies sampled (over 60%) are investing in big data and predictive
More informationHelping Students Get Into Graduate School
Helping Students Get Into Graduate School The Journal of Undergraduate Neuroscience Education (JUNE), Fall 2004, 3(1):A48 Beth A Fischer 1 & Michael J. Zigmond 1,2 1 Survival Skills and Ethics Program,
More information