Data Science Will computer science and informatics eat our lunch? Thomas Lumley University of Auckland (g)tslumley statschat.org.nz notstat schat.tumblr.com
In the 1920s, the computing labs helped establish statistics on the American continent. Without them, even a modest study was beyond the ability of an individual statistician. At the same time, statistics labs often had the most powerful computing machines within their larger institution. They showed how organized computing could benefit science and provided a place for the earliest of computer scientists to test their ideas. -- Grier The origins of statistical computing, Amstat Online
Fig. 2. The Hollerith Electric Tabulating System
Iowa State Statistical Computing Service
CSIRAC
Iowa State Statistical Computing Service
Iowa State statistics PhD prelim exam Two eight-hour written-on-paper exams covering : Theory of Probability and Statistics I. Theory of Probability and Statistics II. Statistical Methods I. Statistical Methods II. Advanced Statistical Methods. Advanced Probability Theory. Advanced Theory of Statistical Inference. They do require a stat computing course: 1 credit/30
What is data science? and where can we get some?
Data Science is just a fancy name for statistics. Fitting simple models to messy and sometimes large data sets Combination of standard black-box fitting tools and good graphics. Doesn t require any fundamental knowledge our students don t have. Needs good computing skills, which our students can learn
Need to avoid going overboard with computing Data Wrangling isn t statistics Cleaning, tidying, querying, reformatting, transforming, getting in and out of databases,
Data Science is just a fancy name for statistics. Data Wrangling isn t statistics If you value self-consistency, you can hold at most one of these opinions. A/Prof Jenny Bryan, UBC (less than one is good)
Data science is statistics in the same way that epidemiology is statistics opinion polling is statistics ag. field trials are statistics
I did think, however, that many well-known applied statisticians attacked problems without the necessary mathematical knowledge and manipulative skill. Moreover, I believed that a principal cause of failure among medical research scientists was the lack of basic scientific knowledge in their special chosen field. H. O. Lancaster
Computing is easier to steal Define and explain the relevance to applied statistics of: Suffix trees Supernodal Cholesky factorization Column-store database Translation look-aside buffer
Computing is easier to steal Need to teach our data science students: A bit about databases and SQL A statistical programming language (eg R) Abstractions such as tidy data, sparse, map/reduce Reproducible data analysis (eg rmarkdown)?collaborative version control (eg git/github) Force students to work with a wild-caught data set and I'm still pretty sure some of the data is Permit interested students to learn the high-tech data structures missing, and butalgorithms could still stuff. be here, in this ONE HUNDRED SHEET excel file a PhD student on Twitter
But we don t know this stuff! let mego glethat for you Google Search I'm Feeling Lucky The computing folks are way better at dissemination than us Unlike statistics, the computer can tell you if you get it wrong.
Free online courses Books Related Courses M Exploratory Data Analysis Reproducible Research Statistical Inference /osljjÿp o D Dynamic Documents with R and knitr Yihui Xie Pract cal Dat Scienc * Nina/ml John Hooni Doing Data Science STRAIGHT TALK FROM THE FRONTLINE Getting and Cleaning Data Regression Models Developing Data Products d«n» «- dcns<ty(dot>i. n - npts) IIMIMINt Cathy O'Neil & Rachel Schutt dy2 <- M» - JfCIO KqtwlM «- rtrfyel.). length(dx)) lf(flu T> confshade(dx2. s«qb«lo». dy? S' I - 5>l The Data Scientist's Toolbox Data Analysis and Statistical Inference People who make their notes available ÿ 5b5 Home FAQ Syllabus Topics People J Data wrangling, exploration, and analysis with R UBC STAT 545A and 547M Software tools Open source environment for deep analysis of largecomplex data The Power of R with Big Data Get Started inminutes Resources to Learn & Join Learn how to explore, groom, visualize, and analyze dab make all of that reproducible, reusable, ar using R software carpentry
What do we have to offer? Popularity? Romance? Excitement?
Big Complex Messy Badly Sampled Creepy Vital to ask the right questions
Big Data Computer folks are better at this than us, but statistical insights important eg: Noel Cressie: fast computation for spatial models Bill Cleveland: optimising the divide/recombine strategy
Big Data Computer folks are better at this than us, Big doesn t mean gigabytes.
Complex Data Models for complex data Summaries (parameters, estimators) that answer the real questions Robustness of meaning, not just of power and level.
Complex Data: networks F(x)µ1- x -a Power laws: come from network, queue, Matthew effect process blog links page views long tail sales data citations to papers word frequencies earthquake sizes
Complex Data: networks F(x)µ1- x -a Power laws: come from network, queue, Matthew effect process blog links page views long tail sales data citations to papers word frequencies earthquake sizes All fit lognormal better, some much better Clauset et al, SIAM Rev. 2009
Complex Data: networks Random graph models for connections Erdös-Renyi graphs Exponential Random Graph Models (ERGMs) meaningful parameters, nice likelihood ERGMs are not consistent under sampling. [Shalizi et al, Ann Stat]
Complex Data Robustness of meaning can be hard: Suppose a Wilcoxon test shows X > Y, Y>Z What does this tell us about Means of X and Z? Medians of X and Z? Wilcoxon test of X and Z?
i i Messy Data Good applied statisticians know from messy data. o CM - X O and I'm still pretty sure some of the data is missing, but it could still be here, in this ONE HUNDRED SHEET excel file blooc Diastolic 40 20 NnT i o r o a PhD student on Twitter 0 ao o c 20 40 60 80 Age (years)
Badly Sampled Whom the Gods Would Destroy, They First Give Real-time Analytics [Dan McKinley, Etsy] This line of thinking is a trap. It's important to divorce the concepts of operational metrics and product analytics. Confusing how we do things with how we decide which things to do is a fatal mistake. Because non-representativeness of short time slices
Badly Sampled Statisticians know about sampling design weighting matching selection models
Creepy What questions should data answer? income Mount Eden atistics NZ Chris McDowall (@fogonwater) Based on census meshblock: not actual household data
Creepy (and Evil) What questions should data answer? Familiar issues: Bioethics Statistical disclosure/confidentiality New, but statistical issues: algorithm audit/accountability We also talk to social scientists more. (not enough)
Creepy (and Evil) How do we learn more? let me LjOOQie that for you Googlo Search I'm Fooling Lucky Cathy O Neil (mathbabe.org) Ed Felten danah boyd
Summary The hard problems in data science are hard. Many of the computational ones are solved (ish) Many of the unsolved ones are closer to statistics
Data Science Will computer science and informatics eat our lunch? Only if we let them, and it would be bad for data science, too