Data Science Will computer science and informatics eat our lunch?



Similar documents
What is Data Science? Girl Develop It! Meetup Renée M. P. Teate, March 2015

What is Data Science? Data, Databases, and the Extraction of Knowledge Renée November 2014

How to Make Money with Google Adwords. For Cleaning Companies. H i tm a n. Advertising

Healthcare data analytics. Da-Wei Wang Institute of Information Science

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Data Mining Methods: Applications for Institutional Research

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Introduction to Data Science: CptS Syllabus First Offering: Fall 2015

Computer Programming for the Social Sciences

Why Big Data is not Big Hype in Economics and Finance?

ANALYTICS A FUTURE IN ANALYTICS

POL 204b: Research and Methodology

The 3 questions to ask yourself about BIG DATA

U N D E R S TA N D I N G T H E D N A O F DATA SCIENCE Persontyle Ltd. All rights reserved.

GETTING AHEAD OF THE COMPETITION WITH DATA MINING

Customer Case Study. Automatic Labs

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Computational Science and Informatics (Data Science) Programs at GMU

HR STILL GETTING IT WRONG BIG DATA & PREDICTIVE ANALYTICS THE RIGHT WAY

Six Signs. you are ready for BI WHITE PAPER

The Edge Editions of SAP InfiniteInsight Overview

Big Data Big Knowledge?

INDEX. Introduction Page 3. Methodology Page 4. Findings. Conclusion. Page 5. Page 10

Streamline your supply chain with data. How visual analysis helps eliminate operational waste

Data Science with Hadoop at Opower

FIVE STEPS FOR DELIVERING SELF-SERVICE BUSINESS INTELLIGENCE TO EVERYONE CONTENTS

Data Analytics in Organisations and Business

A Visualization is Worth a Thousand Tables: How IBM Business Analytics Lets Users See Big Data

What is Data Analysis. Kerala School of MathematicsCourse in Statistics for Scientis. Introduction to Data Analysis. Steps in a Statistical Study

INTRODUCING AZURE MACHINE LEARNING

Making data predictive why reactive just isn t enough

Jay Buckingham Dynamic Signal

UNIFY YOUR (BIG) DATA

Challenges, Tools and Examples for Big Data Inference

5 Point Social Media Action Plan.

Correlational Research

Interoperability and Analytics February 29, 2016

Course Title: Advanced Topics in Quantitative Methods: Educational Data Science Practicum

A Robust Method for Solving Transcendental Equations

Big Data to Knowledge (BD2K)

Statistics for BIG data

Data Analytics at NICTA. Stephen Hardy National ICT Australia (NICTA)

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

POSTGRADUATE PROGRAMS IN APPLIED DATA ANALYTICS

How To Handle Big Data With A Data Scientist

Five Tips for Presenting Data Analyses: Telling a Good Story with Data

A Review of "Free" Massive Open Online Content (MOOC) for SAS Learners

Five Reasons Spotfire Is Better than Excel for Business Data Analytics

Top 5 best practices for creating effective dashboards. and the 7 mistakes you don t want to make

5 - Low Cost Ways to Increase Your

Everything you wanted to know about using Hexadecimal and Octal Numbers in Visual Basic 6

Azure Machine Learning, SQL Data Mining and R

Training for Big Data

Introduction to predictive modeling and data mining

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics


Use advanced techniques for summary and visualization of complex data for exploratory analysis and presentation.

THE SEMANTIC WEB AND IT`S APPLICATIONS

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Website Promotion for Voice Actors: How to get the Search Engines to give you Top Billing! By Jodi Krangle

T he complete guide to SaaS metrics

Network Security. Mobin Javed. October 5, 2011

Top 5 Mistakes Made with Inventory Management for Online Stores

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

Big Data and Data Science: Behind the Buzz Words

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Sunnie Chung. Cleveland State University

Big data in R EPIC 2015

Marketing Online SEO Facebook Google Twitter YouTube

A Changing Standard for SEO Spam:

Ten top tips for social media success

Impressive Analytics

ECON 424/CFRM 462 Introduction to Computational Finance and Financial Econometrics

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Statistics in Applications III. Distribution Theory and Inference

8 WAYS TO BUILD YOUR BRAND USING SOCIAL MEDIA

SAS Certificate Applied Statistics and SAS Programming

Hurwitz ValuePoint: Predixion

HOW TO USE DATA VISUALIZATION TO WIN OVER YOUR AUDIENCE

Big Data Analytics. Lucas Rego Drumond

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

Transcription:

Data Science Will computer science and informatics eat our lunch? Thomas Lumley University of Auckland (g)tslumley statschat.org.nz notstat schat.tumblr.com

In the 1920s, the computing labs helped establish statistics on the American continent. Without them, even a modest study was beyond the ability of an individual statistician. At the same time, statistics labs often had the most powerful computing machines within their larger institution. They showed how organized computing could benefit science and provided a place for the earliest of computer scientists to test their ideas. -- Grier The origins of statistical computing, Amstat Online

Fig. 2. The Hollerith Electric Tabulating System

Iowa State Statistical Computing Service

CSIRAC

Iowa State Statistical Computing Service

Iowa State statistics PhD prelim exam Two eight-hour written-on-paper exams covering : Theory of Probability and Statistics I. Theory of Probability and Statistics II. Statistical Methods I. Statistical Methods II. Advanced Statistical Methods. Advanced Probability Theory. Advanced Theory of Statistical Inference. They do require a stat computing course: 1 credit/30

What is data science? and where can we get some?

Data Science is just a fancy name for statistics. Fitting simple models to messy and sometimes large data sets Combination of standard black-box fitting tools and good graphics. Doesn t require any fundamental knowledge our students don t have. Needs good computing skills, which our students can learn

Need to avoid going overboard with computing Data Wrangling isn t statistics Cleaning, tidying, querying, reformatting, transforming, getting in and out of databases,

Data Science is just a fancy name for statistics. Data Wrangling isn t statistics If you value self-consistency, you can hold at most one of these opinions. A/Prof Jenny Bryan, UBC (less than one is good)

Data science is statistics in the same way that epidemiology is statistics opinion polling is statistics ag. field trials are statistics

I did think, however, that many well-known applied statisticians attacked problems without the necessary mathematical knowledge and manipulative skill. Moreover, I believed that a principal cause of failure among medical research scientists was the lack of basic scientific knowledge in their special chosen field. H. O. Lancaster

Computing is easier to steal Define and explain the relevance to applied statistics of: Suffix trees Supernodal Cholesky factorization Column-store database Translation look-aside buffer

Computing is easier to steal Need to teach our data science students: A bit about databases and SQL A statistical programming language (eg R) Abstractions such as tidy data, sparse, map/reduce Reproducible data analysis (eg rmarkdown)?collaborative version control (eg git/github) Force students to work with a wild-caught data set and I'm still pretty sure some of the data is Permit interested students to learn the high-tech data structures missing, and butalgorithms could still stuff. be here, in this ONE HUNDRED SHEET excel file a PhD student on Twitter

But we don t know this stuff! let mego glethat for you Google Search I'm Feeling Lucky The computing folks are way better at dissemination than us Unlike statistics, the computer can tell you if you get it wrong.

Free online courses Books Related Courses M Exploratory Data Analysis Reproducible Research Statistical Inference /osljjÿp o D Dynamic Documents with R and knitr Yihui Xie Pract cal Dat Scienc * Nina/ml John Hooni Doing Data Science STRAIGHT TALK FROM THE FRONTLINE Getting and Cleaning Data Regression Models Developing Data Products d«n» «- dcns<ty(dot>i. n - npts) IIMIMINt Cathy O'Neil & Rachel Schutt dy2 <- M» - JfCIO KqtwlM «- rtrfyel.). length(dx)) lf(flu T> confshade(dx2. s«qb«lo». dy? S' I - 5>l The Data Scientist's Toolbox Data Analysis and Statistical Inference People who make their notes available ÿ 5b5 Home FAQ Syllabus Topics People J Data wrangling, exploration, and analysis with R UBC STAT 545A and 547M Software tools Open source environment for deep analysis of largecomplex data The Power of R with Big Data Get Started inminutes Resources to Learn & Join Learn how to explore, groom, visualize, and analyze dab make all of that reproducible, reusable, ar using R software carpentry

What do we have to offer? Popularity? Romance? Excitement?

Big Complex Messy Badly Sampled Creepy Vital to ask the right questions

Big Data Computer folks are better at this than us, but statistical insights important eg: Noel Cressie: fast computation for spatial models Bill Cleveland: optimising the divide/recombine strategy

Big Data Computer folks are better at this than us, Big doesn t mean gigabytes.

Complex Data Models for complex data Summaries (parameters, estimators) that answer the real questions Robustness of meaning, not just of power and level.

Complex Data: networks F(x)µ1- x -a Power laws: come from network, queue, Matthew effect process blog links page views long tail sales data citations to papers word frequencies earthquake sizes

Complex Data: networks F(x)µ1- x -a Power laws: come from network, queue, Matthew effect process blog links page views long tail sales data citations to papers word frequencies earthquake sizes All fit lognormal better, some much better Clauset et al, SIAM Rev. 2009

Complex Data: networks Random graph models for connections Erdös-Renyi graphs Exponential Random Graph Models (ERGMs) meaningful parameters, nice likelihood ERGMs are not consistent under sampling. [Shalizi et al, Ann Stat]

Complex Data Robustness of meaning can be hard: Suppose a Wilcoxon test shows X > Y, Y>Z What does this tell us about Means of X and Z? Medians of X and Z? Wilcoxon test of X and Z?

i i Messy Data Good applied statisticians know from messy data. o CM - X O and I'm still pretty sure some of the data is missing, but it could still be here, in this ONE HUNDRED SHEET excel file blooc Diastolic 40 20 NnT i o r o a PhD student on Twitter 0 ao o c 20 40 60 80 Age (years)

Badly Sampled Whom the Gods Would Destroy, They First Give Real-time Analytics [Dan McKinley, Etsy] This line of thinking is a trap. It's important to divorce the concepts of operational metrics and product analytics. Confusing how we do things with how we decide which things to do is a fatal mistake. Because non-representativeness of short time slices

Badly Sampled Statisticians know about sampling design weighting matching selection models

Creepy What questions should data answer? income Mount Eden atistics NZ Chris McDowall (@fogonwater) Based on census meshblock: not actual household data

Creepy (and Evil) What questions should data answer? Familiar issues: Bioethics Statistical disclosure/confidentiality New, but statistical issues: algorithm audit/accountability We also talk to social scientists more. (not enough)

Creepy (and Evil) How do we learn more? let me LjOOQie that for you Googlo Search I'm Fooling Lucky Cathy O Neil (mathbabe.org) Ed Felten danah boyd

Summary The hard problems in data science are hard. Many of the computational ones are solved (ish) Many of the unsolved ones are closer to statistics

Data Science Will computer science and informatics eat our lunch? Only if we let them, and it would be bad for data science, too