The Emerging Discipline of Data Science. Principles and Techniques For Data- Intensive Analysis

Similar documents
How To Understand The Big Data Paradigm

Open Science, Big Data and Research Reproducibility. Tony Hey Senior Data Science Fellow escience Ins>tute University of Washington

Data analysis in Par,cle Physics

Mission. To provide higher technological educa5on with quality, preparing. competent professionals, with sound founda5ons in science, technology

Project Por)olio Management

Big Data. The Big Picture. Our flexible and efficient Big Data solu9ons open the door to new opportuni9es and new business areas

The Data Reservoir. 10 th September Mandy Chessell FREng CEng FBCS Dis4nguished Engineer, Master Inventor Chief Architect, Informa4on Solu4ons

AVOIDING SILOED DATA AND SILOED DATA MANAGEMENT

Realm of Big Data Ini0a0ves

Developing the Agile Mindset for Organiza7onal Agility. Shannon Ewan Managing

MSc Data Science at the University of Sheffield. Started in September 2014

Big Data Processing Experience in the ATLAS Experiment

Experiments on cost/power and failure aware scheduling for clouds and grids

Information and Communications Technology Supply Chain Risk Management (ICT SCRM) AND NIST Cybersecurity Framework

CONTENTS. Introduc on 2. Undergraduate Program 4. BSC in Informa on Systems 4. Graduate Program 7. MSC in Informa on Science 7

CMMI for High-Performance with TSP/PSP

How To Teach Physics At The Lhc

From Data to Foresight:

Stream Deployments in the Real World: Enhance Opera?onal Intelligence Across Applica?on Delivery, IT Ops, Security, and More

An Open Dynamic Big Data Driven Applica3on System Toolkit

Data Mining. Supervised Methods. Ciro Donalek Ay/Bi 199ab: Methods of Sciences hcp://esci101.blogspot.

Effec%ve AX 2012 Upgrade Project Planning and Microso< Sure Step. Arbela Technologies

Transla6ng from Clinical Care to Research: Integra6ng i2b2 and OpenClinica

Graduate Systems Engineering Programs: Report on Outcomes and Objec:ves

Applying Machine Learning to Network Security Monitoring. Alex Pinto Chief Data Scien2st

Poten&al Impact of FDA Regula&on of EMRs. October 27, 2010

benefit of virtualiza/on? Virtualiza/on An interpreter may not work! Requirements for Virtualiza/on 1/06/15 Which of the following is not a poten/al

BENCHMARKING V ISUALIZATION TOOL

Everything You Need to Know about Cloud BI. Freek Kamst

The Real Score of Cloud

Introduc)on to the IoT- A methodology

Clusters in the Cloud

Science Gateways What are they and why are they having such a tremendous impact on science? Nancy Wilkins- Diehr wilkinsn@sdsc.edu

NERSC Data Efforts Update Prabhat Data and Analytics Group Lead February 23, 2015

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Testing the In-Memory Column Store for in-database physics analysis. Dr. Maaike Limper

High Value Manufacturing Catapult. Paul John, Business Director

HIPAA Breaches, Security Risk Analysis, and Audits

Integrating Genetic Data into Clinical Workflow with Clinical Decision Support Apps

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

Big Data Use Cases. At Salesforce.com. Narayan Bharadwaj Director, Product Management

Program Model: Muskingum University offers a unique graduate program integra6ng BUSINESS and TECHNOLOGY to develop the 21 st century professional.

How To Use Splunk For Android (Windows) With A Mobile App On A Microsoft Tablet (Windows 8) For Free (Windows 7) For A Limited Time (Windows 10) For $99.99) For Two Years (Windows 9

Strategy and Architecture to Establish 'Smart Plants'

Solving Healthcare Problems with Big Data. Alicia d Empaire

The 3 questions to ask yourself about BIG DATA

How To Change Medicine

Transcription:

The Emerging Discipline of Data Science Principles and Techniques For Data- Intensive Analysis

What is Big Data Analy9cs? Is this a new paradigm? What is the role of data? What could possibly go wrong? What is Data Science?

Big Data is Hot!

Big Data Is Important Market Hot Results, products, jobs Poten9al 4 th Paradigm Accelerates discovery [urgent] BeLer: cost, speed, specificity Change 80% of processes [Gartner] Government Policy (45+) White House; most US Govt agencies Adop9on: Most Human Endeavors All academic disciplines Computa9onal X Cool Low effec9ve adop9on [EMC] 60% opera9onal 20% significant change < 1% effec9ve Results not opera9onal In its infancy þ lacking Understanding Concepts, tools, techniques (methods) 21 st Century Sta9s9cs Theory: principles, guidelines

Healthcare Poten9al: BeLer Health; Faster, Cheaper Remedies

What could go Wrong? When are Correla9ons Spurious?

Or Just Wrong? Eg Google Flu Trends Allegedly Real- 9me, Reliable Predic9ons High 100 out of 108 weeks

Future of Life: Ins9tute to mi;gate existen;al risks facing humanity

US Legal Community Pursuing Algorithmic Accountability

Do We Know / Can We Prove? DIA Result: correct, complete, efficient? What machines / algorithms / Machine Learning / Black Boxes / DIA do? Emergent Data- Driven Society with High Reward: Cancer cures, drug discovery, personalized medicine, Risk: errors in any of the above

The search for truth evidence- based causality evidence- based correla9ons

Model / Theory Hypotheses Data Analysis

Long Illustrious Histories Data Analysis Mathema9cs Babylon (17 th - 12 th C BCE) India (12 th C BCE) Mathema9cal analysis (17 th C, Scien9fic Revolu9on) Sta9s9cs (5 th C BCE, 18 th C) ~4,000 years Scien1fic Method Empiricism Aristotle (384-322 BCE) Ptolemy (1 st C) Bacons (13 th, 16 th C) ~2,000 years Scien9fic Discovery Paradigms 1 Theory 2 Experimenta9on 3 Simula9on 4 escience / Big Data ~ 1,000 years

Fourth Paradigm Modern Compu1ng Hardware: 40s- 50s FORTRAN: 50s Spreadsheets: 70s Databases: 70s- 80s World Wide Web: 90s ~ 60 years Data- Intensive Analysis of Everything escience (~2000) Big Data (~2007) Par9cle physics, drug discovery, ~ 15 years Paradigms Long developments Significant shiss Conceptual Theore9cal Procedural

Precision Oncology Scans Normal skin cell Original cancer cell Biopsy Biomarkers Monitor Sequence Treated cell Sequencing Machines Treat Compare Patient Chromosomes Test Target Cancer cell Normal cell In vivo In vitro In silico Source: Marty Tenebaum, Cancer Commons

Accelerating Scientific Discovery Probabilistic Results What: Correlation Experiment Model Why: Causation Correlations/ Hypotheses

Accelerating Scientific Discovery Probabilistic Results What: Correlation Baylor Scientists Experiment Model Why: Causation Watson Correlations/ Hypotheses

Profound Changes: Paradigm Shis [Kuhn] New reasoning / problem solving model Data è Data- Intensive (Big Data 4 Vs) Why è What Strategic (theory- based) è Tac9cal (evidence- based) Theory- driven (top- down) è Data- driven (bolom- up) Hypothesis tes9ng è Hypothesis genera9on Enabling Paradigm Shiss in most disciplines Science è escience Accelera9ng (scien9fic / engineering) discovery Most domains Personalized medicine Urban Planning Drug interac9ons Social and Economic Planning Beyond Data- Driven: Symbiosis What + Why Human intelligence + machine intelligence

Big Data and Data- Intensive Analysis THE BIG PICTURE: MY PERSPECTIVE

DIA Pipelines / Ecosystem Q: What Big Data technologies do you see becoming very popular within the next five years? A: I don t like to say that there s a specific technology, there are pipelines that you would build that have pieces to them How do you process the data, how do you represent it, how do you store it, what inferen9al problem are you trying to solve There s a whole toolbox or ecosystem that you have to understand if you are going to be working in the field Michael Jordan, Pehong Chen Dis;nguished Professor at the University of California, Berkeley

Data- Intensive Analysis Analy9cal Models Analy9cal Methods Analy9cal Results Data- Intensive Analysis Data Science

Data- Intensive Analysis Analy9cal Models Analy9cal Methods Analy9cal Results Data- Intensive Analysis Data Science

Data Management for Data- Intensive Analysis Data- Intensive Analysis Data Sources Internal Shared Data Repository Global Data Catalogue & Grid Access Analy9cal Models Analy9cal Methods Analy9cal Results Raw Data Acquisi9on & Cura9on External Shared Repository Catalogue En99es Analy9cal Data Acquisi9on Rela9onships Data- Intensive Analysis Data Science Data Science

Research Method: Examine Complex, Large- Scale Use Cases that push limits DATA- INTENSIVE ANALYSIS (DIA) DIA PROCESS (WORKFLOW / PIPELINE) DIA USE CASE RANGE

Data Analysis è Data- Intensive Analysis Common defini9on far too simplis;c : extract knowledge from data DIA: the ac;vity of using data to inves;gate phenomena, to acquire new knowledge, and to correct and integrate previous knowledge DIA Process/Workflow/Pipeline: a sequence of opera;ons that cons;tute an end- to- end DIA from source data to a quan;fied, qualified result

My Focus is Not common DIA Use Cases

Nor High Impact Organiza9onal DIA

Data(- Intensive) Analysis Range * Small Data (volume, velocity, variety) 98% Conven9onal data analysis: 1 K years - sta9s9cs, spreadsheets, databases, Big Data = (volume, velocity, variety) 2% Simple DIA: most data science is simple Jeff Leek 96% Focus Simple models & methods, single user, short dura9on: 65+ self- service tools, ML, widest- usage Rela9ve simplicity: sales, marke9ng, & social trends, defects, Complex DIA 4% Domains: par9cle physics, economics, stock market, genomics, drug discovery, weather, boiling water, psychology, Models & Methods: large, collabora9ve community, long dura9on, very large scale Why? A: This is where things obviously break * Many more factors

Example Scien9fic Workflow (Arvados)

Complex DIA Use Case #1 escience, Big Science, Networked Science, Community Compu9ng, Science Gateway TOP- QUARK, LARGE HADRON COLLIDER, CERN, SWITZERLAND

Higg s Boson: 40 Year Search LHC Data from proton proton collisions at centre- of- mass energies of 7 TeV (2011) and 8 TeV (2012)

How do you Prove Higgs Boson Exists? Standard model of physics predicts (30 years) Higgs Boson characteris9cs Mass ~125 GeV Decays to γγ, WW and ZZ boson pairs Couplings to W and Z bosons Spin parity Couples to up- type top- quark Couples to down- type fermions? Decays to bolom quarks and τ leptons

5 Sigma=000001% possible error (2012) 10 Sigma (2014)

Original Big Data Applica9on, eg, ATLAS high- energy physics (CERN) Source Data Sets Hardware filtering Root Model Tier 1: long- term cura9on; RAW reprocessing to derived formats Root Model ROOT in Oracle Boson Model ROOT Top Quark Model 1,000s Tier 0: Calibra9on and Alignment Express Stream Analysis ROOT Raw Data Acquisi9on & Cura9on Analy9cal Data Acquisi9on Data- Intensive Analysis

Worldwide LHC Computing Grid Tier 0 (CERN) Data recording Ini9al data reconstruc9on Data distribu9on Tier 1 (13 centers) Permanent storage Re- processing Analysis Tier 2 (~160 centers) Simula9on End- user analysis April 2012 Author/more info 37

Original Big Data Applica9on, eg, ATLAS high- energy physics (CERN); Oracle + DQ2 + ROOT ATLAS Distributed Data Management System (DQ2) (Pig, Hive, Hadoop) 2007+ On the Worldwide LHC Compu9ng Grid (WLCG) Source Data Sets Hardware filtering Root Model Tier 1: long- term cura9on; RAW reprocessing to derived formats Root Model ROOT (Oracle) Meta- Data / Access DQ2 (HDFS) Boson Model ROOT Top Quark Model 1,000s Tier 0: Calibra9on and Alignment Express Stream Analysis ROOT Raw Data Acquisi9on & Cura9on Analy9cal Data Acquisi9on Data- Intensive Analysis

Based on ~30 Large- Scale DIA Use Cases LESSONS LEARNED

DIA Lessons Learned (What) A Sosware Ar9fact: a workflow / pipeline Data- Intensive Analysis Workflow Data Management (80%) (Raw) Data Acquisi9on and Cura9on Analy9cal Data Acquisi9on Data- Intensive Analysis (20%) Objec9ve: switch 80:20 to 20:80 è Let scien;sts do science Explore (DIA) vs Build (sosware engineering) Dura9on: years Emerging Paradigm New programming paradigm Experiments over data Convergence Scien9fic / engineering discovery ~10 programming paradigms: database, IR, BI, DM,

Result Types DIA Lessons Learned (How) Provable <- > Probabilis9c <- > Specula9ve Nature Analy9cal Empirical: complete meta- data Abstract: incomplete meta- data Phases: Explora9on, Analysis, Interpreta9on Users Exploratory, Itera9ve, and Incremental Individual Workgroup Organiza9on / Enterprise Community

DIA Lessons Learned (People) Machine + Human Intelligence Symbiosis op9mized Domain knowledge cri9cal Mul9- disciplinary, Collabora9ve, Itera9ve Community Compu9ng: DIA Ecosystems sharing Massive resources Knowledge Costs Many (~60): escience, Science Gateways, Networked Science, High- energy physics (CERN: ROOT) Astrophysics (Gaia) Scien9fic Workflow Systems: ~30 Macroeconomics Global Alliance for Genomics and Health Enterprise Ecosystems, eg, Informa9on Services: Thomson Reuters, Bloomberg, Open- Science- Grid The Cancer Genome Atlas The Cancer Genomics Hub

DIA Lessons Learned (Essence) The value and role of What data is adequate evidence for Q? truth evidence- based causality evidence- based correla9ons

Complex DIA Use Case #2 Informa9on Services DOW JONES, BLOOMBERG, THOMSON REUTERS, PEARSON,

Informa9on Services Business Collect, curate, enrich, augment (IP) & disseminate informa9on Financial & Risk Investors News & press releases Brokerage research Instruments: stocks, bonds, loans, Legal Dockets Case Law Public records Law firms Global businesses Intellectual Property & Science Scien9fic ar9cles Patents Trademarks Domain names Clinical trials Tax & Accoun9ng Corporate Government Solu9ons

Consequences of Errors

Internal & External Data Sources Enterprise- Scale Big Data Architecture (Informa9on Services) Pre- Big Data Directory Big Data Big Data Catalogue Domain specific Data Marts Domain1 DS1 En9ty1 En9ty2 En9ty3 Raw Data Acquisi9on & Cura9on DSn data data data Organiza1on Knowledge Graph (linked En99es) Knowledge Graph Data Analy9cal Data Acquisi9on Domaink Data- Intensive Analysis Open Data Org data Opinion IP Open Registry KG1 Domain specific Knowledge Graphs

DIA Lessons Learned Modelling Analy9cal Models and Methods Selec9on / crea9on, fi ng / tuning Result verifica9on Model / method management Data Models En99es dominate Data Lakes Named En99es + En9ty (Graph) Models Ontologies (genomics), Ensembles, Emerging DIA Ecosystems Technology Languages (~30) Analy9cs Suites / Pla orms (~60) Big Data Management (~30)

Veracity WHAT COULD POSSIBLY GO WRONG?

Do We Know / Can We Prove DIA Result: correct, complete, efficient? What machines / algorithms / Machine Learning / Black Boxes / DIA do? High Risk / High Reward Data- Driven Society Risk: drugs or medical advice that cause harm Reward: faster, cheaper, more effec9ve cancer cures, drug discovery, personalized medicine,

Professional Cau9ons Experienced prac99oners Medicine Few data- driven results opera9onalized Mount Sinai: no black box solu9ons Authorita9ve organiza9ons NIH, HHS, EOCD, Na9onal Sta9s9cal Organiza9ons, Legal: Algorithmic Accountability: John ZiLrain, Harvard Law School

Q: What could possibly go wrong? A: Every step Data Sets All measurements approximate: availability, quality, requirements, sparse/dense, ; How much can we tolerate? What is the impact on the result? Models: All models wrong George Box 1974 Methods Select (1,000s), tune, verify Different methods è radically different results Results: Probabilis9c, error bounds, verifica9on, Data Analysis is 20% of the story

Pre Big Data Challenges Science: Experimental design: hypotheses, null hypotheses, dependent and independent variables, controls, blocking, randomiza9on, repeatability, accuracy Analysis: models, methods Resources: cost, 9me, precision

+Big Data Challenges Pre- Big Data Challenges @ scale: volume, velocity, and variety Complexity Data: sources, meta- data, 3Vs Models (reflec9ng the domain) Methods (mul9variate palerns) beyond human cogni9on Results: Massive numbers of correla9ons Unreliability (sta9s9cs @ scale) Reliability decreases as the number of variables increases (mul9variate analysis) << 10 variables (science &drug discovery) è 1,000s to millions (Machine Learning) In science and medical research, we ve always known that Misunderstood: Self- service, Automated Data Science- in- a- Box 80% unfamiliar with sta9s9cs, error bars, causa9on/correla9on, probabilis9c reasoning, automated data cura9on and analysis Widespread use of DIA: self- service, democra9za9on of analy9cs Like monkeys playing with loaded guns

Somewhere over there

DIA Verifica9on Principles & Techniques Conven9onal disciplines Man- machine symbiosis DIA Result è Empirical evidence Flashlight analogy: DIA reduces hypothesis space Cross- valida9on Validate predic9ve model: avoid overfi ng, will model work on unseen data sets? Data par99ons: Training Set, Test /Valida9on Set, ground truth K- fold cross valida9on Research Direc9on New measures of significance, the next genera9on P value 21 st Century sta9s9cs

I Proposal SCIENTIFIC METHOD è EMPIRICISM DATA SCIENCE è DATA- INTENSIVE ANALYSIS

Data Science is A body of principles and techniques for applying data- intensive analysis for inves;ga;ng phenomena, acquiring new knowledge, and correc;ng and integra;ng previous knowledge with measures of correctness, completeness, and efficiency DIA: an experiment over data

Conclusions Big Data & Data- Intensive Analysis Value of evidence (from data) Emerging reasoning and problem solving paradigm High risk / high reward Substan9al results already In its infancy, not yet understood, decades to go Overhyped (short term) but may change our world (long term) è Need for Data Science = principles & guidelines We re now at the what are the principles? point in 9me M Jordan Decades of research and prac9ce

Thank You!