How To Teach Data Science



Similar documents
Computational Science and Informatics (Data Science) Programs at GMU

Learning from Big Data in

The Tonnabytes Big Data Challenge: Transforming Science and Education. Kirk Borne George Mason University

Conquering the Astronomical Data Flood through Machine

LSST and the Cloud: Astro Collaboration in 2016 Tim Axelrod LSST Data Management Scientist

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21)

College of Science George Mason University Fairfax, VA 22030

Data analysis of L2-L3 products

Challenges in e-science: Research in a Digital World

Undergraduate Studies Department of Astronomy

Astrophysics with Terabyte Datasets. Alex Szalay, JHU and Jim Gray, Microsoft Research

Department of Science Education and Mathematics Education

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Data Science at the University of Virginia

Data Driven Discovery In the Social, Behavioral, and Economic Sciences

SYLLABUS FORM WESTCHESTER COMMUNITY COLLEGE Valhalla, NY l0595. l. Course #:PHYSC NAME OF ORIGINATOR /REVISOR: PAUL ROBINSON

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 ST CENTURY SCIENCE, ENGINEERING, AND EDUCATION (CIF21)

COGNITIVE SCIENCE AND NEUROSCIENCE

Information Management course

NASA s Big Data Challenges in Climate Science

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 ST CENTURY SCIENCE, ENGINEERING, AND EDUCATION (CIF21) $100,070,000 -$32,350,000 / %

Survey of Canadian and International Data Management Initiatives. By Diego Argáez and Kathleen Shearer

Exploring the roles and responsibilities of data centres and institutions in curating research data a preliminary briefing.

DAME Astrophysical DAta Mining Mining & & Exploration Exploration GRID

Proposal for New Program: BS in Data Science: Computational Analytics

Databases & Data Infrastructure. Kerstin Lehnert

National Nursing Informatics Deep Dive Program

Information Visualization WS 2013/14 11 Visual Analytics

How To Become A Data Scientist

Using Data Mining and Machine Learning in Retail

INFORMATION SYSTEMS (INFO)

Science Investigations: Investigating Astronomy Teacher s Guide

BSc in Information Technology Degree Programme. Syllabus

OUTLINE FOR AN INTERDISCIPLINARY CERTIFICATE PROGRAM

Teaching Computational Thinking using Cloud Computing: By A/P Tan Tin Wee

Ph.D. in Bioinformatics and Computational Biology Degree Requirements

Size and Scale of the Universe

Program Approval Form

CS Matters in Maryland CS Principles Course

Accelerated Undergraduate/Graduate (BS/MS) Dual Degree Program in Computer Science

Statistics for BIG data

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

An interdisciplinary model for analytics education

A Taxonomy for Space Curricula

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

Data UNC. Vinayak Deshpande

Big Data a threat or a chance?

Big Data Hope or Hype?

Core Ideas of Engineering and Technology

UR Undergraduate Physics and Astronomy Program Learning Objectives and Assessment Plan

Manjula Ambur NASA Langley Research Center April 2014

3D Point Cloud Analytics for Updating 3D City Models

CHAPTER 1 INTRODUCTION

Big Data R&D Initiative

Guidelines for Establishment of Contract Areas Computer Science Department

Unit 1.7: Earth and Space Science The Structure of the Cosmos

ANALYTICS CENTER LEARNING PROGRAM

RFI Summary: Executive Summary

A Capability Maturity Model for Scientific Data Management

NITRD and Big Data. George O. Strawn NITRD

Integrating a Big Data Platform into Government:

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

SURVEY REPORT DATA SCIENCE SOCIETY 2014

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

This Symposium brought to you by

Transcription:

The Past, Present, and Future of Data Science Education Kirk Borne @KirkDBorne http://kirkborne.net George Mason University School of Physics, Astronomy, & Computational Sciences

Outline Research and Application My POV : Data Science Relationship to Big Data Data Science Programs at Mason (GMU): Past, Present, and Future PhD Past and Future BS, and an undergraduate minor Future MS professional masters degree Challenges and Reflections http://kirkborne.net/ 2

Outline Research and Application My POV : Data Science Relationship to Big Data Data Science Programs at Mason (GMU): Past, Present, and Future PhD Past and Future BS, and an undergraduate minor Future MS professional masters degree Challenges and Reflections http://kirkborne.net/ 3

Astronomy Example Before we look at Data Science... Let us look at an astronomy example The LSST (Large Synoptic Survey Telescope) Mason is a partner institution and our scientists are involved with the science, data management, and education programs of the LSST

LSST = Large Synoptic Survey Telescope http://www.lsst.org/ (mirror funded by private donors) 8.4-meter diameter primary mirror = 10 square degrees! Hello!

LSST = Large Synoptic Survey Telescope http://www.lsst.org/ (mirror funded by private donors) 8.4-meter diameter primary mirror = 10 square degrees! Hello!

LSST = Large Synoptic Survey Telescope http://www.lsst.org/ (mirror funded by private donors) 8.4-meter diameter primary mirror = 10 square degrees! 100-200 Petabyte image archive 20-40 Petabyte database catalog Hello!

Observing Strategy: One pair of images every 40 seconds for each spot on the sky, then continue across the sky continuously every night for 10 years (~2022-2032), with time domain sampling in log(time) intervals (to capture dynamic range of transients). LSST (Large Synoptic Survey Telescope): Ten-year time series imaging of the night sky mapping the Universe! ~10,000,000 events each night anything that goes bump in the night! Cosmic Cinematography! The New Sky! @ http://www.lsst.org/ 8

LSST Key Science Drivers: Mapping the Dynamic Universe Solar System Inventory (moving objects, NEOs, asteroids: census & tracking) Nature of Dark Energy (distant supernovae, weak lensing, cosmology) Optical transients (of all kinds, with alert notifications within 60 seconds) Digital Milky Way (proper motions, parallaxes, star streams, dark matter) LSST in time and space: When? ~2022-2032 Where? Cerro Pachon, Chile Architect s design of LSST Observatory 9

3-Gigapixel camera LSST Summary http://www.lsst.org/ One 6-Gigabyte image every 20 seconds 30 Terabytes every night for 10 years 100-Petabyte final image data archive anticipated all data are public!!! 20-Petabyte final database catalog anticipated Real-Time Event Mining: ~10 million events per night, every night, for 10 yrs Follow-up observations required to classify these Repeat images of the entire night sky every 3 nights: Celestial Cinematography 10

The LSST Data Challenges 11

Mason (GMU) is an LSST member institution Borne is chairman of the LSST Astroinformatics and Astrostatistics research team http://www.lsst.org/ @KirkDBorne Architect s design of LSST Observatory

Data Science: A National Imperative 1. National Academies report: Bits of Power: Issues in Global Access to Scientific Data, (1997) http://www.nap.edu/catalog.php?record_id=5504 2. NSF (National Science Foundation) report: Knowledge Lost in Information: Research Directions for Digital Libraries, (2003) downloaded from http://www.sis.pitt.edu/~dlwkshop/report.pdf 3. NSF report: Cyberinfrastructure for Environmental Research and Education, (2003) downloaded from http://www.ncar.ucar.edu/cyber/cyberreport.pdf 4. NSB (National Science Board) report: Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century, (2005) downloaded from http://www.nsf.gov/nsb/documents/2005/llddc_report.pdf 5. NSF report with the Computing Research Association: Cyberinfrastructure for Education and Learning for the Future: A Vision and Research Agenda, (2005) downloaded from http://archive.cra.org/reports/cyberinfrastructure.pdf 6. NSF Atkins Report: Revolutionizing Science & Engineering Through Cyberinfrastructure: Report of the NSF Blue-Ribbon Advisory Panel on Cyberinfrastructure, (2005) downloaded from http://www.nsf.gov/od/oci/reports/atkins.pdf 7. NSF report: The Role of Academic Libraries in the Digital Data Universe, (2006) downloaded from http://www.arl.org/storage/documents/publications/digital-data-report-2006.pdf 8. NSF report: Cyberinfrastructure Vision for 21st Century Discovery, (2007) downloaded from http://www.nsf.gov/od/oci/ci_v5.pdf 9. JISC/NSF Workshop report on Data-Driven Science & Repositories, (2007) downloaded from http://www.sis.pitt.edu/~repwkshop/nsf-jisc-report.pdf 10. DOE report: Visualization and Knowledge Discovery: Report from the DOE/ASCR Workshop on Visual Analysis and Data Exploration at Extreme Scale, (2007) downloaded from http://www.sci.utah.edu/vaw2007/doe-visualization-report-2007.pdf 11. DOE report: Mathematics for Analysis of Petascale Data Workshop Report, (2008) downloaded from http://science.energy.gov/~/media/ascr/pdf/program-documents/docs/peta_scaled_at_a_workshop_report.pdf 12. NSTC Interagency Working Group on Digital Data report: Harnessing the Power of Digital Data for Science and Society, (2009) downloaded from http://www.nitrd.gov/about/harnessing_power_web.pdf 13. National Academies report: Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age, (2009) downloaded from http://www.nap.edu/catalog.php?record_id=12615 14. NSF report: Data-Enabled Science in the Mathematical and Physical Sciences, (2010) downloaded from https://www.nsf.gov/mps/dms/documents/data-enabledscience.pdf 15. National Big Data Research and Development Initiative, (2012) downloaded from http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf 16. National Academies report: Frontiers in Massive Data Analysis, (2013) downloaded from http://www.nap.edu/catalog.php?record_id=18374

Research and Application Outline My POV : Data Science Relationship to Big Data Data Science Programs at Mason (GMU): Past, Present, and Future PhD Past and Future BS, and an undergraduate minor Future MS professional masters degree Challenges and Reflections http://kirkborne.net/ 14

What is Data Science? It is a collection of mathematical, computational, scientific, and domain-specific methods, tools, and algorithms to be applied to Big Data for discovery, decision support, and data-to-knowledge transformation Statistics Data Mining (Machine Learning) & Analytics (KDD) Data & Information Visualization Semantics (Natural Language Processing, Ontologies) Data-intensive Computing (e.g., Hadoop, Cloud, ) Modeling & Simulation Metadata for Indexing, Search, & Retrieval Advanced Data Management & Data Structures Domain-Specific Data Analysis Tools 15

From Wikipedia: Big Data refers to any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. What is Big Data?

Definitions of Big Data From Wikipedia: Big Data refers to any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. My suggestion: Big Data refers to Everything, Quantified and Tracked! According to the standard (Wikipedia) definition, even the Ancient Romans had Big Data! That s ridiculous! See my article Today's Big Data is Not Yesterday's Big Data at: http://bit.ly/1axb7hd

Definitions of Big Data From Wikipedia: Big Data refers to any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. My suggestion: Big Data refers to Everything, Quantified and Tracked! The challenges do not change but their scale, scope, scariness, discovery potential do change! Examples: Big Data Science Projects Social Networks IoT = Internet of Things M2M = Machine-to-Machine

Data-Oriented Discovery Experiments can now be run against the big data collection Hypotheses are inferred, questions are posed, experiments are designed & run, results are analyzed, hypotheses are tested & refined! This is the 4 th Paradigm of Science This is Data Science This is especially (and correctly) true if the data collection is the full data set for a given domain: astronomical sky surveys, human genome (the 1000 Genomes Project), social networks, large-scale simulations, earth observing system, ocean observatories initiative, banking, retail, national security, cybersecurity, and the list goes on and on

Outline Research and Application My POV : Data Science Relationship to Big Data Data Science Programs at Mason (GMU): Past, Present, and Future PhD Past and Future BS, and an undergraduate minor Future MS professional masters degree Challenges and Reflections http://kirkborne.net/ 20

CSI Graduate Program at Mason http://spacs.gmu.edu/content/academic-programs CSI = Computational Science & Informatics CSI graduate program has existed at GMU since 1992 Over 200 PhD s graduated in past 20 years Approximately 95 students currently enrolled About 10% of students end with M.S. (Masters in Computational Science) We have a Graduate Certificate in Computational Techniques and Applications (non-degree professional certification program) Note that there is no specific Data Science concentration in CSI. However, students can enroll in other departments to study specific X-Informatics disciplines: Geoinformatics (including Geospatial Intelligence) Bioinformatics Health Informatics X-informatics = Application of Data Science to discipline X 21

CSI Graduate Program at Mason http://spacs.gmu.edu/content/academic-programs CSI = Computational Science & Informatics Students can choose a concentration from several choices: Computational Astrophysics Space Science Computational Physics Computational Fluid Dynamics Computational Statistics Computational Learning Computational Mathematics Computational Materials Science (Physical Chemistry) A student may create their own concentration, such as one of these previously approved concentrations: Computational Finance Remote Sensing Computational Economics 22

CSI Graduate Program at Mason http://spacs.gmu.edu/content/academic-programs CSI = Computational Science & Informatics Students must complete 4 core courses from this set of 5: CSI 700 Numerical Methods CSI 701 Foundations of Computational Science CSI 702 High Performance Computing CSI 703 Scientific and Statistical Visualization CSI 710 Scientific Databases There are also many electives (at least 5 additional CSI courses are required, plus concentration science electives). For example: Data Science Courses! Data Mining, Knowledge Mining, Computational Learning, Statistical Learning, Computational Statistics, Statistical Graphics, Data Exploration, etc. 23

Example Course Syllabus: CSI 710 Scientific Databases CSI 710 Scientific Databases (taught by K.B.) lectures include: Relational Databases: Modeling, Schemas, Normalization, SQL Scientific Databases, Big Data in Science, The 4 th Paradigm E-Science, Ontologies, Semantic E-Science, X-Informatics Distributed Data, Federated Data, Virtual Observatories Citizen Science with Big Data Scientific Data Mining I Scientific Data Mining II Astroinformatics and Astro databases Bioinformatics and Bio databases Geoinformatics and Geo databases Health Informatics Online Science (Jim Gray s KDD-2003 lecture) Intelligent Archives of the Future 24

Outline Research and Application My POV : Data Science Relationship to Big Data Data Science Programs at Mason (GMU): Past, Present, and Future PhD Past and Future BS, and an undergraduate minor Future MS professional masters degree Challenges and Reflections http://kirkborne.net/ 25

CDS Undergraduate Program at Mason http://spacs.gmu.edu/content/academic-programs CDS = Computational and Data Sciences Undergraduate B.S. degree program at Mason since 2007 Currently in hiatus, pending program modifications Originally, students could choose the general CDS degree, or else choose one of these concentrations: Physics Chemistry Biology A student could create their own concentration. For example: Environmental Science Anticipated modifications to the program only 2 emphasis areas (maximizing student s ability to create their own domain-specific course of study, in addition to a small set of required courses): Modeling and Simulation Data Science 26

CDS Undergraduate Program at Mason http://spacs.gmu.edu/content/academic-programs CDS = Computational and Data Sciences The DATA SCIENCE component of the curriculum was developed with the support of a grant (2007) from the NSF (National Science Foundation): CUPIDS = Curriculum for an Undergraduate Program In Data Sciences Primary Goal: to increase student s understanding of the role that data plays across the sciences as well as to increase the student s ability to use the technologies associated with data acquisition, mining, analysis, and visualization. Objectives students are trained: to access large distributed data repositories to conduct meaningful inquiries into the data to mine, visualize, and analyze the data to make objective data-driven inferences, discoveries, and decisions 27

CDS Undergraduate Program at Mason http://spacs.gmu.edu/content/academic-programs CDS = Computational and Data Sciences Core courses that students can choose from: CDS 101 Introduction to Computational Data Sciences CDS 130 Computing for Scientists CDS 251 Introduction to Scientific Programming CDS 301 Scientific Information and Data Visualization CDS 302 Scientific Data and Databases CDS 401 Scientific Data Mining CDS 410 Modeling and Simulations I CDS 411 Modeling and Simulations II Additional required courses include Math, Statistics, Computer Science, Physics I and II, plus courses in student s chosen science concentration 28

CDS Undergraduate Program at Mason http://spacs.gmu.edu/content/academic-programs CDS = Computational and Data Sciences We are also infiltrating the entire undergraduate program at Mason through 3 of our courses that satisfy university General Education graduation requirements for all students at the university: CDS 101 Introduction to Computational Data Sciences Satisfies GMU s Natural Science requirement CDS 130 Computing for Scientists Satisfies GMU s I.T. requirement CDS 151 Data Ethics Satisfies GMU s Ethics requirement CDS undergraduate minor (open to any student at the university) 12 CDS credits (focus on Data Science or Modeling & Simulation) plus one additional science class http://kirkborne.net/ 29

Example of Learning Objectives: CDS 401 Scientific Data Mining Be able to explain the role of data mining within scientific knowledge discovery. Be able to describe the most well known data mining algorithms and correctly use data mining terminology. Be able to express the application of statistics, similarity measures, and indexing to data mining tasks. Identify appropriate techniques for classification and clustering applications. Determine approaches used for mining large scientific databases (e.g., genomics, virtual observatories). Recognize techniques used for spatial and temporal data mining applications. Express the steps in a data mining project (e.g., cleaning, transforming, indexing, mining, analysis). Analyze classic data mining examples and use cases, and assess the applicatio of different data mining techniques. Effectively prepare data for mining. Effectively use software packages for data exploration, visualization, and mining. http://kirkborne.net/ 30

Outline Research and Application My POV : Data Science Relationship to Big Data Data Science Programs at Mason (GMU): Past, Present, and Future PhD Past and Future BS, and an undergraduate minor Future MS professional masters degree Challenges and Reflections http://kirkborne.net/ 31

MS Professional Masters Program at Mason http://data-informed.com/george-mason-university-updates-masters-program-data-science/ This will take effect in Fall 2014: http://goo.gl/0l4gno This program will be a restructuring of our existing MS in Computational Science (i.e., what our website says today is not the way the program will look starting in Fall 2014) The changes include: focus on serving the professional community (not the academic research community) focus on meeting workforce skills demands (especially for Data Scientists) deployment of 3 new Areas of Emphasis that our customers are demanding: Data Science Transportation Safety (associated with the new National Center for Collision Safety and Analysis within our school) Modeling and Simulation 32

Outline Research and Application My POV : Data Science Relationship to Big Data Data Science Programs at Mason (GMU): Past, Present, and Future PhD Past and Future BS, and an undergraduate minor Future MS professional masters degree Challenges and Reflections http://kirkborne.net/ 33

Challenges and Reflections Attracting students: after we started, we realized that no student ever says I want to be a data scientist when I grow up. ( not true any more!) Visibility: most other science departments are not aware of the importance of our courses for their majors. (Note: Biology and Neuroscience now require our Science Computing course.) Scientific computing course: this was identified 3 years ago as a necessary course to attract students we now have this course, and it is very popular (nearly 200 students each semester, and growing ) Program evolution: our journey (PhD to BS to MS) reflects the way the Big Data and Data Science world has evolved from academic research, to general education, to meeting workforce demands here and now, but then it will come back to the non-graduate program focus... Future expectations for Data Science Education: the general education focus will become essential, spreading to K-12 (eventually) http://kirkborne.net/ 34

Data Science Education: 2 Perspectives Data Science in Education introduce data in all learning settings: Informatics (Data Science) enables transparent reuse and analysis of data in inquiry-based classroom learning. Learning is enhanced when students work with real data and information (especially online data) that are related to the topic (any topic) being studied. http://serc.carleton.edu/usingdata/ ( Using Data in the Classroom ) http://www.oceansofdata.org/ (EDC s Oceans of Data Institute) An Education in Data Science students are specifically trained: to access large distributed data repositories to conduct meaningful inquiries into the data to mine, visualize, and analyze the data to make objective data-driven inferences, discoveries, and decisions http://kirkborne.net/ 35

There are many programs now See Data Informed s Map of University Programs in Big Data Analytics at: http://data-informed.com/bigdata_university_map/

follow @KirkDBorne The KirkDBorne Ultimatum** Data Literacy for all! [ **no connection to Robert Ludlum s Bourne Ultimatum ] http://kirkborne.net/