Development of a Computational and Data-Enabled Science and Engineering Ph.D. Program Paul T. Bauman, Varun Chandola, Abani Patra Matthew Jones University at Buffalo, State University of New York EduHPC 2014 Workshop November 16, 2014 P. T. Bauman UB CDSE Ph.D. Program November 16, 2014 1 / 16
What is CDSE? P. T. Bauman UB CDSE Ph.D. Program November 16, 2014 2 / 16
Need for Data Science Training 140,000 190,000 additional trained personnel for deep analytical talent positions, and 1.5 million more data-savvy managers needed to take full advantage of big data in the United States. National shortage for such talent is at least 60% McKinsey Global Institute Universities can hardly turn out data scientists fast enough. New York Times P. T. Bauman UB CDSE Ph.D. Program November 16, 2014 3 / 16
Need for Data Science Training Many new discoveries in science, new constructs in engineering and efficiencies/opportunities in business are made possible by CDSE methodologies e.g. LHC generates 2-6 GB data/second mining of which led to the Higgs discovery! LSST is hailed as a breakthrough in data enabled science. Demand students in multiple departments (MAE, CBE, CSEE, CSE, Math, Physics, Chemistry,...) are in ad hoc programs of study aligned with this theme with gaps, poor sequencing and preparation P. T. Bauman UB CDSE Ph.D. Program November 16, 2014 4 / 16
A Little History! Analog with Computational Science program development Many scientists doing computing, but little training in numerical mathematics, algorithms, software engineering Major advancements came not just from hardware, but from better algorithms Taken from David Keyes, Computational and applied mathematics in scientific discovery P. T. Bauman UB CDSE Ph.D. Program November 16, 2014 5 / 16
Interdisciplinary Approach Transaction Performance vs. Moore s Law: A Trend Analysis 117 We expect similar trends in the emergence of Big-Data Currently, trends essentially follow Moore s Law Opportunities for major advancements emerging from interdisciplinary efforts We want history to repeat itself here! Gains will come from scientists who understand the problems on both sides Average NtpmC per Year 1000000 100000 10000 1000 100 Fig. 3. TPC-C primary performance metric trend vs. Moore s Law During the years 1996 to 2000, Moore s Law slightly underestimates NtpmC, while starting with 2008 it slightly overestimates NtpmC. In the period between 1996 and 2000 Transaction the publications used a Performance very diverse combination of vs. database Moore s and hardware platforms (about 12 hardware and 8 software platforms). Between 2001 and 2007 Moore s Law: estimates ANtpmC Trend pretty accurately. Analysis, In this period TPCTC most publications 2010 were on x86 platforms running Microsoft SQL Server. Starting in 2008 Moore s Law over estimated NtpmC. At the same time the number of Microsoft SQL Server publications started decreasing. 2008 saw 21 benchmark publications, 2009 saw 6 benchmark publications and, so far, 2010 3 has seen 5 benchmark publications. This is mostly because P. T. Bauman UB CDSE Ph.D. of Microsoft s Programincreased interest in the new TPC-E November benchmark, 16, introduced 2014 in 2007, 6 / 16 1993 1994 1995 Average tpmc per processor Moore's Law TPC-C Revision2 1996 1997 Firstresult using 15K RPM SCSI disk drives Firstclustered result 1998 1999 First Linux result First multi core result First result using storage area network (SAN) Firstresult using 7.2K RPM SCSI disk drives TPC-C Revision 3, First Windows result, First x86 result 2000 2001 2002 2003 Publication Year Firstresult using solid state disks (SDD) First result using 15K RPM SAS disk drives Intel introduces multithreading TPC-C Revision 5, First x86-64 bit result Taken from Nambiar and Poess, 2004 2005 2006 2007 2008 2009 2010
UB CDSE Ph.D. Program Students will get heavy doses of graduate applied mathematics/statistics, computer science, and HPC Build on a strong foundation in their core science/engineering Will have end-to-end expertise in innovating, developing, and delivering computational and data science products and tools P. T. Bauman UB CDSE Ph.D. Program November 16, 2014 7 / 16
Other Data Science Programs NYU: Many MS programs, Ph.D. CS specializations Columbia: MS Data Science Stanford: MS CME, Data Science track Data Science CDSE Computational Science P. T. Bauman UB CDSE Ph.D. Program November 16, 2014 8 / 16
UB CDSE Ph.D. Program MS 30 Hours 42 hrs=(30 from areas above; minimum 9 in each) P. T. Bauman UB CDSE Ph.D. Program November 16, 2014 9 / 16
P. T. Bauman UB CDSE Ph.D. Program November 16, 2014 10 / 16
Core Courses Data Science CSE 574: Introduction to Machine Learning EE 634: Principles of Information Theory and Coding IE 512: Decision Analysis MAE 674: Optimal Estimation Methods STA 502: Statistical Inference STA 503: Regression Analysis STA 521: Introduction to Theoretical Statistics I STA 536: Statistical Design and Analysis of Experiments STA 567: Bayesian Statistics Applied and Numerical Mathematics IE 572: Linear Programming IE 575: Stochastic Methods MAE 702: Applied Functional Analysis MTH 539: Methods of Applied Mathematics I MTH 540: Methods of Applied Mathematics II MTH 543: Fundamentals of Applied Mathematics MTH 631: Analysis I MTH 649: Partial Differential Equations P. T. Bauman UB CDSE Ph.D. Program November 16, 2014 11 / 16
Core Courses High Performance and Data Intensive Computing CSE 562: Database Systems CSE 587: Data Intensive Computing CSE 603: Parallel and Distributed Processing CSE 999: Large-Scale Distributed Data Systems IE 551: Simulation and Stochastic Models MAE 609: High Performance Computing I MAE 610: High Performance Computing II STA 545: Data Mining I STA 546: Data Mining II P. T. Bauman UB CDSE Ph.D. Program November 16, 2014 12 / 16
Core Faculty Co-Directors Venu Govindaraju (CSE), Abani Patra (MAE) Lead Faculty Gino Biondini (Math), Vipin Chaudhary (CSE), Marianthi Markatou (Biostats), Murali Ramanathan (Pharm), Christian Tiu (Mgt Sci) New CDSE Faculty Paul Bauman (MAE), Varun Chandola (CSE), Michele Dupuis (CBE) P. T. Bauman UB CDSE Ph.D. Program November 16, 2014 13 / 16
Affiliated Faculty School of Engineering and Applied Sciences A. Aref (CSEE), C.W. Chen (CSE), G. Dargush (MAE), S. Das(MAE), P. Desjardin (MAE), M. Demirbas (CSE), J. Errington (CBE), E. Furlani (CBE), T. Furlani (CCR), K. Hoffmann (Phys & BME), D. Kofke (CBE), T. Kosar (CSE), K. Lewis (CSE), D. Pados (EE), C. Qiao (CSE), D. Salac (MAE), G. Scutari(EE),... College of Arts and Sciences L. Biang (GEO), M. Bursik (GLY), J.-H. Jung (Math), O. Khan (SAP), J. Ringland(Math),S. Metcalf (GEO), C. Renschler (GEO), B. Spencer (Math), G. Valentine (GLY), S. Rapoccio(Phy) School of Management S. Das Smith (MSS), R. Ramesh (MSS), H. R. Rao (MSS), A. Zhang (CSE) P. T. Bauman UB CDSE Ph.D. Program November 16, 2014 14 / 16
Center for Computational Research Leading Academic Supercomputing Center More than 14 years experience delivering HPC to campus 100 Tflops Total Aggregate Compute Capacity 8000 cores; 600 TB of high performance storage Major Upgrades Planned for 2014 $1.2M ESD CFA Award $100M Genome Initiative with NY Genome Center in NYC Utilization in 2013 454 users (147 faculty), 2.1M jobs NYS HPC2 Consortium RPI,UB, Stony Brook, Brookhaven, NYSERNet Bringing HPC to NYS Industry and Academia NSF TAS Award 5-year $7.5M XSEDE-based Award XDMoD P. T. Bauman UB CDSE Ph.D. Program November 16, 2014 15 / 16
http://cdse.buffalo.edu P. T. Bauman UB CDSE Ph.D. Program November 16, 2014 16 / 16