The Big Gift of Big Data Valerio Pascucci Director, Center for Extreme Data Management Analysis and Visualization Professor, SCI institute and School of Computing, University of Utah Laboratory Fellow, Pacific Northwest National Laboratory Pascucci-1
Center for Extreme Data Management, Analysis, and Visualization 10 Faculty + scientists, developers, students, Primary partners: UU, PNNL, LLNL Other partnerships: NSA, INL, ANL,. Involvement in national Initiatives $1.6B NSA data center (1.5 million-square-foot facility) Pascucci-2
What is Big Data? Big Data is like teenage sex: everyone Talks About It, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it. (Dan Ariely). eventually everyone will do it! Pascucci-3
Are Global Warming Trends Associated to Human Activities? Pascucci-4
Can Devastating Material Failures Such as Earthquakes Travel at Supersonic Speed? Understanding the strength of new materials is critical in creating structures as small as microprocessors, buildings, or airplanes that withstand real-world forces Pascucci-5
Change the World of Fuels and Engines to Increase Efficiency and Reduce Pollution Fuel streams are rapidly evolving: Heavy hydrocarbons Oil sands Oil shale Coal New renewable fuel sources Ethanol Biodiesel New engine technologies: Direct Injection (DI) Homogeneous Charge Compression Ignition (HCCI) Low-temperature combustion Mixed modes of combustion (dilute, high-pressure, low-temp.) Pascucci-6
Can We Explain the Origin and the Evolution of the Universe? Pascucci-7
Can We Develop a New Healthcare Process that is Fully Personalized Pascucci-8
Intermezzo to talk about a convicted felon Pascucci-10
Galileo Galilei Jailed in Apr 12, 1633 because "gravely suspect of heresy The case was quickly reviewed by the Catholic Church 1564-1642 Received excuses in 1992 Pascucci-11
Galileo Galilei Led the Modern Revolution of the Scientific Method Make observations Propose a hypothesis of theoretical model Design and perform anrepetible experiment to test the hypothesis Analyze your data to determine whether to accept or reject the hypothesis If necessary, iterate (propose and test a new hypothesis) Pascucci-12
Galileo Galilei Led the Modern Revolution of the Scientific Method Make observations Propose a hypothesis of theoretical model Design and perform anrepetible experiment to test the hypothesis Analyze your data to determine whether to accept or reject the hypothesis If necessary, iterate (propose and test a new hypothesis) Pascucci-13
Galileo Galilei Led the Modern Revolution of the Scientific Method Make observations Propose a hypothesis of theoretical model Design and perform anrepetible experiment to test the hypothesis Analyze your data to determine whether to accept or reject the hypothesis If necessary, iterate (propose and test a new hypothesis) Pascucci-14
Galileo Galilei Led the Modern Revolution of the Scientific Method Make observations Propose a hypothesis of theoretical model Design and perform anrepetible experiment to test the hypothesis Analyze your data to determine whether to accept or reject the hypothesis Invention If necessary, of the iterate (propose and test a new hypothesis) Telescope Pascucci-15
Galileo Galilei Led the Modern Revolution of the Scientific Method Make observations Propose a hypothesis of theoretical model Design and perform anrepetible experiment to test the hypothesis Analyze your data to determine whether to accept or reject the hypothesis If necessary, iterate (propose and test a new hypothesis) Pascucci-16
Galileo Galilei Led the Modern Revolution of the Scientific Method Make observations Propose a hypothesis of theoretical model Design and perform anrepetible experiment to test the hypothesis Analyze your data to determine whether to accept or reject the hypothesis If necessary, iterate Development of formal (propose and test a new hypothesis) model description Pascucci-17
New Data Collection and Processing Resources Challenged This Model Large Synoptic Survey Telescope 15TB/day 100PB in 10-year Pascucci-18
New Data Collection and Processing Resources Challenged This Model PayPal s Data Volumes Pascucci-19
New Data Collection and Processing Resources Challenged This Model Earth System Grid Tens of Petabytes of climate data Pascucci-20
New Data Collection and Processing Resources Challenged This Model Year Annual Number of Google Searches Average Searches Per Day 2013 2,161,530,000,000 5,922,000,000 2012 1,873,910,000,000 5,134,000,000 2011 1,722,071,000,000 4,717,000,000 2010 1,324,670,000,000 3,627,000,000 2009 953,700,000,000 2,610,000,000 2008 637,200,000,000 1,745,000,000 2007 438,000,000,000 1,200,000,000 2000 22,000,000,000 60,000,000 1998 3,600,000 9,800 Pascucci-21
New Data Collection and Processing Resources Challenged This Model 1 st paradigm, empirical science 2 nd paradigm, model based theoretical science 3 rd paradigm, computational science (simulations) 4 th paradigm, data driven investigation (escience) Pascucci-23
New Data Collection and Processing Resources Challenged This Model 1 st paradigm, empirical science 2 nd paradigm, model based theoretical science "Essentially, all models are wrong, but some are useful George Edward Pelham Box (1987) 3 rd paradigm, computational science (simulations) 4 th paradigm, data driven investigation (escience) The End of Theory: The Data Deluge Makes the Scientific Method Obsolete Chris Anderson (2011) Pascucci-24
New Data Collection and Processing Resources Challenged This Model 1 st paradigm, empirical science 2 nd paradigm, model based theoretical science "Essentially, all models are wrong, but some are useful George Edward Pelham Box (1987) 3 rd paradigm, computational science (simulations) 4 th paradigm, data driven investigation (escience) The End of Theory: The Data Deluge Makes the Scientific Method Obsolete Chris Anderson (2011) Pascucci-25
New Data Collection and Processing Resources Challenged This Model 1 st paradigm, empirical science 2 nd paradigm, model based theoretical science "Essentially, all models are wrong, but some are useful George Edward Pelham Box (1987) 3 rd paradigm, computational science (simulations) 4 th paradigm, data driven investigation (escience) The End of Theory: The Data Deluge Makes the Scientific Method Obsolete Chris Anderson (2011) Pascucci-26
New Data Collection and Processing Resources Challenged This Model 1 st paradigm, empirical science 2 nd paradigm, model based theoretical science "Essentially, all models are wrong, but some are useful George Edward Pelham Box (1987) 3 rd paradigm, computational science (simulations) 4 th paradigm, data driven investigation (escience) The End of Theory: The Data Deluge Makes the Scientific Method Obsolete Chris Anderson (2011) Pascucci-27
New Data Collection and Processing Resources Challenged This Model 1 st paradigm, empirical science 2 nd paradigm, model based theoretical science 3 rd paradigm, computational science (simulations) 4 th paradigm, data driven investigation (escience) Pascucci-28
The True Revolution is in the New Challenges that We Can Try to Tackle Milky Way crashes into Andromeda system. Can we test Empirically multiple scenarios? In four billion years Pascucci-29
The True Revolution is in the New Challenges that We Can Try to Tackle Global warming expectation Can we test Empirically multiple scenarios? Predict the outcome in 2100 Pascucci-30
The True Revolution is in the New Challenges that We Can Try to Tackle Evolution of the stock market Can we test Empirically multiple scenarios? Value of investments in 10 years Pascucci-31
The True Revolution is in the New Challenges that We Can Try to Tackle Predicting 100 year aging effects on new elements Can we test Empirically multiple scenarios? Plutonium discovered in 1940 Pascucci-32
Would you go to space with a vehicle that has been developed only based on simulations? Ariane 5's first flight (Flight 501) on 4 June 1996 $370 million in 37 seconds Software bug Pascucci-33
How Would You Maintain an Arsenal of nuclear Bombs that are not tested? Comprehensive Test-Ban Treaty (CTBT) United Nations General Assembly on 10 Sept. 1996 Is the nuclear arsenal aging properly or is it becoming dangerous and ineffective? Pascucci-34
How Would You Maintain an Arsenal of nuclear Bombs that are not tested? Comprehensive Test-Ban Treaty (CTBT) United Nations General Assembly on 10 Sept. 1996 Is the nuclear arsenal aging properly or is it becoming dangerous and ineffective? Pascucci-35
Unprecedented Evolution of High Performance Computing Resources M-5: Los Alamos National Lab No 1. system in June 1993C Pascucci-36
Unprecedented Evolution of High Performance Computing Resources Num. Wind Tunnel: National Aerospace Laboratory of Japan No. 1 system in November 1993 and November 1994 Pascucci-37
Unprecedented Evolution of High Performance Computing Resources Intel XP/S 140 Paragon: Sandia National Labs No. 1 system in June 1994 Pascucci-38
Unprecedented Evolution of High Performance Computing Resources Hitachi SR2201: University of Tokyo No. 1 system in June 19 Pascucci-39
Unprecedented Evolution of High Performance Computing Resources CP-PACS: University of Tsukuba Pascucci-40
Unprecedented Evolution of High Performance Computing Resources ASCI Red: Sandia National Laboratory No. 1 system from June 1997 to June 2000 Pascucci-41
Unprecedented Evolution of High Performance Computing Resources ASCI White: LLNL No. 1 system from Nov. 2000 to Nov. 2001 Pascucci-42
Unprecedented Evolution of High Performance Computing Resources The Earth Simulator No. 1 from June 2002 to June 2004 Pascucci-43
Unprecedented Evolution of High Performance Computing Resources BlueGene/L: LLNL No. 1 from November 2004 to November 2007 Pascucci-44
Unprecedented Evolution of High Performance Computing Resources Roadrunner: Los Alamos National Laboratory No. 1 from June 2008 to June 2009 Pascucci-45
Unprecedented Evolution of High Performance Computing Resources Jaguar: Oak ridge National Laboratory No. 1 from November 2009 to June 2010 Pascucci-46
Unprecedented Evolution of High Performance Computing Resources Tianhe-1A: National SC in Tianjin No. 1 in November 2010 Pascucci-47
Unprecedented Evolution of High Performance Computing Resources K Computer: RIKEN Institute for No. 1 June 2011 to November 2011 Pascucci-48
Unprecedented Evolution of High Performance Computing Resources Sequoia: LLNL No. 1 in June 2012 Pascucci-49
Unprecedented Evolution of High Performance Computing Resources Titan: Oak Ridge National Laboratory No. 1 in November 2012 Pascucci-50
Unprecedented Evolution of High Performance Computing Resources Tianhe-2 (MilkyWay-2) No. 1 system since June 2013 Pascucci-51
Unprecedented Evolution of High Performance Computing Resources Tianhe-2 (MilkyWay-2) No. 1 system since June 2013 Pascucci-52
The Fundamental Paradigm Shift of escience The data is a major driver of the investigations Having the data does NOT guarantee to have the right questions and answers NOT having the data guarantees that you CANNOT develop the right questions and answers HPC, scientific computing, machine learning, data analysis, mining, exploration and visualization are critical activities to knowledge discovery. Pascucci-53
High Performance Data Movements for Real-Time Access to Large Scale Experimental Data Experiment run at Advance Photon Source at ANL Scientists located at PNNL Pascucci-54
High Performance Data Movements for Real-Time Access to Large Scale Simulation Data Data streams that allow merging multiple datasets in real time Time interpolation of and concurrent visualization of climate data ensembles defined on different time scales Server side and client side computation of statistical functions such as median, average, standard deviation,. Standard Deviation and Average of ten climate models Pascucci-55
Interactive Remote Analysis and Visualization of 6TB Imaging Data EM datasets of resolution: 130Kx130Kx340 6.4 TB of raw data Web Server Pascucci-56
SC12/13 Demonstration of data streaming analytics and visualization Live demonstration from Argonne National Laboratory to Supercomputing exhibit floor Infrastructure that scales gracefully with available hardware resources Cores available Pascucci-57
Massive Precomputations Can Avoid the Need for Real Time Processing Problem: Need accurate automated phone quotes in 100ms. They couldn t do these calculations nearly fast enough on the fly. Solution: Each weekend, use a new HPC cluster to precalculate quotes for every American adult and household (60 hour run time) Pascucci-58
The Fundamental Paradigm Shift of escience Development and curation of massive data collections from simulations and sensing will be crucial to any scientific progress Need to develop scientific knowledge and actionable information that cannot be tested empirically Uncertainty Quantification Verification and Validation Pascucci-59
Assessing the Uncertainty in Fuel Design For New Clean Burning Devices Exploration of high dimensional space of possible configurations. High Performance Combustion 10 dimensional landscape Poor Combustion to be avoided Pascucci-60
Topological Analysis of the Space of Composite Materials of a Given Class Features in experimental data show unexpected structures and are used to plan future experiments. Stakeholder: A. Karim, PNNL. Palladium Tin Molybdenum Tungsten Pascucci-61
How Can We Achieve True Verification and Validation? Verification: Are we building the software right? Pascucci-62
How Can We Achieve True Verification and Validation? Verification: Are we building the software right? Is my software bug free? Pascucci-63
How Can We Achieve True Verification and Validation? Verification: Are we building the software right? Is my software bug free? Chimera Pascucci-64
How Can We Achieve True Verification and Validation? Validation: Is the process providing accurate predictions? Pascucci-65
How Can We Achieve True Verification and Validation? Validation: Is the process providing accurate predictions? Pascucci-66
How Can We Achieve True Verification and Validation? Validation: Is the process providing accurate predictions? U.S. Securities and Exchange Commission (SEC) requires to tell: investors that a fund's: past performance does not necessarily predict future results Pascucci-67
Example: Validation: Is the process providing accurate predictions? Sidharth Kumar Aaditya Landge U.S. Securities and Exchange Commission (SEC) requires to tell: investors that a fund's: past performance does not necessarily predict future results Pascucci-68
Validation: Example: ATPESC summer school 2013 Is the process providing accurate predictions? Sidharth Kumar Aaditya Landge U.S. Securities and Exchange Commission (SEC) requires to tell: investors that a fund's: past performance does not necessarily predict future results Pascucci-69
Validation: Example: Is the process providing accurate predictions? Sidharth Kumar Aaditya Landge SC14 ATPESC summer school 2013 U.S. Securities and Exchange Efficient I/O and Storage of Adaptive Resolution Data Commission (SEC) requires to tell: investors that a fund's: SC14 In-Situ Feature Extraction of Large Scale Combustion Simulations Using Segmented Merge Trees past performance does not necessarily predict future results Pascucci-70
Validation: Example: Is the process providing accurate predictions? Sidharth Kumar Aaditya Landge SC14 ATPESC summer school 2013 U.S. Securities and Exchange Efficient I/O and Storage of Adaptive Resolution Data Commission (SEC) requires to tell: investors that a fund's: SC14 In-Situ Feature Extraction of Large Scale Combustion Simulations Using Segmented Merge Trees past performance does not necessarily predict future results Pascucci-71
Validation: Example: Is the process providing accurate predictions? Sidharth Kumar Aaditya Landge SC14 ATPESC summer school 2013 U.S. Securities and Exchange Efficient I/O and Storage of Adaptive Resolution Data Commission (SEC) requires to tell: investors that a fund's: SC14 In-Situ Feature Extraction of Large Scale Combustion Simulations Using Segmented Merge Trees past performance does not necessarily predict future results Pascucci-72
Rethinking Multi-Scale Representation of Massive Data Models Multi-resolution representations are insufficient to deal with big data: Data preprocessing is typically too long Wavelet-like averaging looses information Data analysis results often do not represent well important trends (e.g. multi-modal distributions) New data abstractions are needed for Big Data Multi-resolution Abstraction Pascucci-73
Rethinking Multi-Scale Representation of Massive Data Models Pascucci-74
Topological Analysis of Massive Combustion Simulations Non-premixed DNS combustion (J. Chen, SNL): Analysis of the time evolution of extinction and reignition regions for the design of better fuels time Pascucci-75
Topology Has Been Successful for Analysis and Visualization of Massive Scientific Data 76 Pascucci-76
Running Efficiently Big Data Computations is a Big Data Problem Massive logs Complex Memory Hierarchies Complex and Diverse Interconnects 3D, 4D, 5D tori Fat tree Complex I/O pathways Cost of Power Dominated by Data Movements Pascucci-77
Running Efficiently Big Data Computations is a Big Data Problem Growing Community involving Performance analysis and vis community Workshop at last IEEE VIS Dagstuhl Perspective Workshop Workshop at SC14 Conference Pascucci-78
Large Team Requiring Multi-disciplinary and Multi-institutional Collaboration Challenging collaborations among: Government laboratories Industry Academia Close collaborations with domain scientists requiring to cross language and cultural barriers: New education needed Communicate problems not tasks!!!!!! Pascucci-79
The Big Gift of Big Data A great opportunity to achieve new scientific discoveries and engineering innovations A great opportunity for the Computer Science community to become a central player in the development of modern science A great challenge for all communities to become strongly engaged in interdisciplinary collaboration A great opportunity for our community to become the data generation, processing and exploration telescope of modern science and engineering Pascucci-80
The Big Gift of Big Data A great opportunity to achieve new scientific discoveries and engineering innovations A great opportunity for the Computer Science community to build the infrastructure needed to drive modern science A great challenge for all communities to become strongly engaged in interdisciplinary collaboration A great opportunity for our community to become the data generation, processing and exploration telescope of modern science and engineering Pascucci-81
END Pascucci-82