NASA Earth Science Research in Data and Computational Science Technologies Report of the ESTO/AIST Big Data Study Roadmap Team September 2015 I. Background Over the next decade, the dramatic growth of NASA s Earth Science data collections is projected to outpace the ability of scientists to analyze that data meaningfully. What has been termed the Vs of data (volume, variety, velocity, etc.) pose significant challenges for both Earth Science missions and researchers as traditional methods for developing science data pipelines, distributing scientific datasets and performing effective analysis will require new approaches. The Intergovernmental Panel on Climate Change s (IPCC) Assessment Report 6, for example, predicts the growth of data to tens of petabytes. Future remote sensing projects include an increasing set of data- intensive instruments that will pose severe challenges to existing systems. Likewise, with Earth Science data archives, these massive increases will shift the focus from distributing whole data sets to providing online services for computation and analysis. In addition, instruments flown on NASA Earth- observing satellites will continue to generate data and stress the boundaries of end- to- end data systems. These challenges taken together require new thinking in data capture, management, processing, and analysis, both onboard and on the ground. Big Data and Data Science have been used as terms to describe this data deluge and the discipline devote to address it. For the purposes of this document, Big Data is a term used to describe the state of collection and analysis of data that exceeds conventional methods or software systems. This state of affairs necessitates new approaches that will change the paradigm by which data is collected and analyzed. Data Science focuses on the use of systematic architectural, software, methodological, and algorithmic approaches (e.g., data management, intelligent algorithms, statistics and visualization) for generation, capture, management, analysis, and discovery from massive data sets and streams (i.e., big data) including those from multiple sensors, models, archives, and other sources, to enable research and decision support. Data Science is a term that is being applied at a growing number of universities that are establishing programs in this area. Data Science is thus emerging as a critical area of research and technology to advance scientific discovery. The sheer volume of data increase over the next decade, coupled with the highly distributed and heterogeneous nature of scientific data sets is requiring these new approaches. Numerous technologies are under development in multiple communities to address Big Data challenges. While NASA can leverage some of this capability (e.g., map- reduce to scale computation), there are significant technologies that need to be developed
that scale to address the data- intensive challenges underlying modern observing systems and the science questions they were created to investigate. Existing techniques also do not address the end- to- end nature of NASA s science- driven observational environment. NASA needs to develop a multi- year plan to address these challenges in total, rather than with incremental, isolated improvements that are unlikely to scale. Such an approach promises significant scientific yield from NASA s missions, instruments, archives, and research community and will be necessary in order to remain on the critical path to accomplish science in the Big Data era. II. Current State: Mission, Instrument and Large- scale Data Analysis Challenges Currently, the analysis of large data collections from NASA or other agencies is executed through traditional computational and data analysis approaches, which require users to bring data to their desktops and perform local data analysis. Alternatively, data are hauled to large computational environments that provide centralized data analysis via traditional High Performance Computing (HPC). Scientific data archives, however, are not only growing massive, but are also becoming highly distributed. As a consequence, neither traditional approach provides a good solution for optimizing analysis into the future. Further, assumptions across the NASA mission and science data lifecycle, which historically assume that all data can be collected, transmitted, processed, and archived, will not scale as more capable instruments stress legacy- based systems. A new paradigm is needed in order to increase the productivity and effectiveness of scientific data analysis. This paradigm must recognize that architectural and analytical choices are interrelated, and must be carefully coordinated in any system that aims to allow efficient, interactive scientific exploration and discovery to exploit massive data collections, from point of collection (e.g., onboard) to analysis and decision support. Expanding on this point, the most effective approach to analyzing a distributed set of massive data may involve some exploration and iteration, putting a premium on the flexibility afforded by the architectural framework. The framework should enable scientist users to assemble workflows efficiently, manage the uncertainties related to data analysis and inference, and optimize deep- dive analytics to enhance scalability. These challenges are not limited to NASA. Multiple agencies are confronted with the question of how to draw scientific inference from growing, distributed archives, as identified in the appendix. NASA has already made significant investments in capturing and sharing data from massive, online data systems. The NASA Earth Science Distributed Information System (ESDIS), most prominently the Distributed Active Archive Centers (DAACs), provide an excellent foundation for capturing and building high quality repositories. Technology investments to date by NASA (e.g., ROSES ACCESS, ROSES AIST, etc.) and the DOE (e.g.,
Earth System Grid) have focused on developing software services that have improved access to distributed data. The challenge that is posed now is how to integrate these capabilities with careful architectural approaches and emerging technologies that will allow NASA to continue to scale the entire data lifecycle and support the data analysis needs of the Earth Science community. The existing infrastructures are prime candidates for grounding a more distributed, scalable computational environment to optimize scientific data analysis and move to the era of Big Data Analytics. An integrated architecture and ecosystem will address the modern data and computing challenges across the NASA mission and science data lifecycle: Reproducibility Uncertainty Data fusion Data reduction Data movement Data visualization Cost Performance III. Use Cases Several use cases have been identified which present challenges to future NASA missions and science. These use cases identify the challenges and required capabilities that are needed in order to not just keep pace but increase the science yield from NASA Earth Science missions, instruments and data collections. Use Cases 1 the following identify use cases and their data science challenges. Use Case Data Science Challenge Enabling Mission/ Capability Climate Modeling Formulate hypotheses from observed empirical relationships; Simulate current and past conditions under those hypotheses using climate models; Test hypotheses by comparing simulations to observations; Evaluate uncertainty of predictions originated from statistical sampling of models and observations. Missions such as NI-SAR and SWOT will generate massive observational data. However, they are have different architectural patterns including compute intensive, data intensive, heterogeneous, Highly distributed data sources; fusion of different observations; moving computation to the data; data reduction CMIP6 will move towards exascale archives requiring new approaches to evaluating models relative to observational data. Satellite Missions Massive data rates, data movement challenges, computational NI-SAR and SWOT require new approaches for computation, data movement, data archiving and distribution, 1 NIST has an active working group developing Big Data use cases that includes NASA contributions in
etc. scalability, archiving and distribution; onboard processing for data reduction/analysis; high-volume data transfer for ground processing Distributed computation; highly distributed data sources; data fusion of multiple products; massive new satellite observations. analytics. Applications - Hydrology (Central Valley of California) Understanding groundwater dynamics on a regional scale using measurements from satellite, airborne and in-situ measurements. Compare against predictive models. Integration of data from PALSAR-2, Sentinel, Grace- FO, ASO, and SMAP. Scale to support NI-SAR and SWOT. Comparison against models. Requires new architectural approaches for distributed data analytics.. Airborne Missions Airborne missions tend to be much more agile and on-demand. Integrating this into a data ecosystem provides new opportunities to quickly generate and understand various measurements. On-demand architectures; distributed data sources; on-the-fly data processing; onboard processing for data reduction/analysis; high-volume data transfer for ground processing Current missions such as CARVE and Airborne Snow Observatory; Future such as proposed EVI-3 and ASO follow-on missions Other science disciplines such as biology carry similar use cases and have identified similar technology needs to those of NASA. These are identified in the section on benchmarking. IV. Conceptual Architecture: The Data Lifecycle for NASA Earth Science The data lifecycle is a critical perspective for developing a comprehensive set of capabilities to increase scientific yield from missions. This encompassing approach must include developing capabilities from the point of collection all the way to analysis and extracted understanding. Figure 1 below shows this concept and exposes the need to improve data science capabilities at multiple points of the data lifecycle: from onboard computing, to data triage of massive data streams, to data analysis. This perspective requires architectural considerations for determining and integrating methodologies and infrastructure for capturing and analyzing data across the full lifecycle. To frame the end-
to- end concept, the model below provides guidance for NASA investments and capabilities that will be required to support effectiveness and scalability in the Big Data era. The data lifecycle view reveals that choices and results of operating on data at one point in the lifecycle inevitably shape and possibly limit the possibilities for working with the data at a later point in the lifecycle. Coordination must be addressed in both directions: ultimately objectives for scientific data understanding and reproducibility must back- drive the design and development of capabilities early in the data lifecycle so as to enable desired results, to the greatest extent possible. See Figure 2. There is also an outer loop to the data lifecycle: Successful data understanding (and possibly, disappointments) drives the next round of science instrument and mission proposals. Figure 1: Data Lifecycle for NASA Science Missions
Figure 2: End-to-End NASA Mission/Science Data Lifecycle The data lifecycle views shown in Figures 1 and 2 identify a number of challenges across the mission lifecycle that affect the various scientific disciplines that are core to NASA. Earth Science, in particular, benefits from new approaches to improving data analysis given the complex, heterogeneous, high- volume, and distributed nature of the remote sensing data acquired from Earth- observing missions. As an example, use of these data for comparison against and verification of climate model simulations is critical to supporting research in global climate change. However, much of this data is managed in highly distributed repositories with different data representations, formats, access methods, and owners. Significant time and effort are required to access, move, and rectify data from different repositories to support specific analyses of interest to particular researchers. These requirements constrain both the type and scope of analyses that can be performed, and therefore ultimately the scientific hypotheses that can be tested. A major goal for data science is to reduce the cost of large- scale, interactive data analysis and, at the same time, leverage the richness of massive data sets to reduce uncertainty in the answers to crucial scientific questions. This is not possible in the current environment due to lack of scalability. Data science seeks to enable not just more discipline science (e.g., more data incorporated in analysis, more parameters to compare, etc.), but better science (e.g., quantitative assessments of uncertainty, repeatability of results, etc.), to be performed in a similar or shorter amount of time despite growing sets of data.
VI. Capability Gaps and Drivers A major gap has been the focus on localized data analysis from instruments vs. a holistic consideration of all the observing systems and how they fit into a big data analytics view and capability. Data systems today are organized around the capture and archiving of data from specific missions or observing capabilities. However, the integration of data from multiple instruments (spaceborne, airborne, ground- based) is important for supporting scientific research, in particular, in moving from isolated data analysis to knowledge discovery through the use of a big data analytics approach. This includes making data available from multiple sources and integrating those data using intelligent algorithms and methods. In addition, as data grows and more automated methods are in place for data discovery, this affords opportunities to improve the efficiency and effectiveness of ongoing mission operations and move it towards a data- driven approach where data is reduced onboard or at ground stations, prior to archive, as well as during fully offline science analysis. The introduction of new approaches with interpretation of data across the lifecycle allows for informed decisions at arbitrary points in the lifecycle allowing for mission plans to be updated, new relevant data products to be generated, etc. This same full view of the data lifecycle also will inform provenance practices, so that the relevant details all the way back to the point of collection can be captured to provide a basis for reproducing the end results of scientific understanding. This is a paradigm shift from how mission and science operations vs. analysis are performed today, largely in separate arenas 2. Figure 3 below shows the current approach for organizing NASA Earth Science data systems. The approach focuses extensively on the stewardship of data, by collecting data into organized and community- accessible archives. This approach has enabled NASA to build high quality data archives, but as the need for a shift towards systematic approaches to data analysis increases, it is important to take a broader view of how activities across the entire data lifecycle can be organized and integrated. Today, onboard computing, mission operations, science data processing and archiving, and data analysis are performed as generally independent architectures, systems and components of the data lifecycle. 2 This state of affairs is reflected in the disjoint NASA programs for science missions vs. research and analysis.
On$Board$ Processing$ Science$Teams$ Outreach$ Research$ TDRS% Netwo rk% Acquisi+on$ and$$ Command$ Mission$ Mission$ Opera+on Opera+ons$ s$ Instrume Instrume nt$ Opera+on nt$ Opera+on s$ L0A$$ s$ Processin L0A$$ Processin g$ EDOS/ EDOS/ g$ Ground$ GDS$ EDOS/ Systems$ GDS$ Instrument$ Opera+ons$ L0A$$ Processing$ Science$ Science$ $ Processing$ $ Processing$ L0B$ L1$ L0B$ L2$ L1$ L3$ L2$ L4$ L3$ $ L4$ $ Science$ $ Processing$ L0B$ L1$ L2$ L3$ L4$ $ Science$ SDS$ SDS$ Systems$ Science$ Science$ Management$$ Management$$ Archive$ Archive$ &$ Distribu+on$ &$ &$ Distribu+on$ EOSDIS$DAAC$ EOSDIS$ EOSDIS$DAAC$ Centers$ Figure 3: Today, Stewardship Model of NASA Data Systems Taking a broader view, cyber- infrastructures, machine learning and other intelligent algorithms (analytics), statistics, and visualization should be brought together as integrated architectural solutions that can be scaled to meet NASA s data challenges in enabling missions and science. The Figure 4 below shows a shift from data generation and stewardship of archives to enabling data analysis through an integrated cyber- infrastructure where data, algorithms, computation, and visualization are brought together across a highly distributed data environment to quickly stand up new analytic capabilities for different stakeholders, measurements, science questions, and applications. This requires new thinking as follows: 1. New architectural approaches across the entire science mission and data lifecycle in order to increase the science yield by scaling and improving the integration of each of the various components of the lifecycle. 2. Increasing capability onboard to support data reduction, autonomous mission (re- )planning, etc., as part of managing bandwidth capabilities. 3. Integrated analytics as part of the real- time mission pipeline. 4. Ability to quickly construct new analytic centers that can unify archives, computing capabilities, and software services to bring heterogeneous data sets as well as models/simulations together for analysis.
5. Systems that can react to increased velocity of data across the lifecycle with new technology approaches for data triage and capture. 6. Interoperability with other agencies including capture of data from their instruments and integration with their ground systems, archives, and analytic capabilities. 7. Quantitative uncertainty management at all stages of the data lifecycle in order to provide uncertainties required for scientific inference.. On$Board$ Processing$ $ On$Demand$ Algorithms$ Science$Teams$ TDRS% Netwo rk% Mission$ Opera+on Mission$ Opera+ons$ Satellite$ Instrument$ Science$ Manage$ Systems$ Data Science Infrastructure (Data, Algorithms, Machines) Research$ Acquisi+on$ and$$ Command$ Data Capture! Instrume Instrume nt$ Opera+on Instrument$ nt$ Opera+ons$ L0A$$ s$ L0A$$ Processing$ L0A$$ Processin g$ EDOS/ EDOS/ g$ Ground$ GDS$ EDOS/ Systems$ GDS$ Airborne$ Science$ Manage$ NASA$ Archives$ Science$ Manage$ Other%Data% Other%Data% Other$ Systems%(e.g.% Systems%(e.g.% Systems$ NOAA)% (InDSitu,$Other$ NOAA)% Agency,$etc)$ Figure 4. Future, NASA Mission-Science Data Ecosystem Analy)cs$ Centers$ Data Analysis! Applica+ons$ Decision$ Support$ V. Proposed NASA Data and Computational Science Technology Areas As discussed above, as the data- intensive nature of NASA science and exploration missions increase, there is a greater need to consider the data lifecycle from the point of collection all the way to extracted understanding of the data in order to support scalability and full utilization of the data. Furthermore, missions may have no choice but to require that data reduction and intelligent triage be performed across the data lifecycle to identify which data are to be captured and archived. In addition to considerations of the data and software, it is critical that common information models be developed and defined in order to ensure that consistent definitions of the data are applied (what is definitions of the
data?. In particular, a rigorous, probabilistic definition of uncertainty should be adopted to cover all NASA missions and data so that uncertainty is defined in the same way for all data that are to be used in scientific studies. This practice ensures that the data is reliably managed, is supportive of discovery, and is fully utilized. It is important to point out that data across this entire lifecycle view should not be considered data at rest, but rather data that is discoverable, accessible, and utilizable to update plans and inform other decisions, support local operations, and of course enable science. A well- architected data system from onboard data capture to ground- based operations through data analysis must be in place to support all of these objectives. Critical areas of the lifecycle (with considerations at each stage) include: a) Data Generation: Perform original processing at the sensor/instrument b) Data Triage: Make choices at the collection point about which data to keep c) Data Compression: Maximize information throughput against available bandwidth d) Data Transport: Improve resource efficiencies to enable moving the most data e) Data Processing: Increasing computation availability f) Data Archiving: Scaling the capture, management and distribution of data g) Data Visualization: Develop and apply visualization techniques to enable data exploration and insights h) Data Analytics: Create services to integrate the analysis of massive, distributed, heterogeneous data, and propagate uncertainties. In addition, there are relevant technologies and architectural approaches that are inherently cross cutting in that they apply equally to several Earth Science areas of investigation. This perspective is particularly important for architecture, which can enable how various technologies can be integrated to construct a general data- intensive approach for the NASA Earth Science enterprise. Table 1: Proposed AIST Technology Areas Technology Name Big Data Architecture Earth Science Remote Sensing Big Data Information Models and Semantics Onboard data science methods for data triage Onboard data science methods for data reduction Data Lifecycle Area (s) Cross-Cutting Cross-Cutting Data triage Data Compression Definition of a scalable data big data lifecycle architecture for earth observing systems identifying how Big Data can scale from onboard computing to data analysis to increase science yield. Advanced semantic technologies for defining, deriving, and integrating heterogeneous ontologies and information models as applied across the entire data lifecycle (onboard, ground-based operations, archives, analysis) Onboard data science methods for real-time event detection, and planning. Onboard data science methods for data reduction.
Massive Data Movement Technologies Real-time ground-based data science methods Open source data processing frameworks Reusable data science methodologies for missions and science Data Transport Data Processing Data Processing (1) Data Processing; (2) Data Analytics Massive data movement technologies for ground-based networks from operations through analysis Real-time ground-based data science methods for data reduction and real-time event detection for massive data streams as part of the data lifecycle architecture. Open source data processing and workflow frameworks that can massively scale to computational infrastructures (HPC, public cloud, etc.) handling large data streams, products, including near-real time constraints, as part of the data lifecycle architecture. Development of reusable data science methodologies for analysis of data on the ground as part of the data lifecycle architecture. This includes on-demand data analytics for massive data repositories. Federated data access Data Archives Federation of data access from distributed repositories as part of the data lifecycle architecture, moving towards on-demand distributed data analytics Massive Data Distribution Data Archives Massive data distribution for large-scale repositories and archives including methods for data reduction, computation, etc., as integrated, on-demand data analytics. Intelligent search and mining (1) Data Archives; (2) Data Analytics Provide methods for intelligent search and mining of massive data. This may include integration of on-demand analytics to perform deep searches. Visualization of massive data sets Visualization Visualization of massive data sets including data reduction methods that are driven by domain. On-demand distributed data analytics Data Analytics On-demand data analytics that can integrate data from archives, repositories, etc., applying data science methods (data reduction, fusion, feature detection, etc.) provided through a computational infrastructure Distributed data analytics Data Analytics Analysis of data across distributed archives to support Earth system science Uncertainty Quantification; Measurement Science Open source data management/science frameworks Computational Infrastructures Data Analytics (1) Data Archives; (2) Data Analytics (1) Data Processing; (2) Data Analytics Quantification and management of uncertainty through all data processing steps, and subsequently through analytical algorithms, as part of a measurement science strategy for data fusion and data science Open source data management/science frameworks that can massively scale to handle and manage large data streams, products, including near-real time constraints, as part of the data lifecycle architecture, for archiving and analytics as part of a big data cyber-infrastructure. Computational Infrastructures to scale data analytics using HPC and public cloud. This includes on-demand massive HPC and storage for integration to drive analytics. Figure 5 shows the mapping of these technologies to the Big Data Analytics architecture concept. A key element, as mentioned, is the integration of the various technologies based on a data- driven architecture view.
Cross9Cu;ng% Onboard% Data%Triage% Data%ReducBon% On$Board$ Processing$ TDRS% Network% Satellite$ Instrument$ Science$ Systems$ Manage$ Mission$ Mission$ Opera+on Opera+ons$ s$ Acquisi+on$ and$$ Command$ $ On$Demand$ Algorithms$ NASA$ Science$ Archives$ Manage$ Cyberinfrastructure% Real9Time%Data%Triage% Reusable%Data%Science%Methods% On9demand%workflows,%computaBon% Integrated%Cyberinfrastrcture% Data%System%Architectures% InformaBon%Architectures% Science$Teams$ Research$ Data Science Infrastructure (Data, Algorithms, Machines) Airborne$ Science$ Manage$ Instrume Instrume nt$ nt$ Instrument$ Opera+on Opera+on Opera+ons$ s$ s$ L0A$$ L0A$$ L0A$$ Processing$ Processin Processin g$ g$ Ground$ EDOS/ Systems$ EDOS/ GDS$ GDS$ Ground%System% Analy)cs$ Centers$ Other%Data% Other$$ Other%Data% Systems%(e.g.% Systems$$ Systems%(e.g.% NOAA)% (InDSitu,$Other$ NOAA)% Agency,$etc)$ Open%Source%Data%Management% %%%%%%%%%/Processing%Frameworks% Data%Movement% Federated%Data%Access% Scalable%ComputaBon%and%Storage% Applica+ons$ Decision$ Support$ Data%AnalyBcs%and%Viz% Distributed%Data%AnalyBcs% On9demand%computaBon% Intelligent%search%and%mining% Uncertainty%QuanBficaBon% VisualizaBon%of%massive%data%sets% Figure 5: Architecture-Technology Mapping VIII. Benchmarking Several U.S. agencies have initiated efforts in Big Data. This includes technology investments, new initiatives and new organizations. These efforts are captured in the appendix and are summarized below. Agency National Institutes of Health (NIH) National Science Foundation (NSF) Big Data Overview and Strategy The NIH has initiated a new program in Data Science and appointed an Associate Director to lead the effort. They are establishing an NIH Commons (computation, software, standards, etc.) through its Big Data to Knowledge (BD2K) initiative. The NIH commons will provide capabilities to various NIH institutes who support directed research efforts. Efforts focus on enabling data management and big data analytics capabilities. The NSF has several initiatives coordinating through the Office of Cyberinfrastructure (OCI). OCI coordinates with various disciplines within the NSF including the EarthCube program that seeks to build a national Geosciences Cyberinfrastructure. Goals of the NSF include: 1) Derive knowledge from data; 2) Develop new cyber-infrastructures to manage, curate and serve data; 3) Develop new approaches for education and workforce development; and 4) Enable new types of inter-disciplinary collaboration, community building
DARPA NOAA Department of Energy (DOE) USGS National Institutes of Standards and Technology (NIST) DARPA has several programs in Big Data including the XDATA and Memex Programs that are developing data science frameworks for big data analytics and mechanisms to explore deep searching of the Internet. DARPA is working to explore the use of open source technologies and their application to these programs. NOAA is working to explore commercial opportunities to build cyber-infrastructures. This includes the use of cloudbased computing capabilities and support to scale data management and computation. NOAA is also participating in the Big Earth Data Initiative (BEDI) project. DOE has been exploring programs in extreme scale science, particularly as it relates to high performance computing (HPC). The goal is to address the combined challenges of Big Data and Big Compute to develop a exascale computing environment for simulation and data analysis at scale cutting across various disciplines in energy, biology and climate. USGS is exploring its role in big data, focusing on data capture and integration, as well as sharing and leveraging HPC infrastructure across the agency. Programs such as EROS are exploring new architectural approaches to scale for the future. NIST has established a program in Data Science focusing on the development of architectures, use cases, standards and interoperability. They are also focused on areas including measurement foundations/principles to increase the accuracy of derived inferences from massive data. Science in broadest terms unifies the data science technology needs of NASA and other agencies. Beyond NASA science disciplines such as climate science, other disciplines such as biological science are declaring a direct need for new technologies to support science data analysis (often captured as data- intensive science or data- driven science ). This should not be surprising, given the vast quantities of data being produced in such areas as genomics and proteomics. Data science methods that automate the extraction, classification, reduction and discovery from massive data sets will have direct value across several science disciplines and agencies. International agencies, such as the European Space Agency (ESA), also have initiated several efforts in Big Data. In 2013 and 2014, ESA held a workshop on Big Data for Space 3 covering several of the technologies that are described in this report. Results of the recent conference highlight movement towards the big data lifecycle, enabling integrated analytics, integrating high- performance computing (HPC) with data infrastructures, new techniques for machine learning, and a shift towards new cyber- infrastructures. International groups such as the International Virtual Observatory Alliance (IVOA), International Planetary Data Alliance (IPDA), and the Global Organization for Earth System Science Portals (GO- ESSP) have developed efforts to share data and technology across international boundaries in order to support world- wide science data analysis. IX. Ten Year Roadmap of Capabilities What is this table? It s not explained. Today Near-Term (2-5 Years) Far Term (5-10 Years) Highly curated data repositories Scalable data science framework Onboard data science and reduction methods Access and search methods for distributed Integrated data analytics Distributed data analytics and computation 3 http://congrexprojects.com/2014-events/bigdatafromspace/introduction
repositories Core data standards Data provenance for reproducibility Integrated uncertainty analysis Machine Learning Techniques Is this about machine learning? Is this about machine learning? Is this about machine learning? Unified architecture for data exploration and analysis On-demand data science methods integrated with data repositories Data visualization methods Scalable computing infrastructure Virtual observational missions Virtual, Immersive Visualization Environments for Massive Data Disruptive Approaches Leveraging Computational Science (5-10 Years) Far-Term Disruptive Capabilities Onboard data science and reduction methods Distributed data analytics and computation Integrated uncertainty analysis Virtual observational missions Virtual, Immersive Visualization Environments for Massive Data Increase onboard autonomy, planning and data processing as close to the instrument as possible as part of a broader big data strategy to increase science return. Shift to a data analytics ecosystem to enable analysis of distributed data from multiple instruments (ground-based, airborne, satellite) as well as comparison against climate models. Develop an architectural approach that manages uncertainly levels as data and computational demands change. Shift towards a data-driven approach for missions where new instruments integrate into an overall virtual infrastructure; support planning of new science goals and observations from a combination of instruments and archival data. Enable immersive environments for scientific analysis as well as outreach using observational data from NASA missions. I would like to see an additional role in this table devoted to the interaction between the architecture and the (statistical) methods that define analytical algorithms. New statistical methods optimized for a given architecture; what architecture is best given a set of target analyses; trade- off between uncertainty and cost, etc. Technology Prioritization Technology Name Priority Applicable Far-Term Capability Big Data Architecture for Earth Science Remote Sensing Highest Cross Cutting Big Data Information Models and Semantics Medium Distributed data analytics and computation Onboard data science methods for data triage Medium Onboard data science and reduction Onboard data science methods for data reduction Medium Onboard data science and reduction Massive Data Movement Technologies Medium Distributed data analytics and computation Real-time ground-based data science methods Highest Virtual observational missions Open source data processing frameworks Highest Distributed data analytics and computation; Virtual observational missions; Virtual, immersive visualization environments for massive data Reusable data science methodologies for missions and science Highest Virtual observational missions; Distributed data analytics and computation Federated data access Medium Distributed data analytics and computation Massive Data Distribution Medium Distributed data analytics and computation
Intelligent search and mining Medium Distributed data analytics and computation Visualization of massive data sets Medium Virtual, immersive visualization environments for massive data On-demand distributed data analytics Highest Distributed data analytics and computation Distributed data analytics Highest Distributed data analytics and computation Uncertainty Quantification; Measurement Science Medium I thing this should be Integrated uncertainty analysis Highest Open source data management/science frameworks Medium Distributed data analytics and computation; Virtual observational missions; Virtual, immersive visualization environments for massive data Computational Infrastructures Highest Distributed data analytics and computation; Virtual observational missions; Virtual, immersive visualization environments for massive data X. Overall Recommendations NASA data technology needs to shift from ad hoc investments across the mission and science data lifecycle to an integrated architecture where technology investments fit into a broader capability and vision to enable Earth Science. The NASA data ecosystem needs to shift from a stewardship model to a data- driven discovery model where both data management and data discovery are enabled through a systematic data architecture, computational infrastructure and lifecycle model. Data architectures should be modeled and assessed overall to plan for technology capabilities and improvements to ensure scalability, and to address performance, cost, and uncertainty management goals. Data discovery methods should be applied across the entire data lifecycle to support scalability and discovery at each point, from onboard computing, to data processing and archiving, to distributed data analytics. Architectures should enable flexible tradeoffs of where to and how to compute, to include the improved integration of HPC and data infrastructures. New capabilities should ensure reproducibility of derived scientific results. Computational and data science should play an important role in planning new missions including identification of how data, algorithms, and computing can be applied and integrated to improve overall data discovery. References [1] 2014 NASA Office of the Chief Technologist Roadmap (under development) [2] National Research Council, Frontiers in Massive Data Analysis, 2013.
[3] 2011 PCAST Report on Information Technology. [4] American Geophysical Union, Trends in Earth and Space Science Appendix A. NASA Mission/Science Data Lifecycle As the data- intensive nature of NASA science and exploration missions increases, there is an increasing need to consider the data lifecycle from the point of collection all the way to the application and use of the data. Considerations across the entire lifecycle need to be made in order to support scalability and use of the data. Furthermore, missions may require that data reduction and intelligent triage be done on the data itself across the lifecycle to identify which data should be captured and archived. In addition to considerations of the data and software, it is critical that common information models be developed and defined in order to ensure consistent definitions of the data are applied so that the data itself can be accurately managed, discovered and used. It is important to point out that data across this entire lifecycle should not be considered data at rest, but rather data that should be discoverable, accessible, and usable to update plans, support local operations, and enable science. As a result, a well- architected data system from on- board data capture through ground- based operations and data analysis must be in place to enabling scalability at multiple points across that lifecycle. Critical areas of the lifecycle (with considerations at each stage) include: a) Data Generation: Performing original processing at the sensor/instrument b) Data Triage: Make choices at the collection point about which data to keep c) Data Compression: Maximize information throughput against available bandwidth d) Data Transport: Improve resource efficiencies to enable moving the most data e) Data Processing: Increasing computation availability f) Data Archiving: Scaling the capture, management and distribution of data g) Visualization: Develop and apply analysis techniques to enable data understanding from visualization of massive data h) Data Analytics: Create analytics services to integrate massive, distributed, heterogeneous data Appendix B. ESTO/AIST Proposed Big Data Technology Thrust Areas. Technology Name Big Data Architecture for Earth Science Remote Sensing Big Data Information Models and Semantics Onboard data science methods for data triage Onboard data science methods for data Data Lifecycle Area (s) Cross-Cutting Cross-Cutting Data triage Data Compression Definition of a scalable data big data lifecyce architecture for earth observing systems identifying how Big Data can scale from onboard computing to data analysis to increase science yield. Advanced semantic technologies for defining, deriving, and integrating heterogeneous ontologies and information models as applied across the entire data lifecycle (onboard, ground-based operations, archives, analysis) Onboard data science methods for real-time event detection, and planning. Onboard data science methods for data reduction.
reduction Massive Data Movement Technologies Real-time ground-based data science methods Open source data processing frameworks Real-time ground-based data science methods Reusable data science methodologies for missions and science Data Transport Data Processing Data Processing Data Processing (1) Data Processing; (2) Data Analytics Massive data movement technologies for ground-based networks from operations through analysis Real-time ground-based data science methods for data reduction and real-time event detection for massive data streams as part of the data lifecycle architecture. Open source data processing and workflow frameworks that can massively scale to computational infrastructures (HPC, public cloud, etc) handling large data streams, products, including near-real time constraints, as part of the data lifecycle architecture. Real-time ground-based data science methods for data reduction and real-time event detection for massive data streams as part of the data lifecycle architecture. Duplicated? Development of reusable data science methodologies for analysis of data on the ground as part of the data lifecycle architecture. This includes on-demand data analytics for massive data repositories. Federated data access Data Archives Federation of data access from distributed repositories as part of the data lifecycle architecture, moving towards on-demand distributed data analytics Massive Data Distribution Data Archives Massive data distribution for large-scale repositories and archives including methods for data reduction, computation, etc, as integrated, on-demand data analytics. Intelligent search and mining Visualization of massive data sets On-demand distributed data analytics (1) Data Archives; (2) Data Analytics Visualization Data Analytics Provide methods for intelligent search and mining of massive data. This may include integration of on-demand analytics to perform deep searches. Visualization of massive data sets incuding data reduction methods that are driven by domain. On-demand data analytics that can integrate data from archives, repositories, etc, applying data science methods (data reduction, fusion, feature detection, etc..) provided through a computational infrastructure Distributed data analytics Data Analytics Analysis of data across distributed archives to support Earth system science including application of novel machine learning, statistical methods (e.g., data fusion) and other computational capabilities. Uncertainty Quantification; Measurement Science Open source data management/science frameworks Computational Infrastructures Data Analytics (1) Data Archives; (2) Data Analytics (1) Data Processing; (2) Data Analytics Management of uncertainty in all phases of the data life cycle and analysis as part of a measurement science strategy for data fusion and data science Open source data management/science frameworks that can massively scale to handle and manage large data streams, products, including near-real time constraints, as part of the data lifecycle architecture, for archiving and analytics as part of a big data cyberinfrastructure. Computational Infrastructures to scale data analytics using HPC and public cloud. This includes on-demand massive HPC and storage for integration to drive analytics. Appendix C. Mapping 2014 OCT TA- 11 to Proposed ESTO/AIST Big Data Technology Thrust Areas. The 2014 NASA OCT TA- 11 Roadmap team identified the following areas: Flight Computing: Includes technologies to support greater computation and data management at the point of collection onboard. In some cases, data reduction at the point of collection through intelligent triage methods may be required. Flight computing technologies include ultra- reliable, radiation- hardened platforms, which, until recently, have been extremely costly and limited in performance. Ground Computing: Includes exascale supercomputing and data storage, as well as quantum, cognitive, and other types of advanced computing, for Big Data analysis
and high- fidelity physics- based simulations for Earth and space science, and aerospace research and engineering. Science, Engineering, and Mission Data Lifecycle: As the data- intensive nature of NASA science and exploration missions increases, there is an increasing need to consider the data lifecycle from the point of collection all the way to the application and use of the data. Intelligent Data Understanding: Intelligent data understanding (IDU) refers to the capability to automatically mine and analyze large datasets that are large, noisy, and of varying modalities (discrete, continuous, text, graph, etc.), in order to extract or discover information that can be used for further analysis or in decision making, on the ground or onboard. It is closely coupled to the capability to detect and respond to interesting events and/or to generate alerts. Semantic Technologies: Technologies that enable data understanding, analysis, and automated consulting and operations. Each of these areas has cross- cutting challenges which are relevant to AIST as follows. Flight Computing TA 11.1.1.3 Technology Name High Performance Flight Software Enables on-board, high performance autonomy and data processing Applicable Technology Area Onboard data science methods for data reduction, real-time event detection, and planning Ground Computing TA 11.1. 2.1 11.1. 2.2 11.1. 2.3 11.1. 2.5 Technology Name Exascale Supercomputer Automated Exascale Software Development Toolset Exascale Supercomputing File System Public Cloud Supercomputer Provides peak computational capability of 1 exaflop (10^18 floating point operations per second) for exascale performance of NASA computations, with excellent energy efficiency and reliability, to support NASA s exponentially growing high end computational needs. Provides automated, exascale application performance monitoring, analysis, tuning, and scaling. Provides online data storage capacity of 1 exabyte, enabling data storage for exascale M&S and data analysis, with sufficient performance and reliability to maintain productivity for a broad array of NASA applications. Provides additional resources for NASA supercomputer users, such as for mission-critical computing in an emergency. Applicable Technology Area Computational infrastructures to scale data analytics Computational infrastructures to scale data analytics Computational infrastructures to scale data analytics Computational infrastructures to scale data analytics Mission, Science and Engineering Data Lifecycle
TA 1. 1 1. 2 1. 4 1. 5 1. 6 1. 7 1. 8 Technology Name Reference Information System Architecture Frameworks Distributed Information Architecture Frameworks Onboard Data Capture and Triage Methodologies Real-time Data Triage and Data Reduction Methodologies Scalable Data Processing Frameworks Massive Engineering and Science Data Analysis Methodologies Remote Data Access Framework Provide reference architectures for the end-to-end science and engineering data lifecycle. Provide reference information architectures to define data across the end-to-end engineering and science data lifecycle Apply novel machine learning capabilities on board to support data reduction, model-based compression and triage of massive data sets Apply novel machine learning capabilities in ground data processing systems to support data reduction and triage of massive data sets Provide scalable software processing frameworks for processing scientific and engineering data sets. Provide scalable methodologies for analysis of massive data. Provides access to and sharing of distributed data sources in a secure environment. Applicable Technology Area Definition of a scalable data big data lifecyce architecture for earth observing systems identifying how Big Data can scale from onboard computing to data analysis to increase science yield. Advanced semantic technologies for auto generating ontologies and information models from existing large, distributed, existing data. Onboard data science methods for data reduction, real-time event detection, and planning. Real-time ground-based data science methods for data reduction and real-time event detection for massive data streams as part of the data lifecycle architecture. Open source data processing frameworks that can massively scale to handle large data streams, products, including near-real time constraints, as part of the data lifecycle architecture. Development of reusable data science methodologies for analysis of data on the ground as part of the data lifecycle architecture. This includes on-demand data analytics for massive data repositories. Federation of data access from distributed repositories as part of the data lifecycle architecture, moving towards on-demand distributed data analytics 1. 9 1. 10 1. 11 Massive Data Movement Services Large-Scale Data Dissemination Environments Toolset for Massive Model Data On Demand Data Analytics Develop new technologies for the movement of massive, multi-petabyte data over the network. Enable scaling data infrastructures (software, computation, networks, etc) that are required to support large-scale data dissemination Makes data/information transparent, scalable and usable when infusing multiple large and diverse datasets in complex models. Provides analytics service coupled with computational infrastructures. Data movement (observations, climate model output, etc) across the data lifecycle is critical to scaling for Big Data. Develop capabilities to improve movement of data between points of the data lifecycle from networks to parallel methods for data transfer. Massive data distribution for large-scale repostiroeis and archives including methods for data reduction, computation, etc, as integrated, ondemand data analytics. Scale tools that provide on-demand data analytics that bring together distributed models, observational data, and provides a computational infrastructure to enable analysis and intercomparison. On-demand data analytics that can integrate data from archives, repositories, etc, applying data science methods (data reduction, fusion, feature detection, etc..) provided through a computational infrastructure Intelligent Data Understanding TA 2.1 Technology Name Intelligent Data Collection and Prioritization Toolset Provides a means to reduce the size of the data (e.g., removing clouds), remove corrupted data and/or collect complementary data for value- Applicable Technology Area Development of reusable data science methodologies for analysis of data on the ground as part of the data lifecycle architecture. In
2.2 2.3 2.4 2.5 Event Detection and Intelligent Action Toolset Data on Demand Toolset Intelligent Data Search and Mining Toolset Data Fusion Toolset added information. Provides computational mechanisms to identify high information content data, either prespecified, novel or anomalous, including multispacecraft collaborative event detection and to take an autonomous or assisted onboard decision, as a result of data analysis Enables users and models to task sensors and to leverage sensor webs to develop "on-demand" products Develops search services and engines for massive, distributed data holdings; enables the application of different searching rules/schemes that learn from past searches and develop agents to find and create the most relevant products. It includes rich queries, including factbased, free-text searches, web-service based indexing as well as anomaly/novelty detection, where the system suggests items of interest to the user without the user necessarily prescribing the information being sought. Combines data from multiple sources (e.g. remote sensing, in-situ, models) in order to make inferences that might otherwise not be possible with single data sources, or in order to improve the uncertainty characteristics of these inferences, over what might be achieved with single data sources. particular, intelligent algorithms for data reduction, classification, generation, etc Onboard and ground-based data science methods for data reduction, real-time event detection, and planning. This is representative of multiple AIST areas including architecture, on demand data processing, etc Provide methods for intelligent search and mining of massive data. This may include integration of on-demand analytics to perform deep searches. Ground-based data science methods for data fusion including uncertainty quantification require to make scientific inferences. Semantic Technologies TA 3.1 3.2 3.3 Technology Name Semantic Enabler for Data (Text, Binary and Databases) Ultra Large-Scale Visualization and Incremental Toolset Semantic Bridge Framework Ingests completely data of all types and produces a data model and a precision ontology, improves the quality of any existing metadata, including provenance and quality of the source document and disambiguates words in the context of the document, database or file and visualizes the results. Enables and automates analysis of ultra large scale datasets and visualizes the results in the context of the knowledge domain Enables the alignment of two or more data sources based on their respective ontological descriptions to achieve semantic interoperability and facilitates calculations to be properly made using data from each dataset. Applicable Technology Area Advanced semantic technologies for auto generating ontologies and information models from existing large, distributed, existing data. This could be extremely useful for reverse engineering a large-semantic model for a scientific discipline or sub-discipline. Visualization of massive data sets incuding data reduction methods that are driven by domain. Advanced semantic technologies for ontology integration. As NASA moves to more system science, spans multiple missions, instruments, systems, environments, etc, this is critical. Constructing On-demand data analytics capabilities will require it. Collaborative Systems TA 4.3 Technology Name Distributed Collaborative Science Data Analysis Enable data, computation and services to be brought together to support distributed data analysis in collaborative environments for science Applicable Technology Area On-demand data analytics that can integrate data from archives, repositories, etc, applying data science methods (data reduction, fusion, feature
Frameworks detection, etc..) provided through a computational infrastructure enabling scientific collaboration. Mission Systems TA 5.2 Technology Name Adaptive Systems Framework Manages a set of interacting or interdependent entities, real or abstract, forming an integrated whole that together are able to respond to environmental changes or changes in the interacting parts. Applicable Technology Area Open source data processing frameworks that can massively scale to handle large data streams, products, including near-real time constraints, as part of the data lifecycle architecture. Cyberinfrastructure TA 6.1 6.2 6.3 6.4 Technology Name On-Demand, Multi- Mission Data Storage and Computation Scalable Data Management Frameworks Scalable Data Archives Systems High Performance Networking Provides scalable storage and computing available on demand across projects including both internal and external (hybrid) clouds at center and NASA levels. Are extensible, scalable data management frameworks that can take advantage of massive storage and computing resources Support scalable archives that can capture, manage, distribute and preserve massive data sets, both engineering and science. Provide terabit data networks to handle movement of massive datasets, particularly those for scientific research. Applicable Technology Area Computational infrastructures to scale data archive and analytics. Note: this could an area for joint exploration between ESTO/AIST and NASA OCIO. Open source data mangement/science frameworks that can massively scale to handle and manage large data streams, products, including near-real time constraints, as part of the data lifecycle architecture for mission data management. Open source data mangement/science frameworks that can massively scale to handle and manage large data streams, products, including near-real time constraints, as part of the data lifecycle architecture, for archiving. Data movement (observations, climate model output, etc) across the data lifecycle is critical to scaling for Big Data. Provide high speed networks as core infrastructure. Note: this could an area for joint exploration between ESTO/AIST and NASA OCIO. Human System Interaction TA 7.6 Technology Name Assistive tool for heterogeneous data integration Allows integration of heterogeneous data sources to enable querying and linking data. Applicable Technology Area Federation of data access from distributed repositories as part of the data lifecycle architecture, moving towards on-demand distributed data analytics. Advanced semantic technologies for ontology integration. Appendix D. Classification of Agency Big Data Efforts Based on NITRD, and other presentations; etc.
Agency Strategy Data Mgmt NIH Commons (computati on, software, Decentralized standards, e tc) structure with NIH through NIH Data BD2K Commons to initiative support reuse and sharing. 1) Derive knowledge from data 2) New cyberinfras tructure to manage, curate and serve data; Data Models Reuse of Shared ontoologies and models; metadata standards Data Analytics Computation Projects Comments Programs/pro jects develop NIH Commons to provide more centralized compute NIH BD2K; NCI CBIIT Data science education a priority; large investments in BD2K; Creation of a new Associate Director for Data Science position NSF 3) New approache s for education and workforce developme nt; 4) New types of interdisciplinary collaboratio n, community building Rely on programs to develop. Rely on programs to develop. Rely on programs to develop. Rely on programs to develop. 1) Big Data Regional Hubs; 2) Big Data and NSFCloud; 3) Connecting NSF large efforts (NEON, OOI, LLST); 4) NSF Big Data Programs (Big Data, DIBBS); 5) EarthCube Big Mechanism Data Science Education priority goal for NSF; Sustainability concerns for Big Data ecosystem DARPA Moving from Correlation -based to Casual Modeling Analytics that Empower Users New methods of visualizatio n Data management technology developent party of DARPA big data initiatives; DARPA doesn t curate data long-term Rely on independen t programs to evaluate, develop technology Rely on independent programs to evaluate, develop technology Rely on independent programs to evaluate, develop technology Deep Extraction from Text Insight Memex Network Defense Probabilistic Programming for Advancing Machine Learning XDATA DARPA largely researches and explores big data technologies (e.g., low to mid TRL) rather than building a long-term scalable infrastructure.
NOAA DOE USGS NIST Looking to commercial community to build cyberinfras tructure; also shared use of resources Advanced Extreme Scale Science: Address the combined challenges of Big Data and Big Compute to develop a exascale computing environme nt for simulation and data analysis at scale. Focus on data capture and integration, leveraging HPC infrastructu re Standards developme nt; interoperab ility; Data science measurem ent foundation s/principles RFI to industry seeking collaboration to make NOAA data more accessible New and planned environment s for Climate and Environment, Materials and Office of Science Digital repository capabilities for long tail data Focus on reference architectures and standards for interoperabili ty Standards for NOAA catalogs and metadata Coordinatio n on data and metadata standards with programs and other agencies Data models for integration Data standards part of NIST strategy Looking for shared analytics coupled with HPC Rely on programs to develop; leveraging DOE compute intensive environments On demand processing Foundations for measurement ; best practices for integrating data analytics technology Looking for cross agency, shared use of assets including HPC, data archives, modeling data and analytics Research network advances through ESnet; Exploit HPC on campuses N/A BEDI; Hurricane Forecasting, Weather Forecasting, Climate, etc; investments in applied research projects in advanced computing, communications and information technology Materials Genome Initiative USGCRP, BEDI BEDI, Climate resilience ScienceBase Applied Research Computing Materials Genome Initiative New mission drivers from GOES-R, JPSS, etc);pushing a NOAA Big Data Partnership via RFI; creation of new leadership positions for data and geospatial information (e.g., CDO, GIO) Collaborations with other groups including NASA, NOAA, NSF (via EarthCube) ESIP; Shared role with LP-DAAC NIST major focus is providing standards around big data and data science /analytics methodology and uncertainty quantification Appendix E. ESTO/AIST Proposed Big Data Technology Thrust Areas mappings to other government agency investments
Based on NITRD, and other presentations; etc. Technology Name Applicable Agency Technology Big Data Architecture for NOAA Some overlap with NOAA, however, not a major explicit focus Earth Science Remote Sensing Big Data Information Multiple Agencies Development of both discipline models and related semantic technologies Models and Semantics Onboard data science N/A Unique to NASA methods for data triage Onboard data science N/A Unique to NASA methods for data reduction Massive Data Movement Technologies DOE ESnet initiative focuses on sharing large climate data; major data movement challenges for groups such the CERN/LHC Real-time ground-based DARPA Data triage methods for streaming data, etc data science methods Open source data NSF, NIH NSF Cyberinfrastructure, NIH Commons investments processing frameworks Reusable data science methodologies for missions and science NSF, NIH, DARPA Some data science (e.g., machine learning, statistcal analysis) approaches may be discipine specific, but opportunities for infusion of common methods (e.g., pattern recognition, detection, classification, etc). Federated data access Most agencies Climate/BEDI Massive Data Distribution Multiple agencies Exploration of how to distribute massive data from repositories and data sets Intelligent search and DARPA/Memex efforts mining DARPA Visualization of massive data sets NOAA, DARPA, NSF, etc Exploration of methods and techniques for visualizing and communicating massive data On-demand distributed data analytics Limited exploration of unifying data from highly distributed data repositories and systems for analytics Uncertainty Quantification; Measurement Science NIST Focus on methods and standards for measurement science in deriving inferences from massive data Open source data management/science frameworks NSF, NIH Cyber/data infrastructure development and sharing for constructing scalable data science systesm. Computational Infrastructures Multiple Agencies, particularly DOE and NOAA For DOE, Big Data = Big Compute. Supporting exascale computing. NOAA to take advantage of shared HPC at centers. NIH investigating cloud and other mechanisms for enabling science research. Major need to integrate HPC with the cyberinfrastructure.