NASA Earth Science Research in Data and Computational Science Technologies Report of the ESTO/AIST Big Data Study Roadmap Team September 2015

NASA Earth Science Research in Data and Computational Science Technologies Report of the ESTO/AIST Big Data Study Roadmap Team September 2015 I. Background Over the next decade, the dramatic growth of NASA s Earth Science data collections is projected to outpace the ability of scientists to analyze that data meaningfully. What has been termed the Vs of data (volume, variety, velocity, etc.) pose significant challenges for both Earth Science missions and researchers as traditional methods for developing science data pipelines, distributing scientific datasets and performing effective analysis will require new approaches. The Intergovernmental Panel on Climate Change s (IPCC) Assessment Report 6, for example, predicts the growth of data to tens of petabytes. Future remote sensing projects include an increasing set of data- intensive instruments that will pose severe challenges to existing systems. Likewise, with Earth Science data archives, these massive increases will shift the focus from distributing whole data sets to providing online services for computation and analysis. In addition, instruments flown on NASA Earth- observing satellites will continue to generate data and stress the boundaries of end- to- end data systems. These challenges taken together require new thinking in data capture, management, processing, and analysis, both onboard and on the ground. Big Data and Data Science have been used as terms to describe this data deluge and the discipline devote to address it. For the purposes of this document, Big Data is a term used to describe the state of collection and analysis of data that exceeds conventional methods or software systems. This state of affairs necessitates new approaches that will change the paradigm by which data is collected and analyzed. Data Science focuses on the use of systematic architectural, software, methodological, and algorithmic approaches (e.g., data management, intelligent algorithms, statistics and visualization) for generation, capture, management, analysis, and discovery from massive data sets and streams (i.e., big data) including those from multiple sensors, models, archives, and other sources, to enable research and decision support. Data Science is a term that is being applied at a growing number of universities that are establishing programs in this area. Data Science is thus emerging as a critical area of research and technology to advance scientific discovery. The sheer volume of data increase over the next decade, coupled with the highly distributed and heterogeneous nature of scientific data sets is requiring these new approaches. Numerous technologies are under development in multiple communities to address Big Data challenges. While NASA can leverage some of this capability (e.g., map- reduce to scale computation), there are significant technologies that need to be developed

that scale to address the data- intensive challenges underlying modern observing systems and the science questions they were created to investigate. Existing techniques also do not address the end- to- end nature of NASA s science- driven observational environment. NASA needs to develop a multi- year plan to address these challenges in total, rather than with incremental, isolated improvements that are unlikely to scale. Such an approach promises significant scientific yield from NASA s missions, instruments, archives, and research community and will be necessary in order to remain on the critical path to accomplish science in the Big Data era. II. Current State: Mission, Instrument and Large- scale Data Analysis Challenges Currently, the analysis of large data collections from NASA or other agencies is executed through traditional computational and data analysis approaches, which require users to bring data to their desktops and perform local data analysis. Alternatively, data are hauled to large computational environments that provide centralized data analysis via traditional High Performance Computing (HPC). Scientific data archives, however, are not only growing massive, but are also becoming highly distributed. As a consequence, neither traditional approach provides a good solution for optimizing analysis into the future. Further, assumptions across the NASA mission and science data lifecycle, which historically assume that all data can be collected, transmitted, processed, and archived, will not scale as more capable instruments stress legacy- based systems. A new paradigm is needed in order to increase the productivity and effectiveness of scientific data analysis. This paradigm must recognize that architectural and analytical choices are interrelated, and must be carefully coordinated in any system that aims to allow efficient, interactive scientific exploration and discovery to exploit massive data collections, from point of collection (e.g., onboard) to analysis and decision support. Expanding on this point, the most effective approach to analyzing a distributed set of massive data may involve some exploration and iteration, putting a premium on the flexibility afforded by the architectural framework. The framework should enable scientist users to assemble workflows efficiently, manage the uncertainties related to data analysis and inference, and optimize deep- dive analytics to enhance scalability. These challenges are not limited to NASA. Multiple agencies are confronted with the question of how to draw scientific inference from growing, distributed archives, as identified in the appendix. NASA has already made significant investments in capturing and sharing data from massive, online data systems. The NASA Earth Science Distributed Information System (ESDIS), most prominently the Distributed Active Archive Centers (DAACs), provide an excellent foundation for capturing and building high quality repositories. Technology investments to date by NASA (e.g., ROSES ACCESS, ROSES AIST, etc.) and the DOE (e.g.,

Earth System Grid) have focused on developing software services that have improved access to distributed data. The challenge that is posed now is how to integrate these capabilities with careful architectural approaches and emerging technologies that will allow NASA to continue to scale the entire data lifecycle and support the data analysis needs of the Earth Science community. The existing infrastructures are prime candidates for grounding a more distributed, scalable computational environment to optimize scientific data analysis and move to the era of Big Data Analytics. An integrated architecture and ecosystem will address the modern data and computing challenges across the NASA mission and science data lifecycle: Reproducibility Uncertainty Data fusion Data reduction Data movement Data visualization Cost Performance III. Use Cases Several use cases have been identified which present challenges to future NASA missions and science. These use cases identify the challenges and required capabilities that are needed in order to not just keep pace but increase the science yield from NASA Earth Science missions, instruments and data collections. Use Cases 1 the following identify use cases and their data science challenges. Use Case Data Science Challenge Enabling Mission/ Capability Climate Modeling Formulate hypotheses from observed empirical relationships; Simulate current and past conditions under those hypotheses using climate models; Test hypotheses by comparing simulations to observations; Evaluate uncertainty of predictions originated from statistical sampling of models and observations. Missions such as NI-SAR and SWOT will generate massive observational data. However, they are have different architectural patterns including compute intensive, data intensive, heterogeneous, Highly distributed data sources; fusion of different observations; moving computation to the data; data reduction CMIP6 will move towards exascale archives requiring new approaches to evaluating models relative to observational data. Satellite Missions Massive data rates, data movement challenges, computational NI-SAR and SWOT require new approaches for computation, data movement, data archiving and distribution, 1 NIST has an active working group developing Big Data use cases that includes NASA contributions in

etc. scalability, archiving and distribution; onboard processing for data reduction/analysis; high-volume data transfer for ground processing Distributed computation; highly distributed data sources; data fusion of multiple products; massive new satellite observations. analytics. Applications - Hydrology (Central Valley of California) Understanding groundwater dynamics on a regional scale using measurements from satellite, airborne and in-situ measurements. Compare against predictive models. Integration of data from PALSAR-2, Sentinel, Grace- FO, ASO, and SMAP. Scale to support NI-SAR and SWOT. Comparison against models. Requires new architectural approaches for distributed data analytics.. Airborne Missions Airborne missions tend to be much more agile and on-demand. Integrating this into a data ecosystem provides new opportunities to quickly generate and understand various measurements. On-demand architectures; distributed data sources; on-the-fly data processing; onboard processing for data reduction/analysis; high-volume data transfer for ground processing Current missions such as CARVE and Airborne Snow Observatory; Future such as proposed EVI-3 and ASO follow-on missions Other science disciplines such as biology carry similar use cases and have identified similar technology needs to those of NASA. These are identified in the section on benchmarking. IV. Conceptual Architecture: The Data Lifecycle for NASA Earth Science The data lifecycle is a critical perspective for developing a comprehensive set of capabilities to increase scientific yield from missions. This encompassing approach must include developing capabilities from the point of collection all the way to analysis and extracted understanding. Figure 1 below shows this concept and exposes the need to improve data science capabilities at multiple points of the data lifecycle: from onboard computing, to data triage of massive data streams, to data analysis. This perspective requires architectural considerations for determining and integrating methodologies and infrastructure for capturing and analyzing data across the full lifecycle. To frame the end-

to- end concept, the model below provides guidance for NASA investments and capabilities that will be required to support effectiveness and scalability in the Big Data era. The data lifecycle view reveals that choices and results of operating on data at one point in the lifecycle inevitably shape and possibly limit the possibilities for working with the data at a later point in the lifecycle. Coordination must be addressed in both directions: ultimately objectives for scientific data understanding and reproducibility must back- drive the design and development of capabilities early in the data lifecycle so as to enable desired results, to the greatest extent possible. See Figure 2. There is also an outer loop to the data lifecycle: Successful data understanding (and possibly, disappointments) drives the next round of science instrument and mission proposals. Figure 1: Data Lifecycle for NASA Science Missions

Figure 2: End-to-End NASA Mission/Science Data Lifecycle The data lifecycle views shown in Figures 1 and 2 identify a number of challenges across the mission lifecycle that affect the various scientific disciplines that are core to NASA. Earth Science, in particular, benefits from new approaches to improving data analysis given the complex, heterogeneous, high- volume, and distributed nature of the remote sensing data acquired from Earth- observing missions. As an example, use of these data for comparison against and verification of climate model simulations is critical to supporting research in global climate change. However, much of this data is managed in highly distributed repositories with different data representations, formats, access methods, and owners. Significant time and effort are required to access, move, and rectify data from different repositories to support specific analyses of interest to particular researchers. These requirements constrain both the type and scope of analyses that can be performed, and therefore ultimately the scientific hypotheses that can be tested. A major goal for data science is to reduce the cost of large- scale, interactive data analysis and, at the same time, leverage the richness of massive data sets to reduce uncertainty in the answers to crucial scientific questions. This is not possible in the current environment due to lack of scalability. Data science seeks to enable not just more discipline science (e.g., more data incorporated in analysis, more parameters to compare, etc.), but better science (e.g., quantitative assessments of uncertainty, repeatability of results, etc.), to be performed in a similar or shorter amount of time despite growing sets of data.

VI. Capability Gaps and Drivers A major gap has been the focus on localized data analysis from instruments vs. a holistic consideration of all the observing systems and how they fit into a big data analytics view and capability. Data systems today are organized around the capture and archiving of data from specific missions or observing capabilities. However, the integration of data from multiple instruments (spaceborne, airborne, ground- based) is important for supporting scientific research, in particular, in moving from isolated data analysis to knowledge discovery through the use of a big data analytics approach. This includes making data available from multiple sources and integrating those data using intelligent algorithms and methods. In addition, as data grows and more automated methods are in place for data discovery, this affords opportunities to improve the efficiency and effectiveness of ongoing mission operations and move it towards a data- driven approach where data is reduced onboard or at ground stations, prior to archive, as well as during fully offline science analysis. The introduction of new approaches with interpretation of data across the lifecycle allows for informed decisions at arbitrary points in the lifecycle allowing for mission plans to be updated, new relevant data products to be generated, etc. This same full view of the data lifecycle also will inform provenance practices, so that the relevant details all the way back to the point of collection can be captured to provide a basis for reproducing the end results of scientific understanding. This is a paradigm shift from how mission and science operations vs. analysis are performed today, largely in separate arenas 2. Figure 3 below shows the current approach for organizing NASA Earth Science data systems. The approach focuses extensively on the stewardship of data, by collecting data into organized and community- accessible archives. This approach has enabled NASA to build high quality data archives, but as the need for a shift towards systematic approaches to data analysis increases, it is important to take a broader view of how activities across the entire data lifecycle can be organized and integrated. Today, onboard computing, mission operations, science data processing and archiving, and data analysis are performed as generally independent architectures, systems and components of the data lifecycle. 2 This state of affairs is reflected in the disjoint NASA programs for science missions vs. research and analysis.

On$Board$ Processing$ Science$Teams$ Outreach$ Research$ TDRS% Netwo rk% Acquisi+on$ and$$ Command$ Mission$ Mission$ Opera+on Opera+ons$ s$ Instrume Instrume nt$ Opera+on nt$ Opera+on s$ L0A$$ s$ Processin L0A$$ Processin g$ EDOS/ EDOS/ g$ Ground$ GDS$ EDOS/ Systems$ GDS$ Instrument$ Opera+ons$ L0A$$ Processing$ Science$ Science$ $ Processing$ $ Processing$ L0B$ L1$ L0B$ L2$ L1$ L3$ L2$ L4$ L3$ $ L4$ $ Science$ $ Processing$ L0B$ L1$ L2$ L3$ L4$ $ Science$ SDS$ SDS$ Systems$ Science$ Science$ Management$$ Management$$ Archive$ Archive$ &$ Distribu+on$ &$ &$ Distribu+on$ EOSDIS$DAAC$ EOSDIS$ EOSDIS$DAAC$ Centers$ Figure 3: Today, Stewardship Model of NASA Data Systems Taking a broader view, cyber- infrastructures, machine learning and other intelligent algorithms (analytics), statistics, and visualization should be brought together as integrated architectural solutions that can be scaled to meet NASA s data challenges in enabling missions and science. The Figure 4 below shows a shift from data generation and stewardship of archives to enabling data analysis through an integrated cyberinfrastructure where data, algorithms, computation, and visualization are brought together across a highly distributed data environment to quickly stand up new analytic capabilities for different stakeholders, measurements, science questions, and applications. This requires new thinking as follows: 1. New architectural approaches across the entire science mission and data lifecycle in order to increase the science yield by scaling and improving the integration of each of the various components of the lifecycle. 2. Increasing capability onboard to support data reduction, autonomous mission (re- )planning, etc., as part of managing bandwidth capabilities. 3. Integrated analytics as part of the real- time mission pipeline. 4. Ability to quickly construct new analytic centers that can unify archives, computing capabilities, and software services to bring heterogeneous data sets as well as models/simulations together for analysis.

5. Systems that can react to increased velocity of data across the lifecycle with new technology approaches for data triage and capture. 6. Interoperability with other agencies including capture of data from their instruments and integration with their ground systems, archives, and analytic capabilities. 7. Quantitative uncertainty management at all stages of the data lifecycle in order to provide uncertainties required for scientific inference.. On$Board$ Processing$ $ On$Demand$ Algorithms$ Science$Teams$ TDRS% Netwo rk% Mission$ Opera+on Mission$ Opera+ons$ Satellite$ Instrument$ Science$ Manage$ Systems$ Data Science Infrastructure (Data, Algorithms, Machines) Research$ Acquisi+on$ and$$ Command$ Data Capture! Instrume Instrume nt$ Opera+on Instrument$ nt$ Opera+ons$ L0A$$ s$ L0A$$ Processing$ L0A$$ Processin g$ EDOS/ EDOS/ g$ Ground$ GDS$ EDOS/ Systems$ GDS$ Airborne$ Science$ Manage$ NASA$ Archives$ Science$ Manage$ Other%Data% Other%Data% Other$ Systems%(e.g.% Systems%(e.g.% Systems$ NOAA)% (InDSitu,$Other$ NOAA)% Agency,$etc)$ Figure 4. Future, NASA Mission-Science Data Ecosystem Analy)cs$ Centers$ Data Analysis! Applica+ons$ Decision$ Support$ V. Proposed NASA Data and Computational Science Technology Areas As discussed above, as the data- intensive nature of NASA science and exploration missions increase, there is a greater need to consider the data lifecycle from the point of collection all the way to extracted understanding of the data in order to support scalability and full utilization of the data. Furthermore, missions may have no choice but to require that data reduction and intelligent triage be performed across the data lifecycle to identify which data are to be captured and archived. In addition to considerations of the data and software, it is critical that common information models be developed and defined in order to ensure that consistent definitions of the data are applied (what is definitions of the

data?. In particular, a rigorous, probabilistic definition of uncertainty should be adopted to cover all NASA missions and data so that uncertainty is defined in the same way for all data that are to be used in scientific studies. This practice ensures that the data is reliably managed, is supportive of discovery, and is fully utilized. It is important to point out that data across this entire lifecycle view should not be considered data at rest, but rather data that is discoverable, accessible, and utilizable to update plans and inform other decisions, support local operations, and of course enable science. A well- architected data system from onboard data capture to ground- based operations through data analysis must be in place to support all of these objectives. Critical areas of the lifecycle (with considerations at each stage) include: a) Data Generation: Perform original processing at the sensor/instrument b) Data Triage: Make choices at the collection point about which data to keep c) Data Compression: Maximize information throughput against available bandwidth d) Data Transport: Improve resource efficiencies to enable moving the most data e) Data Processing: Increasing computation availability f) Data Archiving: Scaling the capture, management and distribution of data g) Data Visualization: Develop and apply visualization techniques to enable data exploration and insights h) Data Analytics: Create services to integrate the analysis of massive, distributed, heterogeneous data, and propagate uncertainties. In addition, there are relevant technologies and architectural approaches that are inherently cross cutting in that they apply equally to several Earth Science areas of investigation. This perspective is particularly important for architecture, which can enable how various technologies can be integrated to construct a general data- intensive approach for the NASA Earth Science enterprise. Table 1: Proposed AIST Technology Areas Technology Name Big Data Architecture Earth Science Remote Sensing Big Data Information Models and Semantics Onboard data science methods for data triage Onboard data science methods for data reduction Data Lifecycle Area (s) Cross-Cutting Cross-Cutting Data triage Data Compression Definition of a scalable data big data lifecycle architecture for earth observing systems identifying how Big Data can scale from onboard computing to data analysis to increase science yield. Advanced semantic technologies for defining, deriving, and integrating heterogeneous ontologies and information models as applied across the entire data lifecycle (onboard, ground-based operations, archives, analysis) Onboard data science methods for real-time event detection, and planning. Onboard data science methods for data reduction.

Massive Data Movement Technologies Real-time ground-based data science methods Open source data processing frameworks Reusable data science methodologies for missions and science Data Transport Data Processing Data Processing (1) Data Processing; (2) Data Analytics Massive data movement technologies for ground-based networks from operations through analysis Real-time ground-based data science methods for data reduction and real-time event detection for massive data streams as part of the data lifecycle architecture. Open source data processing and workflow frameworks that can massively scale to computational infrastructures (HPC, public cloud, etc.) handling large data streams, products, including near-real time constraints, as part of the data lifecycle architecture. Development of reusable data science methodologies for analysis of data on the ground as part of the data lifecycle architecture. This includes on-demand data analytics for massive data repositories. Federated data access Data Archives Federation of data access from distributed repositories as part of the data lifecycle architecture, moving towards on-demand distributed data analytics Massive Data Distribution Data Archives Massive data distribution for large-scale repositories and archives including methods for data reduction, computation, etc., as integrated, on-demand data analytics. Intelligent search and mining (1) Data Archives; (2) Data Analytics Provide methods for intelligent search and mining of massive data. This may include integration of on-demand analytics to perform deep searches. Visualization of massive data sets Visualization Visualization of massive data sets including data reduction methods that are driven by domain. On-demand distributed data analytics Data Analytics On-demand data analytics that can integrate data from archives, repositories, etc., applying data science methods (data reduction, fusion, feature detection, etc.) provided through a computational infrastructure Distributed data analytics Data Analytics Analysis of data across distributed archives to support Earth system science Uncertainty Quantification; Measurement Science Open source data management/science frameworks Computational Infrastructures Data Analytics (1) Data Archives; (2) Data Analytics (1) Data Processing; (2) Data Analytics Quantification and management of uncertainty through all data processing steps, and subsequently through analytical algorithms, as part of a measurement science strategy for data fusion and data science Open source data management/science frameworks that can massively scale to handle and manage large data streams, products, including near-real time constraints, as part of the data lifecycle architecture, for archiving and analytics as part of a big data cyber-infrastructure. Computational Infrastructures to scale data analytics using HPC and public cloud. This includes on-demand massive HPC and storage for integration to drive analytics. Figure 5 shows the mapping of these technologies to the Big Data Analytics architecture concept. A key element, as mentioned, is the integration of the various technologies based on a data- driven architecture view.

Cross9Cu;ng% Onboard% Data%Triage% Data%ReducBon% On$Board$ Processing$ TDRS% Network% Satellite$ Instrument$ Science$ Systems$ Manage$ Mission$ Mission$ Opera+on Opera+ons$ s$ Acquisi+on$ and$$ Command$ $ On$Demand$ Algorithms$ NASA$ Science$ Archives$ Manage$ Cyberinfrastructure% Real9Time%Data%Triage% Reusable%Data%Science%Methods% On9demand%workﬂows,%computaBon% Integrated%Cyberinfrastrcture% Data%System%Architectures% InformaBon%Architectures% Science$Teams$ Research$ Data Science Infrastructure (Data, Algorithms, Machines) Airborne$ Science$ Manage$ Instrume Instrume nt$ nt$ Instrument$ Opera+on Opera+on Opera+ons$ s$ s$ L0A$$ L0A$$ L0A$$ Processing$ Processin Processin g$ g$ Ground$ EDOS/ Systems$ EDOS/ GDS$ GDS$ Ground%System% Analy)cs$ Centers$ Other%Data% Other$$ Other%Data% Systems%(e.g.% Systems$$ Systems%(e.g.% NOAA)% (InDSitu,$Other$ NOAA)% Agency,$etc)$ Open%Source%Data%Management% %%%%%%%%%/Processing%Frameworks% Data%Movement% Federated%Data%Access% Scalable%ComputaBon%and%Storage% Applica+ons$ Decision$ Support$ Data%AnalyBcs%and%Viz% Distributed%Data%AnalyBcs% On9demand%computaBon% Intelligent%search%and%mining% Uncertainty%QuanBﬁcaBon% VisualizaBon%of%massive%data%sets% Figure 5: Architecture-Technology Mapping VIII. Benchmarking Several U.S. agencies have initiated efforts in Big Data. This includes technology investments, new initiatives and new organizations. These efforts are captured in the appendix and are summarized below. Agency National Institutes of Health (NIH) National Science Foundation (NSF) Big Data Overview and Strategy The NIH has initiated a new program in Data Science and appointed an Associate Director to lead the effort. They are establishing an NIH Commons (computation, software, standards, etc.) through its Big Data to Knowledge (BD2K) initiative. The NIH commons will provide capabilities to various NIH institutes who support directed research efforts. Efforts focus on enabling data management and big data analytics capabilities. The NSF has several initiatives coordinating through the Office of Cyberinfrastructure (OCI). OCI coordinates with various disciplines within the NSF including the EarthCube program that seeks to build a national Geosciences Cyberinfrastructure. Goals of the NSF include: 1) Derive knowledge from data; 2) Develop new cyber-infrastructures to manage, curate and serve data; 3) Develop new approaches for education and workforce development; and 4) Enable new types of inter-disciplinary collaboration, community building

DARPA NOAA Department of Energy (DOE) USGS National Institutes of Standards and Technology (NIST) DARPA has several programs in Big Data including the XDATA and Memex Programs that are developing data science frameworks for big data analytics and mechanisms to explore deep searching of the Internet. DARPA is working to explore the use of open source technologies and their application to these programs. NOAA is working to explore commercial opportunities to build cyber-infrastructures. This includes the use of cloudbased computing capabilities and support to scale data management and computation. NOAA is also participating in the Big Earth Data Initiative (BEDI) project. DOE has been exploring programs in extreme scale science, particularly as it relates to high performance computing (HPC). The goal is to address the combined challenges of Big Data and Big Compute to develop a exascale computing environment for simulation and data analysis at scale cutting across various disciplines in energy, biology and climate. USGS is exploring its role in big data, focusing on data capture and integration, as well as sharing and leveraging HPC infrastructure across the agency. Programs such as EROS are exploring new architectural approaches to scale for the future. NIST has established a program in Data Science focusing on the development of architectures, use cases, standards and interoperability. They are also focused on areas including measurement foundations/principles to increase the accuracy of derived inferences from massive data. Science in broadest terms unifies the data science technology needs of NASA and other agencies. Beyond NASA science disciplines such as climate science, other disciplines such as biological science are declaring a direct need for new technologies to support science data analysis (often captured as data- intensive science or data- driven science ). This should not be surprising, given the vast quantities of data being produced in such areas as genomics and proteomics. Data science methods that automate the extraction, classification, reduction and discovery from massive data sets will have direct value across several science disciplines and agencies. International agencies, such as the European Space Agency (ESA), also have initiated several efforts in Big Data. In 2013 and 2014, ESA held a workshop on Big Data for Space 3 covering several of the technologies that are described in this report. Results of the recent conference highlight movement towards the big data lifecycle, enabling integrated analytics, integrating high- performance computing (HPC) with data infrastructures, new techniques for machine learning, and a shift towards new cyber- infrastructures. International groups such as the International Virtual Observatory Alliance (IVOA), International Planetary Data Alliance (IPDA), and the Global Organization for Earth System Science Portals (GO- ESSP) have developed efforts to share data and technology across international boundaries in order to support world- wide science data analysis. IX. Ten Year Roadmap of Capabilities What is this table? It s not explained. Today Near-Term (2-5 Years) Far Term (5-10 Years) Highly curated data repositories Scalable data science framework Onboard data science and reduction methods Access and search methods for distributed Integrated data analytics Distributed data analytics and computation 3 http://congrexprojects.com/2014-events/bigdatafromspace/introduction

repositories Core data standards Data provenance for reproducibility Integrated uncertainty analysis Machine Learning Techniques Is this about machine learning? Is this about machine learning? Is this about machine learning? Unified architecture for data exploration and analysis On-demand data science methods integrated with data repositories Data visualization methods Scalable computing infrastructure Virtual observational missions Virtual, Immersive Visualization Environments for Massive Data Disruptive Approaches Leveraging Computational Science (5-10 Years) Far-Term Disruptive Capabilities Onboard data science and reduction methods Distributed data analytics and computation Integrated uncertainty analysis Virtual observational missions Virtual, Immersive Visualization Environments for Massive Data Increase onboard autonomy, planning and data processing as close to the instrument as possible as part of a broader big data strategy to increase science return. Shift to a data analytics ecosystem to enable analysis of distributed data from multiple instruments (ground-based, airborne, satellite) as well as comparison against climate models. Develop an architectural approach that manages uncertainly levels as data and computational demands change. Shift towards a data-driven approach for missions where new instruments integrate into an overall virtual infrastructure; support planning of new science goals and observations from a combination of instruments and archival data. Enable immersive environments for scientific analysis as well as outreach using observational data from NASA missions. I would like to see an additional role in this table devoted to the interaction between the architecture and the (statistical) methods that define analytical algorithms. New statistical methods optimized for a given architecture; what architecture is best given a set of target analyses; trade- off between uncertainty and cost, etc. Technology Prioritization Technology Name Priority Applicable Far-Term Capability Big Data Architecture for Earth Science Remote Sensing Highest Cross Cutting Big Data Information Models and Semantics Medium Distributed data analytics and computation Onboard data science methods for data triage Medium Onboard data science and reduction Onboard data science methods for data reduction Medium Onboard data science and reduction Massive Data Movement Technologies Medium Distributed data analytics and computation Real-time ground-based data science methods Highest Virtual observational missions Open source data processing frameworks Highest Distributed data analytics and computation; Virtual observational missions; Virtual, immersive visualization environments for massive data Reusable data science methodologies for missions and science Highest Virtual observational missions; Distributed data analytics and computation Federated data access Medium Distributed data analytics and computation Massive Data Distribution Medium Distributed data analytics and computation

Intelligent search and mining Medium Distributed data analytics and computation Visualization of massive data sets Medium Virtual, immersive visualization environments for massive data On-demand distributed data analytics Highest Distributed data analytics and computation Distributed data analytics Highest Distributed data analytics and computation Uncertainty Quantification; Measurement Science Medium I thing this should be Integrated uncertainty analysis Highest Open source data management/science frameworks Medium Distributed data analytics and computation; Virtual observational missions; Virtual, immersive visualization environments for massive data Computational Infrastructures Highest Distributed data analytics and computation; Virtual observational missions; Virtual, immersive visualization environments for massive data X. Overall Recommendations NASA data technology needs to shift from ad hoc investments across the mission and science data lifecycle to an integrated architecture where technology investments fit into a broader capability and vision to enable Earth Science. The NASA data ecosystem needs to shift from a stewardship model to a data- driven discovery model where both data management and data discovery are enabled through a systematic data architecture, computational infrastructure and lifecycle model. Data architectures should be modeled and assessed overall to plan for technology capabilities and improvements to ensure scalability, and to address performance, cost, and uncertainty management goals. Data discovery methods should be applied across the entire data lifecycle to support scalability and discovery at each point, from onboard computing, to data processing and archiving, to distributed data analytics. Architectures should enable flexible tradeoffs of where to and how to compute, to include the improved integration of HPC and data infrastructures. New capabilities should ensure reproducibility of derived scientific results. Computational and data science should play an important role in planning new missions including identification of how data, algorithms, and computing can be applied and integrated to improve overall data discovery. References [1] 2014 NASA Office of the Chief Technologist Roadmap (under development) [2] National Research Council, Frontiers in Massive Data Analysis, 2013.

[3] 2011 PCAST Report on Information Technology. [4] American Geophysical Union, Trends in Earth and Space Science Appendix A. NASA Mission/Science Data Lifecycle As the data- intensive nature of NASA science and exploration missions increases, there is an increasing need to consider the data lifecycle from the point of collection all the way to the application and use of the data. Considerations across the entire lifecycle need to be made in order to support scalability and use of the data. Furthermore, missions may require that data reduction and intelligent triage be done on the data itself across the lifecycle to identify which data should be captured and archived. In addition to considerations of the data and software, it is critical that common information models be developed and defined in order to ensure consistent definitions of the data are applied so that the data itself can be accurately managed, discovered and used. It is important to point out that data across this entire lifecycle should not be considered data at rest, but rather data that should be discoverable, accessible, and usable to update plans, support local operations, and enable science. As a result, a well- architected data system from onboard data capture through ground- based operations and data analysis must be in place to enabling scalability at multiple points across that lifecycle. Critical areas of the lifecycle (with considerations at each stage) include: a) Data Generation: Performing original processing at the sensor/instrument b) Data Triage: Make choices at the collection point about which data to keep c) Data Compression: Maximize information throughput against available bandwidth d) Data Transport: Improve resource efficiencies to enable moving the most data e) Data Processing: Increasing computation availability f) Data Archiving: Scaling the capture, management and distribution of data g) Visualization: Develop and apply analysis techniques to enable data understanding from visualization of massive data h) Data Analytics: Create analytics services to integrate massive, distributed, heterogeneous data Appendix B. ESTO/AIST Proposed Big Data Technology Thrust Areas. Technology Name Big Data Architecture for Earth Science Remote Sensing Big Data Information Models and Semantics Onboard data science methods for data triage Onboard data science methods for data Data Lifecycle Area (s) Cross-Cutting Cross-Cutting Data triage Data Compression Definition of a scalable data big data lifecyce architecture for earth observing systems identifying how Big Data can scale from onboard computing to data analysis to increase science yield. Advanced semantic technologies for defining, deriving, and integrating heterogeneous ontologies and information models as applied across the entire data lifecycle (onboard, ground-based operations, archives, analysis) Onboard data science methods for real-time event detection, and planning. Onboard data science methods for data reduction.

reduction Massive Data Movement Technologies Real-time ground-based data science methods Open source data processing frameworks Real-time ground-based data science methods Reusable data science methodologies for missions and science Data Transport Data Processing Data Processing Data Processing (1) Data Processing; (2) Data Analytics Massive data movement technologies for ground-based networks from operations through analysis Real-time ground-based data science methods for data reduction and real-time event detection for massive data streams as part of the data lifecycle architecture. Open source data processing and workflow frameworks that can massively scale to computational infrastructures (HPC, public cloud, etc) handling large data streams, products, including near-real time constraints, as part of the data lifecycle architecture. Real-time ground-based data science methods for data reduction and real-time event detection for massive data streams as part of the data lifecycle architecture. Duplicated? Development of reusable data science methodologies for analysis of data on the ground as part of the data lifecycle architecture. This includes on-demand data analytics for massive data repositories. Federated data access Data Archives Federation of data access from distributed repositories as part of the data lifecycle architecture, moving towards on-demand distributed data analytics Massive Data Distribution Data Archives Massive data distribution for large-scale repositories and archives including methods for data reduction, computation, etc, as integrated, on-demand data analytics. Intelligent search and mining Visualization of massive data sets On-demand distributed data analytics (1) Data Archives; (2) Data Analytics Visualization Data Analytics Provide methods for intelligent search and mining of massive data. This may include integration of on-demand analytics to perform deep searches. Visualization of massive data sets incuding data reduction methods that are driven by domain. On-demand data analytics that can integrate data from archives, repositories, etc, applying data science methods (data reduction, fusion, feature detection, etc..) provided through a computational infrastructure Distributed data analytics Data Analytics Analysis of data across distributed archives to support Earth system science including application of novel machine learning, statistical methods (e.g., data fusion) and other computational capabilities. Uncertainty Quantification; Measurement Science Open source data management/science frameworks Computational Infrastructures Data Analytics (1) Data Archives; (2) Data Analytics (1) Data Processing; (2) Data Analytics Management of uncertainty in all phases of the data life cycle and analysis as part of a measurement science strategy for data fusion and data science Open source data management/science frameworks that can massively scale to handle and manage large data streams, products, including near-real time constraints, as part of the data lifecycle architecture, for archiving and analytics as part of a big data cyberinfrastructure. Computational Infrastructures to scale data analytics using HPC and public cloud. This includes on-demand massive HPC and storage for integration to drive analytics. Appendix C. Mapping 2014 OCT TA- 11 to Proposed ESTO/AIST Big Data Technology Thrust Areas. The 2014 NASA OCT TA- 11 Roadmap team identified the following areas: Flight Computing: Includes technologies to support greater computation and data management at the point of collection onboard. In some cases, data reduction at the point of collection through intelligent triage methods may be required. Flight computing technologies include ultra- reliable, radiation- hardened platforms, which, until recently, have been extremely costly and limited in performance. Ground Computing: Includes exascale supercomputing and data storage, as well as quantum, cognitive, and other types of advanced computing, for Big Data analysis

and high- fidelity physics- based simulations for Earth and space science, and aerospace research and engineering. Science, Engineering, and Mission Data Lifecycle: As the data- intensive nature of NASA science and exploration missions increases, there is an increasing need to consider the data lifecycle from the point of collection all the way to the application and use of the data. Intelligent Data Understanding: Intelligent data understanding (IDU) refers to the capability to automatically mine and analyze large datasets that are large, noisy, and of varying modalities (discrete, continuous, text, graph, etc.), in order to extract or discover information that can be used for further analysis or in decision making, on the ground or onboard. It is closely coupled to the capability to detect and respond to interesting events and/or to generate alerts. Semantic Technologies: Technologies that enable data understanding, analysis, and automated consulting and operations. Each of these areas has cross- cutting challenges which are relevant to AIST as follows. Flight Computing TA 11.1.1.3 Technology Name High Performance Flight Software Enables on-board, high performance autonomy and data processing Applicable Technology Area Onboard data science methods for data reduction, real-time event detection, and planning Ground Computing TA 11.1. 2.1 11.1. 2.2 11.1. 2.3 11.1. 2.5 Technology Name Exascale Supercomputer Automated Exascale Software Development Toolset Exascale Supercomputing File System Public Cloud Supercomputer Provides peak computational capability of 1 exaflop (10^18 floating point operations per second) for exascale performance of NASA computations, with excellent energy efficiency and reliability, to support NASA s exponentially growing high end computational needs. Provides automated, exascale application performance monitoring, analysis, tuning, and scaling. Provides online data storage capacity of 1 exabyte, enabling data storage for exascale M&S and data analysis, with sufficient performance and reliability to maintain productivity for a broad array of NASA applications. Provides additional resources for NASA supercomputer users, such as for mission-critical computing in an emergency. Applicable Technology Area Computational infrastructures to scale data analytics Computational infrastructures to scale data analytics Computational infrastructures to scale data analytics Computational infrastructures to scale data analytics Mission, Science and Engineering Data Lifecycle

TA 1. 1 1. 2 1. 4 1. 5 1. 6 1. 7 1. 8 Technology Name Reference Information System Architecture Frameworks Distributed Information Architecture Frameworks Onboard Data Capture and Triage Methodologies Real-time Data Triage and Data Reduction Methodologies Scalable Data Processing Frameworks Massive Engineering and Science Data Analysis Methodologies Remote Data Access Framework Provide reference architectures for the end-to-end science and engineering data lifecycle. Provide reference information architectures to define data across the end-to-end engineering and science data lifecycle Apply novel machine learning capabilities on board to support data reduction, model-based compression and triage of massive data sets Apply novel machine learning capabilities in ground data processing systems to support data reduction and triage of massive data sets Provide scalable software processing frameworks for processing scientific and engineering data sets. Provide scalable methodologies for analysis of massive data. Provides access to and sharing of distributed data sources in a secure environment. Applicable Technology Area Definition of a scalable data big data lifecyce architecture for earth observing systems identifying how Big Data can scale from onboard computing to data analysis to increase science yield. Advanced semantic technologies for auto generating ontologies and information models from existing large, distributed, existing data. Onboard data science methods for data reduction, real-time event detection, and planning. Real-time ground-based data science methods for data reduction and real-time event detection for massive data streams as part of the data lifecycle architecture. Open source data processing frameworks that can massively scale to handle large data streams, products, including near-real time constraints, as part of the data lifecycle architecture. Development of reusable data science methodologies for analysis of data on the ground as part of the data lifecycle architecture. This includes on-demand data analytics for massive data repositories. Federation of data access from distributed repositories as part of the data lifecycle architecture, moving towards on-demand distributed data analytics 1. 9 1. 10 1. 11 Massive Data Movement Services Large-Scale Data Dissemination Environments Toolset for Massive Model Data On Demand Data Analytics Develop new technologies for the movement of massive, multi-petabyte data over the network. Enable scaling data infrastructures (software, computation, networks, etc) that are required to support large-scale data dissemination Makes data/information transparent, scalable and usable when infusing multiple large and diverse datasets in complex models. Provides analytics service coupled with computational infrastructures. Data movement (observations, climate model output, etc) across the data lifecycle is critical to scaling for Big Data. Develop capabilities to improve movement of data between points of the data lifecycle from networks to parallel methods for data transfer. Massive data distribution for large-scale repostiroeis and archives including methods for data reduction, computation, etc, as integrated, ondemand data analytics. Scale tools that provide on-demand data analytics that bring together distributed models, observational data, and provides a computational infrastructure to enable analysis and intercomparison. On-demand data analytics that can integrate data from archives, repositories, etc, applying data science methods (data reduction, fusion, feature detection, etc..) provided through a computational infrastructure Intelligent Data Understanding TA 2.1 Technology Name Intelligent Data Collection and Prioritization Toolset Provides a means to reduce the size of the data (e.g., removing clouds), remove corrupted data and/or collect complementary data for value- Applicable Technology Area Development of reusable data science methodologies for analysis of data on the ground as part of the data lifecycle architecture. In