Data Requirements from NERSC Requirements Reviews
|
|
|
- Moses Mervin Perry
- 10 years ago
- Views:
Transcription
1 Data Requirements from NERSC Requirements Reviews Richard Gerber and Katherine Yelick Lawrence Berkeley National Laboratory Summary Department of Energy Scientists represented by the NERSC user community have growing requirements for data storage, I/O bandwidth, networking bandwidth, and data software and services. Over the next five years, these requirements are well above what would be provided by increases that follow historical trends. This report focuses primarily on the data needs of the modeling and simulation community, which dominates NERSC usage, although NERSC also has projects that involve data analysis of observational data sets. Regardless of the source of data, scientists across all fields recognized that qualitatively new challenges are arising in storing, accessing, sharing, managing, and analyzing massive data sets. Researchers also point to the need for new data analysis algorithms, networking capabilities, and the ability to use emerging hardware and software platforms. Data analytics are subject to the same architecture challenges plaguing the rest of computing, namely, the lack of clock speed scaling and growing memory and I/O imbalance, but with algorithms that may require an even higher need for high data rates and random access that physical simulations. The growth in demand for data systems and services is coming from multiple sources: The increased size, resolution, and complexity of large-scale simulations are producing 100s of terabytes of data per simulation. In some cases, in-situ analysis is being developed to analyze data as the simulation runs. Data is being imported into NERSC from experimental facilities and remote observations with data rates growing faster than Moore s Law. A growing appreciation for the intrinsic value of scientific data, both simulated and observed, has increased interest in serving scientific data sets to the broader community. This improves the reproducibility of science, enables multiple independent discoveries from a single data set, and allows for analysis across data sets, e.g., combining observed and simulated data. Massive ensemble simulations are increasingly used to quantify uncertainty, to screen a related set of materials (The Materials Genome), proteins, or other domains require new techniques for storing, indexing, search These challenges touch on what is commonly referred to as the V s 1 of dataintensive computations: 1 Some lists combine variability and veracity or omit value to produce a list of 4 V s.
2 Volume: petabyte data sets projected to exceed normal growth plans; Velocity: the speed with which data arrives, which leads to on-demand data processing for experiments and in-situ analysis for simulations; Variability: the ability to combine distinct data sets together, e.g., simulation and experimental data or data from different types of simulations; Veracity: data that is noisy or incomplete, which arise from inaccuracies in data measurement (e.g., short-read sequencers, missing CMB background behind the Milky Way), floating point approximations or missing features in a computational model; and Value: irreplaceable observational data and community data sets that require particular expertise, hardware, and software configurations that make it impractical for an individual scientist to reproduce. Background In a series of six High Performance Computing and Storage Requirements Reviews between NERSC, DOE Program Managers, and leading domain scientists derived a set of production computing needs for the Office of Science. The results are available in a set of reports at This first round of reviews targeted needs for A second round is underway, gathering requirements for Second round reviews with the offices of Biological and Environmental Research (BER) and High Energy Physics (HEP) were completed in 2012 and the reports are in preparation. Reviews with the remaining four offices will be completed by early The first round of reviews revealed a need for 15.6 billion hours of computing time in 2014, more than 10 times greater than can be provided by Hopper, the current production supercomputer for the Office of Science. Data requirements from the reviews for 2014 were a secondary consideration, primarily because the reviews concentrated on users with the largest computational demands, most of them from NERSC s traditional simulation community. Nonetheless, scientists in every review expressed their concerns regarding data: they fully and explicitly recognized that data issues storage volume, I/O rates, data management, and data analytics were quickly growing beyond their ability to deal with them effectively. Data, which was once an afterthought to many, was becoming a factor in their simulation workflows that they could no longer ignore. In addition to the data problems derived from the modeling and simulation workload, NERSC also has substantial projects that involve experimental or observational data sets, including those from the Joint Genome Institute, the Large Hadron Collider at CERN (Alice and ATLAS), the Palomar Transient Factory, the Planck project for measuring Cosmic Microwave Background (CMB), the 20 th Century Climate Re-analysis Project, the Daya Bay Neutrino Experiment (housed in China), the STAR experiment at BNL/RHIC. These projects use hardware, staff and
3 services that use NERSC s data infrastructure including the NGF high performance parallel filesystem, HPSS archival tape storage, optimized data transfer nodes, and Science Gateway services for convenient external data access. These projects are not entirely supported by the NERSC Facility budget from ASCR, but instead involved direct support from other program offices (BER, HEP, NP, NASA) for hardware, software, and staff support. In the second round of requirements reviews, NERSC is including representatives from data-intensive projects and also gathering input from new communities that have extreme unmet data requirements. The goal is to make data forecasts as reliable as the projected needs for computing cycles. Requirements Given the caveats, it is still possible to use results from the requirements reviews and reasonable extrapolations to arrive at both quantitative and qualitative production data needs for the Office of Science. These requirements largely reflect the needs of the simulation community, with some influence from data intensive projects that are currently using NERSC. They do not include the needs of communities that currently do not have a presence at NERSC, even though these groups, e.g. BES light sources and HEP users of accelerators and telescopes, likely have unmet data requirements that outstrip even those from existing NERSC users. For reference, as of late 2012, NERSC provides about 10 PB of spinning disk and about 24 PB of archival storage for its users, with an archive capacity of over 40 PB. Qualitative Needs Attendees at the requirements reviews expressed needs for data storage, I/O bandwidth, and data management tools far beyond today s capabilities. Major concerns included having enough fast access data storage (not tape), I/O bandwidth to support checkpointing and simulation output (I/O should not exceed approximately 10% of a simulation s total run time), the current lack of data management tools, and a growing need for hardware and software to support data analytics. Data needs were prominent in the high-level summary of findings and requirements in each report from the requirements reviews. Below is a brief summary from each program office. BER (target 2017 draft): The key needs are access to more computational and storage resources and the ability to access, read, and write data at a rate far beyond that available today
4 HEP (2017 draft): The key needs are more computing cycles and fast-access storage; support for data-intensive science, including: improvements to archival storage, analytics (parallel, DBs, services, gateways etc.), data sharing, curation, and provenance Slow data access for data stored on tape is a major concerns for many HEP science team; fast-access data storage was deemed to be a priority. ASCR (2014): Applications will need to be able to read, write, and store 100s of terabytes of data for each simulation run. Many petabytes of long-term storage will be required to store and share data with the scientific community. BER (2014): Many BER projects have mission-critical time constraints; examples include predictions for the next IPCC climate change report and the Joint Genome Institute s need to maintain a four-month update cycle for genome datasets. Such projects demand a computational infrastructure that includes powerful, yet highly reliable, resources and resource reservation policies. BER (2014): Data manipulation and analysis is itself becoming a problem that can be addressed only by large HPC systems. Simulation output will become too large to move to home institutions; therefore, NERSC needs to integrate robust workflow, data portal, and database technology into its computational environment and significantly increase real-time-accessible data storage capacity. BES (2014): [There is a need to support] huge volumes of data from the rampup of the SLAC LINAC Coherent Light Source (LCLS) [and other experimental facilities in BES]. HEP (2014): Science teams need to be able to read, write, transfer, store online, archive, analyze, and share huge volumes of data. 1. The projects considered here collectively estimate needing a 10-fold increase in online disk storage space in three to five years. 2. HEP researchers need efficient, portable libraries for performing parallel I/O. Parallel HDF5 and netcdf are commonly used and must be supported. 3. Project teams need well-supported, configurable, scriptable, parallel data analysis and visualization tools. 4. Researchers require robust workflow tools to manage data sets that will consist of hundreds of TB. Science teams need tools for sharing large data sets among geographically distributed collaborators. 5. The NERSC Global File System currently enables high performance data sharing and HEP scientists request that it be expanded in both size and performance. 6. Scientists need to run data analysis and visualization software that often requires a large, shared global memory address space of GB or more. 7. Researchers anticipate needing support for parallel databases and access to databases from large parallel jobs. FES (2014): [Researchers need] data storage systems that can support highvolume/high-throughput I/O. NP (2014): [Needs include] useable methods for cross-correlating across large databases grid infrastructure, including the Open Science Grid (OSG) interface. Increased [computing] capacity has resulted in a significant increase in I/O demands in both intermediate and long-term storage.
5 Quantitative Requirements We chose to focus on archival storage, data that is deemed to be of permanent value and will be saved indefinitely. This is will serve as a proxy for all data, both archival and live. Archival data at NERSC is currently stored in an HPSS system this is fronted by a large disk cache, but ultimately relies on tape storage for the long term. We can derive quantitative numbers for archival storage because NERSC has detailed historical data on HPSS usage and most attendees at the requirements reviews gave projections for their archival storage needs. The figure below shows historical HPSS usage through 2012 and linear projections for each office (calculated independently) through Tentative estimates from the second round of requirements reviews from BER and HEP are also shown. This plot only represents user data, omitting spinning disk backup data (currently more than 12 PB), as well as data stored by NERSC as overhead (several PBs). Archival data storage has consistently grown by a factor of 1.7 year over year for over a decade. Tentative results of the second of reviews indicate that this will not be adequate moving forward. Needs from just HEP and BER will be much greater than what would be provided by accommodating the historical trend. In BER the demand comes largely from climate and genomics research while the need in HEP is driven by simulated data required to support and interpret measurements from accelerators, telescopes, and satellite missions. The following table gives the amount of archival data currently stored at NERSC for each office and projections from the workshops, given all the caveats related above. The 2014 numbers are very rough estimates; the projections for 2017 from the
6 second round of workshops are expected to more faithfully represent all Office of Science production data needs. Table 1 Archival Storage on NERSC's User HPSS System Office 2012 Usage (PB) 2014 Projected Need (PB) 2017 Projected Need (PB) ASCR BER * BES FES HEP * NP *Preliminary results from 2012 requirements reviews. Research Problems in Data-Intensive Science (ASCR Report) There is more detailed information about data needs of particular science projects described in the case studies in each of the six phase 1 reports. Here is (verbatim) an overview of data analytics and needs from the ASCR report, as it is particularly relevant and touches on some of the overarching data challenges across science disciplines and the associated research problems. With the anticipated improvements in both experimental and computational capabilities, the amount of data (from observations, experiments, and simulations) will be unprecedented. For example, by 2014 fusion simulations will use 1 billion cells and 1 trillion particles. Based on mean-time-between-failure concerns when running on a million cores, these codes will need to output 2 GBs/sec per core or 2 PB/sec of checkpoint data every 10 minutes. This amounts to an unprecedented input/output rate of 3.5 terabytes/second. The data questions to consider at the extreme scale fall into two main categories: data generated and collected during the production phase, and data that need to be accessed during the analysis phase. Another example is from climate modeling where, based on current growth rates, data sets will be hundreds of exabytes by To provide the international climate community with convenient access to data and to maximize scientific productivity, data will need to be replicated and cached at multiple locations around the globe. These examples illustrate the urgent need to refine and develop methods and technologies to move, store, and understand data. The data issue is cuts across all fields of science and all DOE Office of Science Program Offices. Currently, each research program has its own data-related
7 portfolio; ASCR program managers envision an integrated data analytics and management program that will bring multi-disciplinary solutions to many of the issues encountered in dealing with scientific data. In Applied Mathematics Research, data analytic needs include Improved methods for data and dimension reduction to extract pertinent subsets, features of interest, or low-dimensional patterns, from large raw data sets; Better understanding of uncertainty, especially in messy and incomplete data sets; and The ability to identify, in real time, anomalies in streaming and evolving data is needed in order to detect and respond to phenomena that are either shortlived or urgent. In Computer Science Research, issues being examined include: Extreme-scale data storage and access systems for scientific computing that minimize the need for scientists to have detailed knowledge of system hardware and operating systems; Scalable data triage, summarization, and analysis methods and tools for insitu data reduction and/or analysis of massive multivariate data sets; Semantic integration of heterogeneous scientific data sets; Data mining, automated machine reasoning, and knowledge representation methods and tools that support automated analysis and integration of large scientific data sets, especially those that include tensor flow fields; and Multi-user visual analysis of extreme-scale scientific data, including methods and tools for interactive visual steering of computational processes. Next-generation Networking Research is concerned with Deploying high-speed networks for effective and easy data transport; Developing real-time network monitoring tools to maximize throughput; and Managing collections of extreme scale data across a distributed network
Data Centric Systems (DCS)
Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems
Data Analytics at NERSC. Joaquin Correa [email protected] NERSC Data and Analytics Services
Data Analytics at NERSC Joaquin Correa [email protected] NERSC Data and Analytics Services NERSC User Meeting August, 2015 Data analytics at NERSC Science Applications Climate, Cosmology, Kbase, Materials,
Data Centric Computing Revisited
Piyush Chaudhary Technical Computing Solutions Data Centric Computing Revisited SPXXL/SCICOMP Summer 2013 Bottom line: It is a time of Powerful Information Data volume is on the rise Dimensions of data
Mission Need Statement for the Next Generation High Performance Production Computing System Project (NERSC-8)
Mission Need Statement for the Next Generation High Performance Production Computing System Project () (Non-major acquisition project) Office of Advanced Scientific Computing Research Office of Science
Make the Most of Big Data to Drive Innovation Through Reseach
White Paper Make the Most of Big Data to Drive Innovation Through Reseach Bob Burwell, NetApp November 2012 WP-7172 Abstract Monumental data growth is a fact of life in research universities. The ability
Data-Intensive Science and Scientific Data Infrastructure
Data-Intensive Science and Scientific Data Infrastructure Russ Rew, UCAR Unidata ICTP Advanced School on High Performance and Grid Computing 13 April 2011 Overview Data-intensive science Publishing scientific
With DDN Big Data Storage
DDN Solution Brief Accelerate > ISR With DDN Big Data Storage The Way to Capture and Analyze the Growing Amount of Data Created by New Technologies 2012 DataDirect Networks. All Rights Reserved. The Big
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a
Information Visualization WS 2013/14 11 Visual Analytics
1 11.1 Definitions and Motivation Lot of research and papers in this emerging field: Visual Analytics: Scope and Challenges of Keim et al. Illuminating the path of Thomas and Cook 2 11.1 Definitions and
Next-Generation Networking for Science
Next-Generation Networking for Science ASCAC Presentation March 23, 2011 Program Managers Richard Carlson Thomas Ndousse Presentation
Workload Characterization and Analysis of Storage and Bandwidth Needs of LEAD Workspace
Workload Characterization and Analysis of Storage and Bandwidth Needs of LEAD Workspace Beth Plale Indiana University [email protected] LEAD TR 001, V3.0 V3.0 dated January 24, 2007 V2.0 dated August
Scala Storage Scale-Out Clustered Storage White Paper
White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current
Sanjeev Kumar. contribute
RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction The field of data mining and knowledgee discovery is emerging as a
Concept and Project Objectives
3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the
EMC NETWORKER AND DATADOMAIN
EMC NETWORKER AND DATADOMAIN Capabilities, options and news Madis Pärn Senior Technology Consultant EMC [email protected] 1 IT Pressures 2009 0.8 Zettabytes 2020 35.2 Zettabytes DATA DELUGE BUDGET DILEMMA
Compute Canada Technology Briefing
Compute Canada Technology Briefing November 12, 2015 Introduction Compute Canada, in partnership with regional organizations ACENET, Calcul Québec, Compute Ontario and WestGrid, leads the acceleration
Big Data and Cloud Computing for GHRSST
Big Data and Cloud Computing for GHRSST Jean-Francois Piollé ([email protected]) Frédéric Paul, Olivier Archer CERSAT / Institut Français de Recherche pour l Exploitation de la Mer Facing data deluge
Four Ways High-Speed Data Transfer Can Transform Oil and Gas WHITE PAPER
Transform Oil and Gas WHITE PAPER TABLE OF CONTENTS Overview Four Ways to Accelerate the Acquisition of Remote Sensing Data Maximize HPC Utilization Simplify and Optimize Data Distribution Improve Business
HPC technology and future architecture
HPC technology and future architecture Visual Analysis for Extremely Large-Scale Scientific Computing KGT2 Internal Meeting INRIA France Benoit Lange [email protected] Toàn Nguyên [email protected]
From Distributed Computing to Distributed Artificial Intelligence
From Distributed Computing to Distributed Artificial Intelligence Dr. Christos Filippidis, NCSR Demokritos Dr. George Giannakopoulos, NCSR Demokritos Big Data and the Fourth Paradigm The two dominant paradigms
WHITE PAPER. www.fusionstorm.com. Get Ready for Big Data:
WHitE PaPER: Easing the Way to the cloud: 1 WHITE PAPER Get Ready for Big Data: How Scale-Out NaS Delivers the Scalability, Performance, Resilience and manageability that Big Data Environments Demand 2
Big Data Challenges in Bioinformatics
Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres [email protected] Talk outline! We talk about Petabyte?
How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)
WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...
PACE Predictive Analytics Center of Excellence @ San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.
PACE Predictive Analytics Center of Excellence @ San Diego Supercomputer Center, UCSD Natasha Balac, Ph.D. Brief History of SDSC 1985-1997: NSF national supercomputer center; managed by General Atomics
NASA Earth Science Research in Data and Computational Science Technologies Report of the ESTO/AIST Big Data Study Roadmap Team September 2015
NASA Earth Science Research in Data and Computational Science Technologies Report of the ESTO/AIST Big Data Study Roadmap Team September 2015 I. Background Over the next decade, the dramatic growth of
Large File System Backup NERSC Global File System Experience
Large File System Backup NERSC Global File System Experience M. Andrews, J. Hick, W. Kramer, A. Mokhtarani National Energy Research Scientific Computing Center at Lawrence Berkeley National Laboratory
MEETING THE CHALLENGES OF COMPLEXITY AND SCALE FOR MANUFACTURING WORKFLOWS
MEETING THE CHALLENGES OF COMPLEXITY AND SCALE FOR MANUFACTURING WORKFLOWS Michael Feldman White paper November 2014 MARKET DYNAMICS Modern manufacturing increasingly relies on advanced computing technologies
A Best Practice Guide to Archiving Persistent Data: How archiving is a vital tool as part of a data center cost savings exercise
WHITE PAPER A Best Practice Guide to Archiving Persistent Data: How archiving is a vital tool as part of a data center cost savings exercise NOTICE This White Paper may contain proprietary information
Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
Data-intensive HPC: opportunities and challenges. Patrick Valduriez
Data-intensive HPC: opportunities and challenges Patrick Valduriez Big Data Landscape Multi-$billion market! Big data = Hadoop = MapReduce? No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard,
Building Platform as a Service for Scientific Applications
Building Platform as a Service for Scientific Applications Moustafa AbdelBaky [email protected] Rutgers Discovery Informa=cs Ins=tute (RDI 2 ) The NSF Cloud and Autonomic Compu=ng Center Department
Visualization and Data Analysis
Working Group Outbrief Visualization and Data Analysis James Ahrens, David Rogers, Becky Springmeyer Eric Brugger, Cyrus Harrison, Laura Monroe, Dino Pavlakos Scott Klasky, Kwan-Liu Ma, Hank Childs LLNL-PRES-481881
Astrophysics with Terabyte Datasets. Alex Szalay, JHU and Jim Gray, Microsoft Research
Astrophysics with Terabyte Datasets Alex Szalay, JHU and Jim Gray, Microsoft Research Living in an Exponential World Astronomers have a few hundred TB now 1 pixel (byte) / sq arc second ~ 4TB Multi-spectral,
CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1
CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level -ORACLE TIMESTEN 11gR1 CASE STUDY Oracle TimesTen In-Memory Database and Shared Disk HA Implementation
Cluster, Grid, Cloud Concepts
Cluster, Grid, Cloud Concepts Kalaiselvan.K Contents Section 1: Cluster Section 2: Grid Section 3: Cloud Cluster An Overview Need for a Cluster Cluster categorizations A computer cluster is a group of
T a c k l i ng Big Data w i th High-Performance
Worldwide Headquarters: 211 North Union Street, Suite 105, Alexandria, VA 22314, USA P.571.296.8060 F.508.988.7881 www.idc-gi.com T a c k l i ng Big Data w i th High-Performance Computing W H I T E P A
Building a Scalable Big Data Infrastructure for Dynamic Workflows
Building a Scalable Big Data Infrastructure for Dynamic Workflows INTRODUCTION Organizations of all types and sizes are looking to big data to help them make faster, more intelligent decisions. Many efforts
Big Data. George O. Strawn NITRD
Big Data George O. Strawn NITRD Caveat auditor The opinions expressed in this talk are those of the speaker, not the U.S. government Outline What is Big Data? NITRD's Big Data Research Initiative Big Data
NetApp Big Content Solutions: Agile Infrastructure for Big Data
White Paper NetApp Big Content Solutions: Agile Infrastructure for Big Data Ingo Fuchs, NetApp April 2012 WP-7161 Executive Summary Enterprises are entering a new era of scale, in which the amount of data
Impact of Big Data in Oil & Gas Industry. Pranaya Sangvai Reliance Industries Limited 04 Feb 15, DEJ, Mumbai, India.
Impact of Big Data in Oil & Gas Industry Pranaya Sangvai Reliance Industries Limited 04 Feb 15, DEJ, Mumbai, India. New Age Information 2.92 billions Internet Users in 2014 Twitter processes 7 terabytes
The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC2013 - Denver
1 The PHI solution Fujitsu Industry Ready Intel XEON-PHI based solution SC2013 - Denver Industrial Application Challenges Most of existing scientific and technical applications Are written for legacy execution
Collaborations between Official Statistics and Academia in the Era of Big Data
Collaborations between Official Statistics and Academia in the Era of Big Data World Statistics Day October 20-21, 2015 Budapest Vijay Nair University of Michigan Past-President of ISI [email protected] What
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
SGI HPC Systems Help Fuel Manufacturing Rebirth
SGI HPC Systems Help Fuel Manufacturing Rebirth Created by T A B L E O F C O N T E N T S 1.0 Introduction 1 2.0 Ongoing Challenges 1 3.0 Meeting the Challenge 2 4.0 SGI Solution Environment and CAE Applications
White Paper The Numascale Solution: Extreme BIG DATA Computing
White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad ABOUT THE AUTHOR Einar Rustad is CTO of Numascale and has a background as CPU, Computer Systems and HPC Systems De-signer
NITRD and Big Data. George O. Strawn NITRD
NITRD and Big Data George O. Strawn NITRD Caveat auditor The opinions expressed in this talk are those of the speaker, not the U.S. government Outline What is Big Data? Who is NITRD? NITRD's Big Data Research
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Human Brain Project -
Human Brain Project - Scientific goals, Organization, Our role Wissenswerte, Bremen 26. Nov 2013 Prof. Sonja Grün Insitute of Neuroscience and Medicine (INM-6) & Institute for Advanced Simulations (IAS-6)
CYBERINFRASTRUCTURE FRAMEWORK FOR 21 ST CENTURY SCIENCE, ENGINEERING, AND EDUCATION (CIF21)
CYBERINFRASTRUCTURE FRAMEWORK FOR 21 ST CENTURY SCIENCE, ENGINEERING, AND EDUCATION (CIF21) Overview The Cyberinfrastructure Framework for 21 st Century Science, Engineering, and Education (CIF21) investment
BUILDING A SCALABLE BIG DATA INFRASTRUCTURE FOR DYNAMIC WORKFLOWS
BUILDING A SCALABLE BIG DATA INFRASTRUCTURE FOR DYNAMIC WORKFLOWS ESSENTIALS Executive Summary Big Data is placing new demands on IT infrastructures. The challenge is how to meet growing performance demands
International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop
ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: [email protected]
Hue Streams. Seismic Compression Technology. Years of my life were wasted waiting for data loading and copying
Hue Streams Seismic Compression Technology Hue Streams real-time seismic compression results in a massive reduction in storage utilization and significant time savings for all seismic-consuming workflows.
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT
numascale Hardware Accellerated Data Intensive Computing White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad www.numascale.com Supemicro delivers 108 node system with Numascale
Introduction to NetApp Infinite Volume
Technical Report Introduction to NetApp Infinite Volume Sandra Moulton, Reena Gupta, NetApp April 2013 TR-4037 Summary This document provides an overview of NetApp Infinite Volume, a new innovation in
The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets
The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets!! Large data collections appear in many scientific domains like climate studies.!! Users and
CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21)
CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21) Goal Develop and deploy comprehensive, integrated, sustainable, and secure cyberinfrastructure (CI) to accelerate research
THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.
THE EMC ISILON STORY Big Data In The Enterprise 2012 1 Big Data In The Enterprise Isilon Overview Isilon Technology Summary 2 What is Big Data? 3 The Big Data Challenge File Shares 90 and Archives 80 Bioinformatics
Data storage services at CC-IN2P3
Centre de Calcul de l Institut National de Physique Nucléaire et de Physique des Particules Data storage services at CC-IN2P3 Jean-Yves Nief Agenda Hardware: Storage on disk. Storage on tape. Software:
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director [email protected] Dave Smelker, Managing Principal [email protected]
Storage Switzerland White Paper Storage Infrastructures for Big Data Workflows
Storage Switzerland White Paper Storage Infrastructures for Big Data Workflows Sponsored by: Prepared by: Eric Slack, Sr. Analyst May 2012 Storage Infrastructures for Big Data Workflows Introduction Big
Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization
Oman College of Management and Technology Course 803401 DSS Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization CS/MIS Department Information Sharing
Percipient StorAGe for Exascale Data Centric Computing
Percipient StorAGe for Exascale Data Centric Computing per cip i ent (pr-sp-nt) adj. Having the power of perceiving, especially perceiving keenly and readily. n. One that perceives. Introducing: Seagate
UNCLASSIFIED. R-1 Program Element (Number/Name) PE 0603461A / High Performance Computing Modernization Program. Prior Years FY 2013 FY 2014 FY 2015
Exhibit R-2, RDT&E Budget Item Justification: PB 2015 Army Date: March 2014 2040: Research, Development, Test & Evaluation, Army / BA 3: Advanced Technology Development (ATD) COST ($ in Millions) Prior
UNINETT Sigma2 AS: architecture and functionality of the future national data infrastructure
UNINETT Sigma2 AS: architecture and functionality of the future national data infrastructure Authors: A O Jaunsen, G S Dahiya, H A Eide, E Midttun Date: Dec 15, 2015 Summary Uninett Sigma2 provides High
Taming Big Data Storage with Crossroads Systems StrongBox
BRAD JOHNS CONSULTING L.L.C Taming Big Data Storage with Crossroads Systems StrongBox Sponsored by Crossroads Systems 2013 Brad Johns Consulting L.L.C Table of Contents Taming Big Data Storage with Crossroads
Recommendations for Performance Benchmarking
Recommendations for Performance Benchmarking Shikhar Puri Abstract Performance benchmarking of applications is increasingly becoming essential before deployment. This paper covers recommendations and best
The Tonnabytes Big Data Challenge: Transforming Science and Education. Kirk Borne George Mason University
The Tonnabytes Big Data Challenge: Transforming Science and Education Kirk Borne George Mason University Ever since we first began to explore our world humans have asked questions and have collected evidence
From Big Data to Smart Data Thomas Hahn
Siemens Future Forum @ HANNOVER MESSE 2014 From Big to Smart Hannover Messe 2014 The Evolution of Big Digital data ~ 1960 warehousing ~1986 ~1993 Big data analytics Mining ~2015 Stream processing Digital
Big Data and the Earth Observation and Climate Modelling Communities: JASMIN and CEMS
Big Data and the Earth Observation and Climate Modelling Communities: JASMIN and CEMS Workshop on the Future of Big Data Management 27-28 June 2013 Philip Kershaw Centre for Environmental Data Archival
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect [email protected]
HPC and Big Data EPCC The University of Edinburgh Adrian Jackson Technical Architect [email protected] EPCC Facilities Technology Transfer European Projects HPC Research Visitor Programmes Training
NASA's Strategy and Activities in Server Side Analytics
NASA's Strategy and Activities in Server Side Analytics Tsengdar Lee, Ph.D. High-end Computing Program Manager NASA Headquarters Presented at the ESGF/UVCDAT Conference Lawrence Livermore National Laboratory
SQL Server 2012 Parallel Data Warehouse. Solution Brief
SQL Server 2012 Parallel Data Warehouse Solution Brief Published February 22, 2013 Contents Introduction... 1 Microsoft Platform: Windows Server and SQL Server... 2 SQL Server 2012 Parallel Data Warehouse...
Data Intensive Science and Computing
DEFENSE LABORATORIES ACADEMIA TRANSFORMATIVE SCIENCE Efficient, effective and agile research system INDUSTRY Data Intensive Science and Computing Advanced Computing & Computational Sciences Division University
Hadoop Cluster Applications
Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday
