Tackling the Big Data Deluge in Science with Metadata

Tackling the Big Data Deluge in Science with Metadata The term Big Data has become virtually synonymous with schema on read (where data is applied to a plan or schema as it is ingested or pulled out of a stored location) unstructured data analysis and handling techniques like Hadoop. These schema on read techniques have been most famously exploited on relatively ephemeral human-readable data like retail trends, twitter sentiment, social network mining, log files, etc. But what if you have unstructured data that, on its own, is hugely valuable, enduring, and created at great expense? Data that may not immediately be human readable or indexable on search? Exactly the kind of data most commonly created and analyzed in science and HPC. Research institutions are awash with such data from large-scale experiments and extreme-scale computing that is used for highconsequence applications. As the recent DOE High Performance Computing Operational Review 2014 1 noted: The value and cost of data relative to computation is growing and, with it, a recognition that concerns such as reproducibility, provenance, curation, unique referencing, and future availability are going to become the rule rather than the exception in scientific communities. Schema on write (where data is mapped to a plan or schema when it is written) metadata centric data analysis and handling techniques are critical to meeting research goals of ensuring data consistency and trustworthiness along with being the only techniques capable of meeting these scientific concerns of data identification, access, reproducibility, provenance, curation, unique referencing, and future data availability. In fact, a case can be made that metadata-based techniques are the most valuable Big Data analysis and handling techniques in scientific computing and HPC. Increased sensor resolution from more and more sequencers, cameras, microscopes, telescopes, scanners and instruments of all types are driving a deluge in scientific data. And while today s high capacity scale-out file and object storage systems can accommodate the sheer volume of data produced, they can t help identify what data is valuable or how to move and process that data through a global workflow, sharing it securely with geographically dispersed collaborators. Or how to make sure that data is authentic and unaltered. Reducing the cost of storing increasingly immense datasets is paramount. Cost effectively managing and ensuring continued access to valuable data, is critical to driving results and new discovery. The alternative is that important scientific data becomes effectively lost forever, even as more and more data is being stored. 1 DOE High Performance Computing Operational Review (HPCOR) Enabling Data-Driven Scientific Discovery at DOE HPC Facilities, www.nersc.gov/assets/hpcor/hpcor-data-2014.pdf

Metadata is the Key Metadata is the key to keeping track of all this unstructured scientific data. Metadata is data about data. In the case of scientific data, metadata is structured data (written in a prescribed schema or order) that describes what the data is, how it was derived, and where it is located. Metadata makes scientific data easy to find, track, share, move and manage at low cost. Unfortunately, today s high capacity storage systems only provide bare bones system metadata consisting of as little as file name, owner and creation/access timestamps. Data intensive scientific workflows need supplemental enhanced metadata, along with access rights and security safeguards. Workflow constituents can then find and access valuable data by querying such extensive metadata. With the increasing data deluge across all scientific domains, rich workflow specific metadata is essential to enable collaborators to find and share valuable data crucial to their endeavors. Key Benefits of Metadata in Scientific Computing Eliminate Data Junkyards Keep track of, and never lose, valuable data Access and manage data on any storage device, anywhere in the world Orchestrate global data intensive workflows, achieving faster time to results Maintain data provenance, audit, security, and access control Reduce storage costs Metadata can now be easily exploited in a wide range of scientific data intensive workflows in life sciences, government, energy exploration and media & entertainment with the metadata creation and management facilities of the broadly deployable General Atomics Nirvana Metadata Centric Intelligent Storage System. Nirvana is a software product that works with existing storage systems and is developed by the General Atomics Magnetic Fusion Energy and Advanced Concepts group in San Diego, California from a joint effort with the San Diego Supercomputing Center s Storage Resource Broker (SRB). Without an application-specific metadata management system like Nirvana, researchers will likely spend most of their time just looking for data, instead of exploiting that data to accelerate discovery. Eliminating Data Junkyards Today s high capacity scale out file (GPFS, Lustre, Isilon) and object (Cleversafe, WOS, Amplidata) storage systems make it easy to store tremendous quantities of data. As a result, many organizations are accumulating Data Junkyards consisting of tens of millions of files from hundreds of past and present users accumulating for years with no way of identifying their technical or business value. And because there is no detailed description of what the data consists of, organizations are afraid to remove any of it. And as capacity is exhausted the only solution seems to be repeatedly purchase additional storage. This is shown in the figure on the next page. Look familiar? Tackling the Big Data Deluge in Science with Metadata 2

Reducing storage costs by removing data that s no longer needed with Nirvana Perhaps not too surprisingly, storage vendors provide no tools to evaluate what is being stored, and analyzing the Data Junkyard by hand is hopeless. Running scanning scripts over millions of files can take days and slows the production file system. But what if you had a tool that could extract a simple inventory of the Data Junkyard in to a separate database explicitly designed for fast query and analysis? Using existing storage system metadata, General Atomics Nirvana can inventory Data Junkyards, characterizing the data according to size, owner, when last accessed, etc. Nirvana exposes what types of files are being stored and usage trends, by file type and user. Nirvana shows how much storage is consumed by duplicate, temporary or improper files types. With this information, administrators and managers can decide what data to keep, move or even delete. How it Works Nirvana is a metadata management and file tracking system that inventories the Data Junkyard using a relational database explicitly designed for fast query and analysis. Nirvana performs background scans on Data Junkyard file systems and records its findings in the Nirvana Metadata Catalog. Administrators can quickly formulate and execute queries and reports over the entire catalog with no performance impact on the production file system. Nirvana identifies what data is valuable and reduces the cost of storing data, by keeping only what is needed. How many files are there? Who owns how many of what types of files? How old are they? When was the last time they were used? Nirvana builds up information to answer those questions and more, and provides information to justify data retention policies and provides continuing analysis of storage consumption. Tackling the Big Data Deluge in Science with Metadata 3

With Nirvana s reports, retention policies can be formulated and applied across all users. Files that are important to keep, but not current, can be moved off to lower cost storage, violations of storage policy can be detected, and terabytes of unused and unwanted data can be cleaned away. The need to buy additional expensive storage, because you do not know what else to do, is no longer necessary. General Atomics Fusion Research stores key research data on expensive, enterprise class storage. Data growth necessitated the purchase of even more expensive storage. Nirvana was used to characterize the data stored and found that 65% of data stored wasn t accessed for years, multiple copies of the same data were found and one user generated over 80% of the files, limiting storage system performance. With this analysis in hand, superfluous data was removed, eliminating the need to buy more storage. Keeping Track of Valuable Data Like a needle in a haystack, high value data stored in large scale storage systems can be effectively lost over time losing its value forever. Nirvana prevents this by creating and exploiting detailed, additional, application-specific metadata about data valuable to all stakeholders. Data can be discovered and accessed through an object s attributes such as creation date, size, frequency of access, author, keywords, projects, devices and more. With Nirvana, valuable information can be found and analyzed even if it resides on very different, incompatible, platforms anywhere in the world. Nirvana can execute complex queries for quick and simple data discovery and also generate summary reports characterizing large data collections. The following figure illustrates standard storage file system metadata and example application-specific (in this case, genome sequencing) metadata added with Nirvana. Nirvana example comprehensive application-specific (genomics) metadata Tackling the Big Data Deluge in Science with Metadata 4

Orchestrating Global Data Intensive Workflows; Access and Manage Data on any Storage Device, Anywhere in the World With collaborative scientific research done on a global scale, today s data intensive workflows are geographically dispersed, spanning multiple storage tiers across many vendor s storage devices over multiple administrative domains among different organizations. By presenting a single global namespace across any storage device, anywhere in the world, Nirvana allows data to be easily and securely shared among globally distributed teams. Nirvana also automatically moves data to various workflow resources, based on policies, so data is always available at the right place, at the right time, and at the right cost while keeping an audit trail as data is ingested, transformed, accessed, and archived through its complete lifecycle. Nirvana can also create multiple copies of the data to be stored at multiple locations for extra data protection and disaster recovery. Nirvana global workflow orchestration is showing in the following figure. Nirvana orchestrates global data intensive workflows Tackling the Big Data Deluge in Science with Metadata 5

Data Provenance Provenance (from the French provenir, "to come from"), is the chronology of the ownership, custody or location of a historical object in order to verify its authenticity. In the context of electronic data, provenance is the information on how data was created, accessed and processed throughout its processing pipeline and workflow. In many fields, data sets are useless without knowing the exact provenance and processing pipeline used to produce derived results. Nirvana tracks data within workflows, through all transformations, analyses, and interpretations, ensuring data authenticity. With Nirvana, data is optimally managed, shared, and reused with verified provenance of the data and the conditions under which it was generated so results are reproducible and analyzable for defects. An example of how provenance is used in genomics research 2 is shown in the figure below. The provenance of the genomics data produced is a record of all the hardware components (sequencer, computers) and software (including version) used in the processing pipeline (base conversion, alignment, pre-processing and variant analysis) to create the final data product. Provenance is essentially like hitting the record button during the process, so derived results can be authenticated and reproduced as needed. Provenance metadata is stored in the Nirvana Metadata Catalog (MCAT) so it can be queried and validated. Genomics sequencing processing and data provenance metadata 2 IBM Reference Architecture for Genomics Brings Power to Research, Frank Lee, Ph.D., IBM Systems Magazine, August 2014, http://www.ibmsystemsmag.com/power/trends/ibmresearch/reference-architecture-genomics Tackling the Big Data Deluge in Science with Metadata 6

Audit, Security, and Access Control Nirvana can audit every transaction within a workflow with automated metadata capture and amendment. An audit trail can contain information such as date of transaction, success or error code, user performing transaction, type of transaction, and notes. Audit trails, like everything else with Nirvana, can be easily queried and filtered. Nirvana maintains multiple authentication mechanisms for access to data and metadata. Nirvana creates access control lists for every user-level metadata attribute, ensuring a high level of security. Single-sign-on and access through one common set of APIs, LDAP or Active Directory provide for complete access transparency. Data can only be viewed and modified by users authorized to do so. Nirvana enforces authentication to protect data through Challenge Response Mechanism, Grid Security Infrastructure - GSI, and Kerberos. Support is also built in for data encryption through GSI and Kerberos. Reducing Storage Costs Nirvana restrains the cost of data growth four ways: Nirvana helps prevent worthless data from entering the workflow and being stored. Nirvana migrates data to lower cost storage tiers using workflow policies, not just file policies. Nirvana removes data that s no longer valuable. Nirvana consolidates and automates complete data lifecycle management. The best way to avoid unwanted data to accumulate is to prevent its introduction into the workflow in the first place. Nirvana can help make sure data ingested is valuable and usable by utilizing metadata information about instrument conditions, configuration, data quality, error rates and missing elements to decide whether to store ingested data. Nirvana also reduces storage costs by orchestrating the migration of data to lower cost storage from multiple vendors with automated data life cycle management incorporating detailed workflow policies, not just file policies, for data migration, retention and disposal. For example, data associated with a specific project or person that hasn t been accessed in a year could be migrated, or even deleted. Nirvana can consolidate and lower management costs by migrating data from multiple distributed locations to one central archive storage system. Data can be analyzed and managed, no matter where it s located. Backup or migration from a local administrative domain can be done centrally, enterprisewide with Nirvana. And all this happens behind the scenes, transparent to users and applications. Tackling the Big Data Deluge in Science with Metadata 7

Nirvana Components Show in the figure below, the heart of Nirvana is the Metadata Catalog (MCAT) which implements the Global Namespace and maps data objects to Storage Resources. MCAT metadata is stored in a relational database such as Postgres or Oracle to support fast, efficient queries. A Storage Resource is any File or Object Storage System, located anywhere in the world. Location Agents directly access Storage Resources. The Nirvana Client allows the user to interact with the MCAT and Location Agents via Browser GUI, Java, Python or custom SDK. Files can also be viewed and accessed as familiar Windows and Linux folders and files through Nirvana. Nirvana can linearly scale Storage Resources horizontally. Multiple MCATS can be deployed to ensure uninterrupted operation and disaster recovery. Nirvana components A Comprehensive, Affordable, Easily Deployed, Fully Supported Solution General Atomics developed Nirvana in a joint effort with the San Diego Supercomputing Center leveraging successful methods proven in data intensive multinational Big Science research. Nirvana responds to the needs of researchers in collaborative projects to share access to data on remote heterogeneous storage systems as part of a global workflow. Nirvana is a comprehensive, easily deployed solution with enhanced metadata capabilities, bolstering advanced data discovery. Nirvana is tuned for performance and scalability focusing on secure operation, ease of use, ease of administration, and reliability. Nirvana is fully supported, ensuring our customer s success. Tackling the Big Data Deluge in Science with Metadata 8

About General Atomics General Atomics ( GA ) and its affiliated companies constitute one of the world's leading resources for high-technology systems ranging from electromagnetic systems, remotely operated surveillance aircraft, airborne sensors, and advanced electronic, wireless and laser technologies. GA carries out the largest and most successful nuclear fusion program in private industry. For More Information www.ga.com/nirvana Phone: +1 866.312.8896 Email: info@nirvanastorage.com Tackling the Big Data Deluge in Science with Metadata 9

Technical Data Nirvana is a distributed application - the multiple software components (daemon services and userlaunched clients) communicate across network connections within or between computer hosts to implement and coordinate Nirvana functions. Each of the component types has a supported set of host environments (operating system and processor hardware) on which they can be installed and operate. The Nirvana clients interact with the user via GUI or command line and with their host s filesystems to send client host data to an Agent, receive data from an Agent and write it locally to the client host, and for local logging. The Nirvana Agents interact with their host s filesystems for logging and typically use some portion as Nirvana Storage Resources, and interact with data-stores external to their host such as object store systems and relational databases and even remote from their host such as SFTP sites and Amazon S3. The Nirvana Metadata Catalog program utilizes a back-end database consisting of a standard relational database. Since the database can run on a separate server from the Metadata Catalog, its host environment is not important to Nirvana. Nirvana Key Features Product Delivery Metadata Authentication Encryption Data Protection SDK Scale Cloud Support Comprehensive, supported, commercial solution Expansive workflow specific user defined, entire database row can be assigned to an object, allowing for simple, faster queries, and far more effective for provenance, access control and auditing than simple key, value pair LDAP, AD, Challenge Response Mechanism, Grid Security Infrastructure - GSI, Kerberos GSI, Kerberos Multiple geographically dispersed replicas C, C++, Java,.NET, Python Up to 500 million data objects and beyond Nirvana supports S3 and runs on AWS; Nirvana Global Namespace connects private storage, private clouds, and public clouds into a single searchable entity Nirvana Daemon Component Metadata Catalog Program Database Client Library for MCAT Storage, ILM and Metadata Agents Supported Environment Element Linux: RHEL 6+, CentOS 6+, SLES 11 SP3+ PostgreSQL 9.2+, MySQL 5.5+, Oracle 11g Linux: RHEL 6+, CentOS 6+, SLES 11 SP3+ Windows Server 2008+ Processor Hardware x86_64, x86_64, (PostgreSQL) x86_64, For daemon components only 64-bit versions are supported. Other operating systems and processor support is available on request. Minimum recommended enterprise-scale entry MCAT server hardware is 2 sockets and 64GB of RAM. Minimum recommended enterprise-scale entry database server hardware is 2 sockets and 128GB RAM. Tackling the Big Data Deluge in Science with Metadata 10

Nirvana Client Supported Environment Element Processor Hardware Vdisk for Windows Windows client 7+ x86 Windows Server 2008+, Windows client 7+ x86_64 Windows Gateway Windows client 7+ x86 Windows Server 2008+, Windows client 7+ x86_64 Windows CLI Windows client 7+ x86 Command Line Linux: RHEL 6+, CentOS 6+, SLES 11 SP3+ x86 Linux: RHEL 6+, CentOS 6+, SLES 11 SP3+ MacOS X 10.8+ x86_64, FUSE for Linux Linux: RHEL 6+, CentOS 6+, SLES 11 SP3+ x86_64 Grid Gateway (server Linux: RHEL 6+, CentOS 6+, SLES 11 SP3+ x86_64 side emulation) Globus Tookit v4.0.1+ Internet (WebDAV) Gateway Linux: RHEL 6+, CentOS 6+, SLES 11 SP3+ Apache Web Server v2.0.52+ x86_64, Java User and Admin Java JRE 7+ 64-bit N/A Web server CGI module Linux: RHEL 6+, CentOS 6+, SLES 11 SP3+ Apache Web Server v2.0.52+ x86_64, Nirvana SDK C on Linux: RHEL 6+, CentOS 6+, SLES 11 SP3+ x86_64, Java JRE 7+ 64-bit N/A C# Microsoft.NET v3+ x86 C# Microsoft.NET v3+ x86_64 Additional clients, operating systems, processor hardware, and 32/64-bit versions on request. Nirvana contains a driver framework for adapting internally standardized data-store operations to the specific function interfaces and behaviors offered by a variety of physical data-stores. Storage Resources Local Host Filesystems Network Share Mounts Cluster Filesystems URL-addressed Resources RDBMS table data RDBMS large object data Hierarchical Storage Management Filesystems Object Stores POSIX and POSIX-similar, NTFS Any network share mount and protocol presenting as a local filesystem: NFS, GPFS, Panasas PanFS, CIFS Aggregate logical construct available to understand multi-node access to the same filesystem and file objects Remote resources addressable using the URL <protocol>:// convention, including FTP, FTPS, SFTP, HTTP, HTTPS, AWS S3 SQL-accessible table data for supported database products SQL-accessible large object data for supported database products Nirvana offers deep integration with Oracle SAM-FS 5.3+. SGI Data Migration Facility 5.3+ DDN Web Object Scaler Other resources are available on request. Tackling the Big Data Deluge in Science with Metadata 11

Networking Network Services Use Network Time Built-in Authentication Mechanisms for User Identity External Authentication Mechanisms Authorization Mechanisms for Users and Groups Nirvana components and clients default to TCP/IP v4 port 5625, support IPv6, and are clients of DNS All computers hosting Nirvana programs must have low system clock skew to avoid various timestamp anomalies and possible authentication system dysfunction Encrypted challenge-response (password), Grid Security Infrastructure, Kerberos v5 (RFC 4120) Credentials are collected from a user and passed to host mechanisms such as MS AD, LDAP, PAM and NIS for pass/fail validation Inheritable and object specific Discretionary Access Control and Access Control Lists (user controlled); Mandatory Access Control (user cannot override) configured by administrator; Limited-use tickets (in count and/or time) for specific object access Tackling the Big Data Deluge in Science with Metadata 12