Intro to Data Management. Chris Jordan Data Management and Collections Group Texas Advanced Computing Center

Similar documents
Wrangler: A New Generation of Data-intensive Supercomputing. Christopher Jordan, Siva Kulasekaran, Niall Gaffney

INTEGRATED RULE ORIENTED DATA SYSTEM (IRODS)

Technical. Overview. ~ a ~ irods version 4.x

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Digital Preservation Lifecycle Management

Scala Storage Scale-Out Clustered Storage White Paper

Certified Information Professional 2016 Update Outline

ECM Migration Without Disrupting Your Business: Seven Steps to Effectively Move Your Documents

European Data Infrastructure - EUDAT Data Services & Tools

Salesforce Certified Data Architecture and Management Designer. Study Guide. Summer 16 TRAINING & CERTIFICATION

Extending Microsoft SharePoint Environments with EMC Documentum ApplicationXtender Document Management

LJMU Research Data Policy: information and guidance

Common Questions and Concerns About Documentum at NEF

Data Mining Governance for Service Oriented Architecture

Data-Intensive Science and Scientific Data Infrastructure

Informatica ILM Archive and Application Retirement

Meister Going Beyond Maven

TACC, its Natural History Collections and iplant

EMC IRODS RESOURCE DRIVERS

A grant number provides unique identification for the grant.

EII - ETL - EAI What, Why, and How!

Policy Policy--driven Distributed driven Distributed Data Management (irods) Richard M arciano Marciano marciano@un

DataGrids 2.0 irods - A Second Generation Data Cyberinfrastructure. Arcot (RAJA) Rajasekar DICE/SDSC/UCSD

Performing a data mining tool evaluation

Report of the DTL focus meeting on Life Science Data Repositories

Data Management using irods

Considerations for Research Data Management

Electronic Document Management System Specification Checklist

IO Informatics The Sentient Suite

How To Develop Software

Data Grids. Lidan Wang April 5, 2007

An Oracle White Paper July Oracle ACFS

From Lab to Factory: The Big Data Management Workbook

Digital Marketplace - G-Cloud

Data Management Plan. Name of Contractor. Name of project. Project Duration Start date : End: DMP Version. Date Amended, if any

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

The ISPS Data Archive: Mission, Work, and Some Reflections

Supporting New Data Sources in SIEM Solutions: Key Challenges and How to Deal with Them

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

EOFS Workshop Paris Sept, Lustre at exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

2. MOTIVATING SCENARIOS 1. INTRODUCTION

A Strategic Approach to Web Application Security

Object Storage A Dell Point of View

A Governance Framework for Data Audit Trail creation in large multi-disciplinary projects

DATA ARCHIVING. The first Step toward Managing the Information Lifecycle. Best practices for SAP ILM to improve performance, compliance and cost

Analyzing Network Servers. Disk Space Utilization Analysis. DiskBoss - Data Management Solution

The archiving activities occur in the background and are transparent to knowledge workers. Archive Services for SharePoint

Implementing SharePoint 2010 as a Compliant Information Management Platform

Software Design Proposal Scientific Data Management System

Grid Computing vs Cloud

Cloud Based Application Architectures using Smart Computing

Veritas ediscovery Platform

Optimizing IT Deployment Issues

IBM Tivoli Storage Manager Version Introduction to Data Protection Solutions IBM

U.S. FDA Title 21 CFR Part 11 Compliance Assessment of SAP Records Management

Opus: University of Bath Online Publication Store

Achieving a Step Change in Digital Preservation Capability

A SMARTER WAY TO SELECT A DATA EXTRACTION METHOD FOR LEGACY DATA

Object Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved.

USING SOFTWARE TO SPEED-UP YEAR-END ACCOUNTS WORK. Part 2: Reviewing & correcting clients data

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Checklist for a Data Management Plan draft

Intro to AWS: Storage Services

P u b l i c a t i o n N u m b e r : W P R e v. A

Software Configuration Management Best Practices

Globus Research Data Management: Introduction and Service Overview. Steve Tuecke Vas Vasiliadis

A complete platform for proactive data management

Object Oriented Storage and the End of File-Level Restores

InstaFile. Complete Document management System

Oracle Real Time Decisions

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007

Image Data, RDA and Practical Policies

Cloud Services Catalog with Epsilon

Data Management Plans & the DMPTool. IAP: January 26, 2016

ETPL Extract, Transform, Predict and Load

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Big Workflow: More than Just Intelligent Workload Management for Big Data

Designing and Deploying Messaging Solutions with Microsoft Exchange Server 2010 Service Pack B; 5 days, Instructor-led

Defensible Disposition Strategies for Disposing of Structured Data - etrash

Virtualization - Adoption

SAP NetWeaver Information Lifecycle Management

Unterstützung datenintensiver Forschung am KIT Aktivitäten, Dienste und Erfahrungen

OpenAIRE Research Data Management Briefing paper

Evaluation of Enterprise Data Protection using SEP Software

Long-term data storage in the media and entertainment industry: StrongBox LTFS NAS archive delivers 84% reduction in TCO

Protecting Big Data Data Protection Solutions for the Business Data Lake

Microsoft End to End Business Intelligence Boot Camp

Test Data Management Concepts

Designing Database Solutions for Microsoft SQL Server 2012

Is ETL Becoming Obsolete?

Hitachi Cloud Service for Content Archiving. Delivered by Hitachi Data Systems

Introduction. Chapter 1. Introducing the Database. Data vs. Information

Reference Architecture, Requirements, Gaps, Roles

Patterns of Information Management

Collaborations between Official Statistics and Academia in the Era of Big Data

Archival of Digital Assets.

Real World Strategies for Migrating and Decommissioning Legacy Applications

Don t Build the Zoo: Discover Content in its Natural Habitat Written By: Chris McKinzie CEO & Co-Founder

TRANSFORMING DATA PROTECTION

Transcription:

Intro to Data Management Chris Jordan Data Management and Collections Group Texas Advanced Computing Center

Why Data Management? Digital research, above all, creates files Lots of files Without a plan, protecting, sharing, and even locating data can be a challenge With a plan, researchers can focus on their areas of expertise Many management policies can be automated Replication, open access, cross-collection links

TACC and Data Management Data Management has become integral to the conduct of research Smart grids, genomics, medical imaging Astronomical imaging, social science, digitization Bringing infrastructure and expertise together Capacity is never a concern Performance (almost) never a concern Allows focus on policy and practices

Basic Principles of Data Management Think in terms of the whole collection Understand the life cycle of the data beforehand Plan for both use and re-use Open Access is always the best choice Access can be subject to embargoes Don t try to do it all yourself! Multi-collection repositories are often available Can make providing access much easier

Life-cycles for data Components of the life cycle: Generation for specific purposes Creation of metadata Direct use in research/experimentation Provision of open access Retirement of inaccurate/outmoded data Archival of not immediately useful data Long-term preservation Incorporation into larger repositories

Data Structure Think about the entire collection of data to be generated or used in research workflow Understand relationships between data of varying types and structures Impose structure on data in a consistent way Example: Data in dated directories based on generation time, or in topical directories based on type?

Data Structure 2 Heterogeneous collections Comprised of both structured data (catalogs) and unstructured data (images/movies/3d) Ideally, catalogs and digitization products should be linked Ideally, these linkages are available via open network mechanisms such as the web Much easier to do this at the time of digitization

Hidden Knowledge /Idiosyncracy Everyone has their own natural way of organizing data Often use internal conventions or hidden knowledge in file/directory names Example: Use of dates generation date? Transfer date? Post-processing date? Not specific to digital realm lab notebooks Consistency, transparency, clarity

DM Planning and Execution Designate one or more individuals w/ primary responsibility for data management Where possible, partner w/ experts: Information Scientists, existing repository managers, TACC and similar organizations Develop a plan before data is generated Workflows should include data destinations, linkage to databases, etc

Types of Data and Analysis Structured vs Unstructured Visualization vs Analysis/Reduction SQL vs NoSQL Data format may be related to analysis Data may undergo multiple transformations Example: CSV, XML, 4D grids vs. Visualization Data may be reduced/subsetted

Structured vs Unstructured Storage The most common storage metaphor is the file system, based on hierarchical collections of binary objects without consistent structure - Images, Sound, office documents Structured data has different storage needs File Systems are tuned for general use, capacity, sequential read/write Structured stores are tuned for strided or random I/O, IOPS

Structured Data Basic Idea: Your data consists of a number of records, themselves composed of elements in a common structure You want to be able to extract/compare records based partially on the structure Increasing prevalence of semi-structured data (e.g HTML/XML) Structured Query Language - SQL/RDBMS

Relational Database Systems Metadata, search, filtering and sorting Transactional scenarios such as websites Ensuring data quality, allowing audit trail Hosted database solutions over desktop Automated and routine backup Multiuser and distributed collaboration Single system of record, no mysterious out of sync copies. Stable, sustainable. Mostly but not exclusively SQL-based

Metadata Metadata = data about data Many types of metadata: Technical (size, type) Legal (license, ownership) Format-specific (structure, bit depth, dimensions) Discipline-specific (species, collection time, observation time) Metadata facilitates organization, search and re-use

Provenance and Metadata Provenance = an audit trail for research Reproducibility is key to scientific progress Without detailed provenance of digital data, how can results be verified? Observational data: location and details of circumstances of collection Simulation data: detailed information on software and hardware

Metadata Practices Metadata (esp Provenance) often cannot be reconstructed after the fact Crucial to record metadata as early as possible in the process Automate extraction/creation where possible Enforce good recording practices for metadata/provenance details Software can help with policy enforcement

Archiving Research Data Repositories (centralized model) Domain specific and institutional Static data Collection size and functionalities limitations Researcher gives access (decentralized model) Issues with maintenance and permanence Evolving data storage (semi-centralized) Flexibility to configure/curate your collection Seamless transition between research stages Long-term agreements

What Not to Do Do not shoot first, ask questions later Do not keep only one copy Do not go to Fry s/newegg/best Buy Do not use a commercial cloud provider as a primary data store Fine for archival copies, costly for access Do not use Excel or Access to build a catalog Good for development, bad for stewardship

TACC Resources for Data Corral 4PB replicated resource for data collections/management/sharing Stockyard 20PB file system for large-scale storage of data used in computation/analysis Ranch 160PB Tape archive for long-term storage of relatively low-value data irods Software data grid allowing single namespace spanning many resources

Collections at TACC Corral supports both structured and unstructured data stores VM Capabilities allow for hosting of websites, applications development, etc Collection/Discipline-Specific websites http://www.fishesoftexas.org http://arctos.database.museum

irods in Data Management Integrated support for metadata and policies Numerous APIs: script, compiled, REST Web, Command-line, GUI interfaces Automated enforcement of rules Metadata extraction Format transformation Appropriate for large, complex data collections

Supercomputing for Collections? Most collections-related problems are embarrassingly parallel Image conversion, resizing, OCR, etc scale linearly with added cores/nodes Can process tens of thousands of files within hours or days rather than weeks Potential for interesting new analysis applications using aggregated collection data

Example: Bioinformatics Diverse data types Structured and Unstructured Data sharing critical openly and within collaboration Complex decisions regarding provenance and data retention Sequencer output and derived data products Variable value, huge data sizes Well-developed metadata and other standards Perhaps too many of them? Infrastructure can be a special challenge Planning is key to successful research Many open repositories and group efforts Genbank, TAIR, etc uncertain sustainability models Have your own resources, share data when possible

Example: Arctos/Collections Management Arctos hosted entirely at TACC Web application with catalog/media linkage/ open access capabilities Semantic web, export to GBIF, etc Many collections in Arctos, including: Museum of Vertebrate Zoology, UC Berkeley Museum of Southwestern Biology, U New Mexico University of Alaska Museum and Herbarium

Example: Simulation Data Consider how data furthers research goal Software often evolving, community-based, or selfdeveloped Output validation could be important Plan for validation and analysis Consider ease of regenerating output Consider time and cost of retention (file-management is effort-intensive) Keep everything is not a plan

We re not alone TACC is not the only possible partner Similar advanced computing centers exist at many Universities Many large-scale repositories that may have relevant expertise and infrastructure Aggregator sites facilitate search/use of data Data is more valuable the more of it there is, and the easier it is to access

Contacts and Questions Chris Jordan ctjordan@tacc.utexas.edu Questions about Data @ TACC? E-mail data@tacc.utexas.edu http://www.tacc.utexas.edu https://portal.tacc.utexas.edu