irods for Big Data Management in Research Driven Organizations Charles Schmitt CTO & Director of Informatics RENCI
|
|
- Willa Glenn
- 8 years ago
- Views:
Transcription
1 irods for Big Data Management in Research Driven Organizations Charles Schmitt CTO & Director of Informatics RENCI
2 Acknowledgements Presented work funded in part by grants from NIH, NSF, NARA, DHS, as well as funding from UNC Teams involved include: DICE team at UNC and UCSD Networking team at RENCI and Duke Data sciences team at RENCI UNC Dept of Genetics, Research Computing, Lineberger Comprehensive Cancer Center, NC Tracs Institute, Center for Bioinformatics, Institute for Pharmacogenetics and Personalized Treatment UNC HealthCare Multiple members of the irods community 2
3 RENCI Researches, Develops, and Deploys Cyberinfrastructure Tools Networks Joint Venture between UNC, Duke, NCSU, and State of North Carolina Virtual Organizations evaluate Visualization Projects Collaborators Data Science of Cyberinfrastructure improve High Performance Computing Funding Scholarship Innovation Engagement Software Analytics 3
4 RENCI Key Initiatives E1: Storm Surge Modeling E2: NSF SSI (HydroShare) E3: PIRE Environmental/Coastal Sciences Biomedical and Health Sciences H1: CTSA H2: Sequencing H3: Secure Med. Workspace HPC H4: Decision Virtual Organizations Visualization Support C1: S2I2 C2: REACH NC Data Science of Cyberinfrastructure Networks C5: CIBER (NARA) C6: ORCA/BEN (NSF GENI) C3: E-iRODS C4. DataNet Software Analytics Tools E-iRODS GeoViz SRW 4
5 Use Case: Informatics for Next-Generation Genomics ~ Whole and Exomic Sequences generated ~10,000 Sequences stored RENCI Next Gen Sequencing Sequencing Informatics Computational Workflows, High Performance Computing, Distributed Data, Security Informatics & Cyberinfrastructure R&D Clinical Practice Identifying genomic variants relevant to clinical care Exploring ethical/legal issues around reporting genomic findings Clinical Research Determining relationships between genomic variants and disease Basic Research Finding new ways to understand the relationship between genes and disease/behavior In collaboration with UNC Research Computing, UNC Dept of Medical Genetics, Lineberger Comprehensive Cancer Center, Institute for Pharmacogenetics and Personalized Treatment, UNC High Throughput Sequencing Core, UNC Center for Bioinformatics 5
6 Managing Research Data: Genomics Sequencers Tape Archive Initial Pipeline, QC Alignment Pipeline, QC Data/Information Flow - managed by: 1) Multiple Custom Workflow Management Systems Archives (NIH, Library?), Replication Variant Detection Analysis: Phasing, Imputation, IBD, Phenotype Correlation R&D New Methods Clinical Decision Support & Presentation 2) Multiple Custom Laboratory Information Management Systems (LIMS) 3) E-iRODS Variant Database, Hadoop Clinical Validation Clinical Review Clinical Binning External Data Feeds (RefSeq, OMIM, PolyPhen, ) 6
7 The research data ecosystem: challenges UNC STORAGE (Tape, Drives) RENCI STORAGE (Tape, Drives) Genomics Storage Lab Machines Open Science Grid Teragrid External Partner Resources UNC HPC RENCI HPC RENCI Hadoop Genomics HPC Genomics Hadoop IT Machines Clouds Data management challenges: Analysts Wild West Automated Processes Controlled Developers Tracking data and metadata Data movement and migration Enforcing policies, compliance, security Encouraging managed automation Cost, disk and IT time Failures Data Providers External Partners IT Staff While not disrupting access to data Students Compliance While goals, processes, users, and software change
8 Big data and new stressors More data munging, more tools and processes involved, more hardware, more people (esp. IT and CS people) More security and compliance concerns More QC and QA concerns Data too big for review + more people/process=more mistakes More infrastructure breakages: storage systems, software tools Time slows down and mistakes are more costly Moving data is a planned IT event, analysis take days to months 8
9 What s needed? A multitude of technologies that play well together LIMS, analysis workflow engines, HPC queues, RDBMS, archival and library systems, web reporting/submission sites, Middleware that: Ties together the technologies Automates data-related chores Virtualizes the IT data infrastructure Securely manages the data at scale Works within a dynamically changing research environment Presentation title goes here 9
10 Integrated Rules Oriented Data System (irods) Proven in production use: NASA, NOAA, National Archives, Max Planck Society, Broad Institute, Wellcome Trust Sanger Institute, Lineberger Comprehensive Cancer Center, Bejing Genome Institute, Dow Chemical, Merck, International Neuroinformatics Coordinating Facilities, Proven at scale: iplant - 10k users; French National Institute for Nuclear Physics and Plasma Physics - 6 PB; Australian Research Collaboration Service storage resources; NASA Center for Climate Simulations million attributes; Cinegrid sites across Japan-US-Europe Solid foundation: SRB: initial product (developed by DICE Group, owned by General Atomics) in 1997 irods: rewrite of SRB by DICE Group in 2006; currently on version 3.3 Enterprise irods: mission critical distribution co-developed by RENCI and DICE in 2012 Support: Community of developers from groups worldwide Independent groups offering consulting and support and development irods Consortium offering formal support, training, involvement, and development help 10
11 irods- high level view Research Community Research Group A Research Group N - Unified logical interface to data and metadata resources (single namespace) - -based management of access - -driven management of data (replication, deletion, ) Institution A repository Archivals Institution B repository PI data sets Community data collections Data Services
12 irods Key Features Unified and consistent name space for digital objects Centralized metadata system Tagging, queries, used for process and security controls Manages digital objects stored in a variety of systems NFS, HDFS, S3, DDN WOS, HPSS, Instantiated via web service call, REST call, SQL query, Hadoop job, Multiple clients and APIs enforcing distributed rule engine 12
13 Principals of driven data management Relational model from late 60s/early 70s Foundation for SQL and RDBMS systems model Foundation for policy based data management systems Presentation title goes here 13
14 -based Data Management Purpose Defines Collection Defines Property Defines Controls Procedure SubType Updates Persistent State Information Ex: - QC check run - File integrity validated Periodic Assessment Criteria Source: Reagan Moore
15 -based Data Management - Collection Purpose Defines Collection Defines Digital Object Attribute Updates Property Defines Controls Procedure SubType Updates Persistent State Information Periodic Assessment Criteria Source: Reagan Moore
16 -based Data Management Collection Properties Purpose Defines Collection Defines Digital Object Attribute Integrity Updates Authenticity Access control Completeness Feature Feature Property Defines Controls Procedure Feature SubType Periodic Assessment Criteria Updates Persistent State Information Correctness Feature Consensus Consistency Source: Reagan Moore
17 -based Data Management Collection Policies Purpose Defines Collection Integrity Defines Replication Checksum Quota Data Type Digital Object Updates Attribute Authenticity Access control Completeness Feature Feature Property Defines Controls Procedure Feature SubType Periodic Assessment Criteria Updates Persistent State Information Correctness Feature Consensus Consistency Source: Reagan Moore
18 -based Data Management Collection Procedures Purpose Defines Collection Integrity Defines Replication Checksum Quota Data Type Digital Object Updates Attribute Authenticity Access control Completeness Feature Feature Property Defines Controls Procedure Feature SubType Periodic Assessment Criteria Workflow Chains Updates Persistent State Information GetUserACL SetDataType Correctness Feature Function SetQuota Consensus DataObjRepl Source: Reagan Moore Consistency Operation SysChksumDataObj
19 -based Data Management Persistent State Purpose Defines Collection DATA_ID DATA_REPL_NUM DATA_CHECKSUM Integrity Defines Replication Checksum Quota Data Type Digital Object Updates Attribute Authenticity Access control Completeness Feature Feature Property Defines Controls Procedure Feature SubType Periodic Assessment Criteria Workflow Chains Updates Persistent State Information GetUserACL SetDataType Correctness Feature Function SetQuota Consensus DataObjRepl Source: Reagan Moore Consistency Operation SysChksumDataObj
20 -based Data Management Enforcement Purpose Defines Collection DATA_ID DATA_REPL_NUM DATA_CHECKSUM Integrity Defines Replication Checksum Quot a Data Type Digital Object Updates Attribute Authenticity Access control Completeness Feature Correctness Feature Property Defines Controls Procedure Feature Feature Enforcement Point SubType Periodic Assessment Criteria Workflow Chains Function Updates Persistent State Information GetUserACL SetDataType SetQuota Consensus Invokes DataObjRepl Source: Reagan Moore Consistency Client Action Operation SysChksumDataObj
21 -based Data Management Implementation in irods Purpose (5 main types) Defines Collection DATA_ID DATA_REPL_NUM DATA_CHECKSUM SubType Archive Data grid Collection Digital Library Processing Pipeline Integrity Authenticity Access control Completeness Source: Reagan Moore Feature Correctness Defines Feature Consensus Property Defines Controls Procedure (11 default) Feature Consistency Replication Checksum Quota Data Type Feature Enforcement Points (70) Invokes Clients (50) SubType Periodic Assessment Criteria Digital Object Updates Workflow Chains Micro-service (317) Operation Updates Attribute Persistent State Information (338) msigetuseracl msisetdatatype msisetquota msidataobjrepl msisyschksumdataobj
22 Recap: -Based Data Management Purpose - reason a collection is assembled Properties - attributes needed to ensure the purpose Policies - enforce and maintain collection properties Procedures - functions that implement the policies Persistent state information - results of applying procedures Property assessment criteria validation that state information conforms to the desired purpose Federation - controlled sharing of logical name spaces These are the necessary elements for collection management 22
23
24 Default Policies in irods Data Grid 1. Setup a collection and trash directory for each account 2. Setup membership in public account 3. Manage deletion of account 4. Manage renaming of the data grid 5. Manage path permission checking 6. Manage resource quota 7. Manage use of parallel I/O streams for large files 8. Manage selection of default storage location 9. Manage selection of storage location for replication 10. Manage selection of number of processes to use when multitasking 11. Manage selection of physical path name
25 irods Rules: defining the policies Server-side workflows Action condition workflow chain recovery chain Condition - test on any attribute: Collection, file name, storage system, file type, user group, elapsed time, IRB approval flag, descriptive metadata Workflow chain: Micro-services / rules that are executed at the storage system Recovery chain: Micro-services / rules that are used to recover from errors 25
26 irods Micro-Services Function snippets that wrap a well-defined process Compute checksum Replicate file Integrity check Zoom image Get tiff image cutout Search PubMed Written in C or Python Recovery micro-services to handle failure Web services, external applications, can be wrapped as micro-services Can be chained to perform complex tasks Micro-services invoked by rule engine 26
27 irods Micro-Services Over 300 published microservices Pluggable: write, publish, re-use 27
28 Example: unified view of data idrop web client Spread across: 1) Disk-storage at UNC, 2) Disk-storage at RENCI, 3) Tape-storage at RENCI 28
29 Example: unified view of data 29
30 Example: data replication policy UNC Data Center RENCI Data Center Isilon E-iRODS icat Server E-iRODS Server DDN9900 StorNext Appliance Two working copies kept For data recovery and to allow analysis at both sites Tape Library Copy me and Data copied metadata control copy process Only on certain files (fastq, finished bam files) irods rule run nightly does the copy Performs copy, verifies copy successful, resets copy me attribute Versioning to allow for re-runs of patient samples 30
31 Example: data access policy Challenge Millions of files across different projects, growing daily Hundreds of users across different labs, changing frequently How to control access UNIX ACLs became too unwieldy Moving data means reproducing permission and group settings : access given if user and data belong to the same groups Tag data with group metadata (e.g., Lab X lung tumor study) Access rule: user s group must match data group E.g. (user y member of Lab X lung tumor study) Advantage: Data group tag generated as part of workflows, automatically Data can be moved without breaking permission model User-Data linkage not based on directory and file names Thanks to Sai Balu at LCCC 31
32 The Data Life Cycle - Collections Each data life cycle stage increases the value and usability of the original collection Project Collection Data Grid Data Processing Pipeline Digital Library Reference Collection Federation Private Shared Analyzed Published Preserved Sustained Local Distribution Service Description Representation Re-purposing Jeff gets data from a sensor Jeff shares data with colleagues Together w/ colleagues, analyzes data and produces results Results peerreviewed and published Jeff et. al. hit jackpot: collection now accepted as ref collection for decades Hydrology Datagrid grows in value to ecology and biology and federated
33 Lifecycles in an R&D data-driven ecosystem UNC STORAGE (Tape, Drives) UNC HPC RENCI STORAGE (Tape, Drives) RENCI HPC RENCI Hadoop Genomics Storage Genomics HPC Genomics Hadoop Lab Machines IT Machines Control over: Data movement and replication Metadata standards Archival, deletion, and retention Wild West As processes mature Policies as much control as needed irods Integration with workflows, hadoop, databases Hiding complexities Automation, all policy driven Analysts Data Providers Automated Processes External Partners Developers IT Staff, while transitioning adhoc practices to production processes
34 irods Clients APIs: Java, C, C++, Fortran, PHP, Python General Interfaces icommands UNIX and Windows command line interface idrop GUI interface idropweb web version of idrop interface Windows browser Web-DAV FUSE Parrot Domain specific clients: Grid tools (GridFTP, SAGA) Portals (EngineFrame) Web services (VOSpace, irods-rest) Workflows (Kepler, Taverna, NCSA Cyberintegrator) Digital libraries (Dspace, Fedora)
35 Storage Resources UNIX file system irods POSIX Driver Local Cache Universal Mass Storage System HPSS Tivoli Storage Manager Windows file system DBO SQL RDBMS HPSS Microservice Objects SRB DDN WOS Z39.50 HTTP FTP Amazon S3 Thredds HDFS/Hadoop
36 Pluggable Storage Resources irods Smart Pluggable Resource Resource 1 (e.g. high performance drive) Resource 2 (e.g. nfs drive) Resource 3 (e.g. archive) Resource 3a (cheap array of disks) Resource 3b (tape) Tree-based approach allows for extending horizontally and vertically Greater range of customized solutions: hierarchical storage management, load balancing, high availability, tailored interfaces with high performance storage environments,
37 -Managed Pluggable Resources E-iRODS Resource Wrapper Pluggable Resource Local drive Remote drive PEPs irods Rules Engine Resource-specific rules User develops pluggable resource Code inspection allows for autogenerated policy enforcement points (PEPs) Grid admin can then develop standard irods policy-enforcing rules specific to the resource Use Case Example: Pluggable resource by default replicates to ensure high availability irods rule informs resource to turn off high availability on ingested files tagged with Protected Health Information metadata
38 Lifecycles in an R&D ecosystem UNC STORAGE (Tape, Drives) RENCI STORAGE (Tape, Drives) Genomics Storage Lab Machines UNC HPC RENCI HPC RENCI Hadoop Genomics HPC Genomics Hadoop IT Machines Wild West As processes mature NFS Hadoop DDN WOS RDBMS Programmatic APIs irods policy control Data Services Data Workflows Web services irods Clients pluggable Analysts Data Providers Automated Processes External Partners IT Staff
39 irods in clinical and translational research Presentation title goes here 39
40 Secure Medical Workspace Combines Virtualization, Endpoint Data Leakage Protection (DLP), standard security such as use of VPNs, network sniffing, antivirus, group policies, 40
41 Secure Access to Data on the Clinical Side Research Systems Clinician Researcher irods-enabled samtools 1) 4) 5) E-iRODS Portal Sequence Data 3) 2) Data Sets Secure Medical Workspace NCGenes EMR 1) Clinician request for sequence reads on patient X 2) Patient id lookup to obtain subject id 3) Subject id lookup in E-iRODS 4) Data sets packaged in zip file and retrieved 5) Data unzipped and displayed within secure workspace Clinical Studies Clinical Systems
42 Loosely coupled distributed SMWs DW Deduce irods client Research Workspace irods Data Server Research Workspace irods Data Server irods client I2b2 irods Data Catalogue DW Research Workspace SAS irods client irods Data Server Researchers can access data via local clinical information system (CIS) or as shared files. Sharing between sites is managed by a combination of CIS federation and data grid middleware (more flexible, less CIS lockin)
43 Questions? Presentation title goes here 43
Technology solutions for managing and computing on largescale biomedical data
Technology solutions for managing and computing on largescale biomedical data Charles Schmitt CTO & Director of Informatics RENCI Brand Fortner Executive Director, irods Consortium Jason Coposky Chief
More informationData Management using irods
Data Management using irods Fundamentals of Data Management September 2014 Albert Heyrovsky Applications Developer, EPCC a.heyrovsky@epcc.ed.ac.uk 2 Course outline Why talk about irods? What is irods?
More informationRELATED WORK DATANET FEDERATION CONSORTIUM, HTTP://WWW.DATAFED.ORG IRODS, HTTP://IRODS.DICERESEARCH.ORG
REAGAN W. MOORE DIRECTOR DATA INTENSIVE CYBER ENVIRONMENTS CENTER UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL RWMOORE@RENCI.ORG PRIMARY RESEARCH OR PRACTICE AREA(S): POLICY-BASED DATA MANAGEMENT PREVIOUS
More informationirods Policy-Driven Data Preservation Integrating Cloud Storage and Institutional Repositories
irods Policy-Driven Data Preservation Integrating Cloud Storage and Institutional Repositories Reagan W. Moore Arcot Rajasekar Mike Wan {moore,sekar,mwan}@diceresearch.org h;p://irods.diceresearch.org
More informationManaging Next Generation Sequencing Data with irods
Managing Next Generation Sequencing Data with irods Presented by Dan Bedard // danb@renci.org at the 9 th International Conference on Genomics Shenzhen, China September 12, 2014 Managing NGS Data with
More informationINTEGRATED RULE ORIENTED DATA SYSTEM (IRODS)
INTEGRATED RULE ORIENTED DATA SYSTEM (IRODS) Todd BenDor Associate Professor Dept. of City and Regional Planning UNC-Chapel Hill bendor@unc.edu http://irods.org/ SESYNC Model Integration Workshop Important
More informationirods Technologies at UNC
irods Technologies at UNC E-iRODS: Enterprise irods at RENCI Presenter: Leesa Brieger leesa@renci.org SC12 irods Informational Reception 1! UNC Chapel Hill Investment in irods DICE and RENCI: research
More informationPolicy Policy--driven Distributed driven Distributed Data Management (irods) Richard M arciano Marciano marciano@un marciano @un.
Policy-driven Distributed Data Management (irods) Richard Marciano marciano@unc.edu Professor @ SILS / Chief Scientist for Persistent Archives and Digital Preservation @ RENCI Director of the Sustainable
More informationTechnical. Overview. ~ a ~ irods version 4.x
Technical Overview ~ a ~ irods version 4.x The integrated Ru e-oriented DATA System irods is open-source, data management software that lets users: access, manage, and share data across any type or number
More informationAutomated and Scalable Data Management System for Genome Sequencing Data
Automated and Scalable Data Management System for Genome Sequencing Data Michael Mueller NIHR Imperial BRC Informatics Facility Faculty of Medicine Hammersmith Hospital Campus Continuously falling costs
More informationThe National Consortium for Data Science (NCDS)
The National Consortium for Data Science (NCDS) A Public-Private Partnership to Advance Data Science Ashok Krishnamurthy PhD Deputy Director, RENCI University of North Carolina, Chapel Hill What is NCDS?
More informationirods at CC-IN2P3: managing petabytes of data
Centre de Calcul de l Institut National de Physique Nucléaire et de Physique des Particules irods at CC-IN2P3: managing petabytes of data Jean-Yves Nief Pascal Calvat Yonny Cardenas Quentin Le Boulc h
More informationDataGrids 2.0 irods - A Second Generation Data Cyberinfrastructure. Arcot (RAJA) Rajasekar DICE/SDSC/UCSD
DataGrids 2.0 irods - A Second Generation Data Cyberinfrastructure Arcot (RAJA) Rajasekar DICE/SDSC/UCSD What is SRB? First Generation Data Grid middleware developed at the San Diego Supercomputer Center
More informationData management challenges in todays Healthcare and Life Sciences ecosystems
Data management challenges in todays Healthcare and Life Sciences ecosystems Jose L. Alvarez Principal Engineer, WW Director Life Sciences jose.alvarez@seagate.com Evolution of Data Sets in Healthcare
More informationIntegrated Rule-based Data Management System for Genome Sequencing Data
Integrated Rule-based Data Management System for Genome Sequencing Data A Research Data Management (RDM) Green Shoots Pilots Project Report by Michael Mueller, Simon Burbidge, Steven Lawlor and Jorge Ferrer
More informationirods and Metadata survey Version 0.1 Date March Abhijeet Kodgire akodgire@indiana.edu 25th
irods and Metadata survey Version 0.1 Date 25th March Purpose Survey of Status Complete Author Abhijeet Kodgire akodgire@indiana.edu Table of Contents 1 Abstract... 3 2 Categories and Subject Descriptors...
More informationUsing Databases to Manage State Information for. Globally Distributed Data
Storage Resource Broker Using Databases to Manage State Information for Globally Distributed Data Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.sdsc sdsc.edu/srb Abstract The
More informationBalancing Big Data for Security, Collaboration and Performance
Balancing Big Data for Security, Collaboration and Performance Sai Balu Lineberger Cancer Center UNC Chapel Hill Oct 14, 2014 About UNC Oldest Public University -1793 Top 5 Public University. 46th World
More informationirods Overview Intro to Data Grids and Policy-Driven Data Management!!Leesa Brieger, RENCI! Reagan Moore, DICE & RENCI!
irods Overview Intro to Data Grids and Policy-Driven Data Management!!Leesa Brieger, RENCI! Reagan Moore, DICE & RENCI! Renaissance Computing Institute (RENCI) A research unit of UNC Chapel Hill Current
More informationMichał Jankowski Maciej Brzeźniak PSNC
National Data Storage - architecture and mechanisms Michał Jankowski Maciej Brzeźniak PSNC Introduction Assumptions Architecture Main components Deployment Use case Agenda Data storage: The problem needs
More informationHow To Understand The Nature Of Big Data
Big Data is Coming for You W. Christopher Lenhardt RENCI DAARWG, Chair Outline A few words about RENCI Introduction: On the Nature of BIG Big Challenges Big Science Questions Big Data Other Big Trends
More informationirods Overview Introduction to Data Grids, Policy-Driven Data Management, and Enterprise irods
irods Overview Introduction to Data Grids, Policy-Driven Data Management, and Enterprise irods Renaissance Computing Institute (RENCI) A research unit of UNC Chapel Hill Directed by Stan Ahalt, formerly
More informationintegrated Rule-Oriented Data System Reference
i integrated Rule-Oriented Data System Reference Arcot Rajasekar 1 Michael Wan 2 Reagan Moore 1 Wayne Schroeder 2 Sheau-Yen Chen 2 Lucas Gilbert 2 Chien-Yi Hou Richard Marciano 1 Paul Tooby 2 Antoine de
More informationObject storage in Cloud Computing and Embedded Processing
Object storage in Cloud Computing and Embedded Processing Jan Jitze Krol Systems Engineer DDN We Accelerate Information Insight DDN is a Leader in Massively Scalable Platforms and Solutions for Big Data
More informationOSG PUBLIC STORAGE. Tanya Levshina
PUBLIC STORAGE Tanya Levshina Motivations for Public Storage 2 data to use sites more easily LHC VOs have solved this problem (FTS, Phedex, LFC) Smaller VOs are still struggling with large data in a distributed
More informationConcepts in Distributed Data Management or History of the DICE Group
Concepts in Distributed Data Management or History of the DICE Group Reagan W. Moore 1, Arcot Rajasekar 1, Michael Wan 3, Wayne Schroeder 2, Antoine de Torcy 1, Sheau- Yen Chen 2, Mike Conway 1, Hao Xu
More informationData Management in an International Data Grid Project. Timur Chabuk 04/09/2007
Data Management in an International Data Grid Project Timur Chabuk 04/09/2007 Intro LHC opened in 2005 several Petabytes of data per year data created at CERN distributed to Regional Centers all over the
More informationIntegrating Data Life Cycle into Mission Life Cycle. Arcot Rajasekar rajasekar@unc.edu sekar@diceresearch.org
Integrating Data Life Cycle into Mission Life Cycle Arcot Rajasekar rajasekar@unc.edu sekar@diceresearch.org 1 Technology of Interest Provide an end-to-end capability for Exa-scale data orchestration From
More informationHow To Manage Research Data At Columbia
An experience/position paper for the Workshop on Research Data Management Implementations *, March 13-14, 2013, Arlington Rajendra Bose, Ph.D., Manager, CUIT Research Computing Services Amy Nurnberger,
More informationAccelerate > Converged Storage Infrastructure. DDN Case Study. ddn.com. 2013 DataDirect Networks. All Rights Reserved
DDN Case Study Accelerate > Converged Storage Infrastructure 2013 DataDirect Networks. All Rights Reserved The University of Florida s (ICBR) offers access to cutting-edge technologies designed to enable
More informationLarge-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of Chicago @madduri
Large-scale Research Data Management and Analysis Using Globus Services Ravi Madduri Argonne National Lab University of Chicago @madduri Outline Who we are Challenges in Big Data Management and Analysis
More informationMigrating NASA Archives to Disk: Challenges and Opportunities. NASA Langley Research Center Chris Harris June 2, 2015
Migrating NASA Archives to Disk: Challenges and Opportunities NASA Langley Research Center Chris Harris June 2, 2015 MSST 2015 Topics ASDC Who we are? What we do? Evolution of storage technologies Why
More informationWOS Cloud. ddn.com. Personal Storage for the Enterprise. DDN Solution Brief
DDN Solution Brief Personal Storage for the Enterprise WOS Cloud Secure, Shared Drop-in File Access for Enterprise Users, Anytime and Anywhere 2011 DataDirect Networks. All Rights Reserved DDN WOS Cloud
More informationBlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything
BlueArc unified network storage systems 7th TF-Storage Meeting Scale Bigger, Store Smarter, Accelerate Everything BlueArc s Heritage Private Company, founded in 1998 Headquarters in San Jose, CA Highest
More informationDelivering the power of the world s most successful genomics platform
Delivering the power of the world s most successful genomics platform NextCODE Health is bringing the full power of the world s largest and most successful genomics platform to everyday clinical care NextCODE
More informationArchiving, Indexing and Accessing Web Materials: Solutions for large amounts of data
Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data David Minor 1, Reagan Moore 2, Bing Zhu, Charles Cowart 4 1. (88)4-104 minor@sdsc.edu San Diego Supercomputer Center
More informationTechnologies for Genomic Medicine: MaPSeq, A Computational and Analytical Workflow Manager for Downstream Genomic Sequencing
Technologies for Genomic Medicine: MaPSeq, A Computational and Analytical Workflow Manager for Downstream Genomic Sequencing The Team: Jason Reilly, RENCI Senior Research Software Developer; Stanley Ahalt,
More informationTHE CCLRC DATA PORTAL
THE CCLRC DATA PORTAL Glen Drinkwater, Shoaib Sufi CCLRC Daresbury Laboratory, Daresbury, Warrington, Cheshire, WA4 4AD, UK. E-mail: g.j.drinkwater@dl.ac.uk, s.a.sufi@dl.ac.uk Abstract: The project aims
More informationHadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
More informationIntro to Data Management. Chris Jordan Data Management and Collections Group Texas Advanced Computing Center
Intro to Data Management Chris Jordan Data Management and Collections Group Texas Advanced Computing Center Why Data Management? Digital research, above all, creates files Lots of files Without a plan,
More informationDistributed File Systems An Overview. Nürnberg, 30.04.2014 Dr. Christian Boehme, GWDG
Distributed File Systems An Overview Nürnberg, 30.04.2014 Dr. Christian Boehme, GWDG Introduction A distributed file system allows shared, file based access without sharing disks History starts in 1960s
More informationConceptualizing Policy-Driven Repository Interoperability (PoDRI) Using irods and Fedora
Conceptualizing Policy-Driven Repository Interoperability (PoDRI) Using irods and Fedora David Pcolar Carolina Digital Repository (CDR) david_pcolar@unc.edu Alexandra Chassanoff School of Information &
More informationKey Considerations for Managing Big Data in the Life Science Industry
Key Considerations for Managing Big Data in the Life Science Industry The Big Data Bottleneck In Life Science Faster, cheaper technology outpacing Moore s law Lower costs and increasing speeds leading
More informationPractical Solutions for Big Data Analytics
Practical Solutions for Big Data Analytics Ravi Madduri Computation Institute (madduri@anl.gov) Paul Dave (pdave@uchicago.edu) Dinanath Sulakhe (sulakhe@uchicago.edu) Alex Rodriguez (arodri7@uchicago.edu)
More informationENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013
ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE October 2013 Introduction As sequencing technologies continue to evolve and genomic data makes its way into clinical use and
More informationScalable Services for Digital Preservation
Scalable Services for Digital Preservation A Perspective on Cloud Computing Rainer Schmidt, Christian Sadilek, and Ross King Digital Preservation (DP) Providing long-term access to growing collections
More informationDiagram 1: Islands of storage across a digital broadcast workflow
XOR MEDIA CLOUD AQUA Big Data and Traditional Storage The era of big data imposes new challenges on the storage technology industry. As companies accumulate massive amounts of data from video, sound, database,
More informationScheduling in SAS 9.4 Second Edition
Scheduling in SAS 9.4 Second Edition SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. Scheduling in SAS 9.4, Second Edition. Cary, NC: SAS Institute
More informationPersonalized Medicine and IT
Personalized Medicine and IT Data-driven Medicine in the Age of Genomics www.intel.com/healthcare/bigdata Ketan Paranjape General Manager, Life Sciences Intel Corp. @Portlandketan 1 The Central Dogma of
More informationEMC IRODS RESOURCE DRIVERS
EMC IRODS RESOURCE DRIVERS PATRICK COMBES: PRINCIPAL SOLUTION ARCHITECT, LIFE SCIENCES 1 QUICK AGENDA Intro to Isilon (~2 hours) Isilon resource driver Intro to ECS (~1.5 hours) ECS Resource driver Possibilities
More informationDistributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
More informationglobus online Cloud-based services for (reproducible) science Ian Foster Computation Institute University of Chicago and Argonne National Laboratory
globus online Cloud-based services for (reproducible) science Ian Foster Computation Institute University of Chicago and Argonne National Laboratory Computation Institute (CI) Apply to challenging problems
More informationBeyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations
Beyond Lambda - how to get from logical to physical Artur Borycki, Director International Technology & Innovations Simplification & Efficiency Teradata believe in the principles of self-service, automation
More informationSCALABLE FILE SHARING AND DATA MANAGEMENT FOR INTERNET OF THINGS
Sean Lee Solution Architect, SDI, IBM Systems SCALABLE FILE SHARING AND DATA MANAGEMENT FOR INTERNET OF THINGS Agenda Converging Technology Forces New Generation Applications Data Management Challenges
More informationSeptember 2009 Cloud Storage for Cloud Computing
September 2009 Cloud Storage for Cloud Computing This paper is a joint production of the Storage Networking Industry Association and the Open Grid Forum. Copyright 2009 Open Grid Forum, Copyright 2009
More informationMigration Scenario: Migrating Backend Processing Pipeline to the AWS Cloud
Migration Scenario: Migrating Backend Processing Pipeline to the AWS Cloud Use case Figure 1: Company C Architecture (Before Migration) Company C is an automobile insurance claim processing company with
More informationWOS for Research. ddn.com. DDN Whitepaper. Utilizing irods to manage collaborative research. 2012 DataDirect Networks. All Rights Reserved.
DDN Whitepaper WOS for Research Utilizing irods to manage collaborative research. 2012 DataDirect Networks. All Rights Reserved. irods and the DDN Web Object Scalar (WOS) Integration irods, an open source
More informationData Grid Landscape And Searching
Or What is SRB Matrix? Data Grid Automation Arun Jagatheesan et al., University of California, San Diego VLDB Workshop on Data Management in Grids Trondheim, Norway, 2-3 September 2005 SDSC Storage Resource
More information2011 FileTek, Inc. All rights reserved. 1 QUESTION
2011 FileTek, Inc. All rights reserved. 1 QUESTION 2011 FileTek, Inc. All rights reserved. 2 HSM - ILM - >>> 2011 FileTek, Inc. All rights reserved. 3 W.O.R.S.E. HOW MANY YEARS 2011 FileTek, Inc. All rights
More informationData grid storage for digital libraries and archives using irods
Data grid storage for digital libraries and archives using irods Mark Hedges, Centre for e-research, King s College London eresearch Australasia, Melbourne, 30 th Sept. 2008 Background: Project History
More informationCommVault Simpana Archive 8.0 Integration Guide
CommVault Simpana Archive 8.0 Integration Guide Data Domain, Inc. 2421 Mission College Boulevard, Santa Clara, CA 95054 866-WE-DDUPE; 408-980-4800 Version 1.0, Revision B September 2, 2009 Copyright 2009
More informationIBM Smart Business Storage Cloud
GTS Systems Services IBM Smart Business Storage Cloud Reduce costs and improve performance with a scalable storage virtualization solution SoNAS Gerardo Kató Cloud Computing Solutions 2010 IBM Corporation
More informationLeading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik
Leading Genomics Diagnostic harma Discove Collab Shanghai Cambridge, MA Reykjavik Global leadership for using the genome to create better medicine WuXi NextCODE provides a uniquely proven and integrated
More informationWorkload Characterization and Analysis of Storage and Bandwidth Needs of LEAD Workspace
Workload Characterization and Analysis of Storage and Bandwidth Needs of LEAD Workspace Beth Plale Indiana University plale@cs.indiana.edu LEAD TR 001, V3.0 V3.0 dated January 24, 2007 V2.0 dated August
More informationBuilding Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT
Building Bioinformatics Capacity in Africa Nicky Mulder CBIO Group, UCT Outline What is bioinformatics? Why do we need IT infrastructure? What e-infrastructure does it require? How we are developing this
More informationImplementing Network Attached Storage. Ken Fallon Bill Bullers Impactdata
Implementing Network Attached Storage Ken Fallon Bill Bullers Impactdata Abstract The Network Peripheral Adapter (NPA) is an intelligent controller and optimized file server that enables network-attached
More informationThe THREDDS Data Repository: for Long Term Data Storage and Access
8B.7 The THREDDS Data Repository: for Long Term Data Storage and Access Anne Wilson, Thomas Baltzer, John Caron Unidata Program Center, UCAR, Boulder, CO 1 INTRODUCTION In order to better manage ever increasing
More informationData Services for Campus Researchers
Data Services for Campus Researchers Research Data Management Implementations Workshop March 13, 2013 Richard Moore SDSC Deputy Director & UCSD RCI Project Manager rlm@sdsc.edu SDSC Cloud: A Storage Paradigm
More informationBig Data Analytics Platform @ Nokia
Big Data Analytics Platform @ Nokia 1 Selecting the Right Tool for the Right Workload Yekesa Kosuru Nokia Location & Commerce Strata + Hadoop World NY - Oct 25, 2012 Agenda Big Data Analytics Platform
More informationReagan Moore, PI Mary Whitton, Project Manager. National Science Foundation Cooperative Agreement: OCI 0940841
Reagan Moore, PI Mary Whitton, Project Manager National Science Foundation Cooperative Agreement: OCI 0940841 DFC to Support Hydrologic Modeling Jon Goodall and Bakinam Essawy University of Virginia DFC
More informationScheduling in SAS 9.3
Scheduling in SAS 9.3 SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc 2011. Scheduling in SAS 9.3. Cary, NC: SAS Institute Inc. Scheduling in SAS 9.3
More information#jenkinsconf. Jenkins as a Scientific Data and Image Processing Platform. Jenkins User Conference Boston #jenkinsconf
Jenkins as a Scientific Data and Image Processing Platform Ioannis K. Moutsatsos, Ph.D., M.SE. Novartis Institutes for Biomedical Research www.novartis.com June 18, 2014 #jenkinsconf Life Sciences are
More informationCollaborative SRB Data Federations
WHITE PAPER Collaborative SRB Data Federations A Unified View for Heterogeneous High-Performance Computing INTRODUCTION This paper describes Storage Resource Broker (SRB): its architecture and capabilities
More informationInitializing SAS Environment Manager Service Architecture Framework for SAS 9.4M2. Last revised September 26, 2014
Initializing SAS Environment Manager Service Architecture Framework for SAS 9.4M2 Last revised September 26, 2014 i Copyright Notice All rights reserved. Printed in the United States of America. No part
More informationWrangler: A New Generation of Data-intensive Supercomputing. Christopher Jordan, Siva Kulasekaran, Niall Gaffney
Wrangler: A New Generation of Data-intensive Supercomputing Christopher Jordan, Siva Kulasekaran, Niall Gaffney Project Partners Academic partners: TACC Primary system design, deployment, and operations
More informationAssessment of RLG Trusted Digital Repository Requirements
Assessment of RLG Trusted Digital Repository Requirements Reagan W. Moore San Diego Supercomputer Center 9500 Gilman Drive La Jolla, CA 92093-0505 01 858 534 5073 moore@sdsc.edu ABSTRACT The RLG/NARA trusted
More informationManaging Microsoft Office SharePoint Server Content with Hitachi Data Discovery for Microsoft SharePoint and the Hitachi NAS Platform
Managing Microsoft Office SharePoint Server Content with Hitachi Data Discovery for Microsoft SharePoint and the Hitachi NAS Platform Implementation Guide By Art LaMountain and Ken Ewers February 2010
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationBig Data and the Data Lake. February 2015
Big Data and the Data Lake February 2015 My Vision: Our Mission Data Intelligence is a broad term that describes the real, meaningful insights that can be extracted from your data truths that you can act
More informationAPI Architecture. for the Data Interoperability at OSU initiative
API Architecture for the Data Interoperability at OSU initiative Introduction Principles and Standards OSU s current approach to data interoperability consists of low level access and custom data models
More informationPutting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable
DDN Whitepaper Putting Genomes in the Cloud with WOS TM Making data sharing faster, easier and more scalable Table of Contents Cloud Computing 3 Build vs. Rent 4 Why WOS Fits the Cloud 4 Storing Sequences
More informationEuropean Data Infrastructure - EUDAT Data Services & Tools
European Data Infrastructure - EUDAT Data Services & Tools Dr. Ing. Morris Riedel Research Group Leader, Juelich Supercomputing Centre Adjunct Associated Professor, University of iceland BDEC2015, 2015-01-28
More informationSOA, case Google. Faculty of technology management 07.12.2009 Information Technology Service Oriented Communications CT30A8901.
Faculty of technology management 07.12.2009 Information Technology Service Oriented Communications CT30A8901 SOA, case Google Written by: Sampo Syrjäläinen, 0337918 Jukka Hilvonen, 0337840 1 Contents 1.
More informationNIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons
The NIH Commons Summary The Commons is a shared virtual space where scientists can work with the digital objects of biomedical research, i.e. it is a system that will allow investigators to find, manage,
More informationAn Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationEnhanced Research Data Management and Publication with Globus
Enhanced Research Data Management and Publication with Globus Vas Vasiliadis Jim Pruyne Presented at OR2015 June 8, 2015 Presentations and other useful information available at globus.org/events/or2015/tutorial
More informationUNINETT Sigma2 AS: architecture and functionality of the future national data infrastructure
UNINETT Sigma2 AS: architecture and functionality of the future national data infrastructure Authors: A O Jaunsen, G S Dahiya, H A Eide, E Midttun Date: Dec 15, 2015 Summary Uninett Sigma2 provides High
More informationLong term retention and archiving the challenges and the solution
Long term retention and archiving the challenges and the solution NAME: Yoel Ben-Ari TITLE: VP Business Development, GH Israel 1 Archive Before Backup EMC recommended practice 2 1 Backup/recovery process
More informationIdentity and Access Management Integration with PowerBroker. Providing Complete Visibility and Auditing of Identities
Identity and Access Management Integration with PowerBroker Providing Complete Visibility and Auditing of Identities Table of Contents Executive Summary... 3 Identity and Access Management... 4 BeyondTrust
More informationiplant + irods: Enabling data driven collaborations Nirav Merchant iplant Collaborative/Univ. of Arizona nirav@email.arizona.edu VAMP 2012 Utrecht
iplant + irods: Enabling data driven collaborations Nirav Merchant iplant Collaborative/Univ. of Arizona nirav@email.arizona.edu VAMP 2012 Utrecht Topic Coverage About iplant 4 th Paradigm Technology challenges
More informationDigital Preservation Lifecycle Management
Digital Preservation Lifecycle Management Building a demonstration prototype for the preservation of large-scale multi-media collections Arcot Rajasekar San Diego Supercomputer Center, University of California,
More informationEmerging Technologies Shaping the Future of Data Warehouses & Business Intelligence
Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence Service Oriented Architecture SOA and Web Services John O Brien President and Executive Architect Zukeran Technologies
More informationFedora Distributed data management (SI1)
Fedora Distributed data management (SI1) Mohamed Rafi DART UQ Outline of Work Package To enable Fedora to natively handle large datasets. Explore SRB integration at the storage level of the repository
More informationPowerful Management of Financial Big Data
Powerful Management of Financial Big Data TickSmith s solutions are the first to apply the processing power, speed, and capacity of cutting-edge Big Data technology to financial data. We combine open source
More informationEMC BACKUP MEETS BIG DATA
EMC BACKUP MEETS BIG DATA Strategies To Protect Greenplum, Isilon And Teradata Systems 1 Agenda Big Data: Overview, Backup and Recovery EMC Big Data Backup Strategy EMC Backup and Recovery Solutions for
More informationIBM Tivoli Storage Manager Version 7.1.4. Introduction to Data Protection Solutions IBM
IBM Tivoli Storage Manager Version 7.1.4 Introduction to Data Protection Solutions IBM IBM Tivoli Storage Manager Version 7.1.4 Introduction to Data Protection Solutions IBM Note: Before you use this
More informationIntroduction to Arvados. A Curoverse White Paper
Introduction to Arvados A Curoverse White Paper Contents Arvados in a Nutshell... 4 Why Teams Choose Arvados... 4 The Technical Architecture... 6 System Capabilities... 7 Commitment to Open Source... 12
More informationSAS 9.4 Intelligence Platform
SAS 9.4 Intelligence Platform Application Server Administration Guide SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2013. SAS 9.4 Intelligence Platform:
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More information