Adding Intelligence to Conventional NAS and File Systems: Metadata, Backups, and Data Life Cycle Management PASIG May 12, 2012 Presented by: Jacob Farmer, CTO Cambridge Computer Copyright 2009-2011, Cambridge Computer Services, Inc. All Rights Reserved www.cambridgecomputer.com 781-250-3000
My Background and My Company Jacob Farmer, CTO, Cambridge Computer 25 years experience with data storage My company: Cambridge Computer Founded in 1991 (20 years this July) Roughly 70 people, spread around the country Expertise in data storage Unusual business model like a broker or agent We help our clients select and deploy the appropriate storage technologies There are typically no fees or additional costs to our service Popular business model for higher education and research! 2
Who Are My Clients: People Who Like Free Help and Special Deals Universities Research institutions Independent labs Divisions in the big government labs Libraries, museums, cultural institutions Some government agencies Industry Manufacturing Pharmaceutical Finance Healthcare Oil and gas Etc. 3
Focus Areas for Novel Ways of Managing Data Scientific research, in particular Life Sciences University labs Independent labs Research divisions in pharmaceutical Digital asset management Especially with home grown software applications Especially for institutions with multiple stove-piped DAM systems 4
Our Work: Defining and Refining Use Cases for SRB and IRODS The Cambridge Computer team is working with SRB (Storage Resource Broker) and IRODS (Integrated Rules-Oriented Data System ) to solve common storage management problems: Backup, Life Cycle Management, Collaboration Our goal is to make these platforms easy to deploy and to solve low hanging fruit problems. We are looking for collaborators, potential guinea pigs, and general feedback. Ultimately (later this year or next) we are looking for customers! 5
SRB / IRODS History 1995/97 Storage Resource Broker (SRB) developed and deployed at San Diego Supercomputer Center DICE Group Data Intensive Cyber Infrastructure Academic license used by roughly 200 government and academic applications 2001 SRB forks into commercial version (Nirvana Storage) developed by General Atomics Available as commercial software 10 years of commercial-grade development and deployment 2008 IRODS replaces the Academic SRB (Integrated Rules Oriented Data System) Features integrated rules engine Open source under Berkeley License 6
Fundamental Concepts of IRODS Inventory your files by crawling the file system and making a database entry for each file and directory Associate metadata with directories Apply storage management rules based on file system and extended metadata. Federate storage devices and user directories with a virtual global file system Rules engine runs real-time or in batch. Micro-services routines that are called by the rules engine Make them simple and discreet. Run a bunch of small micro-services together to carry out the full suite of functionality that you seek. Replace or update functionality at a micro-level 7
What Can You Do with IRODS? Everything and nothing!!! It is not an application. It is middleware It is a framework for how to manage data It is not commercial grade software The core of the system is well-written It is not fully documented It is missing features that are critical for most enterprise IT shops The grant money pays for new ideas and new features, not for refining code or adding ho-hum features. 8
Common Pain Points In Storage Management for Research Data Migrating files between storage systems Data Protection (Backup, Replication, Data Integrity) Satisfying NSF Requirement for Data Management Plans Separating important data from not-so-important data and ensuring preservation of important data Finding data: Machines, Users, Applications Especially after it moves Disposing of data Collaboration Cost leveraging lower cost storage devices 9
Problem: Conventional File System Metadata is Insufficiently Descriptive Problem -- Conventional file systems are not descriptive enough for defining policies or for describing data beyond a single individual s memory. Solution Associate descriptive metadata with files, and apply data management policies based on that metadata. \\myserver\mydirectory\stuff\ \\myserver\mydirectory\copy_of_stuff\ \\myserver\mydirectory\more_stuff_do_not_delete\ \\myserver\mydirectory\yet_more_stuff_save_for_comparison\ \\myserver\mydirectory\raw_results_from_experiment3_run23_march-10\ 10
Problem: When Data Moves, Things Break If you move someone else s data, they may never find it again. At the very least they will complain vocally If you move data, you may break essential links to metadata Some content management applications know how to move files or can be updated when files are moved. Others cannot. Often data tracking applications are written by amateur programmers who are 100% on algorithms not data management. Users may have created applications as simple as spreadsheets that reference UNC paths. Often complex files contain hard links to other files or objects 11
Problem: How Do You Get the Metadata Can you extract it from tags embedded in the files? Can you infer it from the way or the frequency that data is used? Can you pick it up at the point of creation? Can you capture it at various stages in a pipeline? Can you get the users to do data entry? How? Can you beat it out of them? (the stick) Can you give them incentives (the carrot) Some combination of both? 12
File System Middleware 13
Typical Content Management Stack 14
Typical Content Management Stack with Conventional Data Protection 15
Inserting File System Middleware 16
Where Do We Live: In-Band or Out-ofBand? In-Band: If the solution sits in the data path it will introduce latency. This is okay for: Archiving solutions Desktop file access WAN file access where WAN latency dwarfs the virtualization layer s latency Other applications where performance does not matter Out-of-Band: No impact on performance, but Some lag time for the system to synchronize Lots of file system crawling Need really slick user interfaces to entice users to embrace the system. Need some kind of carrot/stick mechanisms to get users to your bidding 17
Where Do We Sit: In Band or Out of Band? Our initial goal is to sit outside of the data path Unobtrusive If our product breaks, we don t take systems down Quality assurance is also a lot easier Research computing will not tolerate in-band latency Someday we hope to sit in the data path FUSE NFS/CIFS Ideally, we would be a hybrid of both 18