Columbia University Digital Library Architecture Robert Cartolano, Director Library Information Technology Office October, 2009
Agenda Technology Architecture Off-site NYSERNet Facility Ingest, curation and support tools Academic Commons - Dspace to Fedora Migration
Goal Scale up efforts to catalog, digitize & publish to the Web unique, distinctive collection holdings that have significant value for teaching or research. Design & implement coherent & comprehensive preservation program for ensuring survival & continued accessibility of Libraries digital content. Develop & budget for long-term digital archiving strategy for content created by the Libraries, whether born-digital or converted from analog formats. Collaborate with other stakeholders to develop affordable cooperative solutions to ensure long-term preservation of licensed content. Libraries Strategic Plan, 2006-2009
Who Columbia University Libraries/Information Services Digital Program and Technology Services Center for Digital Research and Scholarship Center for New Media Teaching and Learning Copyright Advisory Office Digital Preservation and Conversion Libraries Digital Program Division Library Information Technology Office Columbia University Libraries: http://www.columbia.edu/library/
Technical Team Ben Armintor Terry Catapano Robert Cartolano Stephen Davis Jack Donovan Sarah Holsted Risa Karaviotis Rebecca Kennison Stuart Marquis Nada O Neal Alberto Ortiz Patricia Renfro James Stuart
Questions in 2007 Build a single consolidated system? Digital Library Collections DSpace Academic Commons (institutional repository) Long-term Archive What is storage impact of large-scale digitization? How many copies? What level of offsite storage? What are budget requirements?
Requirements Stable, secure storage for large-scale access & longterm preservation Support efficient creation & management of administrative, descriptive, structural, preservation & rights metadata Support object relationships, actions, behaviors Support fine-grained access control policies Administrative tools (eg: statistics, reporting)
Key Decision Points Build integrated system to support: Digital Library Collections Academic Commons (institutional repository) Long-term Archive Use Fedora 3 as platform Two copies on disk, two copies on tape Offsite storage Scalable storage Central IT supported infrastructure Sustainable funding
Technology Approach Risk Averse - use tried and true technologies Use mature, commodity products as much as possible Choose first-tier vendor (Sun) with sustainable support models, proven reliability, stability Open to maximize sustainability and flexibility Open Data, Open Formats, Open Source, Open Protocols, Open Community Entrance and Exit Strategy Hancock, Mara, UC Berkeley, New Structures and Efficiecies, Exploring New Potential Collaborations in the Field, http://opencontent.ccnmtl.columbia.edu/presentations/mhancock.html
Technology Storage Technology Sun Storage Archive Manager (SAM) Policy-based, tiered storage approach Released as open source TAR Archive format - no proprietary archive format on disks Commercial support from Sun Access Method NFS access for application use Solaris 10, ZFS Highly reliable, scalable file system model Released as open source Commercial support from Sun
Technology SUN Storage Archive Manager (SAM) Platform 70TB effective storage, expandable to 400TB Policy-based, tiered storage, commercial support Two Front-End SAM T5240 Solaris Servers Tier I Disk Cache, 9.6 Terabytes (TB) expandable to 60TB Tier II Disk Storage - 192TB Raw, mirrored for 70TB net storage Tier II Tape Storage - 80TB, expandable to 460TB Four copies - 2 Disk, 2 Tape for Preservation Open as possible to maximize sustainability
Storage Architecture Columbia Digital Library Applications Columbia Academic Commons Applications Fedora Servers SAM Servers Disk Cache On-Site Disk Campus Data Center On-Site Tape Campus Data Center Off-Site Disk NYSERNet Data Center Off-Site Tape Off-line, Off-Site Facility Copy 1 Copy 2 Copy 3 Copy 4
Offsite: NYSERNet Data Center Colocation facility in Syracuse, NY 24x7x365 support Battery/Diesel Backup Dual-Power Grid High Speed Fiber (adj. to NYSERNet POP) New Machine Room NYSERNet Data Center Overview: http://www.nysernet.org/services/bcc/
Storage Architecture Columbia Data Center New York, NY Campus Private Network 1 Gigabit/sec NYSERNet Data Center Syracuse, New York SUN T5240 SAM Metadata Servers 10TB FC Disk Storage Copy 3) 70TB SATA Offsite Disk Storage Copy 1) 70TB SATA Onsite Disk Storage Copy 2) 80TB Onsite Tape Copy 4) Offline Tape To Offsite Facility
Fedora Platform and Library/Information Services Columbia Digital Library Columbia Academic Commons Public facing Library facing Fedora Asset Repository & Long-Term Archive Internal Workflow Management Systems Internal Data Management Systems
! For example: Digitized collections Born-digital collections (eg: spatial data) Online Exhibitions Columbia Digital Library!! For example: Columbia-produced content Rich collaboration spaces Faculty profiles Columbia Academic Commons Public facing Library facing Fedora Asset Repository & Long-Term Archive Internal Workflow Management Systems Internal Data Management Systems! For example: Hypatia, batch ingest tools Digitization workflow Preservation workflow Online exhibition workflow!! For example: Backup Data migration SAM-FS
Open Source Benefit Columbia Library/IS staff added capability to Fedora to accept content via locally mounted file system Provide better integration with SAM-FS Patch to be incorporated into Fedora 3.3 Benefit of open source approach make change to meet local requirements, benefit to larger community FCREPO-453 - Allow the retrieval of content via the file URI scheme: https://fedora-commons.org/jira/browse/fcrepo-453
Progress To Date July thru December 2008 Purchased hardware Completed initial hardware installation Sun professional services, training Finalized Fedora software plan and configuration Completed initial Fedora installation Evaluated multiple tools Revised technology roadmap
Progress To Date January thru September 2009 Implemented Academic Commons in Fedora Migrated from DSpace to Academic Commons Built Hypatia cataloging tool for Columbia needs Designing metadata and content models in coordination with Cornell Inventoried and stabilized legacy digital content from hard drives and CDs to staging storage (approx. 8 terabytes) Began metadata remediation of legacy digital assets Began batch ingest of Digital Collections for long-term archive Developed initial requirements for long-term archive
Curation Fast Ingest Rate Slow Complexity Complex Assisted Cataloging Full-Service Cataloging Digital Library Cost $$$ Simple None None Automated Self-Service Cataloging Academic Commons $$ $ Low Effort High Derived from: Goble, Carole, University of Manchester, Curating Services and Workflows: the Good, the Bad and the Ugly, http://www.slideshare.net/carolegoble/dcc-keynote-2007/
Hypatia - Ingest Tool Developed by Columbia Library/IS staff Enables non-programmers to create input forms and workflows for metadata schemas and then catalog items in a secure, controlled environment. Support multiple projects, collections, workflows, with secure, granular access controls, in "Hypatia Spaces." Initial support for Academic Commons: Assisted curation support for Library/IS staff Faculty self-service deposit
Hypatia - Ingest Tool
!"#$"% &#"'$()#*!+,-./* &#,#0# 1-$2 3+.'#" Admin: Reject with Comments Users: Submit Withdraw Describe, Deposit, Disseminate 4#5(#6 7/9(:% '-9+,#0#;9#0./.0. $#'-$/*;5.,(/.0# )(0"0$#.9*;5#$(<= $(>?0"*;5.,(/.0#;@#,/" Hypatia Admin: Approve / Un-Approve 7++$-5# 7/9(:%;A#$(<= $#'-$/*;.""(>: B&";C;'-,,#'D-:;C #E+-$0;0#9+,.0# Admin: Commits 8#/-$. 7F0-9.D'%;G9.(, :-D@'.D-:;0-;F"#$
Academic Commons http://academiccommons.columbia.edu Migrated from DSpace to Fedora for Fall 2009 Open access to Dspace and Fedora backends very helpful Custom export/import code written to move data in exactly the format we wanted to meet Columbia metadata requirements Multiple iterations and extensive testing Match existing public interface with enhancements Provides foundation to rapidly increase deposits
Academic Commons
Next Steps Application Development Develop and implement staff digital collection viewer Continue Academic Commons development Continue Hypatia development Expand to support additional media types (eg: video) Research and Development Investigate Fedora administration tools Investigate media servers (eg: djatoka JPEG2000 server) Develop broad strategy for persistent identifiers (eg: handles) Investigate advanced search and discovery for Academic Commons (eg: Blacklight evaluation)
Final Thoughts Team collaboration across disparate groups Technology platform is working well for our needs Success with Fedora 3 platform Importance of communications and awareness building