Part A Odum Archive Operational Overview

The Odum Institute Background Paper International Data Technology Alliance Workshop July 2009 Version: 7/01/2009 Introduction This background paper outlines the data archiving operation and technical schematics for the Odum Institute Social Science Data Archive and Information Technology Services. Part A outlines the data archive operational overview. Part B describes the data infrastructure and software infrastructure. Part C provides a list of relevant staff and their skills which indicate the scale of the support. Part A Odum Archive Operational Overview Key Concepts Survey data, panel data, qualitative data, longitudinal data, electoral data, subarchives, thematic archives, data curation, OAIS model, workflows, search, textual analysis, GIS, spatial analysis, data collections services, computational analysis, federated A.1. General Overview Figure 1 below presents a functional decomposition of services provided by the Odum Archives and Information Technology group. Our general overview is much the same as that of the Australian Social Science Data Archive in that there are three major components Tools and Services, Archives, and Specialty Archives. Currently our services support the common social science data types, as described in the next section, but we are anticipating a shift to more graphical and visual types of data in the very near future. Our core archival functions are supported in large part by the Dataverse Network, which will be detailed later in section B. In addition to this technology we provide numerous other services during the life cycle of the social science research process. These range from research planning, data collection, data analysis, as well as data archiving.

Figure 1 A.2. Data Types Briefly, these are: Survey data - these data are the products of surveys of individuals. The data files are usually organized as one observation per individual, although some data (census type particularly) can be organized hierarchically. These data almost always include personal and demographic characteristics of the individual and/or household. These data are commonly stored either in raw form (ASCII text) or as portable system files (SPSS portable or SAS export). The Dataverse Network software is used to access, download, subset, and analyze these data. Examples include Harris public opinion data, National Network of State Polling data, and North Carolina vital statistics data. Panel data - these data are gathered as time series data. Individuals/households are interviewed at various points in time and the data is collected and stored either in individual datasets (one record per individual per interview) for each time period or in a hierarchical dataset (one observation per individual per interview per each time period, i.e. multiple observations per individual). These data are stored and accessed as above. The Computer Assisted Panel Study done here at the Odum Institute is an example of this type of data. Text data - these data consist of ASCII text files. This type of data is usually manipulated with text analysis software like Atlas.ti or NVivo. The Dataverse Network software can be used to access and download these types of data, but the

analysis is done outside the DVN. The UNC-Poynter 1996 National Election Study which has a large collection of newspaper stories is an example of this type of data. A.3. Specialty Archives The specialty archives shown in Figure 1 are largely topic/disciplinary specific collections. In general these combine data sets that can take any, or all, of the above forms. There are several reasons for the establishment of a specialty archive: the existence of a large, coherent collection of data sets on a specialist topic to serve a specific research group or network; the collection has restrictions or protocols that users must sign up to; or access to data requires a specific tool for download and/or on-line analysis. A.4. Relationship of the Odum Archive workflow to OAIS Reference Model. Figure 2 below illustrates the current Odum archive workflow in OAIS model terminology. This model was adopted and modified from work at the Consultative Committee for Space Data Systems (CCSDS). Part B will provide a more technical detail of how this has been implemented in the Odum Archive environment. Figure 2

A.5. Data Curation and Data Discovery Software The Odum Institute s data curation software is really a combination of the Dataverse Network catalog and custom tools. The Dataverse Network (DVN) serves as a catalog for Odum s own studies as well as those received from faculty and students at the University of North Carolina at Chapel Hill. However, the DVN also serves as a catalog for thousands of studies from partner organizations such as the National Archives and Records Administration and the Institute for Qualitative Social Science at Harvard University. Figure 3 illustrates an overview picture of this federated environment. The Odum Institute upon ingest of a study will assemble any necessary data files that need to accompany the studies. These include SPSS and SAS files as well as PDF files of any codebooks or questionnaires that accompany the data. This will allow files to be analyzed and subsetted in the Dataverse. The Odum Institute asks for data from the producers in the traditional social science data file types but will also convert files into the appropriate SAS and SPSS formats when needed. In addition, the institute uses SQL to fix variable labels in order to insure that the entire label appears when the data file is downloaded. Finally, Dataverse data files as well as the original files are downloaded to Odum file systems as backup and preservation copies. The search qualities of the Dataverse network allow users to search across several networks throughout the country. The DVN as figure 3 shows catalogs studies from five major organizations as well as individual researchers. The DVN offers federated searching across these DVNs. There are several ways in which users can search the DVN, these include: cataloging information, title, author, study ID, variable information, and most of the other metadata fields, these include but are not limited to geographic coverage and unit, dates, kind of data, producer and distributor of the data, and the abstract. While most of these metadata fields are self explanatory, such as author and title, those that are not will be defined. The default search is cataloguing information and is essentially functions as the search all option for searching in the DVN. A search using this option will search all fields for the term used in the search box. The Study ID search allows users to search for a specific study ID which is assigned to every study on ingest into the DVN. Searching using the variable information function will search the variable name and description fields in the studies. Advanced searching essentially allows the user to search any of the metadata fields in the studies across the Dataverse Network to get more specific results.

Figure 3 A.6. Qualitative Analysis Resources The Odum Institute provides access to several software packages allowing researchers to analyze textual, graphical, video, and audio data. These packages QSR NVivo, ATLAS.ti, MAXQDA, and QDA Miner allow users to code large amounts of data and then search coded text based on Boolean logic. Currently these resources are provided outside of the Odum archive technical infrastructure by our hands-on computing labs. The software also allows researchers to create diagrams and tables showing relationships among codes, such as co-occurrences and sequential links. Users are also able to ask questions based on demographics (e.g., year, county), thereby combining conceptual searches with demographic queries. The software also provides sophisticated tools for managing memos, teamwork, and merging datasets and coded data. A.7. Spatial Analysis of Data The Odum Institute maintains a consulting service, staffed by advanced graduate students, who are on call in the Odum Institute computer laboratories to answer questions regarding GIS software and related analysis. In addition the staff of the Odum Spatial Analysis group has more advanced expertise in GIS-related analysis. The software available in the GIS Lab is listed below.

ArcGIS Desktop 9.3 SP1 (ArcInfo): ArcCatalog, ArcGlobe, ArcMap, ArcReader, ArcScene) ArcView 3.3 (with Network Analyst and Spatial Analyst, nothing else) Erdas IMAGINE 9.1 CrimeStat III GeoDa 0.9.5-i Geographically Weighted Regression 3 (GWR) Google Earth SaTScan 8.0 SpaceStat A.8. Computational Analysis and Modelling support Built into the Dataverse Network system is an online analytical engine. Users have the ability to quickly examine common descriptive statistics as well as perform more complex statistical analysis. Behind this service is an R statistical server that has the ability to perform online analysis of the selected data. In addition the R server has the ability to pull open source statistical routines from the R Zelig collection of programs. These routines can be customized and submitted to Zelig and will be systematically push to other R servers using the Zelig routines. In addition to the Web based online analytical system the Odum Institute leverages numerous in-person and virtual resources for computational analysis. In two computing labs, the Institute makes available approximately forty seats with quantitative analysis software (such as SAS, SPSS, Stata, MATLAB, Mplus, R, and other applications), GIS software (including ArcGIS Desktop, Google Earth, and others), and qualitative text analysis (for instance, Atlas.ti, NVivo, and MAXqda). A number of highly experienced research professionals and graduate students assist users with their problems. However for many tasks, a single workstation is insufficient in terms of processing capacity and the memory available in a 32 bit architecture. The Institute has partnered with Renaissance Computing Institute (RENCI) and the university's Research Computing group to provide access and resources to the new Tar Heel Grid. The Tar Heel Grid uses the Condor Project software to match jobs to idle computers for high throughput computing (HTC). And for jobs that are not suitable for this environment (jobs that are highly coupled or non-serializable), the university's Research Computing group provides a number of load sharing facility clusters for high performance computing needs.

A.9. Data Collection Support The Odum Institute offers full-service data collection for telephone and Web surveys. A.9.1 Telephone Survey Data Collection (The Odum Call Center) We offer full service data collection for telephone surveys. Our 12-station call center is located in the heart of UNC's main campus. The call center uses state of the art CATI (computer assisted telephone interviewing) technology. The interviewing stations are networked to a central server, and all use Blaise software for interviewing, case management, and automated call scheduling. A silent monitoring system allows supervisors to unobtrusively monitor ongoing interviews for quality control and training purposes. The Odum Institute has been conducting telephone surveys since the 1970s. Most well known for studies of public opinion, the Odum Institute conducted the Southern Focus Poll twice yearly for the Atlanta Journal Constitution from 1990 to 2000. Together with the UNC School of Journalism, the Odum Institute conducted a state-wide public opinion survey (The Carolina Poll) twice yearly from the early 1980s until 2000. Today, we conduct telephone surveys for clients both within and beyond the UNC community on a cost-recovery basis. A.9.2. Web Survey Data Collection The Odum Institute offers two types of support for Web survey data collection. Students and researchers who want to develop and administer their own Web surveys are invited to use the Qualtrics.com software free of charge through a software grant from Qualtrics.com. For persons or groups (within and outside UNC-CH) who want someone else to handle data collection, we offer full-service Web survey data collection on a cost-reimbursement basis. Part B Archive Architecture Key Concepts Dataverse Network, Postgres, PgAdmin, Virtual Machines, DDI, Irods, LOCKSS, Linux, R, Zelig, Apache, Glassfish, Lucene, AWstats, Google Analytics, OAI-PMH, preservation. Figure 4 below gives a diagrammatic overview of the technical components used towards providing Odum Archive and Information Technology services.

Figure 4 B.1. Odum Archive Systems The core of the Odum archive is the Dataverse Network (DVN), which is a Glassfish application with a web interface for both administration and end-use. When a study is ingested, Dataverse assigns it a unique handle (www.handle.net) identifier. It stores descriptive information (including metadata) in a PostgreSQL database. Although the Dataverse is designed to hide its PostgreSQL back-end, Odum staff sometimes access the database directly in order to correct question label issues in the ingest process. At Odum, the database and data objects are stored on a (RAID-5) hard disk array and dumped to tape backup nightly. Users can perform searches (implemented by Lucene) and advanced statistical operations (implemented by R and Zelig) from the web interface. This usage is logged both to local files (analysed by AWStats) and to Google Analytics. Dataverse is designed to harvest metadata from other OAI-PMH providers, including other Dataverse networks. The Odum archive periodically harvests the metadata from the NARA, IQSS, ICPSR and Roper archives. It also provides an interface to respond to OAI requests. This OAI handler exposes metadata to other applications, including the TPAP and SSP prototype preservation environments described below. The Odum Institute s Dataverse Network resides on a large physical server. There are also about 30 terabytes of disk packs available for archival research and disk-to-

tape backup. Separately, there are 36 CPUs and 204 gigabytes of RAM reserved for virtual machines, which are implemented by bit-translation in ESX. Odum staff members have taken advantage of the virtual machines point-in-time snapshots and clones in order to test archive systems designed to interact with many peers. Most recently, the virtual machines have provided the infrastructure for the SSP and TPAP preservation environment prototypes (explained in the next section). B.2. Preservation Prototypes The preservation and storage layer is one of the identified areas of weakness for the Odum archive. We are participating in two diverse efforts to develop a comprehensive preservation system. We currently use typical tape backup and off site storage to augment a few manual copies on disk stored at our partner sites. Below are the summaries of our current developmental efforts. B.2.1. NARA TPAP irods (Integrated Rule Oriented Data Systems) is a data grid program that provides a layer of abstraction between an archive and its storage resources. For example, irods allows a collection of files to appear in a single logical space even if each file is stored on different media at different sites. irods is a glass box framework, meaning its internal services are exposed so that each site can define the behaviour of its digital collection. One such service, developed within the Transcontinental Persistent Archives Prototype, gives irods the ability to preserve the Odum Institute s Dataverse archive. The service first populates the irods database with metadata harvested from Dataverse with the OAI Protocol for Metadata Harvesting. Then it uses HTTP to transfer the data objects into irods. This program has successfully produced a deep copy of much of the Odum Institute s Dataverse archive, including searchable metadata. B.2.2. Data-PASS SSP LOCKSS (Lots of Copies Keep Stuff Safe) is a replication and auditing program that copies data from a central location into a loosely-coupled cluster of servers that protect each other from data corruption by periodic polling. LOCKSS ingests data from any website using a site-specific plugin. One such plugin, developed within the Syndicated Storage Project, allows LOCKSS to download the data objects from the Odum Institute s Dataverse archive, subjecting them to LOCKSS s frequent integritychecking. With this plugin, it is possible to geographically distribute the data, a process we have successfully tested with the Inter-university Consortium for Political and Social Research. This prototype system is illustrated in figure 5.

Figure 5 B.3. Consumer and Administrative Services The Dataverse Network is the distribution product for the Odum Institute s studies as well as the studies of the other partner organizations. Studies in the DVN can be downloaded in SPSS, SAS, R, or.txt formats. The studies also usually include the codebooks and guides in PDF format which can also be downloaded and used together with the studies. The DVN also allows for sub setting and analyzing the data within the catalog. Before downloading any data there is a user agreement created by the Odum Institute to protect the data and the producers of the data within the DVN. Users can also create their own DVNs for free and upload their own studies and either keep them private or allow other users to access the studies within the larger Dataverse Network. The administration of the DVN behind the scenes allows the Odum Institute to correct any errors with studies, restrict certain studies, and make general changes to the study metadata and files. The backend of the DVN allows administrators to create new collections of studies and add new studies to existing collections. After the addition of new collections these studies can remain restricted, while administrators continue to edit the metadata fields and the files. Once these changes are done the studies can then be released for public use. This allows administrators control over new acquisitions as well as existing ones. Additionally the DVN allows administrators to restrict the use of certain files and collections to users within the University of North Carolina system by requiring a login for these files. The Dataverse Network allows us to restrict data to particular users and/or to particular groups. We use IP authentication particularly for group restrictions, i.e. allowing only UNC users to access data from ICPSR or Roper (due to contractual obligations). Data access can also be limited to individual users based on a login in the DVN. It is our hope to embrace a common authentication system in the future to

allow more control and reporting of user statistics. One such system under investigation is Shibboleth. B.4. Support for OAIS model and workflows Figure 6 show the Technical Implementation of the OAIS model implemented at the Odum Archive. Figure 6 Below are typical workflow interactions with the Odum Archival systems. Consumer Interacts with the DVN Searches one or more federated DVNs Reviews Metadata Selects studies of interest Analyses/subsets/downloads data Requests administration support Producer Data is received in a variety of formats (SPSS, SAS,.txt, PDF, etc) from diverse producers Reports Google Analytics AWStats Administration Pre process non-standard data

Check deposit for completeness and ask producer for more information (i.e. questionnaire, summary reports, deposit agreements, etc.) Check data for any identifying information and remove (with producer's authority) Scan any paper documentation File all original materials in Depository directory and update depository record (depository.xls) Create archive copies of data and documentation Create pdf(s) of any documentation Create SPSS portable and SAS export files from the original data (using database conversion software, SAS, and/or SPSS as needed) Create text files of questionnaires (for survey data or any data with full text questions) and mark-up for later SQL updates of Dataverse databases (full question text) Prepare Metadata Create a catalog record in DVN Automated ingest via DVN Upload all data and documentation files to the Dataverse. Ingest any SPSS portable files (and/or Stata files) in the Dataverse to create subset/analysis files in the Dataverse Query Postgres database to determine record number for later SQL update Run SAS script on text files from ingest above with information obtained in query postgres above and then use SQL update script created to update variables with the complete question wording for each variable as opposed to the SPSS (or Stata) variable labels (from ingest) Check Dataverse record for completeness and correctness. Construct Collections in DVN Create and copy Dataverse data files to the /pub/irss/ directory (as backup materials) Update depository record Link to other partner organizations DVN Respond to user queries

Part C Relevant Staff Person Special Roles Skills Jonathan Crabtree Assistant Director for Archives & Information Technology Overall strategic direction and management of Odum Information Technology and Archive group Shape overall archival and preservation policy Guide and manage preservation technology David Sheaves Applications Programmer and Public Opinion Data Specialist development activities Programming languages: SAS, SQL, PERL Operating systems: MVS, CMS, Unix, Linux Database experience: Spires, OpenText, Postgres Edward Bachmann Rodney Hodson Paul Mihas Applications Programmer and Census Data Specialist Systems Administrator and Network Manager Qualitative Research Consultant, Odum Editorial Specialist, and Webmaster Programming Languages: Perl, PHP Other Languages: XML, SQL Relational Databases: Postgresql, MySQL Operating Systems: Linux Languages: DOS scripting, VBscript, SQL Administration: Redhat, XP, Vista, Windows Server 2003, Active Directory Database experience: MS SQL, Access DB Qualitative analysis: Teach courses on qualitative research and consult with students and faculty members regarding their analysis strategies. Packages: ATLAS.ti, QSR NVivo, MAXQDA, QDA Miner. Maintain institute Web site. Schemas and

Patrick King Teresa Edwards Mason Chua Information Technologies Consultant and Computing Lab Supervisor Programmer/data manager for Computer-assisted Data Collection survey applications. System administrator and Systems Programming Specialist software: XHTML, XML, Dreamweaver Languages: C, Java, Perl, Intel assembly, XML/XSLT, DOS scripting IDE: Eclipse (with FIT & JUnit testing in Java; oxygen) Administration: XP, Vista, Windows Server 2003, Active Directory Packages: Blaise, DatStat Illume, SAS, MS Access, browserbased web survey systems (e.g. Qualtrics, SurveyMonkey). Human syntax and natural language processing Mathematical routines OAI-PMH Preservation programs: LOCKSS, irods, SRB Infrastructure: Linux, ESX, enterprise storage Languages: Java, Perl, Lua, TCL, shell script