Building a Strong Foundation with Dublin Core



Similar documents
Data Archiving at the U.S. Central Bank

Information and documentation The Dublin Core metadata element set

Workflow Analysis E-THESES BEST PRACTICE SUMMARIES. Josh Brown & Kathy Sadler

Modernizing Financial Data Collection with XBRL

Selection and Management of Open Source Software in Libraries.

ESRC Research Data Policy

Statement of Work (SOW) for Web Harvesting U.S. Government Printing Office Office of Information Dissemination

Progress Report Template -

Building next generation consortium services. Part 3: The National Metadata Repository, Discovery Service Finna, and the New Library System

Geospatial Information in the Statistical Business Cycle 1

Institutional Repositories: Staff and Skills requirements

System Requirements for Archiving Electronic Records PROS 99/007 Specification 1. Public Record Office Victoria

User research for information architecture projects

Comprehensive Exam Analysis: Student Performance

How To Useuk Data Service

Digital Asset Management Workgroup Reed Digital Collections (powered by CONTENTdm) Art & Architecture Collection Data Dictionary Updated 8/27/08

data.bris: collecting and organising repository metadata, an institutional case study

STORRE: Stirling Online Research Repository Policy for etheses

Position Classification Standard for Management and Program Clerical and Assistance Series, GS-0344

Project InterParty: from library authority files to e-commerce. Andrew MacEwan The British Library

Smartphone Use on an Academic Library Website Featuring Responsive Web Design

Case Study. Introduction

Network Working Group

Oracle Utilities Mobile Workforce Management Business Intelligence

USER EDUCATION: ACADEMIC LIBRARIES

HydroDesktop Overview

OPENGREY: HOW IT WORKS AND HOW IT IS USED

Enterprise Resource Planning Analysis of Business Intelligence & Emergence of Mining Objects

National Geospatial Data Asset Management Plan

Report of the Ad Hoc Committee for Development of a Standardized Tool for Encoding Archival Finding Aids

REACCH PNA Data Management Plan

Notes about possible technical criteria for evaluating institutional repository (IR) software

Adaptive Automated GUI Testing Producing Test Frameworks to Withstand Change

LONG INTERNATIONAL. Long International, Inc Whistling Elk Drive Littleton, CO (303) Fax: (303)

This paper was presented at the 1995 CAUSE annual conference. It is part of the proceedings of that conference, "Realizing the Potential of

Data documentation and metadata for data archiving and sharing. Data Management and Sharing workshop Vienna, April 2010

Kristine Brancolini Loyola Marymount University 16 September 2009

Queensland recordkeeping metadata standard and guideline

".,!520%'$!#)!"#$%&!>#($!#)!

Building integration environment based on OAI-PMH protocol. Novytskyi Oleksandr Institute of Software Systems NAS Ukraine

Strategic Management of Learning Assets

Visualizing Relationships and Connections in Complex Data Using Network Diagrams in SAS Visual Analytics

Summary of Responses to the Request for Information (RFI): Input on Development of a NIH Data Catalog (NOT-HG )

Generic Statistical Business Process Model

Chapter 10 Practical Database Design Methodology and Use of UML Diagrams

Using Dublin Core for DISCOVER: a New Zealand visual art and music resource for schools

Database support for research in public administration

METADATA STANDARDS AND GUIDELINES RELEVANT TO DIGITAL AUDIO

Documenting the research life cycle: one data model, many products

OUTSOURCING, QUALITY CONTROL, AND THE ACQUISITIONS PROFESSIONAL

BEST PRACTICES FOR OPEN SOURCE TECHNOLOGY MANAGEMENT IN LIBRARY AND INFORMATION CENTRES.

College of Communication and Information. Library and Information Science

IFS-8000 V2.0 INFORMATION FUSION SYSTEM

April Online Payday Loan Payments

2015 Global Identity and Access Management (IAM) Market Leadership Award

Category: Nomination for NASPE s Eugene H. Rooney, Jr. Award Innovative State Human Resource Management Program

IT and CRM A basic CRM model Data source & gathering system Database system Data warehouse Information delivery system Information users

DIIMS Records Classifier Guide

The ECB Statistical Data Warehouse: improving data accessibility for all users

How collaboration can save [more of] the web: recent progress in collaborative web archiving initiatives

An Introduction to. Metrics. used during. Software Development

2016 Firewall Management Trends Report

Customised Web-based Services at SERC Library with Special Reference to Alert Services

The Metadata in the Data Catalogue DBK

Checklist for a Data Management Plan draft

The Czech Digital Library and Tools for the Management of Complex Digitization Processes

Analysing log files. Yue Mao Supervisor: Dr Hussein Suleman, Kyle Williams, Gina Paihama. University of Cape Town

All You Wanted To Know About the Management of Digital Resources in Alma

Core Components Data Type Catalogue Version October 2011

Essentials of financial consolidation applications

Web OPAC: An Effective Tool for Management of Reprints of ARI Scientists

Digital Library Content and Course Management Systems: Issues of Interoperation

Integrated Library Systems (ILS) Glossary

The Rutgers Workflow Management System. Workflow Management System Defined. The New Jersey Digital Highway

USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

Metadata Quality Control for Content Migration: The Metadata Migration Project at the University of Houston Libraries

Newspaper Digitization Brief Background

Towards research data cataloguing at Southampton using Microsoft SharePoint and EPrints: a progress report

U.S. Department of the Interior/Geological Survey. Earth Science Information System (ESIS)

INSURANCE-CANADA.CA INC. Report. Doug Grant 9/26/2011

Data Management Plan. Name of Contractor. Name of project. Project Duration Start date : End: DMP Version. Date Amended, if any

CROATIAN BUREAU OF STATISTICS REPUBLIC OF CROATIA MAIN (STATISTICAL) BUSINESS PROCESSES INSTRUCTIONS FOR FILLING OUT THE TEMPLATE

DIGITAL STEWARDSHIP SUPPLEMENTARY INFORMATION FORM

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

Monitoring and Reporting Drafting Team Monitoring Indicators Justification Document

The Institutional Repository at West Virginia University Libraries: Resources for Effective Promotion

Electronic publishing and institutional archives: utilising open-source software

Twente Grants Week: Data management. Maarten van Bentum (Library & Archive)

Data quality and metadata

SEARCHING THE SUNDIAL MAIL ARCHIVE by Carl Sabanski / edited by Mac Oglesby

Metadata for Research Data: Current Practices and Trends

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

GUIDELINES FOR THE CREATION OF DIGITAL COLLECTIONS

DATA MASKING A WHITE PAPER BY K2VIEW. ABSTRACT K2VIEW DATA MASKING

Numbering Systems. InThisAppendix...

Educating the Health Librarians in Africa today: Competencies, Skills and Attitudes required in a Changing Health Environment

Search Engine Optimization (SEO): Improving Website Ranking

Management of Journals Through KOHA Open Source Software: an Overview. Asheesh Kamal Assistant Librarian

Sales Analysis User Manual

Guideline for Implementing the Universal Data Element Framework (UDEF)

Transcription:

Building a Strong Foundation with Dublin Core Linda Powell Board of Governors of the Federal Reserve System, U.S.A. Linda.Powell@frb.gov Abstract When the Board of Governors of the Federal Reserve System needed to create a new metadata system, staff decided to start with a time-tested standard, Dublin Core. The U.S. central bank consumes a variety of metadata, including metadata that define collections of data and metadata that describe variable-level data. This paper follows the process used to create collection-level metadata and variable-level metadata, then to retrofit existing metadata to make them all usable by economists, financial analysts, and information technology professionals. Keywords: metadata; financial data; banking. 1. Introduction The Board of Governors of the Federal Reserve System (the Board) collects, gathers, and purchases data from a variety of sources. Advances in technology and industry standards over the past 10 years have enabled firms, including the Board, to improve data repositories. For a data repository to be complete, it needs good metadata, which are generally defined as data about data, or the electronic data dictionary for the data of interest. Although there are many types of metadata, for the purposes of this paper, we will focus on collection-level metadata (sometimes called meta-metadata), which describe groups of data or data sets, and variable-level metadata, which define each item (or ) in the data set. For example, the financial data of companies can be broken into collection-level metadata and variable-level metadata. The collection- level data could include (1) a description indicating that the data represent individual companies financial statements, (2) how the data are compiled, and (3) the date range available. The variable-level metadata would include information about the data within the financial statements such as (1) a description for each variable (for example, total assets, total liabilities, or capital ratio), (2) the data type (for example, monetary, percent, text), and (3) formulas used to calculate the numbers. 2. The Business Problem Over time, as technology has grown, the availability of data from the private sector has expanded and the financial markets have increased in complexity, resulting in changes to the types of data the Board uses. Historically, when the Board purchased data, they were used for specific research projects rather than market or policy analysis. Today, significant volumes of financial market data are purchased. Most of these data sets are time series cross-sectional data, meaning that they cover data for a large number of institutions over many periods. The periods can range from daily (for example, the value of derivative contracts) to quarterly (for example, the financial statements of all U.S. banks) or periodic observations. In 2006, the change in business practices at the Board became apparent, and the need for better information about purchased data was addressed by creating a new collection-level metadata repository. As of 2009, the Board has three main metadata repositories for firm-level data: one for collection-level data called the Data and News CataloguE (DANCE), one for data collected by the Federal Reserve System called the MicroData Reference Manual (MDRM), and one for data purchased from vendors (Vendor Metadata). Each of these repositories was developed to meet a 87

2009 Proc. Int l Conf. on Dublin Core and Metadata Applications specific need. At the time the repositories were developed, the metadata requirements were based on a subset of users needs and requirements. Over the past several years, these repositories have been evaluated to determine if they meet the needs of the user community and if and how they can be used collectively. In summary, we found that the repositories should not be combined but should be revamped to provide better information and enable working collaboratively. To revise these repositories, we looked to the body of international standards for metadata and chose Dublin Core as the primary standard for revamping our metadata repositories. 3. Existing Metadata Repositories for Non Time Series Data Metadata have been used at the Board since before the World Wide Web came into existence. The metadata for data collected by the Federal Reserve System have been maintained and well organized for more than 30 years in the MDRM. The MDRM provides both collection- level and variable-level metadata. The presentation has improved and the scope of explanatory data in the MDRM has increased over the years. Although the MDRM is an established and useful tool, it could benefit from adherence to industry standards and consistency with other metadata tools. The DANCE repository was first developed in 2004 to collect and disseminate basic information about data purchased by the Board. The original purpose was to facilitate finding various types of data. The need for additional information became apparent as soon as the application reached production. In addition to the need for more information about the data sets, it was discovered that users of the data sets needed a centralized location to record information about their experiences using the data. The implementation of an international standard would provide a strong foundation for determining the scope of the expanded variables. To meet the need for experiential information, it was determined that a wiki page for each data collection would provide users with the opportunity to share information. The Vendor Metadata repository development began in 2007 to facilitate the programmatic creation of internal SAS data sets from purchased financial data. It was quickly expanded to document the data structure and variables for use by end users. In addition to variable-level metadata, this repository contains system or technical metadata detailing file layouts and formats. Because this repository was originally designed to store system-level metadata, using an industry standard to expand on the metadata provided guidance on the scope of the variables and consistency among the repositories. 4. Review of Metadata Standards Although the DANCE application met the requirements of its initial purpose, as soon as the application hit production, it became obvious that this application could be expanded, and additional demands surfaced. Once the decision was made to expand the application, a review of metadata standards was performed. While examining the various industry standards, it was determined that all three of the Board s metadata repositories could benefit from review and the application of industry standards. In addition, there was an effort to achieve consistency among the repositories. Although additional standards were reviewed, five candidates for the Board s data warranted in-depth evaluation: SDMX (Statistical Data and Metadata Exchange), XBRL (Extensible Business Reporting Language), MARC, DDI (Data Documentation Initiative), and Dublin Core. The review began with the standards that Board staff were familiar with: SDMX, XBRL, and MARC. SDMX appeared to be geared toward aggregate time series data rather than crosssectional data (SDMX Standards). XBRL provided guidance for the variable-level metadata but did not provide the scope of collection-level metadata (which was the primary focus of the review) that other standards provided (XBRL Specifications). And MARC was familiar to the Board s library staff but was more bibliographic than necessary for this purpose (MARC 21 Introduction). 88

DDI provided good collection-level metadata as well as extensive variable-level metadata. However, at the time DDI was reviewed, it did not allow for data collections that persisted over time. In addition, the variable-level metadata were geared toward social science survey data, whereas the majority of data stored in these repositories are financial in nature. (DDI 2.1 Specifications.) Although none of the standards were a perfect fit for the Board s needs, we found that Dublin Core had an excellent breadth of metadata s without being overwhelming. Dublin Core was a particularly good match for the collection-level metadata. Dublin Core also had the advantage of being publicly well documented and relatively uncomplicated, thus enabling immediate application of the standard. (DCMI Element Set.) Although Dublin Core was found to be a well-fitting standard for collection-level metadata, we found that we needed additional variable-level metadata; as such, we turned to XBRL. XBRL is a good fit for the variable-level data because it focuses on financial statement data, which comprise the majority of data that are stored in the metadata repository and the MDRM. 5. APPLICATION PROFILE The DANCE application profile consists of the variables listed in Appendix 1. Because some variables can change over time (such as update accrual method) and other variables (such as relation) can have multiple entries, the data were divided into separate tables in the database. The application profile was designed to support the discovery of metadata and provide information to end users. Most of the variables in the application profile are from Dublin Core; however, some variables were added to supplement the Dublin Core s and provide agency-specific information such as which department within the organization purchased the data. The metadata variables were chosen by a group of end users consisting of data managers, programmers, librarians, and financial analysts. The user group reviewed each of the Dublin Core s and modeled example data sets to determine the applicability of each. The Vendor Metadata and MDRM application profiles expand on the DANCE profile and incorporate XBRL s specific to financial data. 6. Open Source Software Once the decision to use the Dublin Core standard for collection-level metadata was made, we explored the possibility of capitalizing on one of the open source repositories. We preferred to use a PostgreSQL relational database but were open to MySQL as a backend storage facility. The systems chosen for evaluation were EPrints, FEDORA, and DSpace. After our initial review, DSpace appeared to be the best candidate for the new DANCE. However, in-depth review and initial testing showed that to meet the requirements for the new DANCE application, the open source system would need significant customization. In general, we wanted to be able to integrate our data security information and Vendor Metadata repository with DANCE to allow end users to drill down through the application for related information. In the end, we decided to build our own collection-level and variable-level repositories in-house using a PostgreSQL server on Linux. 7. Conclusions and Future Work The greatest advantage of using the Dublin Core standard was that the collective experience and knowledge of hundreds of professionals went into creating the standard. Some administrative data that were specific to the Board s needs were added to enhance the standard s variables. A quick review of the original variables versus the complete list of variables for the DANCE repository in Appendix 1 demonstrates the greater breadth of information suggested by the Dublin Core standard than was originally identified. The Vendor Metadata repository, Appendix 2, and the variable-level metadata repository in the MDRM, Appendix 3, both benefited from the inclusion of Dublin Core and XBRL variable-level metadata. 89

2009 Proc. Int l Conf. on Dublin Core and Metadata Applications A second advantage of using the Dublin Core standard is the interoperability among systems that Dublin Core facilitates (that is, using a common standard for all of the Board s metadata repositories will assist in automating data retrieval from all three repositories). These repositories were designed for use within the Board, but the Federal Reserve System has 12 regional banks that have expressed interest in the data stored in the repositories. Using an industry standard should facilitate easier interoperability among the various locations. Finally, the Board currently publishes aggregate economic macrodata using SDMX. Because SDMX supports Dublin Core as a metadata format, there may be opportunities to capitalize on the interoperability of SDMX and Dublin Core when bridging firm-level microdata to aggregate-level macrodata. Acknowledgments The author wishes to thank her colleagues Wei-na Chow and Andrew Boettcher for their research contributions to this paper. The views expressed in this paper are those of the author and do not necessarily represent those of the Board of Governors of the Federal Reserve System. References CPIT. (2007). Technical Evaluation of Selected Open Source Repository Solutions, Version 1.3. Retrieved February 27, 2007 from https://www.eduforge.org/docman/view.php/131/1062/repository%20evaluation%20document.pdf. Data Documentation Initiative. (2006-2009). Retrieved August 3, 2009 from http://www.ddialliance.org. DCMI. (2006-2009). DCMI Element Set. Retrieved February 27, 2007 from http://www.dublincore.org/index.shtml. DCMI. (2006-2009). DCMI Metadata Terms. Retrieved February 27, 2007 from http://www.dublincore.org/index.shtml. DCMI. (2006-2009). User guide. Retrieved February 27, 2007 from http://www.dublincore.org/index.shtml. INNODATA ISOGEN. Content Metadata Standards: Libraries, Publishers, and More. White paper. Retrieved October 30, 2008 from http://www.innodata-isogen.com/knowledge_center/white_papers/content_metadata_standards_wp. MARC. (April 2008). MARC 21 Introduction. Retrieved January 4, 2009 from http://www.loc.gov/marc/bibliographic/lite/genintro.html. SDMX. (2006-2009). Retrieved February 27, 2007 from http://www.sdmx.org. SDMX. (2006-2009). SDMX Standards. Retrieved January 4, 2009 from http://sdmx.org/?page_id=10. Snijder, Ronald. (2001). Metadata standards and information analysis: A survey of current metadata standards and the underlying models. Retrieved September 10, 2008, from http://www.geocities.com/ronaldsnijder. XBRL. (2005-2009). Retrieved January 4, 2009 from http://www.xbrl.org/home/. XBRL. (2005-2009). XBRL Specifications. Retrieved August 3, 2009 from http://www.xbrl.org/specrequirements/. Appendix A: DANCE Variables Category of information Original variables Internal name of retained/ new variables Dublin Core Dublin Core definition/guideline or comment Descriptive Category This item was not found useful and was removed in the redesign. Descriptive Database Title Title The name given to the resource. Descriptive Alternative Alternative An alternative name for the resource. Descriptive Database URL Product URL 90

Descriptive Description Description Description Description may include but is not limited to: an abstract, table of contents, reference to a graphical representation of content, or a free-text account of the content. Descriptive Keyword(s) Keyword Subject The topic of the content of the resource. Typically, a subject will be expressed as keywords or key phrases or classification codes that describe the topic of the resource. Descriptive Vendor Publisher Publisher Recommends against using same name as in "creator." Suggests "publisher" for organizations and "creator" for individuals. If ambiguous, use "contributor." Descriptive Vendor Vendor URL URL Descriptive Creator Creator Entity responsible for the content. Descriptive Record ID Identifier An unambiguous reference to the resource within a given context. Coverage Date created Date Coverage Date range available Available Available should be used in the case of a resource for which the date of availability may be distinct from the date of creation, and the date of availability is relevant to the use of the resource. The spatial or temporal topic of the resource. Coverage Geographical coverage Coverage Coverage Status Data set status Accrual policy The policy governing the addition of items to a collection. Storage Data origination Source In general, include in this area information about a resource that is related intellectually but does not fit easily into a relation. Storage Related resources Relation Reference to a related resource. Storage Form of Data location access Storage Form of access Format Format Physical or digital manifestation of the resource. Storage Type Type Nature or genre of the content. Storage Physical medium Medium The material or physical carrier of the resource. Storage Update method Accrual method The method by which items are added to a collection. Storage Software name Software A computer program in the source or compiled form. 91

2009 Proc. Int l Conf. on Dublin Core and Metadata Applications Storage Update schedule Accrual periodicity The frequency with which items are added to a collection. Contact Data contact Data contact Contact Division Purchasing division Contact Data requestor Contact Contributing section Contributor Use for entities with lesser responsibility or more ambiguous contributions than the entity in "creator." Contact Purchasing section Contact Vendor contacts License License contact License owner Rights holder Defined as person or organization holding rights over the resource. License License information Data confidentiality Rights Information about rights held in and over the resource. License Bibliographic notation Bibliographic citation A bibliographic reference for the resource. License License agreement License A legal document giving official permission to do something with the resource. Other Help files Instructional method A process used to engender knowledge, attitudes, and skills that the resource is designed to support. Other Data documentation Other Document path or URL Text A resource consisting primarily of words for reading. Other Additional information This item links to a wiki that is updatable by all users. Criticality Cost Restricted access. Payment Restricted access. schedule Contract renewal Restricted access. date Purchase Restricted access. justification Purchase order Restricted access. Reason needed Restricted access. Contract length Restricted access. Record last updated Record updated by 92

Appendix B: Vendor Metadata Repository Variables Internal name of retained/new variables Variable storage name Variable name short Mnemonic Dublin Core Title Alternative XBRL Dublin Core definition/guideline or comment The name given to the resource. Typically, a Title will be a name by which the resource is formally known. An alternative name for the resource. To link to the MDRM database. Description Description Description may include but is not limited to: an abstract, table of contents, reference to a graphical representation of content or a free-text account of the content. MDRM number To link to MDRM database. ID variable Identifier An unambiguous reference to the resource within a given context. Date start Date end Ordinal position Display order. Flow-stock Flow-stock Flow periodicity Flow periodicity Debit/credit indicator Debit/ credit indicator Data type Data type Monetary, boolean, text, rate, percent. Multiplier This was an in the initial XBRL requirements but was removed in subsequent versions (for example, thousands of dollars). Numeric flag Derived flag Formula Formula Formula used if the variable is derived. Variable length Variable precision Confidentiality Format Hasformat A related resource that is substantially the same as the pre-existing described resource, but in another format. Record last updated Record updated by Appendix C: MDRM Variables Internal name Dublin Core XBRL Dublin Core definition/guideline or comment Current variables Item name Title The name given to the resource. Typically, a Title will be a name by which the resource is 93

2009 Proc. Int l Conf. on Dublin Core and Metadata Applications formally known. Item short caption Alternative An alternative name for the resource. Mnemonic A four character mnemonic that identifies the data set or survey. When combined with the MDRM number, it creates a unique identifier. Description Description Description may include but is not limited to: an abstract, table of contents, reference to a graphical representation of content, or a freetext account of the content. MDRM number A suffix that identifies the accounting concept. When combined with the mnemonic, it creates a unique identifier. ID variable Identifier The combination of the mnemonic and MDRM number. Date start Date end Confidential Item type This variable will be broken into data type and derived flag. Record last updated Record updated by Future variables Ordinal position A combination of the reporting form schedule and line number. Flow-stock Flow-stock Flow periodicity Flow periodicity Debit/credit indicator Debit/ credit indicator Data type Data type Monetary, boolean, text, rate, percent. Multiplier This was an in the initial XBRL requirements but was removed in subsequent versions. Derived flag Formula Formula Formula used if the variable is derived. Collection-level metadata Title Title The name given to the resource. Mnemonic Sub series Segmented mnemonics Reporting form number Other database ID Access method Status Date start Identifier Alternative Accrual policy Reference to the survey form number. 94

Date end Record last updated Record updated by Future collectionlevel metadata convert from text to a database Description Frequency Reporting panel Data availability Confidentiality Major series changes Additional information Public release Creator Data location Format Data contact Description Accrual periodicity Source Accrual method Event Text Creator Format A nonpersistent, time-based occurrence. Link to additional information for complex series. 95