Digital Collections as Big Data. Leslie Johnston, Library of Congress Digital Preservation 2012



Similar documents
Functional Requirements for Digital Asset Management Project version /30/2006

Update on the Twitter Archive At the Library of Congress

Using Data Analytics to Detect Fraud

Web Archiving and Scholarly Use of Web Archives

WHY DIGITAL ASSET MANAGEMENT? WHY ISLANDORA?

The A-Z of Building a Digital Newspaper Archive: A Case Study of the Upper Hutt City Leader

Overview of NDNP Technical Specifications

Viewpoint ediscovery Services

2015 NASCIO STATE IT RECOGNITION AWARD SUBMISSION

A full spectrum of analytics you can get yourself

Teleconference information: Call-in toll-free number: (US) Conference Code: Webinar call-in number:

The Intelligence Engine.

Symantec ediscovery Platform, powered by Clearwell

Scholarly Use of Web Archives

1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India

Long Term Preservation of Earth Observation Space Data. Preservation Workflow

Considering Third Generation ediscovery? Two Approaches for Evaluating ediscovery Offerings

Navigating to Success: Finding Your Way Through the Challenges of Map Digitization

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Chapter 6 - Enhancing Business Intelligence Using Information Systems

OVERVIEW OF NTU LIBRARIES 南 洋 理 工 大 学 图 书 馆 简 介

etools for Online Communication : Analytics and

Digital Preservation Strategy,

Online Media Kit 2014-FCC_OnlineMediaKit 12/4/2014 8:56 AM Page 1 nline Odvertising A

Cataloging Efficiencies in Special Collections

Use Excel to analyze Twitter data

Get results with modern, personalized digital experiences

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Quick and Easy Web Maps with Google Fusion Tables. SCO Technical Paper

Digital Heritage Preservation - Economic Realities and Options

Veritas ediscovery Platform

How To Understand The Benefits Of Big Data

THE AMERICAN CIVIL WAR & INDIANA RESEARCHER S GUIDE TO CIVIL WAR MATERIALS AT THE INDIANA HISTORICAL SOCIETY

!!!!! BIG DATA IN A DAY!

Computer Programming for the Social Sciences

Digital Asset Manager, Digital Curator. Cultural Informatics, Cultural/ Art ICT Manager

Pinterest has to be one of my favourite Social Media platforms and I m not alone!

Quick Guide to Getting Started: Twitter for Small Businesses and Nonprofits

SQL Server 2012 Business Intelligence Boot Camp

Newspaper Digitization Brief Background

UNIT I OVERVIEW OF E- COMMERCE

THE EUROPEAN DATA PORTAL

Introduction to Data Mining

Making Good Use of Data at Hand: Government Data Projects. Mark C. Cooke, Ph.D. Tax Management Associates, Inc.

Data Driven Discovery In the Social, Behavioral, and Economic Sciences

Adlib Internet Server

Connecting library content using data mining and text analytics on structured and unstructured data

Reasons to use Mobile Marketing. The statistics behind Mobile Marketing are even more convincing

Keystone Image Management System

REFLECTIVE : Innovation ecosystems of digital cultural assets

Visualizing Big Data. Activity 1: Volume, Variety, Velocity

OCLC CONTENTdm and the WorldCat Digital Collection Gateway Overview

B.Sc. in Computer Information Systems Study Plan

Marketing Solutions Built with People in Mind

OCLC CONTENTdm. Geri Ingram Community Manager. Overview. Spring 2015 CONTENTdm User Conference Goucher College Baltimore MD May 27, 2015

OpenAIRE Research Data Management Briefing paper

Assignment 5: Visualization

Session 805 -End-to-End SAP Lumira: Desktop to On-Premise, Cloud, and Mobile

OpenChorus: Building a Tool-Chest for Big Data Science

Best Practices for Structural Metadata Version 1 Yale University Library June 1, 2008

#mstrworld. No Data Left behind: 20+ new data sources with new data preparation in MicroStrategy 10

GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING

Visualizing Data: Scalable Interactivity

THE ICDD & SOCIAL MEDIA. By Betsy Potter, Director of Operations

Better Business Analytics with Powerful Business Intelligence Tools

Survey of Canadian and International Data Management Initiatives. By Diego Argáez and Kathleen Shearer

Blazent IT Data Intelligence Technology:

Assessing a Scientific Data Center as a Trustworthy Digital Repository

What s New in Analytics: Fall 2015

CS Matters in Maryland CS Principles Course

How to Use Boards for Competitive Intelligence

QUICK FACTS. Implementing a Big Data Solution on Behalf of a Media House TEKSYSTEMS GLOBAL SERVICES CUSTOMER SUCCESS STORIES

Grow your online business with Google AdSense

BC Geographic Warehouse. A Guide for Data Custodians & Data Managers

Transcription:

Digital Collections as Big Data Leslie Johnston, Library of Congress Digital Preservation 2012

Data is not just generated by satellites, identified during experiments, or collected during surveys. Datasets are not just scientific and business tables and spreadsheets. I do not need to convince this audience that we have Big Data in our Libraries, Archives and Museums.

More and more researchers want to use collections as a whole, mining and organizing the information in novel ways. Researchers use algorithms to mine the rich information and tools to create pictures that translate that information into knowledge. Researchers may want to interact with a collection of artifacts, or they may want to work with a data corpus.

We still have collections. But what we also have is Big Data, which requires us to rethink the infrastructure that is needed to support Big Data services. Our community used to expect researchers to come to us, ask us questions about our collections, and use our digital collections in our environment. Now our collections are, more often than not, self-serve.

Case Study: Web Archives Web Archives, such as the one at the Library of Congress, may be comprised of billions of files. When we began archiving election web sites, we imagined users browsing through the web pages, studying the graphics or use of phrases or links. But when our first researchers came to the Library, they wanted to know about all those topics, but they used scripts to query for them and sort them into categories. They were not very much interested in reading web pages. The Library is testing tools for full-text indexing of the entire archive and collection subsets http://www.loc.gov/webarchiving/

Case Study: Historic Newspapers The Chronicling America collection has 5 million page images from historic newspapers with OCR from organizations in 25 states. The site gets approximately 4 million hits per day. Some researchers want to search for stories in historic newspapers. Some researchers want to mine newspaper OCR for trends across time periods and geographic areas. Requests have come in to analyze all 5 million pages. http://chroniclingamerica.loc.gov/

Case Study: Twitter The Twitter archive has 10s of billions of tweets in it. Research requests have included users looking for their own Twitter history, the study of the geographic spread of news, the study of the spread of epidemics, and the study of the transmission of new uses of language. visualization social science social media status events personal commercial privacy

Are our institutions ready? We are building large digital collections and must consider new ways in which they should be managed and used.

The Library of Congress is proceeding on multiple fronts

The development of a variety of repository services that will be used to ingest and inventory Big Data collections. The ingest and inventory of such collections, other than scale, is basically understood.

How much ingest processing should be done with data collections, or collections that can be treated as data? Do we process collections to create a variety of derivatives that might be used in various forms of analysis before ingesting them? Do we have sufficient infrastructure to create full-test indexes for billions of files to support full discovery? Do we load collections into analytical tools? These products are still in early days for the scale of billions of files.

LC will benchmark ingest and indexing processes in multiple hardware environments.

And what are the service models? If we decide that we will simply provide access to data, do we limit it to the native format or provide preprocessed or on-the-fly format transformation services for downloads? Can we handle the download traffic? Can our staff develop the expertise to provide guidance to researchers in using analytical tools? Or do we leave researchers to fend for themselves?

The Library is increasingly looking towards selfservice researchers need not ask to download or tell us that they have. We may never know. BUT, we do have collections that are limited to on-site only access due to licenses or gift agreements. In that case, we may have to provide high-powered workstations with analytical tools for researchers to work with these collections and take analysis outputs away with them. Both have policy implications and implications for public service staffing.

And now we will discuss Leslie Johnston lesliej@loc.gov