Digital Collections as Big Data. Leslie Johnston, Library of Congress Digital Preservation 2012
|
|
|
- Amie Fitzgerald
- 10 years ago
- Views:
Transcription
1 Digital Collections as Big Data Leslie Johnston, Library of Congress Digital Preservation 2012
2 Data is not just generated by satellites, identified during experiments, or collected during surveys. Datasets are not just scientific and business tables and spreadsheets. I do not need to convince this audience that we have Big Data in our Libraries, Archives and Museums.
3 More and more researchers want to use collections as a whole, mining and organizing the information in novel ways. Researchers use algorithms to mine the rich information and tools to create pictures that translate that information into knowledge. Researchers may want to interact with a collection of artifacts, or they may want to work with a data corpus.
4 We still have collections. But what we also have is Big Data, which requires us to rethink the infrastructure that is needed to support Big Data services. Our community used to expect researchers to come to us, ask us questions about our collections, and use our digital collections in our environment. Now our collections are, more often than not, self-serve.
5 Case Study: Web Archives Web Archives, such as the one at the Library of Congress, may be comprised of billions of files. When we began archiving election web sites, we imagined users browsing through the web pages, studying the graphics or use of phrases or links. But when our first researchers came to the Library, they wanted to know about all those topics, but they used scripts to query for them and sort them into categories. They were not very much interested in reading web pages. The Library is testing tools for full-text indexing of the entire archive and collection subsets
6 Case Study: Historic Newspapers The Chronicling America collection has 5 million page images from historic newspapers with OCR from organizations in 25 states. The site gets approximately 4 million hits per day. Some researchers want to search for stories in historic newspapers. Some researchers want to mine newspaper OCR for trends across time periods and geographic areas. Requests have come in to analyze all 5 million pages.
7 Case Study: Twitter The Twitter archive has 10s of billions of tweets in it. Research requests have included users looking for their own Twitter history, the study of the geographic spread of news, the study of the spread of epidemics, and the study of the transmission of new uses of language. visualization social science social media status events personal commercial privacy
8 Are our institutions ready? We are building large digital collections and must consider new ways in which they should be managed and used.
9 The Library of Congress is proceeding on multiple fronts
10 The development of a variety of repository services that will be used to ingest and inventory Big Data collections. The ingest and inventory of such collections, other than scale, is basically understood.
11 How much ingest processing should be done with data collections, or collections that can be treated as data? Do we process collections to create a variety of derivatives that might be used in various forms of analysis before ingesting them? Do we have sufficient infrastructure to create full-test indexes for billions of files to support full discovery? Do we load collections into analytical tools? These products are still in early days for the scale of billions of files.
12 LC will benchmark ingest and indexing processes in multiple hardware environments.
13 And what are the service models? If we decide that we will simply provide access to data, do we limit it to the native format or provide preprocessed or on-the-fly format transformation services for downloads? Can we handle the download traffic? Can our staff develop the expertise to provide guidance to researchers in using analytical tools? Or do we leave researchers to fend for themselves?
14 The Library is increasingly looking towards selfservice researchers need not ask to download or tell us that they have. We may never know. BUT, we do have collections that are limited to on-site only access due to licenses or gift agreements. In that case, we may have to provide high-powered workstations with analytical tools for researchers to work with these collections and take analysis outputs away with them. Both have policy implications and implications for public service staffing.
15 And now we will discuss Leslie Johnston
Functional Requirements for Digital Asset Management Project version 3.0 11/30/2006
/30/2006 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 9 20 2 22 23 24 25 26 27 28 29 30 3 32 33 34 35 36 37 38 39 = required; 2 = optional; 3 = not required functional requirements Discovery tools available to end-users:
Update on the Twitter Archive At the Library of Congress
January 2013 Update on the Twitter Archive At the Library of Congress In April, 2010, the Library of Congress and Twitter signed an agreement providing the Library the public tweets from the company s
Using Data Analytics to Detect Fraud
Using Data Analytics to Detect Fraud Gerard M. Zack, CFE, CPA, CIA, CCEP Introduction to Data Analytics CPE Instructions Course Objectives How data analytics can be used to detect fraud Different tools
Web Archiving and Scholarly Use of Web Archives
Web Archiving and Scholarly Use of Web Archives Helen Hockx-Yu Head of Web Archiving British Library 15 April 2013 Overview 1. Introduction 2. Access and usage: UK Web Archive 3. Scholarly feedback on
WHY DIGITAL ASSET MANAGEMENT? WHY ISLANDORA?
WHY DIGITAL ASSET MANAGEMENT? WHY ISLANDORA? Digital asset management gives you full access to and control of to the true value hidden within your data: Stories. Digital asset management allows you to
The A-Z of Building a Digital Newspaper Archive: A Case Study of the Upper Hutt City Leader
The A-Z of Building a Digital Newspaper Archive: A Case Study of the Upper Hutt City Leader Palmer, Meredith, DL Consulting, Hamilton, New Zealand Duncan, Debbie, Upper Hutt City Council, Upper Hutt, New
Overview of NDNP Technical Specifications
Overview of NDNP Technical Specifications and Philosophy Digitization from preservation microfilm print negatives (2n) provides the most cost-efficient approach for large-scale digitization Distributed
Viewpoint ediscovery Services
Xerox Legal Services Viewpoint ediscovery Platform Technical Brief Viewpoint ediscovery Services Viewpoint by Xerox delivers a flexible approach to ediscovery designed to help you manage your litigation,
2015 NASCIO STATE IT RECOGNITION AWARD SUBMISSION
2015 NASCIO STATE IT RECOGNITION AWARD SUBMISSION Title: TheStoryofTexas.com Category: Government to Citizen (G to C) Contact: Linda Miller [email protected] 512.936.2231 Initiation Date:
A full spectrum of analytics you can get yourself
Industry area A full spectrum of analytics you can get yourself 5 reasons to choose IBM for self-service business intelligence Contents Self-service business intelligence that paints a full picture 3 Reason
Teleconference information: Call-in toll-free number: 1-866-410-6539 (US) Conference Code: 597 987 4688. Webinar call-in number: 1-866-410-6539
The audio for this webinar will be broadcast through your computer. Once you join the presentation, a small audio broadcast box will appear on your screen and you will hear the host through your computer
The Intelligence Engine.
The Intelligence Engine. Simple Search Simple Search offers a straightforward approach to searching, allowing you to target by source or date for high-quality relevant results. Key Word Searching Use the
Symantec ediscovery Platform, powered by Clearwell
Symantec ediscovery Platform, powered by Clearwell Data Sheet: Archiving and ediscovery The brings transparency and control to the electronic discovery process. From collection to production, our workflow
Scholarly Use of Web Archives
Scholarly Use of Web Archives Helen Hockx-Yu Head of Web Archiving British Library 15 February 2013 Web Archiving initiatives worldwide http://en.wikipedia.org/wiki/file:map_of_web_archiving_initiatives_worldwide.png
1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India
1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India Call for Papers Colossal Data Analysis and Networking has emerged as a de facto
Long Term Preservation of Earth Observation Space Data. Preservation Workflow
Long Term Preservation of Earth Observation Space Data Preservation Workflow CEOS-WGISS Doc. Ref.: CEOS/WGISS/DSIG/PW Data Stewardship Interest Group Date: March 2015 Issue: Version 1.0 Preservation Workflow
Considering Third Generation ediscovery? Two Approaches for Evaluating ediscovery Offerings
Considering Third Generation ediscovery? Two Approaches for Evaluating ediscovery Offerings Developed by Orange Legal Technologies, Providers of the OneO Discovery Platform. Considering Third Generation
Navigating to Success: Finding Your Way Through the Challenges of Map Digitization
Presentations (Libraries) Library Faculty/Staff Scholarship & Research 10-15-2011 Navigating to Success: Finding Your Way Through the Challenges of Map Digitization Cory K. Lampert University of Nevada,
Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak 9.6.2015
Computer-Based Text- and Data Analysis Technologies and Applications Mark Cieliebak 9.6.2015 Data Scientist analyze Data Library use 2 About Me Mark Cieliebak + Software Engineer & Data Scientist + PhD
Chapter 6 - Enhancing Business Intelligence Using Information Systems
Chapter 6 - Enhancing Business Intelligence Using Information Systems Managers need high-quality and timely information to support decision making Copyright 2014 Pearson Education, Inc. 1 Chapter 6 Learning
OVERVIEW OF NTU LIBRARIES 南 洋 理 工 大 学 图 书 馆 简 介
OVERVIEW OF NTU LIBRARIES 南 洋 理 工 大 学 图 书 馆 简 介 4 AREAS OF FOCUS 4 大 工 作 重 点 Supporting scholarly communication & research 为 教 学 科 研 提 供 文 献 信 息 保 障 Preparing students for the knowledge economy 协 助 学 生
etools for Online Communication : Analytics and Email
etools for Online Communication : Analytics and Email Arnold Chandler, Trainer About Me Arnold Chandler, Consultant Principal, A. L. Chandler Consulting (www.arnoldchandler.com) Founder of The Bay Area
Digital Preservation Strategy, 2012-2015
Digital Preservation Strategy, 2012-2015 Preface This digital preservation strategy sets out what the National Library of Wales (NLW) intends to do to preserve digital materials over the next three years.
Online Media Kit 2014-FCC_OnlineMediaKit 12/4/2014 8:56 AM Page 1 nline Odvertising A
Online Advertising The Network Forum Communications Company online advertising network includes 46 websites across Minnesota, Wisconsin, North Dakota, and South Dakota. Our entire network reaches 2+ Million
Cataloging Efficiencies in Special Collections
Cataloging Efficiencies in Special Collections Morag Boyd Head, Special Collections Cataloging The Ohio State University Libraries OCLC Cataloging Efficiencies October 21, 2011 Environment: Academic Libraries
Use Excel to analyze Twitter data
Use Excel to analyze Twitter data Blog Post Date: May 27th, 2013 Category: Technology made easy, Social Media Author: Ulrika Hedlund Source: http://www.businessproductivity.com/use-excel-to-analyze-twitter-data
Get results with modern, personalized digital experiences
Brochure HP TeamSite What s new in TeamSite? The latest release of TeamSite (TeamSite 8) brings significant enhancements in usability and performance: Modern graphical interface: Rely on an easy and intuitive
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate
Quick and Easy Web Maps with Google Fusion Tables. SCO Technical Paper
Quick and Easy Web Maps with Google Fusion Tables SCO Technical Paper Version History Version Date Notes Author/Contact 1.0 July, 2011 Initial document created. Howard Veregin 1.1 Dec., 2011 Updated to
Digital Heritage Preservation - Economic Realities and Options
Digital Heritage Preservation - Economic Realities and Options Ronald Walker Executive Director, Canadiana.org Abstract The demand for digital heritage preservation is increasing, particularly in response
Veritas ediscovery Platform
TM Veritas ediscovery Platform Overview The is the leading enterprise ediscovery solution that enables enterprises, governments, and law firms to manage legal, regulatory, and investigative matters using
How To Understand The Benefits Of Big Data
Findings from the research collaboration of IBM Institute for Business Value and Saïd Business School, University of Oxford Analytics: The real-world use of big data How innovative enterprises extract
THE AMERICAN CIVIL WAR & INDIANA RESEARCHER S GUIDE TO CIVIL WAR MATERIALS AT THE INDIANA HISTORICAL SOCIETY
THE AMERICAN CIVIL WAR & INDIANA RESEARCHER S GUIDE TO CIVIL WAR MATERIALS AT THE INDIANA HISTORICAL SOCIETY A. INTRODUCTION AND PURPOSE The Indiana Historical Society s (IHS) Collections & Library collects,
!!!!! BIG DATA IN A DAY!
BIG DATA IN A DAY December 2, 2013 Underwritten by Copyright 2013 The Big Data Group, LLC. All Rights Reserved. All trademarks and registered trademarks are the property of their respective holders. EXECUTIVE
Computer Programming for the Social Sciences
Department of Social and Political Sciences Computer Programming for the Social Sciences This two day workshop will teach beginner level, practical computer programming skills for use in social science
Digital Asset Manager, Digital Curator. Cultural Informatics, Cultural/ Art ICT Manager
Role title Digital Cultural Asset Manager Also known as Relevant professions Summary statement Mission Digital Asset Manager, Digital Curator Cultural Informatics, Cultural/ Art ICT Manager Deals with
Pinterest has to be one of my favourite Social Media platforms and I m not alone!
Pinterest has to be one of my favourite Social Media platforms and I m not alone! With 79.3 million users, 50 billion pins and 1 billion boards it is host to an enormous amount of content. But many of
Quick Guide to Getting Started: Twitter for Small Businesses and Nonprofits
Quick Guide to Getting Started: Twitter for Small Businesses and Nonprofits Social Media www.constantcontact.com 1-866-876-8464 Insight provided by 2011 Constant Contact, Inc. 11-2168 What is Twitter?
SQL Server 2012 Business Intelligence Boot Camp
SQL Server 2012 Business Intelligence Boot Camp Length: 5 Days Technology: Microsoft SQL Server 2012 Delivery Method: Instructor-led (classroom) About this Course Data warehousing is a solution organizations
Newspaper Digitization Brief Background
Digitizing California s Newspapers: A Guide and Best- Practices for Institutions Around the Golden State Created by the Center for Bibliographical Studies and Research, UC Riverside for LSTA Grant 40-7696,
UNIT I OVERVIEW OF E- COMMERCE
1 UNIT I OVERVIEW OF E- COMMERCE Definition of E-Commerce: The use of electronic transmission medium ( telecommunications ) to engage in the exchange including buying and selling of products and services
THE EUROPEAN DATA PORTAL
European Public Sector Information Platform Topic Report No. 2016/03 UNDERSTANDING THE EUROPEAN DATA PORTAL Published: February 2016 1 Table of Contents Keywords... 3 Abstract/ Executive Summary... 3 Introduction...
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
Making Good Use of Data at Hand: Government Data Projects. Mark C. Cooke, Ph.D. Tax Management Associates, Inc.
Making Good Use of Data at Hand: Government Data Projects Mark C. Cooke, Ph.D. Tax Tax Management Associates Privately held company serving state and local government Markets across eighteen (18) states
Data Driven Discovery In the Social, Behavioral, and Economic Sciences
Data Driven Discovery In the Social, Behavioral, and Economic Sciences Simon Appleford, Marshall Scott Poole, Kevin Franklin, Peter Bajcsy, Alan B. Craig, Institute for Computing in the Humanities, Arts,
Adlib Internet Server
Adlib Internet Server Software for professional collections management in archives, libraries and museums Comprehensive, Flexible, User-friendly Adlib Internet Server Put your data online, the easy way
Connecting library content using data mining and text analytics on structured and unstructured data
Submitted on: May 5, 2013 Connecting library content using data mining and text analytics on structured and unstructured data Chee Kiam Lim Technology and Innovation, National Library Board, Singapore.
Reasons to use Mobile Marketing. The statistics behind Mobile Marketing are even more convincing
Reasons to use Mobile Marketing Instant Delivery: Active campaigns can reach your audience in next to no time. Mobility: Connect with your audience even when they are on the move. They are no longer limited
Keystone Image Management System
Image management solutions for satellite and airborne sensors Overview The Keystone Image Management System offers solutions that archive, catalogue, process and deliver digital images from a vast number
REFLECTIVE 6-2015: Innovation ecosystems of digital cultural assets
HORIZON 2020 WP 2014 2015 Europe in a changing world inclusive, innovative and reflective Societies Reflective Societies: Cultural Heritage and European Identities REFLECTIVE 6-2015: Innovation ecosystems
Visualizing Big Data. Activity 1: Volume, Variety, Velocity
Visualizing Big Data Mark Frydenberg Computer Information Systems Department Bentley University [email protected] @checkmark OBJECTIVES A flood of information online from tweets, news feeds, status
OCLC CONTENTdm and the WorldCat Digital Collection Gateway Overview
OCLC CONTENTdm and the WorldCat Digital Collection Gateway Overview Geri Ingram OCLC Community Manager June 2015 Overview Audience This session is for users library staff, curators, archivists, who are
B.Sc. in Computer Information Systems Study Plan
195 Study Plan University Compulsory Courses Page ( 64 ) University Elective Courses Pages ( 64 & 65 ) Faculty Compulsory Courses 16 C.H 27 C.H 901010 MATH101 CALCULUS( I) 901020 MATH102 CALCULUS (2) 171210
Marketing Solutions Built with People in Mind
Marketing Solutions Built with People in Mind Tailored emails, web recommendations and data-driven digital advertising designed to engage new prospects and excite customers. MAGNETIC MISSION To understand
OCLC CONTENTdm. Geri Ingram Community Manager. Overview. Spring 2015 CONTENTdm User Conference Goucher College Baltimore MD May 27, 2015
OCLC CONTENTdm Overview Spring 2015 CONTENTdm User Conference Goucher College Baltimore MD May 27, 2015 Geri Ingram Community Manager Overview Audience This session is for users library staff, curators,
OpenAIRE Research Data Management Briefing paper
OpenAIRE Research Data Management Briefing paper Understanding Research Data Management February 2016 H2020-EINFRA-2014-1 Topic: e-infrastructure for Open Access Research & Innovation action Grant Agreement
Assignment 5: Visualization
Assignment 5: Visualization Arash Vahdat March 17, 2015 Readings Depending on how familiar you are with web programming, you are recommended to study concepts related to CSS, HTML, and JavaScript. The
Session 805 -End-to-End SAP Lumira: Desktop to On-Premise, Cloud, and Mobile
September 9 11, 2013 Anaheim, California Session 805 -End-to-End SAP Lumira: Desktop to On-Premise, Cloud, and Mobile Ashish C. Morzaria, SAP Disclaimer This presentation outlines our general product direction
OpenChorus: Building a Tool-Chest for Big Data Science
OpenChorus: Building a Tool-Chest for Big Data Science Milind Bhandarkar Chief Scientist, Machine Learning Platforms EMC Greenplum 1 Agenda! Tools for Data Science! Data Science Workflow! Greenplum OpenChorus!
Best Practices for Structural Metadata Version 1 Yale University Library June 1, 2008
Best Practices for Structural Metadata Version 1 Yale University Library June 1, 2008 Background The Digital Production and Integration Program (DPIP) is sponsoring the development of documentation outlining
#mstrworld. No Data Left behind: 20+ new data sources with new data preparation in MicroStrategy 10
No Data Left behind: 20+ new data sources with new data preparation in MicroStrategy 10 MicroStrategy Analytics Agenda Product Workflows Different Data Import Processes Product Demonstrations Data Preparation
GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING
MEDIA MONITORING AND ANALYSIS GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING Searchers Reporting Delivery (Player Selection) DATA PROCESSING AND CONTENT REPOSITORY ADMINISTRATION AND MANAGEMENT
Visualizing Data: Scalable Interactivity
Visualizing Data: Scalable Interactivity The best data visualizations illustrate hidden information and structure contained in a data set. As access to large data sets has grown, so has the need for interactive
THE ICDD & SOCIAL MEDIA. By Betsy Potter, Director of Operations
THE ICDD & SOCIAL MEDIA By Betsy Potter, Director of Operations BENEFITS n Relationships n Branding n Learning HOW SHOULD SOCIAL MEDIA BE USED n Integrate n Amplify n Repurpose n Build community n Learn
Better Business Analytics with Powerful Business Intelligence Tools
Better Business Analytics with Powerful Business Intelligence Tools Business Intelligence Defined There are many interpretations of what BI (Business Intelligence) really is and the benefits that it can
Survey of Canadian and International Data Management Initiatives. By Diego Argáez and Kathleen Shearer
Survey of Canadian and International Data Management Initiatives By Diego Argáez and Kathleen Shearer on behalf of the CARL Data Management Working Group (Working paper) April 28, 2008 Introduction Today,
Blazent IT Data Intelligence Technology:
Blazent IT Data Intelligence Technology: From Disparate Data Sources to Tangible Business Value White Paper The phrase garbage in, garbage out (GIGO) has been used by computer scientists since the earliest
Assessing a Scientific Data Center as a Trustworthy Digital Repository
Assessing a Scientific Data Center as a Trustworthy Digital Repository Robert R. Downs 1 and Robert S. Chen 2 1 [email protected] 2 [email protected] NASA Socioeconomic Data and Applications
What s New in Analytics: Fall 2015
Adobe Analytics What s New in Analytics: Fall 2015 Adobe Analytics powers customer intelligence across the enterprise, facilitating self-service data discovery for users of all skill levels. The latest
CS Matters in Maryland CS Principles Course
CS Matters in Maryland CS Principles Course Curriculum Overview Project Goals Computer Science (CS) Matters in Maryland is an NSF supported effort to increase the availability and quality of high school
How to Use Boards for Competitive Intelligence
How to Use Boards for Competitive Intelligence Boards are highly customized, interactive dashboards that ubervu via Hootsuite users can personalize to fit a specific task, job function or use case like
QUICK FACTS. Implementing a Big Data Solution on Behalf of a Media House TEKSYSTEMS GLOBAL SERVICES CUSTOMER SUCCESS STORIES
[ Communications, Services ] TEKSYSTEMS GLOBAL SERVICES CUSTOMER SUCCESS STORIES Client Profile (parent company) Industry: Media, broadcasting and entertainment Revenue: Approximately $28 billion Employees:
Grow your online business with Google AdSense
Grow your online business with Google AdSense Grow your online business As a publisher, you invest a great deal of time and energy into creating your content and maintaining your website. Many of our AdSense
BC Geographic Warehouse. A Guide for Data Custodians & Data Managers
BC Geographic Warehouse A Guide for Data Custodians & Data Managers Last updated November, 2013 TABLE OF CONTENTS INTRODUCTION... 1 Purpose... 1 Audience... 1 Contents... 1 It's All About Information...
