Digitization of Print materials, Audio and Video



Similar documents
TEXT FILES. Format Description / Properties Usage and Archival Recommendations

Digital Imaging Color or black-and-white scans, from web to exhibition sizes. Images can be provided in a variety of formats including Adobe PDF.

Links. Blog. Great Images for Papers and Presentations 5/24/2011. Overview. Find help for entire process Quick link Theses and Dissertations

Prof. Dr. M. H. Assal

What Resolution Should Your Images Be?

Get the Best Digital Images Possible. What s it all about anyway?

Scanning and OCR Basics

Digital Imaging and Image Editing

WHAT You SHOULD KNOW ABOUT SCANNING

Computers Are Your Future Eleventh Edition Chapter 5: Application Software: Tools for Productivity

ELECTRONIC DOCUMENT IMAGING

CREATING DIGITAL ARTWORK

In addition, a decision should be made about the date range of the documents to be scanned. There are a number of options:

The Keyboard One of the first peripherals to be used with a computer and is still the primary input device for text and numbers.

Scanning Archival Material By Daniel D. Whitney

Best practices for producing high quality PDF files

How To Scan A Document

Graphic Communication

CHAPTER 6: GRAPHICS, DIGITAL MEDIA, AND MULTIMEDIA

MassArt Studio Foundation: Visual Language Digital Media Cookbook, Fall 2013

Pictures / images on computers

Image Resolution. Color Spaces: RGB and CMYK. File Types and when to use. Image Resolution. Finding Happiness at 300 dots-per-inch

Chapter 3: Computer Hardware Components: CPU, Memory, and I/O

MMGD0203 Multimedia Design MMGD0203 MULTIMEDIA DESIGN. Chapter 3 Graphics and Animations

ACADEMIC TECHNOLOGY SUPPORT

Parts of a Computer. Preparation. Objectives. Standards. Materials Micron Technology Foundation, Inc. All Rights Reserved

Kentucky Department for Libraries and Archives Public Records Division

File Formats for Electronic Document Review Why PDF Trumps TIFF

Data Storage. Chapter 3. Objectives. 3-1 Data Types. Data Inside the Computer. After studying this chapter, students should be able to:

Introduction to Digital Video

SMART Board Menu. Full Reference Guide

Primary Memory. Input Units CPU (Central Processing Unit)

Introduction. KIC Help Desk Guide Page 1

EPSON PERFECTION SCANNING BASICS

Scanning, analysing and archiving photographs

Comparison of different image compression formats. ECE 533 Project Report Paula Aguilera

The Keyboard One of the first peripherals to be used with a computer and is still the primary input device for text and numbers.

In the two following sections we separately consider hardware and software requirements. Sometimes, they will be offered for sale as a package.

Preparing Images for PowerPoint, the Web, and Publication

Scanning in Windows XP: Basics Learning guide

College Archives Digital Preservation Policy. Created: October 2007 Last Updated: December 2012

designed and prepared for california safe routes to school by circle design circledesign.net Graphic Standards

1 PERSONAL COMPUTERS

Data Storage 3.1. Foundations of Computer Science Cengage Learning

Image Formatting. Thanks to the Jerome Foundation.

Chapter 5 Understanding Input. Discovering Computers Your Interactive Guide to the Digital World

New York State Archives Digital Imaging Guidelines (2014) 1

OCR LEVEL 2 CAMBRIDGE TECHNICAL

Solomon Systech Image Processor for Car Entertainment Application

Chapter 5 Objectives. Chapter 5 Input

Introduction. KIC Help Desk Guide v. 2.7

Preparing Illustrations for Publication

E-Content Service Group Virtual Meeting. Digital Preservation: How to Get Started

Copyright Kinoma Inc. All rights reserved.

Course Title: Multimedia Design

Electronic Records Management Guidelines - File Formats

Alternative Methods Of Input. Kafui A. Prebbie 82

Ian Hawdon Physics Page 1 of 20 Barcodes

Design and Implement of Digital Library: An Overview

encoding compression encryption

Video and Audio Codecs: How Morae Uses Them

Creating Content for ipod + itunes

Print Services User Guide

ERA6 OBJECTIVE QUSESTION

Web Site Design Specifications


Basics of Computer 1.1 INTRODUCTION 1.2 OBJECTIVES

LG LM7600 CINEMA 3D SMART LED HDTV WITH MAGIC REMOTE

TWAIN Driver Reference Guide

State of Michigan Document Imaging Guidelines

The Advantages of Document Imaging

Court Services Online - e-filing. Frequently Asked Questions

Optimizing Courses for Online Learning

A Practical Approach to Starting and Growing Your Own Law Firm

Develop Computer Animation

CHAPTER 3: HARDWARE BASICS: PERIPHERALS

14.1 Scanning photographs

Chapter 5 Input. Chapter 5 Objectives. What Is Input? What Is Input? The Keyboard. The Keyboard

Mississippi Department of Archives and History. Reformatting Standards

Functions of Software Programs

Management Challenge. Managing Hardware Assets. Central Processing Unit. What is a Computer System?

Overview of NDNP Technical Specifications

Let s Digitize! Funds provided by

10.1 FUNCTIONS OF INPUT AND OUTPUT DEVICES

Preparing TIFF-images for publication

Tomorrow s Technology and You

The Provincial Archives of Alberta. Price List

Claim your FREE Scanning trial today. Your guide to Document Scanning, Data Capture & Entry

How To Use An Epson Scanner On A Pc Or Mac Or Macbook

Reviewer s Guide. Morpheus Photo Animation Suite. Screenshots. Tutorial. Included in the Reviewer s Guide:

SMU Central University Libraries Digitization Guidelines and Procedures Best Practices for Digitization

Useful Utilities. Here are links to free third party applications that we use and recommend.

Enjoy easy, fast and versatile scanning

How to use Adobe Media Encoder CS6

Smithsonian Institution Archives Guidance Update SIA. ELECTRONIC RECORDS Recommendations for Preservation Formats. November 2004 SIA_EREC_04_03

Legacy Publication Digitization at Scripps

Scanners and How to Use Them

Digital Video-Editing Programs

Image only High Quality Print Image + text

Transcription:

Workshop on Digital Libraries: Theory and Practice March, 2003 DRTC, Bangalore Digitization of Print materials, Audio and Video Bibhuti Bhusan Sahoo Documentation and Research Training Centre Indian Statistical Institute, Bangalore 560059 E-mail: bibhutisahoo@hotmail.com Abstract This paper discusses the process of digitization of the print material, audio and video including slides, microfilms etc. it discusses the scanning process, scanners, OCR technology, image editing softwares and standards for information storage.

Sahoo 1. INTRODUCTION Basically in a library, information is available in following forms: Image Text Audio Video Let us now examine how to capture each type of above-mentioned data. 2. CAPTURING OF TEXT AND IMAGES There are two main types of capture devices i.e. Scanner & digital cameras (2). 2.1 Scanners There are different types of scanners available in market 2.1.1 Flatbed scanners A flatbed scanner often used by libraries and archives to capture images of bound volumes (books), manuscripts, Journals and so on. Flatbed scanners require the user to lift, rotate and place the document face down. Flatbed scanners can scan at 600dpi (dots per inch). Many flatbed scanners offer resolution of 1200 dpi or more resolution 2.1.2. Faceup Scanners Planetary or faceup scanners are more expensive than flatbed scanners. The planetary scanner not only does not touch the source document but a bound volume can be gently opened without pressing down on the spine. 2.1.3. Feed-through scanners These are very much suited for loose sheets. These scanners roll the page over stationary scanning head. Automatic Document Feeders (ADF) are supplied at an additional cost. 2.1.4. Hand scanners Handheld scanners are best for scanning selective sections of data. But, these force the user to go for multiple passes for an A4 size page. Besides the user should have a steady hand while moving the scanner over the material. Different types of scanners have been recommended for different types of documents in various forms. These are mentioned below (5). Single leaf, regular size, flat materials Flatbed scanner Sheetfed scanner (if nonbrittle) Digital camera Scanner Suggestions for Various Material Types Single leaf, oversized, flat materials Oversize flatbed scanner Sheetfed scanner (if nonbrittle) Digital camera Bound materials Digital camera with book cradle Right angle, prism, or overhead flatbed scanner Transparent media Slide scanner Film scanner Multi-format flatbed scanner Digital camera 2

Digitization of Print materials, Audio and Video 3. DIGITAL CAMERAS Digital cameras are used primarily for digitizing color originals. Digital camera does not come into contact with the source documents. Digital cameras are used primarily by professional photographers & graphics artists rather than by libraries and archives. If the library does not have a planetory scanner, may wish to consider purchasing digital camera and camera stand for capturing images from a particularly fragile source document. The ideal capture choice is 24 bit color and 1200 dpi for most of the libraries and archives because that capture choice offers a high quality image at moderate cost. However, one must be stringent in assigning resolution at a reasonable level so as to restrict file sizes. 4. SCANNING Before discussing about the scanning process let us know some definitions related to image and image quality. 4.1. Pixels Often referred to as dot, as in "dots per inch". "Pixel" is short for picture elements, which make up an image, similar to grains in a photograph or dots in a half tone. Each pixel can represent a number of different shades or colors, depending on how much storage space is allocated for it. Pixels per inch (ppi) is sometimes the preferred term, as it more accurately describes the digital image. 4.2. Resolution And Dpi (Dots Per Inch) Resolution is the number of dots per inch High resolution has many dots per inch Low resolution has few dots per inch 4.3. Bit Depth The number of bits used to define one pixel in the image. It has an effect on image quality. A bit can have two values, 0 or 1. If an object is scanned at a bit depth of 8, it can have 256 possible colors. A bit depth of 24 produces over 16 million colors. Bit depth also has an effect on file size: as bit depth increases, the file size increases arithmetically. Scanners generally sample at a higher bit depth and then the final image output is sampled down to a lower bit depth. The higher the bit depth the greater the number of grayscale and color tones 4.4. Resolution The resolution determines the quality of the images. Resolution is often expressed as an array-- the number of pixels across both dimensions of an image (or more simply as 3000 pixels across the long side), as dpi (dots per inch), or as ppi (pixels per inch), because they place more pixels (therefore, information) in an inch than do the lower dpi settings. However, the higher the dpi, the larger your file size. You must take into account your server or computer storage capacity when determining resolution settings. Scanning at a high resolution is recommended if you are planning 3

Sahoo to convert an important collection into digital form to increase access and to build a virtual archive, generate "archival" images, or make prints of the digital image on a good printer. 5. SCANNER SOFTWARE There are two types of software needed for most digital imaging projects. The first is the scanning software that comes with the scanner. The second type of software is the image editing software, normally applied to the image after it has been scanned. Some software, such as Adobe Photoshop, can serve as both the scanning software and the image editing software. The scanning software is usually limited in its functionality. You should choose scanning software that is at least capable of saving image files into standard formats such as TIFF, JPG, GIF, etc. Software that converts image files from one format to another is useful. 6. HOW TO STORE YOUR FILES? Generally the following media are used to save files: CD DVD Tape Hard disk 7. EDITING OF IMAGES Often we have cartographic materials, pictures, sketches etc. as material to collect for Digital library. Such collections are common in museums or special libraries. The images can be scanned using a scanner. Few source documents are in perfect condition. They might have got discolored, stained, turn or otherwise difficult to read. There are three levels of image editing. These are basic, intermediate and professional. The first level is usually cost saving and it is easy to use. It lacks the ability to remove tears scratches from the images. Libraries and archives should avoid basic level tools. Intermediate tools like Yasc paint shop pro 5 also lack the ability to remove scratches and tears from images. Adobe PhotoShop & corel photo/paint are two widely used professional editing tools for images. Both products have flexible features than a library or archive likely to need. 7.1. Image Formats In the Internet era, the two widely used file formats for images are GIF (Graphics Interchange Format)s and JPEG (Joint Photographic Experts Group)s. These two formats have been picked because they are small, fast and between them, they are capable of displaying any type of picture. Now PNGs (Portable Network Graphics Specification) are a new format invented specifically for the web, and are here to hopefully take over as the one all-dominant format. The following are the advantages and disadvantages of different file formats given (2): JPEG: o Good for use with: Photographs Game screenshots Movie stills Desktop backgrounds File size is smaller o Bad for use with: Windows application screenshots Line art and text Anywhere where fine lines or sharp color contrast is needed PNG: 4

Digitization of Print materials, Audio and Video o Good for use with: Text, line art, comic-style drawings, general web graphics Windows application screenshots When absolutely 100% quality is required (24 bit) When alpha channel support is required As a general replacement for anything that is a non-animated GIF o Bad for use with: Photos, in-game screenshots (only when quality is not important and you're looking for small files) Disappointing browser support from Microsoft and others GIF: o Good for use with: Where animations are absolutely required Widespread browser support o Bad for use with: Patented, legal technicalities Large file sizes compared to PNG for the same quality Obsolete 8. OCR TECHNOLOGY AND EDITING OF TEXT OCR (Optical Character Recognition) technology converts printed characters into electronic ones that can be processed by a computer. Image scanners translate graphics (e. g., line art, photographs) into digitized images for computer processing. Both OCR and image scanners use a similar technique to convert their data. Software within the computer determines how the data is to be manipulated and eventually stored (1). Hard Copy Scanner Scanned Image OCR System Digitized file OCR system OCR comprises of two steps in translating characters on a page into a digitized form: i. Optical Scanning ii. Recognition System With text, scanning is only the first step. The OCR software converts the analog signal pattern from each pixel and digitizes it into a matrix of binary data. This data table, stored in RAM, is then checked against a table of characters stored in programmable read-only memory (PROM). The OCR software compares the data against its set of characters and converts successful character matches into ASCII format. A few years ago, most OCR systems in the market could not read dot matrix print, and few, if any, could read typeset documents. Today, many OCR systems are able to read dot matrix print, and several vendors offer scanners that can read almost any font or format (omnifont) including typeset material. The trend toward omnifont readability vastly broadens the range of document sources from typewriters and daisy wheel printers to laser printer, offset presses, typesetters, photocopiers, and letter quality, dot matrix printers. The OCR software today has the option to train the recognition unit to recognize the strokes in manuscripts. The most scanning software generate TIFF (Tagged Image File Format) format by default. The scanned textual images (TIFF) are not searchable nor can be manipulated like text file document (ASCII). In most computers all the alphanumeric characters are assigned an ASCII (American 5

Sahoo Standard Code for Information Interchange) value. For example, the ASCII value for the character 'A' is 65 (decimal) or 01000001 (binary) and that of 'B' is 66 (decimal) or 01000010 (binary) and so on. When a page is scanned the scanner generates an image or picture of the character 'A' in the same size as it appears on the page. The task of the recognition system is to identify the character 'A' and assign the respective ASCII value. The scanned TIFF format are converted to text by the process of Optical Character Recognition (OCR). Some popular OCR softwares are caere s Omni page and Xerox s Textbridge, ABBYY Finereader. The recognition system of OCR technology tries to analyze each character and translate into machine language. The OCR software allows the option of maintaining text and graphics in their original layout as well as plain ASCII and word processing formats. Through OCR software we can save the file into html, doc and other formats. The Omnipage Pro 11.0 version has the option to convert the image file into the PDF file format directly including other formats like.html,.doc,.xls etc. The Acrobat Capture generates the text in PDF file. The PDF file retains the page layout, fonts and fits text into exact space occupied in original image, text etc as like in original document. 9. AUDIO The songs or speeches we listen from tape recorder or radio are in analogue form. So if it has to be kept part of the digital library system it should be first digitized. Digitization can be done by attaching an audio player to a system through an audio capture card and then sound can be recorded to the system. It depends on the choice of library, which format it stores voices for example.wav,.mp3,.midi etc. Recently developed MP3 format is very compact and takes less space while the quality of audio is also better compared to other formats. 9.1. Conversion/Capture Analog Vs. Digital Analog recording converts the audio wave in the air to an ANALOGous electrical signal on tape, record, etc. Digital Recording takes measurements (digital snapshots) of the audio wave in the air. 9.2. Simplified Steps To Digitizing And Distributed Digital Audio: According to the Colorado digitization project the following steps are followed to digitize the Audio (4) Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Example Analog Source Analog to digital Digital Digital storage Delivery/ Playback material conversion conversion Distribution Audio Cassettes Soundtrack from Video Recording Reel to Reel Tape etc. Computer with audio card Standalone digital Audio Wav Mp3 Streaming Real Audio Computer hard Disk CD-R Digital Audio tape Other digital Media On-line (Internet or Intranet) Portable media Computer Audio CD Player Portable Digital Audio Player MP3 Player Once we record the voice in digital format if there is noise we should remove it and software for multimedia or noise reduction softwares are available in the market, which can be used for the audio processing. The idea of noise reduction particularly is to filter the frequency, which is required for hearing. Recently Dolby labs has come with a new technique called as AC3 where 6

Digitization of Print materials, Audio and Video they first change the frequency which is most suitable for human hearing zone and then filter it so that only that part stays which has the frequency near to the desired audio stays then. 10. VIDEO Like audio, video capture also requires recording through video capture card with input from video cassette player or movie camera. E.g. ATI all in wonder video capture card. It takes input from TV antenna or cable or VCR or movie cameras. In fact, now-a-days there are digital video cameras, which are used to record any live event in the digital form. There is a wide variety of Digital video cameras, companies like Panasonic, Kodak, etc. are in the market. The digitized file can be saved in.mov or.avi, divx, mpeg. File format. Achieving 35 mm film quality is a problem because it takes enormous space. For video compression CODEC (Compression and Decompression or Coder or Decoder) is generally used. It is a piece of software, which compresses and then decompresses the data on play. But the problem with the decompression image is that it loses some data and leaves the decompressed image degraded from the original. Which affects the sharpness of the image. 11. CONCLUSION Digitization is the first step in building digital libraries. Besides, digitization of documents also achieves the purpose of preservation for posteriority. How ever, it should be noted that digitization is the most time consuming work and a costly affair as it involves acquisition of hardware and software and also manpower. However, replicating the digital documents is the much cheaper compared to printed document. Digital documents facilitate search and retrieval and can easily accessible world wide once they are made available on the Internet. 12. REFERENCES 1. Madalli, D. P. (2002, January 24-31). Optical Character Recognition Technology: An Overview. Paper presented at the Digitization of Print Materials, DLIS, Osmania University, Hyderabad 2. Sahoo, B. B. (2002, 12 September). Digital Libraries : challenges for librarians, DRTC, Bangalore. 3. When and how to use internet image formats. from http://www.r1ch.net/img-formats/ 4. Introduction to Digital Audio. from http://www.cdpheritage.org/resource/audio/documents/digaudio_ppt.pdf 5. Introduction to Digitization and Scanning Technology. from http://www.cdpheritage.org/resource/scanning/std_scanning.htm 6. Image formats. from http://www.yourhtmlsource.com/images/fileformats.html 7