The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT)



Similar documents
Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System

Modeling and Design of Intelligent Agent System

Collecting Polish German Parallel Corpora in the Internet

USABILITY OF A FILIPINO LANGUAGE TOOLS WEBSITE

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Graduate Co-op Students Information Manual. Department of Computer Science. Faculty of Science. University of Regina

Survey on Artificial Intelligence Technology in Thailand

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

ANALYSIS OF WEB-BASED APPLICATIONS FOR EXPERT SYSTEM

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Search and Information Retrieval

Multimedia Technology Bachelor of Science

Role of Text Mining in Business Intelligence

Natural Language to Relational Query by Using Parsing Compiler

Cross-Lingual Concern Analysis from Multilingual Weblog Articles

Specialty Answering Service. All rights reserved.

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology

NATURAL LANGUAGE TO SQL CONVERSION SYSTEM

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

Day 7 Business Information Systems-- the portfolio. Today s Learning Objectives

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

A Framework of Personalized Intelligent Document and Information Management System

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Introduction to Pattern Recognition

An Approach for Facilating Knowledge Data Warehouse

Domain Classification of Technical Terms Using the Web

Machine Learning: Overview

LONG BEACH CITY COLLEGE MEMORANDUM

Compare and Contrast OCR and Forms Recognition Technologies. Peter Lang and Scott Hamilton

DEVELOPMENT OF NATURAL LANGUAGE INTERFACE TO RELATIONAL DATABASES

Master of Science in Computer Science

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

Framework model on enterprise information system based on Internet of things

NATIONAL SUN YAT-SEN UNIVERSITY

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Graphical Web based Tool for Generating Query from Star Schema

Analysis of Data Mining Concepts in Higher Education with Needs to Najran University

Bachelor Degree in Informatics Engineering Master courses

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Data Integration using Agent based Mediator-Wrapper Architecture. Tutorial Report For Agent Based Software Engineering (SENG 609.

Using Artificial Intelligence to Manage Big Data for Litigation

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Master of Science (Electrical Engineering) MS(EE)

In-memory databases and innovations in Business Intelligence

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

CURRICULUM VITAE. Dept. of Mechanical Engineering and Industrial Design Τ.Ε.Ι. of Western Macedonia KOZANI, GREECE

Web-based Multimedia Content Management System for Effective News Personalization on Interactive Broadcasting

Index Terms: Online Ticket Resolving System (OTRS), Network Operation Center(NOCs), Incident Management(INC),

Fuzzy Knowledge Base System for Fault Tracing of Marine Diesel Engine

The Key Technology Research of Virtual Laboratory based On Cloud Computing Ling Zhang

File Magic 5 Series. The power to share information PRODUCT OVERVIEW. Revised November 2004

A Grid Architecture for Manufacturing Database System

Control scanning, printing and copying effectively with uniflow Version 5. you can

DATA MINING TECHNIQUES AND APPLICATIONS

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Special Topics in Computer Science

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

Master s Program in Information Systems

Rotorcraft Health Management System (RHMS)

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Module 6. RAID and Expansion Devices

Cross-Cultural Communication Training for Students in Multidisciplinary Research Area of Biomedical Engineering

Locating and Decoding EAN-13 Barcodes from Images Captured by Digital Cameras

The Re-emergence of Data Capture Technology

Healthcare Measurement Analysis Using Data mining Techniques

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Core Syllabus. Version 2.6 B BUILD KNOWLEDGE AREA: DEVELOPMENT AND IMPLEMENTATION OF INFORMATION SYSTEMS. June 2006

Overview of MT techniques. Malek Boualem (FT)

Decision Support and Business Intelligence Systems. Chapter 1: Decision Support Systems and Business Intelligence

Exploitation of Server Log Files of User Behavior in Order to Inform Administrator

Introduction. Philipp Koehn. 28 January 2016

Montgomery College Course Designator/Course Number: CS 110 Course Title: Computer Literacy

The Impact of Using Technology in Teaching English as a Second Language

SIMPLE MACHINE HEURISTIC INTELLIGENT AGENT FRAMEWORK

Designing and Embodiment of Software that Creates Middle Ware for Resource Management in Embedded System

Chapter 3. Application Software. Chapter 3 Objectives. Application Software

Blog Post Extraction Using Title Finding

Expert System and Knowledge Management for Software Developer in Software Companies

Building a Question Classifier for a TREC-Style Question Answering System

Interactive Dynamic Information Extraction

Classification of Fuzzy Data in Database Management System

IFS-8000 V2.0 INFORMATION FUSION SYSTEM

Transcription:

The Development of Multimedia-Multilingual Storage, Retrieval and Delivery for E-Organization (STREDEO PROJECT) Asanee Kawtrakul, Kajornsak Julavittayanukool, Mukda Suktarachan, Patcharee Varasrai, Nathavit Buranapraphanont, Chaiwat Ketsuwan, Duangpen Jetpipattanapong, Prakorn Santiwatt, Nattakan Pengphon Natural Language Processing and Intelligent Information Technology Research Laboratory Department of computer engineering Faculty of Engineering, Kasetsart University Bangkok, Thailand 10900 Email: ak@ku.ac.th Abstract This paper introduces the new project called STREDEO: The Development of Multimedia- Multilingual Storage, Retrieval and Delivery for E-Organization. STREDEO aims to provide the system for multimedia multilingual document management consisting of storage, retrieval and delivery. The project can be divided into seven subprojects, which are: The Development of Multimedia and Multilingual Storage (MUU-DOC), The Development of Processing for Indexing (DIM), The Development of Web-based Intelligent Information Retrieval (WIRE), The Development of Automatic Clustering and Delivery (CLUD), The Development of Multimedia Query Processing : Speech, Text and Handwriting Text (MUL-Q), The Development of Linguistic Knowledge Acquisition and Natural Language Processing Techniques (KANAL), and A Very Large Scale Multimedia Database Management Design and Integrating (INTEGRATE) Keyword: E-Organization, Natural Language Processing, Processing for Indexing, Automatic Indexing, Automatic Clustering, Very Large Scale Hypermedia Storage and Delivery, Web-based Intelligent Search Engine, Linguistic Knowledge Base, Knowledge Acquisition 1. Introduction There is no doubt that today information technology is expanding very rapidly particularly, in the field of communication and networking. In addition, it is likely that the information will continue to grow exponentially. These create the need for the collection of extremely huge information of different languages and media. Table 1 shows estimation of the sizes of information for different media and their growth rates. Table 1: Worldwide production of original content, stored digitally using standard compression methods, in terabytes circa 1999 [9]. Storage Medium Type of Content Terabytes/Year Upper Estimate Terabytes/Year Lower Estimate Growth Rate (%) Paper Book Newspaper Periodicals 8 5 1 -

Film Optical information Magnetic Information Office document 1 195 Total 40 Picture Movie X-Rays 410,000 16 17,00 1 19 41,000 16 17,00 Total 47,16 58,16 4 CDs songs CDs data DVDs 58 Total 8 1 70 Camcorder Tape PC Disk Drives Departmental Servers Enterprise Servers 00,000 766,000 460,000 167,000 6 00,000 7,660 161,000 109,000 Total 1,69,000 65,660 55 Grand total,10,59 69,90 50 5 5 Above information can be useful for a wide range of users from an organization to an individual person. However with the very large size of information available, potential problems such as too long searching time or system unstability can easily be encountered. Consequently, there is a need to organize and manage such huge information. These include organizing and managing the storage system, the retrieval system and the delivery system. Today information technology is applied to storage and retrieval system [11]. Examples of such technology are, large scale multimedia document storage [1], [6], [1] automatic indexing system [], [4] and automatic document clustering system [], [8], [10]. However the mentioned systems are created for English and they are not applicable for Thai. That is because Thai has unique characteristics such as no space required between words, ambiguity in meaning between noun and noun phase [7], [1]. STREDEO project aims to develop the technology and apply for Thai storage system, Thai retrieval system and Thai delivery system. The project will help to support an office that uses only electronic document and eliminate the use of paper which can help creating better environment for the world. In addition, it can easily provide services and exchange of information both within and outside Thailand.. STREDEO Overview Figure 1 shows the overview of STREDEO. There are types of input: text and image. Text could also be collected by using webrobot. In case of document image, it will be converted into text (not necessary be high quality) before indexing and then kept its image in the data warehouse. When there are information in the document warehouse, the system will continue to perform

document clustering and delivering. If a user input a query using natural language such as text, handwritten, or speech, the system will retrieve the relevant information or document to the user. STREDEO project can be divided into 7 subprojects, which will be described in the following subsections. Very Large Corpora Knowledge Acquisition Parser Linguisti c Linguistic Knowledge Base and Acquisition Toolkit Linguistic Knowledge Base Thesaurus Multimedia Storage Intelligent Search Engine Automatic Indexing And Storaging Warehouse in Multimedia Retrieval Processing www Web Robot Electronic Text to Text Shallow Converting Processing Clustering and Delivering Clustering and Delivering Speech and Text Query Processing Query Processing Electronic Office User Query (Natural Language,Speech) Internet Books or Papers Scanner Division A Division B Division C Division D User Figure 1: Seven subprojects of STREDEO including system integrating

.1. The Development of Multimedia and Multilingual Storage (MUU-DOC) MUU-DOC is an important subsystem of the project. The main function is to analyze information in a document or document image for indexing and storing. Figure shows the scope of MUU-DOC. Automatic Indexing Index Representation Warehouse Input Noun Phrase Analysis Electronic Text Automatic Analyzing and Storing Morphological Analysis Processing Electronic Office VISIO CORPORATION $ ก Figure : The Development of Multimedia and Multilingual Storage (MUU-DOC).. The Development of Processing for Indexing (DIM) The text data from book or paper, that will be used for indexing and storing in corpus, must be manually typed. The task is time consuming and tedious. DIM is a part of MUU-DOC that will analyze and recognise the text data roughly from the scanned document image and make the indexing of a large number of documents more convenient. It can reduce much time and human work in typing the text data, that will improve the speed of feeding data to the Multimedia and Multilingual Storage. DIM has four main tasks. N improving to solve scanning problem. N Layout Analysis to distinguish between text image and picture image. N Character Segmentation to segment connected character that cause of scanning or font of characters. N Character recognition to recognise text image to text characters.

The process of converting image of typed characters into text document in Thai uses syntactic, fuzzy logic and feature extraction. To make the system more practical, this subproject is not designed to focus only on character recognition but also image processing and character segmentation. Figure shows an example of such system. ก ก ก ก ก ก ก กก Line segmentation กก image transformation ก ก ก ก improving ก ก ก ก ก ก ก กก Line segmentation กก image transformation ก ก ก ก Layout Analysis ก ก ก ก ก ก ก กก Line segmentation กก image transformation ก ก ก ก Line Segmentation ก ก ก ก ก document character Segmentation ก ก Character recognition Text document Figure : image processing system.. The Development of Web-based Intelligent Information Retrieval (WIRE) The increasing of information technology and of using the internet cause the electronic documents to be increased exponentially. Consequently, the searching of an information is a nontrivial problem. It is necessary to create web-based an intelligent Information Retrieval system, which called WIRE. WIRE is a prototype system that capable of searching information in bilingual text (Thai- English). It can be divided into two parts, query processing system and searching system. The query processing system will process query words from users by transforming the query words to be multilevel such as words level, phrase level and sentence level. For example, if the query words are What is an internet address?, the query processing system will generate a multilevel query as internet address for phrase level, and networking for conceptual level. In addition, the query processing system will allow a user to enter query in many different styles for example address of internet or address on internet and still yields the same result. Since the query processing system produces multilevel queries, the searching system must also capable of searching in multilevel too. This can be done by starting the search in phrase level, then word level and conceptual level respectively.

.4. The Development of Automatic Clustering and Delivery (CLUD) As mentioned before, the increase in information technology and the increase in using the internet cause the electronic documents to be increased exponentially. There is a need to arrange electronic documents into groups. However if the task is done by human, it can be time consuming, ineffective and very tedious. Therefore, there should be a system that can automatically, effectively and accurately cluster electronic documents [10]. In addition to document clustering, the system also provide the capability that could forward the document to the right users..5. The Development of Multimedia Query Processing : Speech, Text and Handwriting Text (MUL-Q) Today all input queries are entered by using keyboard. To make the system become more friendlier, MUL-Q is proposed to be a multimedia query processing system that allows users to use speech and handwriting as input query to STERDEO. This project is limited to recognize discontinuous speech with domain based vocabularies. Another form of query can be handwriting. Handwriting character recognition (HCR) is more difficult than OCR. However this project is limited to process only neatly handwriting..6. The Development of Linguistic Knowledge Acquisition and Natural Language Processing Techniques (KANAL) Research in natural language processing is important to the development of document processing in term of better understanding human language. This subproject aims to develop linguistic knowledge acquisition system and natural language processing techniques in order to support document processing in indexing, clustering and query..7. A Very Large Scale Multimedia Database Management Design and Integrating (INTEGRATE) The development of software and database for very large-scale multimedia always have a lot of problems. For example, connecting each module together, controlling schedule and quality of each module. Since the development of STREDEO project has seven subprojects, the problems always occur if it has no good planning. The objective of this project is then, to design and development of software architecture, planning development direction, plug-in module, test and maintenance service via the network by applying software engineering technique.. Conclusion Today information technology has proved that there is a need to store, query, search, retrieve, and deliver large amount of electronic information efficiently and accurately. This paper introduces STREDEO project that will deal with the growing number of electronic document. STREDEO project consists of seven subprojects. The first subproject, MUU-DOC, will focus on multimedia and multilingual document storage. The second subproject, DIM, will focus on document image processing system for indexing. The third subproject will focus on web-based intelligent information retrieval. The fourth subproject will focus on automatic document clustering and delivery. The fifth project, MUL-Q, will focus on multimedia query processing such

as speech, text and handwriting text. The sixth project, KANAL will focus on linguistic knowledge acquisition and natural language processing Techniques. The last project will focus on a very large scale multimedia database management design and integrating STREDEO. 4. References [1] Andres, F. 000, Active Hypermedia Delivery and PHASME Information Engine, In Proceedings of AdInfo000 First International Symposium on Advandced Informatics 1: pp7-44. [] Chengxing, Z. 1995, Evaluation of syntactic phrase indexing-clarit NLP, Track Report, Text Retrieval Conference 4, New York, p5 [] Cohen, W. W. 1996, Learning rules that classify e-mail, in the Proceedings of the 1996 AAAI Spring Symposium on Machine Learning in Information Access., pp18-5. [4] Dik, L. 1997, Information storage and retrieval, nd ed., Pentice Hall Publishing Company, New York. 40 p. [5] Kawtrakul, A.,et.al. 000, Multi-Feature Extraction for Printed Thai Character, SNLP 000 Symposium of Natural Language Processing [6] Kawtrakul, A. et.al. 000, Toward on Enhancement of Textual Database Retrieval by Using NLP Technique, NECTEC Technical Journal, Vol.11 No.7 March-June, 000. [7] Kawtrakul, A. and Thumkanon, C. 1997, A statistical Approach for Thai Morphological International Conference, China. [8] Lang, K. 1995, NewsWeeder learning to filter netnews In Proceeding of ICML-95, 1 th International Conference on Machine Learning 1, pp1-9. [9] Peter L. and Hal R. V. 1999, How much Information? [online] http://www.sims.berkeley.edu/how-much-info [10] Sebastiani, F. 1999, A Tutorial on Automated Text Categorisation. In Analia Amandi and Alejandro Zunino (eds.), Proceedings of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, Buenos Aires, AR, pp7-5. [11] William, B. F. and Baeza, Y. R. 199, Information retrieval Data Structure & Algorithm, Prentice Hall, Englewood Cliffs, New Jersey. p504 [1] Kawtrakul A., Andres F., Ono K. and et.al. 000, The Implementation of VLSHDS Project for Thai Retrieval in Proc. First International Symposium on Advance Informatics, Tokyo, Japan.