Cloud Storage-based Intelligent Document Archiving for the Management of Big Data



Similar documents
INTELLIGENT AND PERVASIVE ARCHIVING FRAMEWORK TO ENHANCE THE USABILITY OF THE ZERO-CLIENT- BASED CLOUD STORAGE SYSTEM

Clustering Technique in Data Mining for Text Documents

Tag-manager based document management prototype system of building material information

SIPAC. Signals and Data Identification, Processing, Analysis, and Classification

SPATIAL DATA CLASSIFICATION AND DATA MINING

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo Database And Data Mining Research Group

Data Integration Hub for a Hybrid Paper Search

Active Learning SVM for Blogs recommendation

Spam Detection Using Customized SimHash Function

Facilitating Business Process Discovery using Analysis

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Index Terms Domain name, Firewall, Packet, Phishing, URL.

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Fraud Detection in Online Reviews using Machine Learning Techniques

Key Factors for Developing a Successful E-commerce Website

An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them

Natural Language to Relational Query by Using Parsing Compiler

Home Appliance Control and Monitoring System Model Based on Cloud Computing Technology

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

The Implementation of Face Security for Authentication Implemented on Mobile Phone

ecommerce Web-Site Trust Assessment Framework Based on Web Mining Approach

An Overview of Knowledge Discovery Database and Data mining Techniques

Machine Learning Log File Analysis

Inner Classification of Clusters for Online News

Framework model on enterprise information system based on Internet of things

Big Data Text Mining and Visualization. Anton Heijs

Research and Development of Data Preprocessing in Web Usage Mining

Intelligent Tools For A Productive Radiologist Workflow: How Machine Learning Enriches Hanging Protocols

Operations Research and Knowledge Modeling in Data Mining

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Efficient Automated Build and Deployment Framework with Parallel Process

Automated News Item Categorization

IJCSES Vol.7 No.4 October 2013 pp Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

Flattening Enterprise Knowledge

Cosdes: A Collaborative Spam Detection System with a Novel Abstraction Scheme

Web Document Clustering

Semantic Concept Based Retrieval of Software Bug Report with Feedback

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

Monitoring Web Browsing Habits of User Using Web Log Analysis and Role-Based Web Accessing Control. Phudinan Singkhamfu, Parinya Suwanasrikham

Distributed Computing and Big Data: Hadoop and MapReduce

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

UTILIZING COMPOUND TERM PROCESSING TO ADDRESS RECORDS MANAGEMENT CHALLENGES

A Monitored Student Testing Application Using Cloud Computing

A Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment

An Object-Oriented Analysis Method for Customer Relationship Management Information Systems. Abstract

A Research Using Private Cloud with IP Camera and Smartphone Video Retrieval

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

KEITH LEHNERT AND ERIC FRIEDRICH

Automated Test Approach for Web Based Software

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

Fahad H.Alshammari, Rami Alnaqeib, M.A.Zaidan, Ali K.Hmood, B.B.Zaidan, A.A.Zaidan

Enterprise Content Management. Image from José Borbinha

Component visualization methods for large legacy software in C/C++

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

White Paper Case Study: How Collaboration Platforms Support the ITIL Best Practices Standard

Cyber Forensic for Hadoop based Cloud System

Viewpoint ediscovery Services

Design for Management Information System Based on Internet of Things

How To Write A Summary Of A Review

Managing e-records without an EDRMS. Linda Daniels-Lewis Senior IM Consultant Systemscope

Building a Question Classifier for a TREC-Style Question Answering System

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

Hexaware E-book on Predictive Analytics

CA Deliver r11.7. Business value. Product overview. Delivery approach. agility made possible

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

DATA SECURITY IN CLOUD USING ADVANCED SECURE DE-DUPLICATION

State of Michigan Records Management Services. Guide to E mail Storage Options

Web Database Integration

Selective dependable storage services for providing security in cloud computing

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online

Knowledge Discovery from patents using KMX Text Analytics

Sentiment analysis on tweets in a financial domain

An Automated Workflow System Geared Towards Consumer Goods and Services Companies

Dissecting the Learning Behaviors in Hacker Forums

COURSE RECOMMENDER SYSTEM IN E-LEARNING

SVM Ensemble Model for Investment Prediction

Forecasting stock markets with Twitter

Technical. Overview. ~ a ~ irods version 4.x

A Review of Data Mining Techniques

ElegantJ BI. White Paper. The Enterprise Option Reporting Tools vs. Business Intelligence

SYNTHETIC DATA GENERATION CAPABILTIES FOR TESTING DATA MINING TOOLS. Rui Xiao University of California, Riverside

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Journal of Chemical and Pharmaceutical Research, 2015, 7(3): Research Article. E-commerce recommendation system on cloud computing

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Software Configuration Management Plan

Course MIS. Foundations of Business Intelligence

Data Mining for Successful Healthcare Organizations

Efficiently Managing Firewall Conflicting Policies

Customer Classification And Prediction Based On Data Mining Technique

Increasing Marketing ROI with Optimized Prediction

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

A Framework for Data Migration between Various Types of Relational Database Management Systems

A Framework of User-Driven Data Analytics in the Cloud for Course Management

Collaboration. Michael McCabe Information Architect black and white solutions for a grey world

Transcription:

Cloud Storage-based Intelligent Document Archiving for the Management of Big Data Keedong Yoo Dept. of Management Information Systems Dankook University Cheonan, Republic of Korea Abstract : The cloud storage for the centralized management of organizational big data is gaining much interest because of its benefits in managing and securing information resources. However, cloud storagebased centralized repository also has problems in utilization, which are the difficulty in determining the proper category to store working documents and the complexity in retrieving a document. This paper proposes a methodology to resolve these problems by automating the processes of identifying the topic of working documents and storing them under the identified topic-based category of the cloud storage-based central repository. Without user s direct definition about the title of a working document, it can be automatically stored under the identified topic-based category in the central repository. To demonstrate the validity of the proposed concepts, a prototype system enabling the function of automatic topic identification, automatic category searching, and automatic archiving is implemented. Keywords- Document centralization; Intelligent archiving; Automatic topic identification; Cloud storage I. INTRODUCTION Centralized management of documents, or the document centralization, is emerging as an indispensable choice to strategically secure and utilize organizational intellectual assets nowadays. The up-todate concept of Enterprise Content Management (ECM), the strategies, methods and tools used to capture, manage, store, preserve, and deliver content and documents related to organizational processes, is a typical example centralizing resources and efforts to manage organizational documents and contents. Organization s intellectual assets include not only business-related documents and contents, but also business processes and information technologies. Therefore, through the document centralization, organizations can expect efficient application of information resources by effectively allocating them into proper processes and tasks. Highly secured management of internal resources can be also initiated; therefore many companies are now trying to establish robust and scalable systems for document centralization. Organizational documents can be centralized using the network infrastructure, and the most commonly applied network technology is the cloud computing. Among cloud computing technologies, the cloud storage forms the repository to store transmitted documents. Cloud storage, one of widely known cloud computing technologies, initiates its function by providing the Internet- and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..133

based data storage as a service. One of the biggest merits of cloud storage is that users can access data in a cloud anytime and anywhere, using any device [5]. Typical examples of cloud storage services are Amazon S3 (http://aws.amazon.com/s3), Mosso (http://www.rackspacecloud.com), Wuala (http://www.wuala.com), or ucloud (http://www.ucloud.com); All of these services offer users clean and simple storage interfaces, hiding the details of the actual location and management of resources [8]. Once a document to be archived is stored in a cloud storage, users can access and download it anytime and anywhere if the right to access has been granted. Because of such advantage in utilizing organizational information resources, more companies and organizations are implementing the online storage under the cloud computing environment. While cloud storage can deliver users various benefits, it also has not a few technical limits in network security as well as privacy [9]. From the viewpoint of usability, many users also point out a very serious problem in using cloud storage, which is the difficulty in storing and retrieving documents. To store a working document under any categories provided by the cloud storage, a user has to determine the category that exactly coincides with the contents of the document. Since the category is naturally various and the overall structure of categories is complicated, determining the proper category is not an easy work. When retrieving a document in which a user is interested, he/she has to spend not little time to locate the file because too many categories exist. Assistance in concluding the category to store a document can be accomplished by analyzing the contents of the document with respect to the categories defined in the cloud storage. Since any keywords or topics extracted from the document stand for the possible name of the category under which the document must be stored, users can easily achieve their goal. In retrieving a document from the storage, more accurate and fast searching can be made because each document was archived into the topic-based category. This research tries to enhance the usability of cloud storage-based central repository by automatically archiving the working documents according to automatically identified topic or keyword of documents. To do so, this research proposes a methodology to automatically extract the predefined category-specific keywords (or topics) of a working document by applying a text mining algorithm. Based on the extracted keywords, documents can be automatically stored into categories in cloud storage. To demonstrate the validity of the proposed concepts, a prototype system enabling the function of automatic topic identification, automatic category searching, and automatic archiving is implemented. II. PROPOSED METHODOLOGY As Fig. 1 illustrates, the process to automatically identify topics (keywords) of the working document is additionally needed to automate the whole process of cloud storage-based archiving. Tasks in the dotted ellipse are required to perform automated topic identification, and they must be processed sequentially. Once the topic of the given working document is identified, the document can be automatically stored in cloud storage with the and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..134

topic. By writing some programming codes to search corresponding directory with the topic and to save the document with the topic, automatic document archiving can be completed. When the destination directory is concluded, the system may send a message to the user to confirm whether the directory is valid. Although the user may change the directory as he/she intends, the system also automatically store the document in the location where the user designated by simply applying the agent programming. Specific roles of each task are as follows; Figure 1. Conceptual Framework of the Proposed Methodology A. File Converter A file converter changes the format (one of.doc,.xls, or.html ) of a working document to an analyzable one (.txt ) so that the following module can read the contents. A file converter plays the role of a file format filter that prepares input documents into a unified format (.txt in this research). B. Word Stemming Module To standardize the words in the document, unnecessary or redundant parts of each word must be eliminated. A stem, in linguistics, is the combination of the basic form of a word (called the root) plus any derivational morphemes, but excluding inflectional elements. This means, alternatively, that the stem is the form of the word to which inflectional morphemes can be added, if applicable. For example, the root of the English verb form destabilized is stabil- (alternate form of stable); the stem is de-stabil-ize, which includes the derivational affixes de- and -ize, but not the inflectional past tense suffix -(e)d. C. Word Vector Tool Based on the word stems from the word stemming module, the word vector tool transforms each word stem into the vector. To extract the vector, TF/IDF(Term Frequency/Inverse Term Frequency) is used. TF/IDF is a statistical technique used to evaluate how important a word is to a document. The importance increases proportionally to the number of times a word appears in the document; however is offset by how common the word is in all of the documents in the collection or corpus. A high weight in TF/IDF is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weight hence tends to filter out common terms. The word with the highest TF/IDF can be regarded as a keyword. However, to determine the keyword of a given document, usually every TF/IDF value of meaningful terms (stems) must be respectively calculated. and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..135

D. Classifier A classifier extracts resultant keywords by projecting the word vectors of the target document on the vector spaces provided by the training based on a corpus. A corpus is a predefined directories, and each directory possesses a lot of related example documents. To train the classifiers based on the constructed corpus, sample documents can be excavated by browsing conventional Web pages. Because conventional Web pages have been already labeled with corresponding keywords as titles, in a sense, the title of each document can be deemed to be already formalized [4]. This research deploys SVM-based classifier, because it is demonstrated that the SVM outperforms other similar text mining algorithms applicable to topic identification [1, 6]. The SVM determines the keyword of a document by depicting the word vectors on the vector space Rn (n: number of dimensions) and comparing the kernel functions of each document. The accuracy of the SVM was verified to be very satisfactory. If the prediction model has been trained sufficiently, then the SVM outputs very accurate and correct results. Comparing to the accuracy of manual classification, that of SVM-based classification was reported to be over 90% [3]. III. PROTOTYPE IMPLEMENTATION A. Overview To check whether the proposed methodology-based approach can yield correct results, a prototype system which intelligently stores working documents into the cloud storage-based repository under the automatically identified topic-specific category is implemented. The prototype system analyzes and extracts the topic of a working document in a real time basis when a user finishes writing and tries to store the document. Indexing the document by tagging the identified topic with user s ID and time, the prototype transmits and stores the document into the cloud storage. A dialogue between the user and the prototype is to be bridged to confirm the correctness of the identified topic. If the recommended topic has no problem, the prototype transmits the file to the cloud storage with tag information: Automatic archiving can be completed. Fig. 2 shows the sequence of functions provided by the prototype. Figure 2. Sequence Diagram of the Prototype System and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..136

The prototype system has been implemented using JDK v1.5.0_06 under Java2 runtime environment. The sub modules of Stemmer and WV Tool have been implemented by using Word stemming tool and Vector creating tool of Yale, an open source environment for KDD(Knowledge Discovery and Data mining) and machine learning [7], respectively. The SVM module as a classifier deploys LibSVM v2.81 [2]. The prototype can process documents related with activities within the university context, and therefore, for the convenience of implementation and test, the categories of cloud storage have been provided by simplifying the University Ontology defined by Department of Computer Science of University of Maryland (http://www.cs.umd.edu/projects/plus/shoe/onts/univ1.0.html). B. Example: A Document on Conference Participation To explain how the prototype system works in detail, a document concerning Conference Participation is exemplified. The prototype initiates its function by converting.doc,.xls, and.html format-based document into analyzable.txt format. Fig. 3 shows the example document formatted in.txt. Figure 3. Example Document based on.txt format To extract the topic of the given document, words in the document must be refined so that only meaningful part of a word can be inputted. Meaningless words, or stop words, must be eliminated in advance, and stems of each meaningful word must be separated. Fig. 4 shows the resultant stems by the stemming module. and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..137

Figure 4. Word Stems in Example Document Based on the meaningful stems and predefined categories, the vector of the example document can be calculated. Because only 9 categories are selected and considered in this research, resultant vector is relatively simple comparing to previous researches results. This result can be caused by applying the simple structure of predefined category; however there exists no problem in demonstrating the performance of topic identification, because now a few previous researches also simplified the volume of predefined category for the ease of training and predicting. In this research, the categories in the University Ontology have been modified so that overall categories can be consistent and compatible. Fig. 5 shows specific categories used in this research and the resultant vector. Figure 5. Predefined Category & Resultant Vector of Example Document The SVM module, the classifier, projects the vector of the document onto 9-dimensional vector space, and concludes which topic of category best stands for the contents in the document. Before performing actual categorization, the SVM module must be trained using sufficient number of sample documents already assigned into each category. In this research, from 80 to 90 number of sample documents per each category were used to train the SVM module, and the accuracy of category prediction, which automatically estimated by LibSVM, is concluded as 92.5% (MSE=1.025, SCC=0.825273), which means 37 out of 40 example documents are correctly classified. Fig. 6 shows the resultant category number provided by the SVM module, and the number 2.0 means the third category Conference. and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..138

Figure 6. Resultant Category (Category 2.0 Conference) Since identified topic means the target category under which the document can be archived, it must be tagged onto the document. To avoid the case of duplicated saving that different documents are tagged with the same topic, user s ID and time completing documenting need to be tagged together using simple programming as follows; Finishing indexing using tag information, the target category to store the document must be concluded. Since the categories of University Ontology are originally composed of 30 entities, although only 9 categories were selected to determine the topic of the document, the cloud storage of central repository is set to have 30 categories. Therefore the document is to be stored under one of 30 categories. The Hash function is proper to do this job, and corresponding programming codes are as follows; and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..139

Once the target category to store the document has been concluded, the document must be save with the file name topic-id-date under the concluded category, as following codes address; During performing automatic archiving, automatically identified topic of the document must be confirmed by user not to save the document under a wrong category. After completing automatic archiving, any message making the user know in which category the document has been stored. These kinds of dialogues between the system and a user can be processed based on following codes; and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..140

IV. CONCLUDING REMARKS Benefits from utilizing cloud storage in companies and organizations can be beyond description because it promotes effective and efficient sharing of organizational information and knowledge regardless to the time and place. If some usability issues around cloud computing, however, are not resolved realistically, then the benefits as well as interests can be scattered away. This research tries to resolve one of such usability issues around cloud storage by suggesting a practical guidance to relieve user s burden in selecting directories of cloud storage. The proposed methodology to identify the topics of working documents and to store documents with respect to the identified topics in an automated manner can contribute higher productivity and convenience of work. Companies can also expect more concentrated management of organizational information and knowledge through the proposed concepts, because more accurate and secured processing of organizational document archive is guaranteed. This research, however, must be further studied so that the proposed methodology can be applied to various mobile devices, such as smartphones and smartpads, which are the essential items of current users. To cope with this requirement, wireless-communication-oriented networking protocols must be additionally considered. Formal corpus, in addition, needs to be also developed to heighten the performance of topic identification, because the accuracy of text mining mainly depends on the result of training based on the corpus. Since the corpus may have the same structure with the directory of cloud storage, this adjustment can also reinforce the realistic application of automatic document archiving. ACKNOWLEDGEMENT REFERENCES [1] Basu, A., Watters, C., & Shepherd, M., Support Vector Machines for Text Categorization, Proceedings of the 36th Hawaii International Conference on System Sciences, Vol.4, 2003. [2] Chang, C. & Lin, C., LIBSVM: a library for support vector machines, ACM Transactions on Intelligent Systems and Technology, Vol.2, No.3, 1-27, 2011. [3] Hsu, C.W., Chang, C.C., & Lin, C.J., A Practical Guide to Support Vector Classification: LibSVM Tutorial, available at http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf, 2001 [4] Kim, S., Suh, E., & Yoo, K., A study of context inference for Web-based information systems, Electronic Commerce Research and Applications, Vol.6, 146-158, 2007. and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..141

[5] Liu, Q., Wang, G., & Wu, J., Secure and privacy preserving keyword searching for cloud storage services, Journal of Network and Computer Applications, Vol.35, No.3, 927-933, 2012. [6] Meyer, D., Leisch, F., & Hornik, K., The support vector machine under test, Neurocomputing, Vol.55, 169-186, 2003. [7] Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., & Euler, T., YALE: Rapid Prototyping for Complex Data Mining Tasks, Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-06), 2006. [8] Pamies-Juarez, L., García-López, P., Sánchez-Artigas, M., & Herrera, B., Towards the design of optimal data redundancy schemes for heterogeneous cloud storage infrastructures, Computer Networks, Vol.55, 1100-1113, 2011. [9] Svantesson, D. & Clarke, R., Privacy and consumer risks in cloud computing, Computer Law & Security Review, Vol.26, 391-397, 2010.] and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..142