Cloud Storage-based Intelligent Document Archiving for the Management of Big Data Keedong Yoo Dept. of Management Information Systems Dankook University Cheonan, Republic of Korea Abstract : The cloud storage for the centralized management of organizational big data is gaining much interest because of its benefits in managing and securing information resources. However, cloud storagebased centralized repository also has problems in utilization, which are the difficulty in determining the proper category to store working documents and the complexity in retrieving a document. This paper proposes a methodology to resolve these problems by automating the processes of identifying the topic of working documents and storing them under the identified topic-based category of the cloud storage-based central repository. Without user s direct definition about the title of a working document, it can be automatically stored under the identified topic-based category in the central repository. To demonstrate the validity of the proposed concepts, a prototype system enabling the function of automatic topic identification, automatic category searching, and automatic archiving is implemented. Keywords- Document centralization; Intelligent archiving; Automatic topic identification; Cloud storage I. INTRODUCTION Centralized management of documents, or the document centralization, is emerging as an indispensable choice to strategically secure and utilize organizational intellectual assets nowadays. The up-todate concept of Enterprise Content Management (ECM), the strategies, methods and tools used to capture, manage, store, preserve, and deliver content and documents related to organizational processes, is a typical example centralizing resources and efforts to manage organizational documents and contents. Organization s intellectual assets include not only business-related documents and contents, but also business processes and information technologies. Therefore, through the document centralization, organizations can expect efficient application of information resources by effectively allocating them into proper processes and tasks. Highly secured management of internal resources can be also initiated; therefore many companies are now trying to establish robust and scalable systems for document centralization. Organizational documents can be centralized using the network infrastructure, and the most commonly applied network technology is the cloud computing. Among cloud computing technologies, the cloud storage forms the repository to store transmitted documents. Cloud storage, one of widely known cloud computing technologies, initiates its function by providing the Internet- and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..133
based data storage as a service. One of the biggest merits of cloud storage is that users can access data in a cloud anytime and anywhere, using any device [5]. Typical examples of cloud storage services are Amazon S3 (http://aws.amazon.com/s3), Mosso (http://www.rackspacecloud.com), Wuala (http://www.wuala.com), or ucloud (http://www.ucloud.com); All of these services offer users clean and simple storage interfaces, hiding the details of the actual location and management of resources [8]. Once a document to be archived is stored in a cloud storage, users can access and download it anytime and anywhere if the right to access has been granted. Because of such advantage in utilizing organizational information resources, more companies and organizations are implementing the online storage under the cloud computing environment. While cloud storage can deliver users various benefits, it also has not a few technical limits in network security as well as privacy [9]. From the viewpoint of usability, many users also point out a very serious problem in using cloud storage, which is the difficulty in storing and retrieving documents. To store a working document under any categories provided by the cloud storage, a user has to determine the category that exactly coincides with the contents of the document. Since the category is naturally various and the overall structure of categories is complicated, determining the proper category is not an easy work. When retrieving a document in which a user is interested, he/she has to spend not little time to locate the file because too many categories exist. Assistance in concluding the category to store a document can be accomplished by analyzing the contents of the document with respect to the categories defined in the cloud storage. Since any keywords or topics extracted from the document stand for the possible name of the category under which the document must be stored, users can easily achieve their goal. In retrieving a document from the storage, more accurate and fast searching can be made because each document was archived into the topic-based category. This research tries to enhance the usability of cloud storage-based central repository by automatically archiving the working documents according to automatically identified topic or keyword of documents. To do so, this research proposes a methodology to automatically extract the predefined category-specific keywords (or topics) of a working document by applying a text mining algorithm. Based on the extracted keywords, documents can be automatically stored into categories in cloud storage. To demonstrate the validity of the proposed concepts, a prototype system enabling the function of automatic topic identification, automatic category searching, and automatic archiving is implemented. II. PROPOSED METHODOLOGY As Fig. 1 illustrates, the process to automatically identify topics (keywords) of the working document is additionally needed to automate the whole process of cloud storage-based archiving. Tasks in the dotted ellipse are required to perform automated topic identification, and they must be processed sequentially. Once the topic of the given working document is identified, the document can be automatically stored in cloud storage with the and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..134
topic. By writing some programming codes to search corresponding directory with the topic and to save the document with the topic, automatic document archiving can be completed. When the destination directory is concluded, the system may send a message to the user to confirm whether the directory is valid. Although the user may change the directory as he/she intends, the system also automatically store the document in the location where the user designated by simply applying the agent programming. Specific roles of each task are as follows; Figure 1. Conceptual Framework of the Proposed Methodology A. File Converter A file converter changes the format (one of.doc,.xls, or.html ) of a working document to an analyzable one (.txt ) so that the following module can read the contents. A file converter plays the role of a file format filter that prepares input documents into a unified format (.txt in this research). B. Word Stemming Module To standardize the words in the document, unnecessary or redundant parts of each word must be eliminated. A stem, in linguistics, is the combination of the basic form of a word (called the root) plus any derivational morphemes, but excluding inflectional elements. This means, alternatively, that the stem is the form of the word to which inflectional morphemes can be added, if applicable. For example, the root of the English verb form destabilized is stabil- (alternate form of stable); the stem is de-stabil-ize, which includes the derivational affixes de- and -ize, but not the inflectional past tense suffix -(e)d. C. Word Vector Tool Based on the word stems from the word stemming module, the word vector tool transforms each word stem into the vector. To extract the vector, TF/IDF(Term Frequency/Inverse Term Frequency) is used. TF/IDF is a statistical technique used to evaluate how important a word is to a document. The importance increases proportionally to the number of times a word appears in the document; however is offset by how common the word is in all of the documents in the collection or corpus. A high weight in TF/IDF is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weight hence tends to filter out common terms. The word with the highest TF/IDF can be regarded as a keyword. However, to determine the keyword of a given document, usually every TF/IDF value of meaningful terms (stems) must be respectively calculated. and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..135
D. Classifier A classifier extracts resultant keywords by projecting the word vectors of the target document on the vector spaces provided by the training based on a corpus. A corpus is a predefined directories, and each directory possesses a lot of related example documents. To train the classifiers based on the constructed corpus, sample documents can be excavated by browsing conventional Web pages. Because conventional Web pages have been already labeled with corresponding keywords as titles, in a sense, the title of each document can be deemed to be already formalized [4]. This research deploys SVM-based classifier, because it is demonstrated that the SVM outperforms other similar text mining algorithms applicable to topic identification [1, 6]. The SVM determines the keyword of a document by depicting the word vectors on the vector space Rn (n: number of dimensions) and comparing the kernel functions of each document. The accuracy of the SVM was verified to be very satisfactory. If the prediction model has been trained sufficiently, then the SVM outputs very accurate and correct results. Comparing to the accuracy of manual classification, that of SVM-based classification was reported to be over 90% [3]. III. PROTOTYPE IMPLEMENTATION A. Overview To check whether the proposed methodology-based approach can yield correct results, a prototype system which intelligently stores working documents into the cloud storage-based repository under the automatically identified topic-specific category is implemented. The prototype system analyzes and extracts the topic of a working document in a real time basis when a user finishes writing and tries to store the document. Indexing the document by tagging the identified topic with user s ID and time, the prototype transmits and stores the document into the cloud storage. A dialogue between the user and the prototype is to be bridged to confirm the correctness of the identified topic. If the recommended topic has no problem, the prototype transmits the file to the cloud storage with tag information: Automatic archiving can be completed. Fig. 2 shows the sequence of functions provided by the prototype. Figure 2. Sequence Diagram of the Prototype System and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..136
The prototype system has been implemented using JDK v1.5.0_06 under Java2 runtime environment. The sub modules of Stemmer and WV Tool have been implemented by using Word stemming tool and Vector creating tool of Yale, an open source environment for KDD(Knowledge Discovery and Data mining) and machine learning [7], respectively. The SVM module as a classifier deploys LibSVM v2.81 [2]. The prototype can process documents related with activities within the university context, and therefore, for the convenience of implementation and test, the categories of cloud storage have been provided by simplifying the University Ontology defined by Department of Computer Science of University of Maryland (http://www.cs.umd.edu/projects/plus/shoe/onts/univ1.0.html). B. Example: A Document on Conference Participation To explain how the prototype system works in detail, a document concerning Conference Participation is exemplified. The prototype initiates its function by converting.doc,.xls, and.html format-based document into analyzable.txt format. Fig. 3 shows the example document formatted in.txt. Figure 3. Example Document based on.txt format To extract the topic of the given document, words in the document must be refined so that only meaningful part of a word can be inputted. Meaningless words, or stop words, must be eliminated in advance, and stems of each meaningful word must be separated. Fig. 4 shows the resultant stems by the stemming module. and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..137
Figure 4. Word Stems in Example Document Based on the meaningful stems and predefined categories, the vector of the example document can be calculated. Because only 9 categories are selected and considered in this research, resultant vector is relatively simple comparing to previous researches results. This result can be caused by applying the simple structure of predefined category; however there exists no problem in demonstrating the performance of topic identification, because now a few previous researches also simplified the volume of predefined category for the ease of training and predicting. In this research, the categories in the University Ontology have been modified so that overall categories can be consistent and compatible. Fig. 5 shows specific categories used in this research and the resultant vector. Figure 5. Predefined Category & Resultant Vector of Example Document The SVM module, the classifier, projects the vector of the document onto 9-dimensional vector space, and concludes which topic of category best stands for the contents in the document. Before performing actual categorization, the SVM module must be trained using sufficient number of sample documents already assigned into each category. In this research, from 80 to 90 number of sample documents per each category were used to train the SVM module, and the accuracy of category prediction, which automatically estimated by LibSVM, is concluded as 92.5% (MSE=1.025, SCC=0.825273), which means 37 out of 40 example documents are correctly classified. Fig. 6 shows the resultant category number provided by the SVM module, and the number 2.0 means the third category Conference. and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..138
Figure 6. Resultant Category (Category 2.0 Conference) Since identified topic means the target category under which the document can be archived, it must be tagged onto the document. To avoid the case of duplicated saving that different documents are tagged with the same topic, user s ID and time completing documenting need to be tagged together using simple programming as follows; Finishing indexing using tag information, the target category to store the document must be concluded. Since the categories of University Ontology are originally composed of 30 entities, although only 9 categories were selected to determine the topic of the document, the cloud storage of central repository is set to have 30 categories. Therefore the document is to be stored under one of 30 categories. The Hash function is proper to do this job, and corresponding programming codes are as follows; and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..139
Once the target category to store the document has been concluded, the document must be save with the file name topic-id-date under the concluded category, as following codes address; During performing automatic archiving, automatically identified topic of the document must be confirmed by user not to save the document under a wrong category. After completing automatic archiving, any message making the user know in which category the document has been stored. These kinds of dialogues between the system and a user can be processed based on following codes; and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..140
IV. CONCLUDING REMARKS Benefits from utilizing cloud storage in companies and organizations can be beyond description because it promotes effective and efficient sharing of organizational information and knowledge regardless to the time and place. If some usability issues around cloud computing, however, are not resolved realistically, then the benefits as well as interests can be scattered away. This research tries to resolve one of such usability issues around cloud storage by suggesting a practical guidance to relieve user s burden in selecting directories of cloud storage. The proposed methodology to identify the topics of working documents and to store documents with respect to the identified topics in an automated manner can contribute higher productivity and convenience of work. Companies can also expect more concentrated management of organizational information and knowledge through the proposed concepts, because more accurate and secured processing of organizational document archive is guaranteed. This research, however, must be further studied so that the proposed methodology can be applied to various mobile devices, such as smartphones and smartpads, which are the essential items of current users. To cope with this requirement, wireless-communication-oriented networking protocols must be additionally considered. Formal corpus, in addition, needs to be also developed to heighten the performance of topic identification, because the accuracy of text mining mainly depends on the result of training based on the corpus. Since the corpus may have the same structure with the directory of cloud storage, this adjustment can also reinforce the realistic application of automatic document archiving. ACKNOWLEDGEMENT REFERENCES [1] Basu, A., Watters, C., & Shepherd, M., Support Vector Machines for Text Categorization, Proceedings of the 36th Hawaii International Conference on System Sciences, Vol.4, 2003. [2] Chang, C. & Lin, C., LIBSVM: a library for support vector machines, ACM Transactions on Intelligent Systems and Technology, Vol.2, No.3, 1-27, 2011. [3] Hsu, C.W., Chang, C.C., & Lin, C.J., A Practical Guide to Support Vector Classification: LibSVM Tutorial, available at http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf, 2001 [4] Kim, S., Suh, E., & Yoo, K., A study of context inference for Web-based information systems, Electronic Commerce Research and Applications, Vol.6, 146-158, 2007. and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..141
[5] Liu, Q., Wang, G., & Wu, J., Secure and privacy preserving keyword searching for cloud storage services, Journal of Network and Computer Applications, Vol.35, No.3, 927-933, 2012. [6] Meyer, D., Leisch, F., & Hornik, K., The support vector machine under test, Neurocomputing, Vol.55, 169-186, 2003. [7] Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., & Euler, T., YALE: Rapid Prototyping for Complex Data Mining Tasks, Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-06), 2006. [8] Pamies-Juarez, L., García-López, P., Sánchez-Artigas, M., & Herrera, B., Towards the design of optimal data redundancy schemes for heterogeneous cloud storage infrastructures, Computer Networks, Vol.55, 1100-1113, 2011. [9] Svantesson, D. & Clarke, R., Privacy and consumer risks in cloud computing, Computer Law & Security Review, Vol.26, 391-397, 2010.] and Applied Computing ( ICIEACS 2013 ), Bangkok, Thailand on April 6-7, 2013 Page..142