Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing
|
|
|
- Vernon Norton
- 9 years ago
- Views:
Transcription
1 Vol.8, No.4 (214), pp Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing Yong-Il Kim 1, Yoo-Kang Ji 2 and Sun Park 3 * 1 Honam University, South Korea, 2 DongShin University, South Korea, 3 GIST, South Korea 1 [email protected], 2 [email protected], 3 [email protected] Abstract Clustering of class labels can be generated automatically, which is much lower quality than labels specified by human. If the class labels for clustering are provided, the clustering is more effective. In classic document clustering based on vector model, documents appear terms frequency without considering the semantic information of each document. The property of vector model may be incorrectly classified documents into different clusters when documents of same cluster lack the shared terms. To overcome this problem are applied by the knowledge based approaches. However, these approaches have an influence of inherent structure of documents on clustering and a cost problem of constructing ontology. In addition, the methods are limited to cluster suitable text document clustering from in exploding big text data on Cloud environment. In this paper, we propose a big text document clustering method using terms of class label and semantic feature based Hadoop. Class label term can well represent the inherent structure of document clusters by non-negative matrix factorization (NMF) based Hadoop. The proposed method can improve the quality of document clustering which uses the class label terms and the term weights based on term mutual information (TMI) with ordnet at a little cost. It also can cluster the big data size of document using the distributed parallel processing based on Hadoop. The experimental results demonstrate that the proposed method achieves better performance than other document clustering methods. Keywords: document clustering, NMF, semantic features, term weight, ordnet, term mutual information, big text data, Hadoop 1. Introduction Clustering of class labels can be generated automatically however there are different from labels specified by humans. The automatic class label is much lower quality than a maunal class label. If the specified class labels for clustering are provided with no human intervention, the clustering is more effective [1-3]. Traditional document clustering methods are based on bag of words (BO) model, which represents documents with features such as weighted term frequencies (i.e., vector model). However, these methods ignore semantic relationship between the terms within a document set. The clustering performance of the BO model is dependent on a distance measure of document pairs. But the distance measure cannot reflect the real distance between two documents because the documents are composed of the high dimension terms with relation to the complicated document topics. In addition, the results of clustering documents are influenced by the properties of documents or the desired * Corresponding author ISSN: IJSEIA Copyright c 214 SERSC
2 Vol.8, No.4 (214) cluster forms by user [1]. Recently, to overcome the problems of the vector model-based document clustering, internal and external knowledge approaches are applied. Internal knowledge-based document clustering uses the inherent structure of the document set by means of a factorization technique. The factorization techniques for document clustering including non-negative matrix factorization (NMF) [4-6], concept factorization (CF) [7], adaptive subspace iteration (ASI) [8], and clustering with local and global regularization (CLGR) [9] have been proposed, which can accurately identify the topics of document set from their semantic features. These methods have been studied intensively and although they have many advantages, the successful construction of a semantic features from the original document set remains limited regarding the organization of very different documents or the composition of similar documents [1]. External knowledge-based document clustering exploits the constructed term ontology from external knowledge database with regard to ontology. Recently, the term ontology techniques for document clustering are proposed such as term mutual information with conceptual knowledge by ordnet [11], concept mapping schemes from ikipedia [12], concept weighting from domain ontology [13], and fuzzy associations with condensing cluster terms by ordnet [14], etc. The term ontology techniques can improve the BO term representation of document clustering. However, it is often difficult to locate a comprehensive ontology that covers all concepts mentioned in the documents collection, which is a cause of loss of information [1, 12]. Moreover, the ontology-based method takes higher cost to construct the ontology manually by knowledge engineers and domain experts. In order to resolve the limitations of the knowledge-based approaches, this paper proposes a document clustering method that uses terms of class label by semantic features of NMF based on Hadoop and term weights by term mutual information (TMI) in connection with ordnet. The proposed method combines the advantages of the internal and external knowledge-based methods. In the proposed method, first, meaningful terms of class label for describing cluster topics of documents are extracted using NMF based on Hadoop. The extracted terms well represents the class label of document clusters by means of semantic features (i.e., internal knowledge) having inherent structure of documents. Second, the term weights of documents are calculated using the TMI based on the synonyms of ordnet (i.e., external knowledge) with respect to documents terms. The term weights can easily classify documents into an appropriate class label by extending the coverage of document with respect to class label. The method can cluster the big data size of document using the distributed parallel processing based on Hadoop. The rest of the paper is organized as follows: Section 2, Hadoop framework are introduced for related works. In Section 3 describe the NMF algorithm in detail. In section 4 explains the proposed methods. In Section 5 shows the performance evaluation and experimental results of the proposed method. Finally, we conclude in Section Hadoop Framework The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Hadoop project includes the Apache Hadoop software library which is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures [15]. 2 Copyright c 214 SERSC
3 Vol.8, No.4 (214) 3. Non-negative Matrix Factorization This section reviews NMF theory with algorithm and describes the advantage of semantic features by comparison between NMF and SVD (singular value decomposition) in Example 1. In this paper, we define the matrix notation as follows: Let X *j be j th column vector of matrix X, X i * be i th row vector, and X ij be the element of i th row and j th column. NMF is to decompose a given m n matrix A into a non-negative semantic feature matrix and a nonnegative semantic variable matrix H as shown in Equation (1) [1]. A H (1) where is a m r non-negative matrix and H is a r n non-negative matrix. Usually r is chosen to be smaller than m or n, so that the total sizes of and H are smaller than that of the original matrix A. The objective function is used minimizing the Euclidean distance between each column of ~ A and its approximation A = H, which was proposed by Lee and Seung [1]. As an objective function, the Frobenius norm is used: E (, H ) A H 2 F m n A i 1 j 1 l 1 Updating and H is kept until Θ (, H ) converges under the predefined threshold or E exceeds the number of repetition. The update rules are as follows: ij r il H lj 2 (2) H αμ ( αμ H, αμ T ( T A ) H ) αμ iα ( AH T iα iα T (3) ( HH ) ) iα Example 1) e illustrate an example of NMF and SVD decomposition [2, 1]. The nonnegative matrix A is decomposed by nnmf() function of Matlab 7.8 into two non-negative matrices, and H, as shown in Figure 1(a). A H (a) Result of NMF Decomposition A* 2 H 12 *1 H 22 * 2 H 23 (b) Example of Column vector A *2 Representation using Semantic Features and Semantic Variables * 3 Copyright c 214 SERSC 3
4 Vol.8, No.4 (214) A U S V (c) Result of SVD decomposition Figure 1. Example of NMF and SVD Figure 1(b) shows an example of the representation of a column vector corresponding to document by a linear combination of semantic feature and semantic variable. Figure 1(c) shows the result of SVD by svd() function of Matlab 7.8. There are no zero values and negative values in relation to the semantic feature matrices U and V in Figure 1(c). Unlike SVD, the semantic feature matrices and H by NMF are sparse in Figure 1(a). Intuitively, the NMF can obtain semantic features that have a small semantic range rather than SVD. In other words, the sparse property of the semantic features of NMF can cover class labels by several terms of document to be associated with the semantic features. Thus, the semantic feature can easily identify class label terms to signify document cluster. Besides, it can help to distinguish the multiple meanings of the same term [1]. 4. Proposed Document Clustering Method This paper proposes a document clustering method using class label terms by NMF based on Hadoop and term weights based on TMI with ordnet. The proposed method consists of three phases: preprocessing, extracting class label terms, and clustering document, as shown in Figure 2. In the subsections below, each phase is explained in full. Preprocessing Document Set Remove stop words Perform words stemming Construct term document frequency matrix Extracting class label terms (a) NMF based on Hadoop semantic feature matrix (b) Extracting terms of class label ordnet (c) Computing term weights by TMI For each clusters Clustering Document (d) Calculating similarity between class label and term weights (e) Classifying documents into class lables Document Clusters Figure 2. Document Clustering Method using Class Label Terms and Term eights 4.1. Preprocessing In the preprocessing phase, Rijsbergen s stop words list is used to remove all stop words, and word stemming is removed using Porter s stemming algorithm [16]. Then, the term document frequency matrix A is constructed from the document set Extracting Class Label Terms This section extracts class label terms in regard to the properties of the document clusters using NMF as in Figure 2(b). The terms of class label that can well explain the topic of 4 Copyright c 214 SERSC
5 Vol.8, No.4 (214) document cluster are derived by the semantic features of the NMF based on Hadoop. Semantic features of big text document for clustering are extracted by the modified Liu s distributed NMF (dnmf) method [17] based on distributed parallel processing on Hadoop MapReduce programming. The extracting method is described as follows. First, term document frequency matrix A is constructed by performing the preprocessing phase. Second, let the number of cluster (i.e., the number of semantic feature r) be set, and then dnmf is performed on the matrix A to decompose the two sematic feature matrices and H. Finally, matrix and Equation (4) are used to extract class label terms. The column vector of matrix corresponds to class label of cluster and the row vector of matrix refers to terms of document, which the element of matrix (i.e., the semantic feature value) indicates how much the term reflects the cluster class labels. The equation of extracting terms of class labels is as follows. CL p A ij if p = arg max 1 j r ij and ij as j (4) where CL p is the term set of p th class label of cluster, A ij is the term corresponding to the semantic feature of i th row and the j th column in the matrix. The average semantic feature value of j th column vector, as j, is as follows. as m nz ij j i = 1 = (5) where m is the number of i th row (i.e., the number of terms), nz is the number of non-zero (i.e., positive) elements of i th row Computing Term eights by TMI based on ordnet In this section, the term weights are calculated by TMI (term mutual information) based on the synonyms of ordnet as in Figure 2(c). ordnet is a lexical database for the English language where words (i.e., terms) are grouped in synsets consisting of synonyms and thus representing a specific meaning of a given term [18]. Class label terms may be restricted from properties of document cluster and document composition. To resolve this problem, this paper uses term weight of documents by using the TMI on synonyms of ordnet. Term weights of the document are calculated by jing s TMI as in Equation (6) [11]. The Jing s TMI is as follows. In Equation (6), δ il is to indicate semantic information between two terms. If term A lj appears in the synonyms of A ij by means of ordnet, δ il will be treated in a same level for different A ij and A lj, otherwise, δ il will be set zero. m ~ (6) A = A + ij ij δ A il lj Example 2) Table 1 shows the six documents from a part of Figure 4.1 of [2]. Table 2 shows term document frequency matrix by performing the preprocessing phase from Table 1. Table 3 shows the sematic feature matrix obtained through NMF from Table 2, and the result of average of non-zero element of semantic features vector as by using Equation (5). Table 4 shows the results of the extracted class label terms from Table 3, which matches the semantic feature values of more than the average semantic feature value as j. l =1 i l Copyright c 214 SERSC 5
6 Vol.8, No.4 (214) Table 1. Document Set of Composition of 6 Documents documents d1 d2 d3 d4 document contents A course on integral equations A tractors for semigroups and evolution equations Automatic differentiation of algorithms : theory, implementation, and application Geometrical aspects of partial differential equations Table 2. Term Document Frequency Matrix from Table 1 document d1 term d2 d3 d4 course 1 integral 1 equations tractors 1 semigroups 1 evolution 1 automatic 1 computational algebra commutative oscillation neutral delay Table 3. The as and Semantic Feature Matrix by NMF from Table 2 term r1(cluster1) r2(cluster2) r3(cluster3) course.321 integral.312 equations tractors.434 semigroups.434 evolution.434 automatic computational.999 algebra commutative.999 oscillation neutral delay as Copyright c 214 SERSC
7 Vol.8, No.4 (214) Table 4. The Extracted Terms of Class Labels r1 r2 r3 algorithms, geometric, ideals, varieties, introduction, computational 4.4. Clustering Document using Similarity equations, different automatic, different, algorithms, theory, implementation This section presents the clustering documents using cosine similarity between class label terms and term weights of documents. The proposed method in Figure 2(d) and 2(e) is described as follows. First, the cosine similarity between class label terms and term weights is calculated. And then a document having a highest similarity value with respect to the class label is clustered into cluster label in connection with the document clusters [3, 16]. 5. Experiments and Evaluation This paper uses 2 Newsgroups data set for performance evaluation [19-22]. To evaluate the proposed method, mixed documents were randomly chosen from the 2 Newsgroups documents. Normalized mutual information metric used to measure the document clustering performance [2-4, 7-9]. The cluster numbers for the evaluation method are set by ranging from 2 to 1, as shown in Table 5. For each given cluster number K, 5 experiments were performed on different randomly chosen clusters, and the final performance values averaged the values obtained from running experiments. In this paper, the eight different document clustering methods are implemented as in Table 5. The KM is a document clustering using Kmeans method based on a traditional partitioning clustering technique [2]. The NMF, ASI, CLGR, RNMF, and FPCA methods are document clustering methods based on internal knowledge. The FADN and TMINMF methods are clustering methods based on combining the internal and external knowledge. TMINMF denotes the proposed method described within this paper. FADN denotes the previously proposed method using the ordnet and fuzzy theory [14]. FPCA is the previously proposed method using PCA (principal component analysis) and fuzzy relationship [6], and RNMF is the method proposed previously using NMF and cluster refinement [5]. NMF denotes Xu s method using non-negative matrix factorization [4]. ASI is Li s method using adaptive subspace iteration [8]. Lastly, CLGR denotes ang s method using local and global regularization [9]. As seen in Table 5, the average normalized metric of TMINMF is 16.7% higher than that of KM, 13.5% higher than that of NMF, 12.9% higher than that of ASI, 7.68% higher than that of CLGR, 5.12% higher than that of RNMF, 3.27% higher than that of FPCA, and 2.44% higher than that of FADN. Table 5. Evaluation Results Of Performance Comparison K KM NMF ASI CLGR RNMF FPCA FADN TMINMF average Copyright c 214 SERSC 7
8 Vol.8, No.4 (214) 6. Conclusion This paper proposes the enhancing document clustering method using class label terms and term weights. The proposed method uses the semantic features by internal knowledge of NMF to extract the class label terms, which are well represented within the important class labels of the documents cluster. To resolve the limitation of the semantic features with respect to be influenced by internal structure of documents, the method uses TMI (term mutual information) to calculate term weights of documents based on external knowledge of ordnet. In addition, it uses a similarity between the class label terms and term weights to improve the quality of the document clustering. It also can cluster the big data size of document using the distributed parallel processing based on Hadoop. It was demonstrated that the normalized mutual information is higher than other document clustering methods for 2 Newsgroups test collections using the proposed method. References [1] J. Hu, L. Fang, Y. Cao, H. J. Zeng, H. Li, Q. Yang and Z. Chen, "Enhancing Text Clustering by Leveraging ikipedia Semantics", Proceeding of SIGIR 8, Singapore, (28) July, pp [2] S. Chakrabarti, Mining the web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, (23). [3] B. Y. Ricardo and R. N. Berthier, Moden Information Retrieval: the concepts and technology behind search, Second edition, ACM Press, (211). [4]. Xu, X. Liu and Y. Gon, Document Clustering Based On Non-negative Matrix Factorization, Proceeding of SIGIR 3, Toronto Canada, (23) August, pp [5] S. Park, D. U. An, B. R. Cha and C.. Kim, Document Clustering with Cluster Refinement and Nonnegative Matrix Factorization, Proceeding of the 16th ICONIP 9, , Bangkok, Thailand, (29) December. [6] S. Park and K. J. Kim, Document Clustering using Non-negative Matrix Factorization and Fuzzy Relationship,The Journal of Korea Navigation Institute, vol. 14, no. 2, (21) April, pp [7]. Xu and Y. Gong, Document Clustering by Concept Factorization, Proceeding of SIGIR 4, UK, (24) July, pp [8] T. Li, S. Ma, M. Ogihara, Document Clustering via Adaptive Subspace Iteration, Proceeding of SIGIR 4, UK, (24) July, pp [9] F. ang and C. Zhang, Regularized Clustering for Documents, Proceeding of SIGIR 7, Amsterdam, (27) July, pp [1] D. D. Lee and H. S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature, vol. 41, (1999) October, pp [11] L. Jing, L. Zhou, M. K. Ng and J. Z. Huang, Ontology-based Distance Measure for Text Clustering, Proceeding of SIAM International conference on Text Data Mining, Bethesda, MD., (26). [12] X. Hu, X. Zhang, C. Lu, E. K. Park and X. Zhou, Exploiting ikipedia as External Knowledge for Document Clustering, In proceeding of KDD'9, Paris, France, (29) June, pp [13] H. H. Tar and T. T. S. Nyaunt, Ontology-based Concept eighting for Text Documents, orld Academy of Science, Engineering and Technology, vol. 81, (211), pp [14] S. Park and S. R. Lee, Enhancing Document Clustering Using Condensing Cluster Terms and Fuzzy Association, Journal of IEICE TRANS, Information and System, vol. E94-D, no. 6, (211) June, pp [15] The Apache Hadoop project, (213). [16]. B. Frankes and B. Y. Ricardo, Information Retrieval: Data Structure & Algorithms, Prentice-Hall, (1992). [17] C. Liu, H. C. Yang, J. Fan, L.. He and Y. M. ang, Distributed Nonnegative Matrix Factorization for eb-scale Dyadic Data Analysis on MapReduce, Proceeding of the International orld ide eb Conferene Comittee, USA, (21), pp [18] G. Miller, ordnet: A lexical databassed for English, CACM, vol. 38, no. 11, (1995), pp [19] The 2 newsgroups data set. (212). [2] A. Ngoumou and M. F. Ndjodo, A Rewriting System Based Operational Semantics for the feature Oriented Resue Method, International Journal of Software Engineering and Its Applications, vol. 7, no. 6, (213), pp Copyright c 214 SERSC
9 Vol.8, No.4 (214) [21] S. Mal and K. Rajnish, New Quality Inheritance Metrics for Object-Oriented Design, International Journal of Software Engineering and Its Applications, vol. 7, no. 6, (213), pp [22] S. H. Lee, J. G. Lee and K. I. Moon, A preprocessing of Rough Sets Based on Attribute Variation Minimization, International Journal of Software Engineering and Its Applications, vol. 7, no. 6, (213), pp Authors Yong-Il Kim, received a B.S. in computer science from Chonnam University in 1984, and M.S. degree in computer science from Korea Advanced Institute Science and Technology (KAIST), Korea in From March 22, he has joined as an Associate Professor at the Honam University, Gwangju, Korea. His research interests in data mining, big data, and intelligent agents. He is a member of IEEE. Yoo-Kang Ji, he is a Visiting Professor at Dept. of Information & communication Eng., Dongshin Univ., South KOREA. He received the B.S., M.S., and Ph.D. degree in the Dept. of Information & Communication Eng. from DongShin Univ., KOREA in 2, 22 and 26 respectively. He has worked professor in Dept. of Information & Communication Eng. DongShuin Univ. Mar. 26 to Aug. 29 His research interests in Mobile S/, Networked Video and Embedded System. Sun Park, he is a research professor at school of Information Communication Engineering at Gwangju Institute of Science and Technology (GIST), South Korea. He received the Ph.D degree in Computer & Information Engineering from Inha University, South Korea, in 27, the M.S. degree in Information & Communication Engineering from Hannam University, Korea, in 21, and the B.S. degree in Computer Engineering from Jeonju University, Korea, in Prior to becoming a researcher at GIST, he has worked as a research professor at Mokpo National University, a postdoctoral at Chonbuk National University, and professor in Dept. of Computer Engineering, Honam University, South Korea. His research interests include Data Mining, Information Retrieval, Information Summarization, Convergence IT and Marine, IoT, and Cloud. Copyright c 214 SERSC 9
10 Vol.8, No.4 (214) 1 Copyright c 214 SERSC
Big Data Summarization Using Semantic. Feture for IoT on Cloud
Contemporary Engineering Sciences, Vol. 7, 2014, no. 22, 1095-1103 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ces.2014.49137 Big Data Summarization Using Semantic Feture for IoT on Cloud Yoo-Kang
FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
Clustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,
Method of Fault Detection in Cloud Computing Systems
, pp.205-212 http://dx.doi.org/10.14257/ijgdc.2014.7.3.21 Method of Fault Detection in Cloud Computing Systems Ying Jiang, Jie Huang, Jiaman Ding and Yingli Liu Yunnan Key Lab of Computer Technology Application,
Patent Big Data Analysis by R Data Language for Technology Management
, pp. 69-78 http://dx.doi.org/10.14257/ijseia.2016.10.1.08 Patent Big Data Analysis by R Data Language for Technology Management Sunghae Jun * Department of Statistics, Cheongju University, 360-764, Korea
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
THE Walsh Hadamard transform (WHT) and discrete
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 54, NO. 12, DECEMBER 2007 2741 Fast Block Center Weighted Hadamard Transform Moon Ho Lee, Senior Member, IEEE, Xiao-Dong Zhang Abstract
An Information Retrieval using weighted Index Terms in Natural Language document collections
Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia
SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA
SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA J.RAVI RAJESH PG Scholar Rajalakshmi engineering college Thandalam, Chennai. [email protected] Mrs.
Linear Algebra Methods for Data Mining
Linear Algebra Methods for Data Mining Saara Hyvönen, [email protected] Spring 2007 Text mining & Information Retrieval Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki
The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network
, pp.67-76 http://dx.doi.org/10.14257/ijdta.2016.9.1.06 The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network Lihua Yang and Baolin Li* School of Economics and
Oracle8i Spatial: Experiences with Extensible Databases
Oracle8i Spatial: Experiences with Extensible Databases Siva Ravada and Jayant Sharma Spatial Products Division Oracle Corporation One Oracle Drive Nashua NH-03062 {sravada,jsharma}@us.oracle.com 1 Introduction
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
TECHNOLOGY ANALYSIS FOR INTERNET OF THINGS USING BIG DATA LEARNING
TECHNOLOGY ANALYSIS FOR INTERNET OF THINGS USING BIG DATA LEARNING Sunghae Jun 1 1 Professor, Department of Statistics, Cheongju University, Chungbuk, Korea Abstract The internet of things (IoT) is an
Science Navigation Map: An Interactive Data Mining Tool for Literature Analysis
Science Navigation Map: An Interactive Data Mining Tool for Literature Analysis Yu Liu School of Software [email protected] Zhen Huang School of Software [email protected] Yufeng Chen School of Computer
A Virtual Machine Searching Method in Networks using a Vector Space Model and Routing Table Tree Architecture
A Virtual Machine Searching Method in Networks using a Vector Space Model and Routing Table Tree Architecture Hyeon seok O, Namgi Kim1, Byoung-Dai Lee dept. of Computer Science. Kyonggi University, Suwon,
A Divided Regression Analysis for Big Data
Vol., No. (0), pp. - http://dx.doi.org/0./ijseia.0...0 A Divided Regression Analysis for Big Data Sunghae Jun, Seung-Joo Lee and Jea-Bok Ryu Department of Statistics, Cheongju University, 0-, Korea [email protected],
Fast Fuzzy Control of Warranty Claims System
Journal of Information Processing Systems, Vol.6, No.2, June 2010 DOI : 10.3745/JIPS.2010.6.2.209 Fast Fuzzy Control of Warranty Claims System Sang-Hyun Lee*, Sung Eui Cho* and Kyung-li Moon** Abstract
DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.
DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
Efficient Data Replication Scheme based on Hadoop Distributed File System
, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,
Analytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University
Semantic Concept Based Retrieval of Software Bug Report with Feedback
Semantic Concept Based Retrieval of Software Bug Report with Feedback Tao Zhang, Byungjeong Lee, Hanjoon Kim, Jaeho Lee, Sooyong Kang, and Ilhoon Shin Abstract Mining software bugs provides a way to develop
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
UPS battery remote monitoring system in cloud computing
, pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India
A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster
, pp.11-20 http://dx.doi.org/10.14257/ ijgdc.2014.7.2.02 A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster Kehe Wu 1, Long Chen 2, Shichao Ye 2 and Yi Li 2 1 Beijing
General Framework for an Iterative Solution of Ax b. Jacobi s Method
2.6 Iterative Solutions of Linear Systems 143 2.6 Iterative Solutions of Linear Systems Consistent linear systems in real life are solved in one of two ways: by direct calculation (using a matrix factorization,
IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD
Journal homepage: www.mjret.in ISSN:2348-6953 IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD Deepak Ramchandara Lad 1, Soumitra S. Das 2 Computer Dept. 12 Dr. D. Y. Patil School of Engineering,(Affiliated
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
A Statistical Text Mining Method for Patent Analysis
A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, [email protected] Abstract Most text data from diverse document databases are unsuitable for analytical
DYNAMIC QUERY FORMS WITH NoSQL
IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 7, Jul 2014, 157-162 Impact Journals DYNAMIC QUERY FORMS WITH
Search Result Optimization using Annotators
Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on
Neovision2 Performance Evaluation Protocol
Neovision2 Performance Evaluation Protocol Version 3.0 4/16/2012 Public Release Prepared by Rajmadhan Ekambaram [email protected] Dmitry Goldgof, Ph.D. [email protected] Rangachar Kasturi, Ph.D.
SYMMETRIC EIGENFACES MILI I. SHAH
SYMMETRIC EIGENFACES MILI I. SHAH Abstract. Over the years, mathematicians and computer scientists have produced an extensive body of work in the area of facial analysis. Several facial analysis algorithms
An Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation
An Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation Shanofer. S Master of Engineering, Department of Computer Science and Engineering, Veerammal Engineering College,
α = u v. In other words, Orthogonal Projection
Orthogonal Projection Given any nonzero vector v, it is possible to decompose an arbitrary vector u into a component that points in the direction of v and one that points in a direction orthogonal to v
Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]
Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
Clustering Very Large Data Sets with Principal Direction Divisive Partitioning
Clustering Very Large Data Sets with Principal Direction Divisive Partitioning David Littau 1 and Daniel Boley 2 1 University of Minnesota, Minneapolis MN 55455 [email protected] 2 University of Minnesota,
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India [email protected] 2 School of Computer
An Efficiency Keyword Search Scheme to improve user experience for Encrypted Data in Cloud
, pp.246-252 http://dx.doi.org/10.14257/astl.2014.49.45 An Efficiency Keyword Search Scheme to improve user experience for Encrypted Data in Cloud Jiangang Shu ab Xingming Sun ab Lu Zhou ab Jin Wang ab
Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances
Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances It is possible to construct a matrix X of Cartesian coordinates of points in Euclidean space when we know the Euclidean
Research on the Performance Optimization of Hadoop in Big Data Environment
Vol.8, No.5 (015), pp.93-304 http://dx.doi.org/10.1457/idta.015.8.5.6 Research on the Performance Optimization of Hadoop in Big Data Environment Jia Min-Zheng Department of Information Engineering, Beiing
Integration of Hadoop Cluster Prototype and Analysis Software for SMB
Vol.58 (Clound and Super Computing 2014), pp.1-5 http://dx.doi.org/10.14257/astl.2014.58.01 Integration of Hadoop Cluster Prototype and Analysis Software for SMB Byung-Rae Cha 1, Yoo-Kang Ji 2, Jong-Won
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Exchange of Data for Big Data in Hybrid Cloud Environment
, pp. 67-72 http://dx.doi.org/10.14257/ijseia.2015.9.4.08 Exchange of Data for Big Data in Hybrid Cloud Environment Chi-gon Hwang 1, Chang-Pyo Yoon 2 and Daesung Lee 3 1 Dept of Internet Information, Kyungmin
Design of Micro-payment to Strengthen Security by 2 Factor Authentication with Mobile & Wearable Devices
, pp.28-32 http://dx.doi.org/10.14257/astl.2015.109.07 Design of Micro-payment to Strengthen Security by 2 Factor Authentication with Mobile & Wearable Devices Byung-Rae Cha 1, Sang-Hun Lee 2, Soo-Bong
Movie Classification Using k-means and Hierarchical Clustering
Movie Classification Using k-means and Hierarchical Clustering An analysis of clustering algorithms on movie scripts Dharak Shah DA-IICT, Gandhinagar Gujarat, India [email protected] Saheb Motiani
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: [email protected]
Volume 2, Issue 11, November 2014 International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 11, November 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com
Advances in Natural and Applied Sciences
AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and
Pattern Recognition Using Feature Based Die-Map Clusteringin the Semiconductor Manufacturing Process
Pattern Recognition Using Feature Based Die-Map Clusteringin the Semiconductor Manufacturing Process Seung Hwan Park, Cheng-Sool Park, Jun Seok Kim, Youngji Yoo, Daewoong An, Jun-Geol Baek Abstract Depending
MapReduce Approach to Collective Classification for Networks
MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty
Fault Analysis in Software with the Data Interaction of Classes
, pp.189-196 http://dx.doi.org/10.14257/ijsia.2015.9.9.17 Fault Analysis in Software with the Data Interaction of Classes Yan Xiaobo 1 and Wang Yichen 2 1 Science & Technology on Reliability & Environmental
Spam Filtering Based on Latent Semantic Indexing
Spam Filtering Based on Latent Semantic Indexing Wilfried N. Gansterer Andreas G. K. Janecek Robert Neumayer Abstract In this paper, a study on the classification performance of a vector space model (VSM)
DATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
Density Map Visualization for Overlapping Bicycle Trajectories
, pp.327-332 http://dx.doi.org/10.14257/ijca.2014.7.3.31 Density Map Visualization for Overlapping Bicycle Trajectories Dongwook Lee 1, Jinsul Kim 2 and Minsoo Hahn 1 1 Digital Media Lab., Korea Advanced
Facilitating Business Process Discovery using Email Analysis
Facilitating Business Process Discovery using Email Analysis Matin Mavaddat [email protected] Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process
Search Engine Based Intelligent Help Desk System: iassist
Search Engine Based Intelligent Help Desk System: iassist Sahil K. Shah, Prof. Sheetal A. Takale Information Technology Department VPCOE, Baramati, Maharashtra, India [email protected], [email protected]
Design of Electric Energy Acquisition System on Hadoop
, pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University
W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015
W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction
Effective Use of Android Sensors Based on Visualization of Sensor Information
, pp.299-308 http://dx.doi.org/10.14257/ijmue.2015.10.9.31 Effective Use of Android Sensors Based on Visualization of Sensor Information Young Jae Lee Faculty of Smartmedia, Jeonju University, 303 Cheonjam-ro,
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
Impact of Boolean factorization as preprocessing methods for classification of Boolean data
Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science,
Understanding Web personalization with Web Usage Mining and its Application: Recommender System
Understanding Web personalization with Web Usage Mining and its Application: Recommender System Manoj Swami 1, Prof. Manasi Kulkarni 2 1 M.Tech (Computer-NIMS), VJTI, Mumbai. 2 Department of Computer Technology,
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff
Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear
Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network
, pp.273-284 http://dx.doi.org/10.14257/ijdta.2015.8.5.24 Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network Gengxin Sun 1, Sheng Bin 2 and
A Comparative Study on Sentiment Classification and Ranking on Product Reviews
A Comparative Study on Sentiment Classification and Ranking on Product Reviews C.EMELDA Research Scholar, PG and Research Department of Computer Science, Nehru Memorial College, Putthanampatti, Bharathidasan
International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6
International Journal of Engineering Research ISSN: 2348-4039 & Management Technology Email: [email protected] November-2015 Volume 2, Issue-6 www.ijermt.org Modeling Big Data Characteristics for Discovering
Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics
INTERNATIONAL BLACK SEA UNIVERSITY COMPUTER TECHNOLOGIES AND ENGINEERING FACULTY ELABORATION OF AN ALGORITHM OF DETECTING TESTS DIMENSIONALITY Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree
A Solution for Data Inconsistency in Data Integration *
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 27, 681-695 (2011) A Solution for Data Inconsistency in Data Integration * Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai,
SERG. Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing
Delft University of Technology Software Engineering Research Group Technical Report Series Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing Marco Lormans and Arie
Using Big Data and GIS to Model Aviation Fuel Burn
Using Big Data and GIS to Model Aviation Fuel Burn Gary M. Baker USDOT Volpe Center 2015 Transportation DataPalooza June 17, 2015 The National Transportation Systems Center Advancing transportation innovation
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing
A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing Liang-Teh Lee, Kang-Yuan Liu, Hui-Yang Huang and Chia-Ying Tseng Department of Computer Science and Engineering,
Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data
Client Based Power Iteration Clustering Algorithm to Reduce Dimensionalit in Big Data Jaalatchum. D 1, Thambidurai. P 1, Department of CSE, PKIET, Karaikal, India Abstract - Clustering is a group of objects
AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM
AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM ABSTRACT Luis Alexandre Rodrigues and Nizam Omar Department of Electrical Engineering, Mackenzie Presbiterian University, Brazil, São Paulo [email protected],[email protected]
GOAL-BASED INTELLIGENT AGENTS
International Journal of Information Technology, Vol. 9 No. 1 GOAL-BASED INTELLIGENT AGENTS Zhiqi Shen, Robert Gay and Xuehong Tao ICIS, School of EEE, Nanyang Technological University, Singapore 639798
Subspace Analysis and Optimization for AAM Based Face Alignment
Subspace Analysis and Optimization for AAM Based Face Alignment Ming Zhao Chun Chen College of Computer Science Zhejiang University Hangzhou, 310027, P.R.China [email protected] Stan Z. Li Microsoft
Big Data: Study in Structured and Unstructured Data
Big Data: Study in Structured and Unstructured Data Motashim Rasool 1, Wasim Khan 2 [email protected], [email protected] Abstract With the overlay of digital world, Information is available
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
Data Outsourcing based on Secure Association Rule Mining Processes
, pp. 41-48 http://dx.doi.org/10.14257/ijsia.2015.9.3.05 Data Outsourcing based on Secure Association Rule Mining Processes V. Sujatha 1, Debnath Bhattacharyya 2, P. Silpa Chaitanya 3 and Tai-hoon Kim
Research on Capability Assessment of IS Outsourcing Service Providers
, pp.291-298 http://dx.doi.org/10.14257/ijunesst.2015.8.6.28 Research on Capability Assessment of IS Outsourcing Service Providers Xinjun Li and Junyan Wang School of Economic and Management, Yantai University,
Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 E-commerce recommendation system on cloud computing
A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE
STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE DIANA HALIŢĂ AND DARIUS BUFNEA Abstract. Then
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
