Document Clustering Through Non-Negative Matrix Factorization: A Case Study of Hadoop for Computational Time Reduction of Large Scale Documents
|
|
|
- Cody Morton
- 10 years ago
- Views:
Transcription
1 Document Clustering hrough Non-Negative Matrix Factorization: A Case Study of Hadoop for Computational ime Reduction of Large Scale Documents Bishnu Prasad Gautam, Dipesh Shrestha, Members IAENG 1 Abstract In this paper we discuss a new model for document clustering which has been adapted using non-negative matrix factorization method. he key idea is to cluster the documents after measuring the proximity of the documents with the extracted features. he extracted features are considered as the final cluster labels and clustering is done using cosine similarity which is equivalent to k-means with a single turn. his model was implemented using apache lucene project for indexing documents and mapreduce framework of apache hadoop project for parallel implementation of k-means algorithm. Since experiments were carried only in one cluster of Hadoop, the significant reduction in time was obtained by mapreduce implementation when clusters size exceeded 9 i.e. 40 documents averaging 1.5 kilobytes. hus it is concluded that the feature extracted using NMF can be used to cluster documents considering them to be final cluster labels as in k-means, and for large scale documents, the parallel implementation using mapreduce can lead to reduction of computational time. We have termed this model as KNMF (K-means with NMF algorithm). Index erms Document Clustering, KNMF, MapReduce 1. INRODUCION he need for the organisation of data is a must for a quick and efficient retrieval of information. A robust means for organisation of data in any organisation has been the use of databases. Databases like the relational, object-oriented or object-relational databases, all have well structured format to keep data. Not all information that an organisation generates is kept or can be kept in databases. Information is stored in huge amount in form of unstructured or semi-structured documents. Organising these documents into meaningful groups is a typical sub-problem of Information Retrieval, in which there is need to learn about the general content of data, Cutting D et al. [1]. 1.1 Document clustering 1 Bishnu Prasad Gautam is affiliated with Wakkanai Hokusei Gakuen University, Wakabadai 1 chome , Wakkanai, Hokkaido, Japan. Currently doing research in cloud computing, networking and distributed computing(contact phone: , Fax: , [email protected]) Dipesh Shrestha was affiliated with Wakkanai Hokusei Gakuen University, Wakkanani, Hokkaido, Japan. His research interests are data mining, cloud computing and system engineering. He is now working as a System Engineer with DynaSystem Co., Ltd., 1-14, North 6, West 6, Kita-ku, Sapporo, Hokkaido, Japan (Contact phone: , Fax: , [email protected]) Document clustering can loosely be defined as clustering of documents. Clustering is a process of recognizing the similarity and/or dissimilarity between the given objects and thus, dividing them into meaningful subgroups sharing common characteristics. Good clusters are those in which the members inside the cluster have quite a deal of similar characteristics. Since clustering falls under unsupervised learning, predicting the documents to fall into certain class or group isn't carried out. Methods under document clustering can be categorized into two groups as follows: a. Document partitioning (Flat Clustering) his approach divides the documents into disjoint clusters. he various methods in this category are : k-means clustering, probabilistic clustering using the Naive Bayes or Gaussian model, latent semantic indexing (LSI), spectral clustering, non-negative matrix factorization (NMF). b. Hierarchical clustering his approach finds successive clusters of document from obtained clusters either using bottom-up (agglomerate) or top-bottom (divisive) approach. 1.2 Feature extraction raditional methods in document clustering use words as measure to find similarity between documents. hese words are assumed to be mutually independent which in real application may not be the case. raditional Vector Space Information Retrieval model uses words to describe the documents but in reality the concepts, semantics, features, and topics are what describe the documents. he extraction of these features from the documents is called Feature Extraction. he extracted features hold the most important idea and concept pertaining to the documents. Feature extraction has been successfully used in text mining with unsupervised algorithms like Principal Components Analysis (PCA), Singular Value Decomposition (SVD), and Non-Negative Matrix Factorization (NMF) involving factoring the document-word matrix [5]. 1.3 Latent Semantic Indexing (LSI) Latent Semantic Indexing (LSI) is a novel Information Retrieval technique that was designed to address the deficiencies of the classic VSM model. In order to overcome the shortcomings of VSM model, LSI estimates the structure in word usage through truncated Singular Value
2 Decomposition (SVD). Retrieval is then performed using a database of singular values and vectors obtained from the truncated SVD. Application of LSI with results can be found in Berry et al. [14] and Landauer et al. [15]. 1.4 Non-negative matrix factorization (NMF) Non-negative matrix factorization is a special type of matrix factorization where the constraint of non-negativity is on the lower ranked matrices. It decomposes a matrix V mn into the product of two lower rank matrices W mk and H kn, such that V mn is approximately equal to W mk times H kn. (1) Vmn Wmk H kn Where, k << min(m,n) and optimum value of k depends on the application and is also influenced by the nature of the collection itself [13]. In the application of document clustering, k is the number of features to be extracted or it may be called the number of clusters required. V contains column as document vectors and rows as term vectors, the components of document vectors represent the relationship between the documents and the terms. W contains columns as feature vectors or the basis vectors which may not always be orthogonal (for example, when the features are not independent and have some have overlaps). H contains columns with weights associated with each basis vectors in W. hus, each document vector from the document-term matrix can be approximately composed by the linear combination of the basis vectors from W weighted by the corresponding columns from H. Let v i be any document vector from matrix V, column vectors of W be {W 1, W 2,,...,W k } and the corresponding components from column of matrix H be {h i1,h i2,...,h ik } then, (2) v i W 1 h i1 +W 2 h i W k h ik NMF uses an iterative procedure to modify the initial values of W mk and H kn so that the product approaches V mn he procedure terminates when the approximation error converges or the specified number of iterations is reached. he NMF decomposition is non-unique; the matrices W and H depend on the NMF algorithm employed and the error measure used to check convergence. Some of the NMF algorithm types are, multiplicative update algorithm by Lee and Seung [2], sparse encoding by Hoyer [10], gradient descent with constrained least squares by Pauca [11] and alternating least squares algorithm by Pattero [12]. hey differ in the measure cost function for measuring the divergence between V and WH or by regularization of the W and/or H matrices. wo simple cost functions studied by Lee and Seung are the squared error (or Frobenius norm) and an extension of the Kullback-Leibler divergence to positive matrices. Each cost function leads to a different NMF algorithm, usually minimizing the divergence using iterative update rules. Using the Frobenius norm for matrices, the objective function or minimization problem can be stated as min 2 V WH (3) W,H F where, W and H are non-negative. he method proposed by Lee and Sung [2] based on multiplicative update rules using Forbenus norm, popularly called multiplicative method (MM) which can be described as follows MM Algorithm i. Initialize W and H with non-negative values. ii. Iterate for each c, j, and i until within approximation error converge or after l iterations: ( ) ( ) cj e ( VH ) ic (b) W Wic ( WHH ) + e W V cj (a) Hcj Hcj (4) W WH + ic (5) ic In steps ii (a) and ii (b), e, a small positive parameter equal to 10 9, is added to avoid division by zero. As observed from the MM Algorithm, W and H remain non-negative during the updates. Lee and Seung [2] proved that with the above update rules objective function (1) achieve monotonic convergence and is non-increasing, and they becomes constant if and only if W and H are at a stationary point. he solution to the objective function is not unique. 1.5 Document clustering with NMF Ding C et al. [8] shows that when Frobenius norm is used as a divergence and adding an orthogonality constraint H H = I, NMF is equivalent to a relaxed form of k-means clustering. Wei Xu et al. were the first ones to use NMF for document clustering [6] where unit euclidean distance constraint was added to column vectors in W. his work was extended by Yang et al. [7] adding the sparsity constraints because sparseness is one of the important characters of huge data in semantic space. In both of the works the clustering has been based on the interpretation of the elements of the matrices. here is an analogy with the SVD in interpreting the meaning of the two non-negative matrices U and V. Each element u ij of matrix U represents the degree to which term f i W belongs to cluster j, while each element v ij of matrix V indicates to which degree document i is associated with cluster j. If document i solely belongs to cluster x, then v ix will take on a large value while rest of the elements in i th row vector of V will take on a small value close to zero. [6] From the above statement, we can summarize the relationship as W UV. From the work of Kanjani K [9] it is seen that the accuracy of algorithm from Lee and Seung [2] is higher than their derivatives [9], [10]. In our model, the original multiplicative update proposed by Lee and Seung [2] is undertaken. 2. MEHODOLOGY he following section describes the proposed model in detail. It includes the clustering method with the proposed model and the parallel implementation strategy of k-means. he latter parts contain explanation to the underlying architecture of Hadoop Distributed File System (HDFS) over which k-means algorithm is implemented. 2.1 he Proposed Model From hereinafter our model is termed as KNMF (k-means with NMF algorithm). In KNMF the document clustering is
3 done on basis of the similarity between the extracted features and each document. Let, extracted feature vectors be F={f 1,f 2,f 3...f k } computed by NMF. Let the documents in the term-document matrix be V = {d 1,d 2,d 3...d n }, then document d i is said to be the member of cluster f x if, the angle between d i and f x is minimum he methodology adapted i. Construct the term-document matrix V from the files of a given folder using term frequency-inverse document frequency ii. Normalize length of columns of V to unit Euclidean length. iii. Perform NMF based on Lee and Seung [2] on V and get W and H using (1). iv. Apply cosine similarity to measure distance between the documents d i and extracted features/vectors of W. Assign d i to w x if the angle between d i and w x is smallest. his is equivalent to k-means algorithm with a single turn. o run the parallel version of k-means algorithm, Hadoop is started in local reference mode and pseudo-distributed mode and the k-means job is submitted to the JobClient. he time taken from i though iii and the total time taken were noted separately Steps in indexing the documents in a folder i. Determine if the document is new; update the index of non-updated document. ii. If it's up to date then do nothing. While the document is new, create a Lucene Document, if it's not updated then delete the old document and create new Lucene Document. iii. Extract the words from the document. iv. Remove the stop-words. v. Apply Stemming. vi. Store the created Lucene Document in index. vii. Remove stray files. he Lucene Document contains three fields namely path, contents and modified. hese fields respectively store the full-path of the document, the terms and modified date (to seconds). he field path is used to uniquely identify documents in the index; the field modified is used to avoid re-indexing the documents if it's not modified. In the step vii the documents which have been removed from the folder but with entries in the index are removed from index. his step has been followed to keep the optimal word dictionary size. In our document stop-words are filtered using the wordlist from the project of Key Phrase Extraction Algorithm [4] which defines some 499 stop-words. his worldlist can be modified by the users if further stop-words are required to be maintained. After the removal of stop-words, the document is stemmed by the Porter algorithm [3]. 2.2 Parallel implementation of k-means he parallel implementation strategy of k-means algorithms in multi-core is described [20] as: In k-means [9], it is clear that the operation of computing the Euclidean distance between the sample vectors and the centroids can be parallelized by splitting the data into individual subgroups and clustering samples in each subgroup separately (by the mapper). In recalculating new centroid vectors, we divide the sample vectors into subgroups, compute the sum of vectors in each subgroup in parallel, and finally the reducer will add up the partial sums and compute the new centroids. In the same paper, it was noted that the performance of k-means algorithm with map-reduce increased in an average times than its serial implementation without map-reduce. From as low as times in Synthetic ime Series (sample = and features = 10) to as high as times in KDD Cup 999 (sample = and features = 41). It also adds that it was possible to achieve 54 times speedup on 64 cores. Indeed, the performance upgrading with the increase of number of cores was almost linear. his paper was the source for the foundation of Mahout project 2 which include the implementation strategy of k-means algorithm in map-reduce over the Hadoop. Implementation of k-means in mapreduce is also presented in lectures from [17]. Drost, I [16] describes k-means of Mahout project. In [18] Gillick et al. studied the performance of Hadoop's implementation of mapreduce and has suggested performance enhancing guidance as well MapReduce in KNMF In the method proposed in this work, since the final clusters are the features extracted from the NMF algorithms, the parallelization strategy of map-reduce can be applied to compute the distance between the data vectors and the feature vectors. Since it requires only one iteration, it can be considered as having only one map-reduce operation. Furthermore, since the cluster centred computation isn't needed, only one map operation is sufficient. he map operation intakes the list of feature vectors and individual data vectors and outputs the closest feature vector for the data vector. For instance, we have list of data vectors V = {v1,v2,...vn} and list of feature vectors W= {w1,w2,w3} computed by NMF. hen, <v i, W> map <v i, w x > where w x is the closest (cosine similarity) feature vector to data vector v i. 2.3 Hadoop Hadoop is a distributed file system written in Java with an additional implementation of Google's MapReduce framework [19] that enables application based on map-reduce paradigm to run over the file system. It provides high throughput access to data which is makes it suitable for working with large scale data (typical block size is 64 Mb) Hadoop Distributed File System (HDFS) 3 It is a distributed file system specifically designed for fault-tolerant which can be deployed on low-cost hardware. It is based on master/slave architecture. he master nodes are called namenodes and every cluster has only one namenode. It manages the filesystem namespace and access to files by client (opening, closing, renaming files)
4 he task of datanode is to manage the data stored in the node (each file is stored in one of more blocks). It's responsible for read/write requests from clients (creation, deletion, replication of blocks). All HDFS communication protocols are layered on top of the CP/IP protocol. Files in HDFS are write-once (read many) and have strictly one writer at any time. he blocks of a file are replicated for fault tolerance. he block size and replication factor are configurable per file. Hadoop can be run in Local(stand-alone), Pseudo-distributed mode or Fully-distributed mode MapReduce framework in Hadoop he input and output to the map-reduce application can be shown as follows: (input) <k1,v1> map <k2,v2> reduce <k3,v3> (output) he input data is divided and processed in parallel across different machines, processes in map phase and the reduce combines the data according the key to give final output. For this sort of task the framework should be based on master-slave architecture. Since HDFS is itself based on master-slave architecture, MapReduce framework fits well in Hadoop. Moreover usually the compute nodes and the storage nodes are the same, that is, the map-reduce framework and the distributed filesystem are running on the same set of nodes. his configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. o implement map-reduce framework in Hadoop, there is a single master called Jobracker per job. Job is the list of task submitted to the MapReduce framework in Hadoop. he master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. here can be one slave or tasktracker per cluster-node. he slaves execute the tasks as directed by the master. Figure 1 Clustering with our model components developed with Java. Considering this fact, Java was chosen as the programming language for the implementation 4 of our proposed model. Before the documents can be clustered, they need to be indexed. For the purpose of indexing, Lucene APIs have been utilized. he documents are determined whether they are up-to-date in index. If it's up to date then do nothing what follows. If the document is new, a Lucene Document is created, if it's not updated then old documents are deleted and new Lucene Document is created. he stop-words are removed using key-words from [4] and Potter Stemming [3] is applied. Indexing is described in detail in Section Clustering has more complex steps than indexing of the documents. It involves creation of document-term matrix followed by the application of KNMF. Section describes the steps in detail. Figure 1 shows the files to be clustered in the left side frame and the resulting clusters and folders along with their corresponding files is shown in the lower middle frame. he upper middle frame shows the extracted themes. he settings regarding number of clusters, convergence delta for of k-means and NMF, stop words list is done from the right side frame. 4. EXPERIMENS AND RESULS able i: List of opics of 20 New Groups comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x Misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey alk.politics.misc talk.politics.guns talk.politics.mideast sci.crypt sci.electronics sci.med sci.space talk.religion.mis c alt.atheism soc.religion.christ ian 3. IMPLEMENAION Since Hadoop, Lucene and Mahout are built with Java natively, it would be easy for the interoperability between the 4 his application (named Swami) is distributed under GNU GPL v3.0. It is currently tested only on GNU/Linux and can be downloaded from
5 4.1 Data set Description 20 News Groups 5 is a popular data set for text clustering and classification. It has a collection about 20,000 documents across 20 different newsgroups from Usenet. Each newsgroup is stored in a subdirectory, with each article stored as a separate file. Some of the newsgroups are closely related with each other while others are highly unrelated. able i shows the topics of the newsgroups arranged by Jason Renn Experiment For the purpose of experimentation, clustering was done using up to 10 groups. Each group containing five documents were taken randomly and added to a folder. he folder was indexed after removing the stop-words using KEA stop-words [4] and applying Porter stemming [3]. hen the clustering was done and results were noted. Other five documents were taken out randomly from another group, added to the folder, indexed and clustered accordingly. In this way a total of 10 groups with 50 documents were clustered. Clustering results were noted for three cases, without using Hadoop, using Hadoop in local mode and finally using Hadoop in pseudo-distributed mode. In addition to this, for KNMF the following parameters were used i. NMF: convergence parameter = and maximum iteration = 10. ii. K-Means: k = number of news groups in folder, convergence parameter = 0.001, maximum iteration = 1, distance measure = cosine Since the length of W was not normalized as suggested by Xu et al. [6] there was no unique solution. For this purpose the experiments with the highest values of AC among the three cases as mentioned above was taken. he performance of the clustering algorithm evaluated by calculating the accuracy defined in [6] is illustrated as follows: Given a document d i, let l i and α i be the cluster label and the label provided by the document corpus, respectively. he AC is defined as follows: δ( α i,map( l i ) AC = (6) n where n denotes the total number of documents, δ(x, y) is the delta function that equals one if x = y and equals zero otherwise, and map(l i ) is the mapping function that maps each cluster label l i to the equivalent label from the document corpus. k AC Without Hadoop able ii: Results Local Reference mode Pseudo-Distri buted mode / / / / /2.0 1/ / / / / / / / / / / / / / / / / / / / / /3 mode shows time taken by NMF and KNMF (he value at numerator is time taken by NMF and the value at denominator is total time taken by KNMF). he column pseudo-distributed mode shows time taken by Map phases (he numerator value is the time taken by map phase 1 and the denominator value is the time taken by map phase 2). he figure 2 shows the time taken during the clustering phase which is calculated from table ii as (the total time taken - time by NMF). For the pseudo-distributed mode of Hadoop, the time taken by map phase is considered as the time taken for clustering. It can be seen that the time taken by the map phase of pseudo-distributed mode of Hadoop is quite steady and rises only when the number of clusters increase to 8. he time taken by local reference and serial implementation or without using Hadoop exceeds time taken by pseudo-distributed mode of Hadoop for cluster size equivalent to 10. Pseudo-Distributed Local Reference Without Hadoop Figure 2 ime taken by the clustering phase (k-means with 1 turn) 4.3 Results able ii shows the time taken by KNMF algorithm on the 20 Newsgroup collection on a Linux Ubuntu 8.04 machine (1.66Ghz Intel Pentium Dual-core, 1G RAM) with and without MapReduce (Map = 2). he numbers of clusters were denoted by k and the accuracy is denoted by AC. Further, the column Without Hadoop and Local Reference
6 5. CONCLUSIONS In this work, a new working model for document clustering was given along with development of application based on this model. his application can be used to organise documents into sub-folders without having the knowledge about the contents of the document. his really improves the performance of information retrieval in any scenario. he accuracy of model was tested and found to be 80% for 2 clusters of documents and 75% for 3 clusters and the results averages to 65% when for 2 through 10 clusters. NMF has shown to be a good measure for clustering document and this work has also shown similar results when the extracted features are used as the final cluster labels for k-means algorithm. o scale the document clustering the proposed model uses the map-reduce implementation of k-means from Apache Hadoop Project and it has shown to scale even in a single cluster computer when clusters size exceeded 9 i.e. 40 documents averaging 1.5 kilobytes. [18] Gillick, D, Faria, A & DeNero, J (December 18, 2006). MapReduce: Distributed Computing for Machine Learning. [19] Dean, J & Ghemawat, J (December 2004 ). MapReduce: Simplified Data Processing on Large Clusters, In the Proceedings of the 6th Symp. on Operating Systems Design and Implementation. [20] Chu, C, Kim, SK, Lin, YA, Yu, YY, Bradski, G, Yng, Andrew, & Olukotun, K (2006). Map-Reduce for Machine Learning on Multicore, NIPS REFERENCES [1] Cutting, D, Karger, D, Pederson, J & ukey, J (1992). Scatter/gather: A cluster-based approach to browsing large document collections, In Proceedings of ACM SIGIR. [2] Lee, D & Seung, H (2001). Algorithms for non-negative matrix factorization. In. G. Dietterich and V. resp, editors, Advances in Neural Information Processing Systems, volume 13, Proceedings of the 2000 Conference: ,he MI Press. [3] Porter, MF (1980 ). "An algorithm for suffix stripping", Program, Vol. 14, No. 3, pages [4] Key Phrase Extraction Algorithm (KEA) [5] Guduru, N (2006). ext mining with support vector machines and non-negative matrix factorization algorithm. Masters hesis. University of Rhode Island, CS Dept. [6] Xu, W, Liu, X & Gong, Y (2003). Document clustering based on non-negative matrix factorization, Proceedings of ACM SIGIR, pages [7] Yang, CF, Ye, M & Zhao, J (2005). Document clustering based on non-negative sparse matrix factorization. International Conference on advances in Natural Computation, pages [8] Ding, C, He X, & Simon, HD (2005). On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering. Proceedings in SIAM International Conference on Data Mining, pages [9] Kanjani, K (2007). Parallel Non Negative Matrix Factorization for Document Clustering. [10] Hoyer, P (2002). Non-Negative Sparse Coding. In Proceedings of the IEEE Workshop on Neural Networks for Signal Processing, Martigny, Switzerland. [11] Pauca, V, Shahnaz, F, Berry, MW & Plemmons R (April 22-24, 2004). ext Mining Using Non-Negative Matrix Factorizations. In Proceedings of the Fourth SIAM International Conference on Data Mining, Lake Buena Vista, FL. [12] Amy, L & Carl, M (2006). ALS Algorithms Nonnegative Matrix Factorization ext Mining. [13] Guillamet, D & Vitria, J (2002). Determining a Suitable Metric when Using Non-Negative Matrix Factorization. In Sixteenth International Conference on Pattern Recognition (ICPR 02), Vol. 2, Quebec City, QC, Canada. [14] Berry, M, Dumais, S & O'Brien, GW(1995). Using Linear Algebra for Intelligent Information Retrieval. Illustration of the application of LSA to document retrieval. [15] Landauer,, Foltz, PW & Laham, D(1998). Introduction to Latent Semantic Analysis.. Discourse Processes 25: pages [16] Drost, I (November 2008). Apache Mahout : Bringing Machine Learning to Industrial Strength, In Proceedings of ApacheCon 2008, pages 14-29, New Orleans [17] Michels, S (July 5, 2007). Problem Solving on Large-Scale Clusters, Lecture 4.
TEXT MINING WITH LUCENE AND HADOOP: DOCUMENT CLUSTERING WITH FEATURE EXTRACTION DIPESH SHRESTHA A THESIS SUBMITTED FOR THE FULFILLMENT RESEARCH DEGREE
TEXT MINING WITH LUCENE AND HADOOP: DOCUMENT CLUSTERING WITH FEATURE EXTRACTION BY DIPESH SHRESTHA A THESIS SUBMITTED FOR THE FULFILLMENT OF RESEARCH DEGREE TO WAKHOK UNIVERSITY 2009 ABSTRACT In this
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms
Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms Elena Burceanu, Irina Presa Automatic Control and Computers Faculty Politehnica University of Bucharest Emails: {elena.burceanu,
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing
Vol.8, No.4 (214), pp.1-1 http://dx.doi.org/1.14257/ijseia.214.8.4.1 Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing Yong-Il Kim 1, Yoo-Kang Ji 2 and
Big Data Summarization Using Semantic. Feture for IoT on Cloud
Contemporary Engineering Sciences, Vol. 7, 2014, no. 22, 1095-1103 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ces.2014.49137 Big Data Summarization Using Semantic Feture for IoT on Cloud Yoo-Kang
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Advances in Natural and Applied Sciences
AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Active Learning for Multi-Class Logistic Regression
Active Learning for Multi-Class Logistic Regression Andrew I. Schein and Lyle H. Ungar The University of Pennsylvania Department of Computer and Information Science 3330 Walnut Street Philadelphia, PA
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA
IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA Jayalatchumy D 1, Thambidurai. P 2 Abstract Clustering is a process of grouping objects that are similar among themselves but dissimilar
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Apache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
http://www.paper.edu.cn
5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook
Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Part 2:Mining using MapReduce Mining algorithms using MapReduce
Hadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
Sparse Nonnegative Matrix Factorization for Clustering
Sparse Nonnegative Matrix Factorization for Clustering Jingu Kim and Haesun Park College of Computing Georgia Institute of Technology 266 Ferst Drive, Atlanta, GA 30332, USA {jingu, hpark}@cc.gatech.edu
Clustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Machine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image
Map/Reduce Affinity Propagation Clustering Algorithm
Map/Reduce Affinity Propagation Clustering Algorithm Wei-Chih Hung, Chun-Yen Chu, and Yi-Leh Wu Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology,
Radoop: Analyzing Big Data with RapidMiner and Hadoop
Radoop: Analyzing Big Data with RapidMiner and Hadoop Zoltán Prekopcsák, Gábor Makrai, Tamás Henk, Csaba Gáspár-Papanek Budapest University of Technology and Economics, Hungary Abstract Working with large
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Hadoop Operations Management for Big Data Clusters in Telecommunication Industry
Hadoop Operations Management for Big Data Clusters in Telecommunication Industry N. Kamalraj Asst. Prof., Department of Computer Technology Dr. SNS Rajalakshmi College of Arts and Science Coimbatore-49
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data
Client Based Power Iteration Clustering Algorithm to Reduce Dimensionalit in Big Data Jaalatchum. D 1, Thambidurai. P 1, Department of CSE, PKIET, Karaikal, India Abstract - Clustering is a group of objects
Spam Filtering Based on Latent Semantic Indexing
Spam Filtering Based on Latent Semantic Indexing Wilfried N. Gansterer Andreas G. K. Janecek Robert Neumayer Abstract In this paper, a study on the classification performance of a vector space model (VSM)
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
International Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
MapReduce Approach to Collective Classification for Networks
MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty
Efficient Data Replication Scheme based on Hadoop Distributed File System
, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on
Reduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Aditya Jadhav, Mahesh Kukreja E-mail: [email protected] & [email protected] Abstract : In the information industry,
Cloud Computing based on the Hadoop Platform
Cloud Computing based on the Hadoop Platform Harshita Pandey 1 UG, Department of Information Technology RKGITW, Ghaziabad ABSTRACT In the recent years,cloud computing has come forth as the new IT paradigm.
Clustering Very Large Data Sets with Principal Direction Divisive Partitioning
Clustering Very Large Data Sets with Principal Direction Divisive Partitioning David Littau 1 and Daniel Boley 2 1 University of Minnesota, Minneapolis MN 55455 [email protected] 2 University of Minnesota,
Mining Interesting Medical Knowledge from Big Data
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from
Generic Log Analyzer Using Hadoop Mapreduce Framework
Generic Log Analyzer Using Hadoop Mapreduce Framework Milind Bhandare 1, Prof. Kuntal Barua 2, Vikas Nagare 3, Dynaneshwar Ekhande 4, Rahul Pawar 5 1 M.Tech(Appeare), 2 Asst. Prof., LNCT, Indore 3 ME,
Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.
Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
Design of Electric Energy Acquisition System on Hadoop
, pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015
Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Classification of Documents using Text Mining Package tm
Classification of Documents using Text Mining Package tm Pavel Brazdil LIAAD - INESC Porto LA FEP, Univ. of Porto http://www.liaad.up.pt Escola de verão Aspectos de processamento da LN F. Letras, UP, 4th
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
HDFS Architecture Guide
by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5
Processing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP
OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP 1 KALYANKUMAR B WADDAR, 2 K SRINIVASA 1 P G Student, S.I.T Tumkur, 2 Assistant Professor S.I.T Tumkur Abstract- Product Review System
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics
Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses
Slope One Recommender on Hadoop YONG ZHENG Center for Web Intelligence DePaul University Nov 15, 2012 Overview Introduction Recommender Systems & Slope One Recommender Distributed Slope One on Mahout and
Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis
, 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying
Mammoth Scale Machine Learning!
Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes
