Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model
|
|
|
- Mervyn Bradley
- 9 years ago
- Views:
Transcription
1 Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model Condro Wibawa, Irwan Bastian, Metty Mustikasari Department of Information Systems, Faculty of Computer Science and Technology, Gunadarma University, Indonesia Abstract Plagiarism of digital documents that happened frequently should be prevented and reduced. In order to measure the degree of document s similarity, Ferret Algorithm is used on MapReduce and Hadoop framework. This application program is tested using ten pair documents in English and ten pair documents in Indonesian to get the degree of document s similarity between document s pairs. In this experiment, it was also tested the accuracy of the system by using several kind of n- grams, that is 1-gram, 2-gram, 3-gram, 4-gram, 5-gram, and 6-gram. In addition, this experiment is conducted on stand-alone computers and Hadoop multi-node cluster system. In order to compare the time taken of the system, it is used nine pairs of documents with various size of documents. Keywords Similarity, Ferret Algorithm, MapReduce, Hadoop, Jaccard. I. INTRODUCTION The growth of information technology which running fast have many positive and negative impacts. One of the negative impacts is plagiarism of digital documents. It's because plagiarism can be done easily. There are many definitions describing plagiarism. According to research resources in plagiarism.org, plagiarism is one of the following activities turning in someone else's own, copying words or ideas of others without giving credit, failed to put the quote in quotes, giving incorrect information about the source quote, changing words but copying the sentence structure of a source without giving credit, copying so many words or ideas from a source that makes up the majority of our work, whether we give credit or not [5]. Plagiarism can occur in the text, music, images, maps, technical drawings, paintings, etc. Although academics sectors are most remarkable, other sectors such as journalism and information retrieval is also suffering from a serious problem. For example, the Internet is full of information that is redundant and duplicate or near duplicate documents that will decrease the effectiveness of the search engine [6]. In order to rapidly develop large-scale data processing, MapReduce framework is introduced by Google. Mapreduce scalable data processing framework and its open source realization Hadoop, provide and offer effective programming model to manage large documents. This research proposed a document similarity detection using Ferret algorithm on MapReduce and Hadoop framework. It also covers the Hadoop Distributed File System which is used MapReduce algorithm to manage large amount of data by splitting up datasets across multiple servers and parallely processing each part and then combining the result of each part to produce the final answer. Using Ferret model as the document similarity index, it can be determined whether the documents is product of plagiarism or not. The programming code will be implemented in stand-alone computer and Hadoop multi-node cluster system to determine which is better in terms of effectiveness and efficiency. This algorithms is complemented with n-gram model and Jaccard Index. The programming language used in this research are Python and mrjob library. II. METHOD A. MapReduce Framework Jayram Chandar [2] define MapReduce as a programming model that is used to process a big data. Whereas according to AnikMomtaz [3] data processing on the MapReduce was performed by distributed and parallel in a cluster. A cluster can consist of one master computer and several slave computers (workers) or only one computer that acts as both a master computer and a slave computer. The number of slave computers is unlimited and hardware specification that used can be different. In general, the MapReduce process is divided into two processes that is Map Processing and Reduce Processing. Both of these processes are distributed to all slave computers connected in a cluster and run parallely. Map Processing charge of dividing the problem into sub-problems and distribute it to the slave computers. Then, results of the process on a slave computers will be collected by the master computer. It's call Reduce Processing. The results of the Reduce Processing is then delivered to the user as a final output [4]. Fig.1 Scheme of MapReduce Processing [3] ISSN: Page76
2 Map function assigned to read input as a pairs of key/value and also produce output of pairs of key/value. map: (key 1, value 1) [(key 2, value 2)] Furthermore, Reduce function will read the output from Map function and then it will be combined or grouped based on the key. Values that have the same key will be combined into one group. Reduce function also produce the output of pairs of key/value. reduce: (key 2, [value 2]) [(key 3, value 3)] Figure 2 depicts the MapReduce process scheme to count the number of words in a the document : Fig. 3 Hadoop Processed Scheme B. Ferret Model Ferret is very effective in finding plagiarism or copied part in English and other languages, but this approach does not turn into useful in finding semantically similar text written independently. For example, a language processing technologies that purposed is to translate English into Chinese. The method is based on successive token processing; in English tokens are words, whereas in Chinese they are a character or string of characters [1, 3]. Ferret Algorithm divides the documents into trigrams. The set of trigrams then compared to the documents to be compared. The trigrams similarity of the two documents then used to calculate similarity index using Jaccard Index. Jaccard Index or also known as the Jaccard Similarity Coefficient or Jaccard Coefficient is a statistical calculation that was created by Paul Jaccard and used to compare the degree of similarity and dissimilarity of sample sets. Coefficient Jaccard (similarity index) is denoted by J(A,B) which is defined as number of intersection divided by number of union in the sample [3]. Fig.2 Word Count, Example of MapReduce Process Hadoop is a Java-based software framework that is used to process big data by using clustering method [7]. Hadoop consists of four main parts: Hadoop Common, Hadoop Distributed File System, Hadoop YARN, and HadoopMapReduce MapReduce programming codes that are created using Java can be directly run on Hadoop System. Otherwise, programming codes that are created with other programming language must past through Hadoop Streaming before processed in Hadoop System. Hadoop Streaming is a Hadoop translator in order to MapReduce function that can be read in Hadoop System[7]. Fig.4 Set of A and B Where A is determine to number (n) of object A, B to object B, i to object which exist in both A and B, a to object which exist in A but not in B, and b to object which exist in B but not in A [1, 3]. ISSN: Page77
3 III. EXPERIMENTAL RESULT Hadoop-based MapReduce programming model require more than one computer. It can be a dedicated computers or virtual computers. In this experiments, each computer requires minimum specification, such as 384 MB of memory, Network Interface Card, 5GB free space of Hardisk, Ubuntu Server Operating System. It also requires software such as : Java Development Kit, Hadoop-1.0.3, Python-2.7.7, and mrjob-0.4 (python library). A. Global Program Design In general, the system flowchart depicts executing the system and selecting two documents to be compared. After that the process continues with executing mapping and reducing process. If the system is executed in stand-alone computer, the document will be processed only in that computer. However, if the system is executed in Hadoop System, the document processing will be continued in Hadoop System. The results of the data processing will be returned to the user as a percentage that indicates the level of document similarity. the IP Address and SSH. Setting the IP Address and SSH can be seen in TABLE I. TABLE I IP ADDRESS SETTING OF MASTER NODE AND SLAVE NODE The next step is installation of Hadoop and configuration some of Hadoop files, such as :hadoop-env.sh, core-site.sh, mapred-site.sh, hdfs-site.sh, conf/master, and conf/slave. The installation of Hadoop and configuration some of Hadoop files are presented in TABLE II. Fig.5 Global Program Design Scheme B. Hadoop System Design Hadoop System consists of several computers that communicate with each other. Therefore, the first step is to set Fig. 6 Flowchart of Hadoop System Configuration C. Programming code Design The programming code consist of three moduls of MapReduce functions (sub-programming code) and one modul of main programming code. The description of the programming codes is depicted in TABLE II. ISSN: Page78
4 TABLE II MAPREDUCE SUB-PROGRAM The main programming code is responsible for the whole processes. First step is to obtain the document to be compared and check if the document exist, then store both documents into temporary folder. After that, main programming code execute code1 (input1.py), code2 (input2.py) and code3 (proses.py) for the final step, than the result is displayed. Technicaly, there is only a small difference between creating program for stand alone computer and Hadoop System. The difference is only on statement code to execute MapReduce function. Statement code to execute MapReduce function in Hadoop System should be added by line : -r hadoop The process flow for each programming code can be seen on these flowchart below : Fig. 8 Sample of Code Program in Stand Alone Computer Fig. 9 Sample of Code Program in Hadoop System Fig. 7 Flowchart of Main Program Fig.10 Flowchart of Code1 (Input.py) ISSN: Page79
5 Code1 is a sub-programming code to check the first document. There are two main processes in this code. First process is parsing, get the word by word from the text and then filtering it with stopword list. The second process is to create n-gram from each word and give them value 1, therefore the output will be like this: 1 0 n-gram. The position of value 1 is in the first place to determine that the n-gram identifiy first document. The value 1 will be used later to count the number of the n-gram. Code1 and code2 have the same process flow. The difference of two codes is in the output. In the code1, the position of value 1 is in the first place, whereas in the code2 value 1 is in the second place. The output will be like : 0 1 n-gram. Fig. 11 Flowchartof code3 (Procees.py) First step in code3 is to do a group by n-gram the combined output of code1 and code2. Therefore, if both documents have the same n-gram, the statement is written as : sum1 sum2 x n-gram. x is a value to indicate whether n- gram exist both in documents. If n-gram exist both in the documents the x value will be value of sum1 or sum2 depend on which one has the smaller value. If n-gram not exist in both documents, x value will be 0. With this information, the jaccard computation can be done. To execute the programming code type : python [main programming code name]. Fig. 12 Running the Program Fig. 13 Program Output D. Result Evaluation In this experiment, it was also tested the accuracy of the system by using several kind of n-grams, that is 1-gram, 2- gram, 3-gram, 4-gram, 5-gram, and 6-gram. In this experiment, it used 10 pair documents (articles) in english and 10 pair documents (articles) in Indonesian. The result are shown as follows: ISSN: Page80
6 TABLE III N-GRAM COMPARISON TABLE IV TIME TAKEN COMPARISON From the experimental results, the pair of the documents is grouped into two groups, the first group is a pair documents that the result of similarity index tend to be increased. This group reaches the highest value mostly in 3- gram column. This indicates that using 3-gram provides the best performance. In the second group the result of similarity index tend to be decreased. As can be seen in TABLE III this group mostly the similarity index in 6-gram has smallest value and the largest value are in 1-gram column. It indicates that the pair documents are not similar and not identical. The experiment is not only compared the accuracy of measuring documents similarity, but is also evaluate the time taken to measure various large of documents. The purpose of the experiment is to evaluate which is faster, the system that runs on stand-alone computer or on Hadoop System. TABLE IV shows the time needed to run the system in various size of documents. As can be seen in the TABLE IV and figure 14, in file size around 1 MB the system using stand alone computer is much faster than Hadoop system. Whereas in file size around 200 MB time taken using Hadoop System faster than using Stand Alone Computer. It indicates that using Hadoop system in big size is more stable and event faster than using stand alone computer (The time taken depends on computer hardware specification). Fig. 14 Graphic of Time Taken Comparison IV. CONCLUSION This research proposed an application program used to provide information of the degree of documents similarity between two documents using Ferret algorithm on MapReduce and Hadoop Framework. Based on the experimental result, using 3-gram is the most accurate. System that is run on a stand alone computer is faster than system that is run on Hadoop System for document which size is less than 128MB. However, system that is run on hadoop system which document size is greater than or equal to 200MB is faster than system run on stand alone computer. Hadoop system can only be run on the Linux operating system. However, system that is run on a stand alone computer can also be run on the Windows operating system. As further studies the proposed system should be evaluated for larger collection of documents and various similarity measurements. ISSN: Page81
7 REFERENCES [1] J.P. Bao, C. M. yon, P. C. R. Lane, W. Ji, and J. A. Malcolm,Copy detection in Chinese documents using the Ferret: A report on experiments Technical Report 456, Science and Technology Research Institute, Xian: University of Hertfordshire, [2] Chandar, Jairam. Join Algorithms using MapReduce. Edinburgh: University of Edinburgh, [3] Momtaz, Anik. Detecting Document Similarity in Large Document Collection using MapReduce and the Hadoop Framework. Dhaka: Brac University, [4] Nayakam, GhanaShyam Nath. Study of Similarity Coefficients using MapReduce. North Dakota: North Dakota State University, [5] (2014) What is Plagiarism?, [Online]. Available: [6] Maurer. H., Kappe. F., and Zaka. B., Plagiarism - A Survey. Journal of Universal Computer Science, vol. 12, no. 8, , [7] (2014) What Is Apache Hadoop?, [Online]. Available: [8] (2013) Concepts, [Online]. Available: ISSN: Page82
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Large-Scale Test Mining
Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding
Apache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Tutorial for Assignment 2.0
Tutorial for Assignment 2.0 Web Science and Web Technology Summer 2012 Slides based on last years tutorials by Chris Körner, Philipp Singer 1 Review and Motivation Agenda Assignment Information Introduction
IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA
IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA Jayalatchumy D 1, Thambidurai. P 2 Abstract Clustering is a process of grouping objects that are similar among themselves but dissimilar
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has
Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics
Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Redundant Data Removal Technique for Efficient Big Data Search Processing
Redundant Data Removal Technique for Efficient Big Data Search Processing Seungwoo Jeon 1, Bonghee Hong 1, Joonho Kwon 2, Yoon-sik Kwak 3 and Seok-il Song 3 1 Dept. of Computer Engineering, Pusan National
Tutorial for Assignment 2.0
Tutorial for Assignment 2.0 Florian Klien & Christian Körner IMPORTANT The presented information has been tested on the following operating systems Mac OS X 10.6 Ubuntu Linux The installation on Windows
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
A Study of Data Management Technology for Handling Big Data
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM
A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM Ramesh Maharjan and Manoj Shakya Department of Computer Science and Engineering Dhulikhel, Kavre, Nepal [email protected],
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Hadoop and Map-reduce computing
Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.
Hadoop Scheduler w i t h Deadline Constraint
Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,
HADOOP MOCK TEST HADOOP MOCK TEST II
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Integrating NLTK with the Hadoop Map Reduce Framework 433-460 Human Language Technology Project
Integrating NLTK with the Hadoop Map Reduce Framework 433-460 Human Language Technology Project Paul Bone [email protected] June 2008 Contents 1 Introduction 1 2 Method 2 2.1 Hadoop and Python.........................
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,
On a Hadoop-based Analytics Service System
Int. J. Advance Soft Compu. Appl, Vol. 7, No. 1, March 2015 ISSN 2074-8523 On a Hadoop-based Analytics Service System Mikyoung Lee, Hanmin Jung, and Minhee Cho Korea Institute of Science and Technology
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Hadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.
Data Analytics CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL All rights reserved. The data analytics benchmark relies on using the Hadoop MapReduce framework
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
The Quest for Conformance Testing in the Cloud
The Quest for Conformance Testing in the Cloud Dylan Yaga Computer Security Division Information Technology Laboratory National Institute of Standards and Technology NIST/ITL Computer Security Division
SIPAC. Signals and Data Identification, Processing, Analysis, and Classification
SIPAC Signals and Data Identification, Processing, Analysis, and Classification Framework for Mass Data Processing with Modules for Data Storage, Production and Configuration SIPAC key features SIPAC is
Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)
Open Cloud System (Integration of Eucalyptus, Hadoop and into deployment of University Private Cloud) Thinn Thu Naing University of Computer Studies, Yangon 25 th October 2011 Open Cloud System University
Perplexity Method on the N-gram Language Model Based on Hadoop Framework
94 International Arab Journal of e-technology, Vol. 4, No. 2, June 2015 Perplexity Method on the N-gram Language Model Based on Hadoop Framework Tahani Mahmoud Allam 1, Hatem Abdelkader 2 and Elsayed Sallam
Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014
White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page
Big Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
BIG DATA IN BUSINESS ENVIRONMENT
Scientific Bulletin Economic Sciences, Volume 14/ Issue 1 BIG DATA IN BUSINESS ENVIRONMENT Logica BANICA 1, Alina HAGIU 2 1 Faculty of Economics, University of Pitesti, Romania [email protected] 2 Faculty
Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster
Setup Hadoop On Ubuntu Linux ---Multi-Node Cluster We have installed the JDK and Hadoop for you. The JAVA_HOME is /usr/lib/jvm/java/jdk1.6.0_22 The Hadoop home is /home/user/hadoop-0.20.2 1. Network Edit
Architectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet [email protected] October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
A programming model in Cloud: MapReduce
A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value
HDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
The Hadoop Framework
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen [email protected] Abstract. The Hadoop Framework offers an approach to large-scale
map/reduce connected components
1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains
Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation)
Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation) Approved for Public Release. Distribution Unlimited. Data Intensive Applications in T&E Win-T at ATC Automotive Data
Zihang Yin Introduction R is commonly used as an open share statistical software platform that enables analysts to do complex statistical analysis with limited computing knowledge. Frequently these analytical
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Data Deduplication in Slovak Corpora
Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace
Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis
, 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying
Hadoop. Bioinformatics Big Data
Hadoop Bioinformatics Big Data Paolo D Onorio De Meo Mattia D Antonio [email protected] [email protected] Big Data Too much information! Big Data Explosive data growth proliferation of data capture
HDFS Cluster Installation Automation for TupleWare
HDFS Cluster Installation Automation for TupleWare Xinyi Lu Department of Computer Science Brown University Providence, RI 02912 [email protected] March 26, 2014 Abstract TupleWare[1] is a C++ Framework
Hadoop Technology for Flow Analysis of the Internet Traffic
Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet
Introduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh [email protected] October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Chase Wu New Jersey Ins0tute of Technology
CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at
Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science
A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast
International Conference on Civil, Transportation and Environment (ICCTE 2016) Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast Xiaodong Zhang1, a, Baotian Dong1, b, Weijia Zhang2,
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
Hadoop Setup. 1 Cluster
In order to use HadoopUnit (described in Sect. 3.3.3), a Hadoop cluster needs to be setup. This cluster can be setup manually with physical machines in a local environment, or in the cloud. Creating a
A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE
STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE DIANA HALIŢĂ AND DARIUS BUFNEA Abstract. Then
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM
IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM Sugandha Agarwal 1, Pragya Jain 2 1,2 Department of Computer Science & Engineering ASET, Amity University, Noida,
Manifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
Processing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
International Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables
Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables Tahani Mahmoud Allam Assistance lecture in Computer and Automatic Control Dept - Faculty of Engineering-Tanta University, Tanta,
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
Image Search by MapReduce
Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
FREE computing using Amazon EC2
FREE computing using Amazon EC2 Seong-Hwan Jun 1 1 Department of Statistics Univ of British Columbia Nov 1st, 2012 / Student seminar Outline Basics of servers Amazon EC2 Setup R on an EC2 instance Stat
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014
Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)
