# Scalable Sentiment Classification for Big Data Analysis Using Naïve Bayes Classifier

Save this PDF as:

Size: px
Start display at page:

Download "Scalable Sentiment Classification for Big Data Analysis Using Naïve Bayes Classifier"

## Transcription

4 Algorithm 1 Training Job Input: The training reviews. Output: Model. 1: function MAPPER(key, value) 2: s = parsesentiment(value) 3: for all w d do 4: emit(w, 1 : s) 5: end for 6: end function 7: function REDUCER(key, values) 8: possum = 0; negsum = 0; 9: for all value values do 10: [s, count] = parse(value) 11: if s == positive then 12: possum+ = count; 13: else 14: negsum+ = count; 15: end if 16: end for 17: emit(key, possum, negsum) 18: end function Algorithm 2 Combining Job Input: The test reviews and the model. Output: Intermediate results for classify job input. 1: function MAPPER(key, value) 2: if value model then 3: [word, pos s um, neg s um] = parse(value) 4: emit(word, pos s um, neg s um) 5: else 6: [docid, sentiment] = parse(value) 7: for all word value do 8: emit(word,1,docid,sentiment) 9: end for 10: end if 11: end function 12: function REDUCER(key, values) 13: for all value values do 14: if value model then 15: outstr.append(value) 16: else 17: [count, docid] = parse(value) 18: docs[docid].add(count) 19: end if 20: end for 21: for all docid docs do 22: outstr.append(count,docid) 23: end for 24: emit(word, outstr) 25: end function B. Algorithms After the simplification of the problem, the task can be divided into three sequential jobs as follows. 1) Training job (Algorithm 1). All training reviews are fed into this job to produce a model for all unique words with their their frequency in positive and negative review documents respectively. 2) Combining job (Algorithm 2). In this job, the model and the test reviews are combined to a intermediate table with all necessary information for the final classification. 3) Classify job (Algorithm 3). This job classifies all reviews simultaneously and writes the classification results to HDFS. Algorithm 1-3 are the pseudo-codes for the three jobs. These jobs are executed in sequence because the dependencies between them, as shown in Fig. 2. The training job produces a model that is used to compute the probability of each word in the two classes. The combining job then associates the test data with the model, excluding the words that appear in test data but not in the model. The intermediate table produced by the combining job is then fed to the classify job. By the end of classify job, all reviews are classified into positive or negative classes. The actual Hadoop code we develop is less than five hundred programming lines, including comments and package import statements. With such a simple program, the results are surprisingly good, considering the difficulties of mining sentiment using computer programs. Algorithm 3 Classify Job Input: The intermediate results. Output: Classification results and accuracy. 1: function MAPPER(key, value) 2: for all item value do 3: [docid, count, pos s um, neg s um] = parse(value) 4: emit(docid, count, pos s um, neg s um) 5: end for 6: end function 7: function REDUCER(key, values) 8: [docid, truesentiment]=parse(key) 9: for all value values do 10: calculate pos prob 11: calculate neg prob 12: end for 13: predict = (pos prob > neg prob)? pos : neg 14: if predict = truesentiment then 15: correct = true 16: correct count : else 18: correct = false 19: end if 20: emit(docid, predict, correct) 21: end function A. Cloud Infrastructure V. EXPERIMENTAL STUDY Virtual Hadoop cluster is a fast and easy way to test a Hadoop program in the Cloud, although the performance might be weaker compared to a physical Hadoop cluster. Our cloud infrastructure is built on a Dell server with 12 Intel Xeon E GHz cores and 32G memory. We use Xen Cloud Platform (XCP) 1.6 as the hypervisor. On top of XCP 1.6, we built a virtual Hadoop cluster of seven nodes. There is one special node, the manager, that we use to manage the Hadoop

6 details for converting a classifier to Hadoop program, of which any machine learning algorithm could be replaced for the NBC. Our results show that NB classifier can scale up easily, even without a database. Because of our simplified setup, the average accuracy stays below 82% in all cases. An intelligent filter might be helpful to increase the accuracy. We believe that our work is just a beginning of employing machine learning technologies in large-scale datasets. Future work will include using our framework for information fusion over imagery and text, distributed robotics applications, and cyber analysis using cloud computing. [17] P. Domingos and M. Pazzani, On the optimality of the simple bayesian classifier under zero-one loss, Machine learning, vol. 29, no. 2-3, pp , [18] H. Karloff, S. Suri, and S. Vassilvitskii, A model of computation for mapreduce, in Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 2010, pp REFERENCES [1] J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large clusters, in Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6, ser. OSDI 04. Berkeley, CA, USA: USENIX Association, [2] U. Kang, D. H. Chau, and C. Faloutsos, Pegasus: Mining billionscale graphs in the cloud, in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, 2012, pp [3] S. Suri and S. Vassilvitskii, Counting triangles and the curse of the last reducer, in Proceedings of the 20th international conference on World wide web. ACM, 2011, pp [4] U. Kang, D. H. Chau, and C. Faloutsos, Mining large graphs: Algorithms, inference, and discoveries, in Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011, pp [5] B. Pang, L. Lee, and S. Vaithyanathan, Thumbs up?: sentiment classification using machine learning techniques, in Proceedings of the ACL-02 conference on Empirical methods in natural language processing-volume 10, 2002, pp [6] B. Pang and L. Lee, A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts, in Proceedings of the 42nd annual meeting on Association for Computational Linguistics, 2004, p [7] P. Chesley, B. Vincent, L. Xu, and R. K. Srihari, Using verbs and adjectives to automatically classify blog sentiment, Training, vol. 580, no. 263, p. 233, [8] M. Gamon, A. Aue, S. Corston-Oliver, and E. Ringger, Pulse: Mining customer opinions from free text, in Advances in Intelligent Analysis VI. Springer, 2005, pp [9] A. Kennedy and D. Inkpen, Sentiment classification of movie reviews using contextual valence shifters, Computational Intelligence, vol. 22, no. 2, pp , [10] M. Thomas, B. Pang, and L. Lee, Get out the vote: Determining support or opposition from congressional floor-debate transcripts, in Proceedings of the 2006 conference on empirical methods in natural language processing. Association for Computational Linguistics, 2006, pp [11] E. Blasch, A. Steinberg, S. Das, J. Llinas, C. Chong, O. Kessler, E. Waltz, and F. White, Revisiting the JDL model for information exploitation, in Int l Conf. on Info Fusion, [12] E. Blasch, Y. Chen, G. Chen, D. Shen, and R. Kohler, Information fusion in a cloud-enabled environment, High Performance Semantic Cloud Auditing, Springer Publishing, [13] B. Liu, Y. Chen, E. Blasch, K. Pham, D. Shen, and G. Chen, A holistic cloud-enabled robotics system for real-time video tracking application, in Intl Workshop on Enhanced Cloud Fusion, in conjunction with Future Tech, Korea, Sept. 2013, Sept. [14] B. Liu, E. Blasch, Y. Chen, A. J. Aved, A. Hadiks, D. Shen, and G. Chen, Information fusion in a cloud computing era: A systemslevel perspective, Submitted to IEEE AES Magazine, [15] J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large clusters, Communications of the ACM, vol. 51, no. 1, pp , [16] D. Lewis, Naïve (bayes) at forty: The independence assumption in information retrieval, Machine Learning: ECML-98, pp. 4 15, 1998.

### Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

### Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

### Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

### CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

### Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

### Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

### A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

### INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

### Distributed Framework for Data Mining As a Service on Private Cloud

RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

### BIG DATA What it is and how to use?

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

### A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

### Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul

### Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

### Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

### Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

### Application Development. A Paradigm Shift

Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

### CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

### Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

### Data Mining with Hadoop at TACC

Data Mining with Hadoop at TACC Weijia Xu Data Mining & Statistics Data Mining & Statistics Group Main activities Research and Development Developing new data mining and analysis solutions for practical

### Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

### Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

### Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

### Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

### International Journal of Innovative Research in Computer and Communication Engineering

FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

### A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed

Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

### Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

### http://www.wordle.net/

Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

### INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

### Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

### Big Data with Rough Set Using Map- Reduce

Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,

### A Study on Data Analysis Process Management System in MapReduce using BPM

A Study on Data Analysis Process Management System in MapReduce using BPM Yoon-Sik Yoo 1, Jaehak Yu 1, Hyo-Chan Bang 1, Cheong Hee Park 1 Electronics and Telecommunications Research Institute, 138 Gajeongno,

### Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang 2011-10

Application and practice of parallel cloud computing in ISP Guangzhou Institute of China Telecom Zhilan Huang 2011-10 Outline Mass data management problem Applications of parallel cloud computing in ISPs

### Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

### Chase Wu New Jersey Ins0tute of Technology

CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

### R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

### Research Article Hadoop-Based Distributed Sensor Node Management System

Distributed Networks, Article ID 61868, 7 pages http://dx.doi.org/1.1155/214/61868 Research Article Hadoop-Based Distributed Node Management System In-Yong Jung, Ki-Hyun Kim, Byong-John Han, and Chang-Sung

### E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data

### Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data

### Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

### Snapshots in Hadoop Distributed File System

Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any

### Online Failure Prediction in Cloud Datacenters

Online Failure Prediction in Cloud Datacenters Yukihiro Watanabe Yasuhide Matsumoto Once failures occur in a cloud datacenter accommodating a large number of virtual resources, they tend to spread rapidly

### A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

### Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing

Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 E-commerce recommendation system on cloud computing

### SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA J.RAVI RAJESH PG Scholar Rajalakshmi engineering college Thandalam, Chennai. ravirajesh.j.2013.mecse@rajalakshmi.edu.in Mrs.

### Hadoop Technology for Flow Analysis of the Internet Traffic

Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet

### DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

### Firebird meets NoSQL (Apache HBase) Case Study

Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

### Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

### Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

### Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

### Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

### Mobile Storage and Search Engine of Information Oriented to Food Cloud

Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

### Redundant Data Removal Technique for Efficient Big Data Search Processing

Redundant Data Removal Technique for Efficient Big Data Search Processing Seungwoo Jeon 1, Bonghee Hong 1, Joonho Kwon 2, Yoon-sik Kwak 3 and Seok-il Song 3 1 Dept. of Computer Engineering, Pusan National

### Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

, 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying

### Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

### Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

### International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6

International Journal of Engineering Research ISSN: 2348-4039 & Management Technology Email: editor@ijermt.org November-2015 Volume 2, Issue-6 www.ijermt.org Modeling Big Data Characteristics for Discovering

### Large-Scale Test Mining

Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding

### Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,

### Cyber Forensic for Hadoop based Cloud System

Cyber Forensic for Hadoop based Cloud System ChaeHo Cho 1, SungHo Chin 2 and * Kwang Sik Chung 3 1 Korea National Open University graduate school Dept. of Computer Science 2 LG Electronics CTO Division

### UPS battery remote monitoring system in cloud computing

, pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology

### Research on IT Architecture of Heterogeneous Big Data

Journal of Applied Science and Engineering, Vol. 18, No. 2, pp. 135 142 (2015) DOI: 10.6180/jase.2015.18.2.05 Research on IT Architecture of Heterogeneous Big Data Yun Liu*, Qi Wang and Hai-Qiang Chen

### Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

### Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

### Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

### A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden yi.cheng@ericsson.com, christian.schaefer@ericsson.com Abstract While big data analytics provides

### Apriori-Map/Reduce Algorithm

Apriori-Map/Reduce Algorithm Jongwook Woo Computer Information Systems Department California State University Los Angeles, CA Abstract Map/Reduce algorithm has received highlights as cloud computing services

### Fault Tolerance in Hadoop for Work Migration

1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

### CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensie Computing Uniersity of Florida, CISE Department Prof. Daisy Zhe Wang Map/Reduce: Simplified Data Processing on Large Clusters Parallel/Distributed

### MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

### CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,

### Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

### Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

### NetFlow Analysis with MapReduce

NetFlow Analysis with MapReduce Wonchul Kang, Yeonhee Lee, Youngseok Lee Chungnam National University {teshi85, yhlee06, lee}@cnu.ac.kr 2010.04.24(Sat) based on "An Internet Traffic Analysis Method with

### Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

### What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding

1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

### Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

### Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

### Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Final Project Proposal CSCI.6500 Distributed Computing over the Internet Qingling Wang 660795696 1. Purpose Implement an application layer on Hybrid Grid Cloud Infrastructure to automatically or at least

Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

### Implementing Graph Pattern Mining for Big Data in the Cloud

Implementing Graph Pattern Mining for Big Data in the Cloud Chandana Ojah M.Tech in Computer Science & Engineering Department of Computer Science & Engineering, PES College of Engineering, Mandya Ojah.chandana@gmail.com

### Image Search by MapReduce

Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s

### Oracle s Big Data solutions. Roger Wullschleger.

s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline

### Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

### IMAV: An Intelligent Multi-Agent Model Based on Cloud Computing for Resource Virtualization

2011 International Conference on Information and Electronics Engineering IPCSIT vol.6 (2011) (2011) IACSIT Press, Singapore IMAV: An Intelligent Multi-Agent Model Based on Cloud Computing for Resource

Radoop: Analyzing Big Data with RapidMiner and Hadoop Zoltán Prekopcsák, Gábor Makrai, Tamás Henk, Csaba Gáspár-Papanek Budapest University of Technology and Economics, Hungary Abstract Working with large

### BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

### Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of

### Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are