Novel Framework for Distributed Data Stream Mining in Big data Analytics Using Time Sensitive Sliding Window

Size: px
Start display at page:

Download "Novel Framework for Distributed Data Stream Mining in Big data Analytics Using Time Sensitive Sliding Window"

Transcription

1 ISSN(Print): ISSN(Online): JOURNAL OF COMPUTER SCIENCE AND SOFTWARE APPLICATION In Press Novel Framework for Distributed Data Stream Mining in Big data Analytics Using Time Sensitive Sliding Window V. Valli Mayil* Director/MCA, Vivekanandha Institute of Information and Management Studies, Tiruchengode *Corresponding author: Abstract: In the digital world, large volume of data has been continuously generating from sensors, social media sites, videos, purchase transaction records, and cell phone GPS signals. These types of data are called big data. Big data is used to describe collection of data sets so large and complex, structured and unstructured types and it becomes difficult to process using on-hand data management tools or traditional data processing applications. Due to complex types and volumes of big data, the traditional static database and mining procedure is not suitable for Big Data Analytics. Predicting patterns in such dynamic environment is a challenging task in big data analytics. This paper proposes a novel frame work for mining frequent patterns in real time dynamic environment based on time sensitive sliding window. Our framework proposes a distributed mining which predicts frequent patterns from continuous data streams over various tilted time window. The framework suggests the distributed file system to store the continuous data stream and tilted time window model to hold the part of stream of data. The system recommends data distribution model, so that data window are distributed to different processing commodity node and the frequent pattern mining procedure is applied on separate node simultaneously. In this paper we have proposed framework which will use the power of Hadoop for mining the frequent Itemset in a Distributed environment. Keywords: Data Stream; Tilted Time Window; Big Data; Frequent Pattern; Hadoop; Mapreduce 1. INTRODUCTION Now a days tremendous and volume of data streams are generated from real time surveillance systems, communication networks, Internet traffic, online transactions, electric Power grids etc. Different types of structured and unstructured data flows continuously at high speed with varying rate. Every big data source has different characteristics, including the frequency, volume, velocity, type, and veracity of the data. Choosing an architecture and building an appropriate solution for bigdata is a challenging task as it would consider Volume, Variety and Velocity of data. Here the volume refers to the amount of data generated and variety refers to types of both structured and unstructured data. The structured data can be stored in terms of spreadsheets, databases where as the unstructured data is in the form of messages, images, videos, PDFs and audio files. This variety of unstructured data creates the problems of storing, mining and analyzing data. The big Data is also be characterized by another property called 1

2 velocity which deals the rate of data flows from sources like business processes, machines, networks and human interaction with things like social media sites, mobile devices, etc. The uniqueness of big data can be described as 1. Massive: High volume of data ranges from terabyte to yota byte 2. Temporarily ordered 3. Fast changing and Continuity : Data continuously arrive at a high rate 4. Expiration : data can be read only once 5. Potentially infinite : total amount of data is unbounded 6. Types of data : Unstructured, structured and semi structured data The above factor leads to the following requirement in our proposed distributed framework. Single scan: The big data volume starts from terabytes or even petabytes. The massive volume of data should be monitored at every fraction of time continuously by the sensor or processing system. The repeated scans or multiple scans are not possible in the increasing rate of bigdata. Hence our proposed system requires implementing single scan of bigdata storage. Streaming data Model: It is an analytic computing platform that is focused on speed. Our proposed system requires the model as the bigdata applications require a continuous stream of unstructured data to be processed. The data is continuously analyzed and transformed in memory before it is stored on a disk. Processing streams of data is adopted by splitting the streams in a model called time windows which can then be processed across a cluster of servers. Sliding Window model: In this model, the part of bigdata streams can be analysed for producing an approximate answer to a data stream. It evaluates only the recent data over sliding window. Distributed file system: Due to unlimited amount of data and limited system resources, such as memory space and CPU power, a mining algorithm distributes data to multiple splits and splits are assigned to multiple processing nodes for easy manipulation. Hadoop and MapReduce: The group of streaming data can be assigned to clusters of commodity system and parallel processing can be adopted to find the pattern mining in each frame. Mapreduce techniques shuffles and merge the final result into predicting patterns Parallel processing: allotted slots of massive data are to be processed simultaneously. Approximation: A method for providing approximate answer with accuracy guarantee is required. Adjustability: Due to unlimited amount of data, a mechanism that adapts itself to available resources is needed. Unlike a traditional data base system, the big data system produces a data streams continuously from different sources with high speed. It is a time varying, multiple, unbounded data, rapidly generated in a dynamic environment. Due to the characteristics of stream data, there are some inherent challenges for retrieving, storing and manipulating the data. First, the data is retrieving at high speed, there is not enough time to rescan the whole database or perform multi scan. Second, there is not enough space to store all the stream data for on line processing. The mining method of data stream needs to adapt to their changing data distribution. Third, the speed of mining algorithm should be faster than the data coming rate otherwise the approximation algorithm can be engaged, which will reduce the accuracy of results. Fourth, due to the changing data distribution characteristics of data, the analysis results of data stream needs a proper updation. The mining algorithm should be an incremental process to keep up with the highly updata rate. 2

3 Novel Framework for Distributed Data Stream Mining in Big data Analytics Using Time Sensitive Sliding Window 2. RELATED WORK 2.1 Stream Data Processing Models As seen the characteristics and challenges of big data in previous section, it is impractical to scan through the entire data stream more than once. For the effective processing of stream data, the new data structure, algorithm and techniques are needed. In this section, we discuss some of the common data structures and techniques to retrieve the most recent data stream for the analysis. According to the research, the authors Zhu and Shasha [1], have discussed three data processing models such as Landmark, Damped and Sliding window. The landmark model mines the frequent item set over the entire history of stream data from a specified time point called landmark to the present. This model does not provide the most recent updation in a data stream. It is not applicable for the user like stock monitoring and user who are interested to obtain current and real time information. The damped model (also called Time-Fading model), mines frequent item set in stream data based on weight assigned to them. Older transactions of stream have higher weight compared to the recent one. This will be suited for application in which old transaction has an effective in the mining procedures. Sliding window: The sliding window model mines the patterns within the window size of data. The part of data within the window are stored and processed. In [2], the author proposed an algorithm to mine the frequent item sets of data stream within the current sliding window. The size of window may be decided according to applications and system resources. Imposing sliding windows on data streams is a natural method for approximation that has several attractive properties. It is well-defined and easily understood: the semantics of the approximation are clear, so that users of the system can be confident that they understand what is given up in producing the approximate answer. It is deterministic, so there is no danger that unfortunate random choices will produce a bad approximation. Most importantly, it emphasizes recent data, which in the majority of real-world applications is more important and relevant than old data. if one is trying in real-time to make sense of network traffic patterns, or phone call or transaction records, or scientific sensor data, then in general insights based on the recent past will be more informative and useful than insights based on stale data. In [3], the author Yi Yang and Guojan discussed the new type of self adaptive sliding window model which can learn the sliding window control parameter that meets the demand when patterns are mined from data streams. Sampling, Load Sheding and Synopsis Data Model: These models follow the techniques for choosing subset of streams into process and provide the approximate results. Sampling refers to the process of probabilistic choice of data item to be processed. The drawback of sampling is the unknown data set size. Load shedding discussed in [4] refers to the process of dropping a sequence of data streams. Creating synopsis [5] of data refers to the process of applying summarization techniques that are capable of summarizing the incoming stream for future analysis. Histogram: It is a data structure used to approximate the frequency distribution of element values in a data stream. A histogram partitions the data into set of contiguous buckets. Hash based methods are proposed to mine frequent items. This method has been extended with lossy counting model. Multi resolution method: A common way to deal with a large amount of data by encompassing divide and conquer strategies such as multiresolution data structures. The method offers the ability to understand a data stream at multiple levels of detail. The clustering method is employed here to store the multiple levels of stream data. The hierarchical clustering data structure like CF tree in BIRCH to form a hierarchical clusters. All the above approaches provide the approximate answers for long term data and adjust their storage requirement based on available space. Therefore they satisfy two requirements called approximation and adjustability. 3

4 2.2 Mining Techniques in Distributed Environment Mining Data streams is concerned with extracting knowledge represented in models and patterns from non stopping streams of information. It requires a special architecture and methods. The authors in [6] discussed SAMOA - a platform for online mining in a cluster/cloud environment. It features a pluggable architecture that allows it to run on several distributed stream processing. It includes algorithms for the most common machine learning tasks such as classification and clustering. FP stream Approach in [7] suggests the procedure to mine frequent itemset under time fading model. In this model, frequent items are stored in a compact tree representation, called FP tree. In this approach, the continuous streams are divided under time window model. Each window maintains the frequent patterns and their counts. FP tree is also built for each window. It is suitable to mine the recent data only. The authors in paper [5] discussed different methods and issues in data stream management. The paper discussed different mining techniques such as Association, clustering and classification methods for data stream mining. The author in [8] proposed general method for scaling up machine learning algorithm called Very fast Machine learning. Mining frequent itemsets in dynamic databases [9, 10] has been studied over the last decade. In this method the frequent items and support counts are derived from original database and it is updated when new transactions are added to the data bases. Recent research has been done under mining of frequent items over data stream using sliding window model [10, 11]. In [12] the author proposed a distributed algorithm which imposes low communication overhead for mining distributed datasets. Minimum support threshold and frequency count is given interactively for all the window transactions. Finally, results are merged together to find frequent patterns of dataset in the window W. In order to mine the frequent item from data stream the appropriate single scan frequent mining algorithm is applied. The author in [13] proposed new methods for mining frequent itemsets in parallel on the MapReduce framework. The method Eclat, distributes the search space as evenly as possible among mappers. In [14] the author presented a new map-reduce based algorithm addressing problem of mining frequent itemsets using dynamic workload management through a blockbased partitioning. In [15] the author suggests the distributed file system HDFS and described NIM (Network Intelligence Miner) which is a scalable and elastic streaming solution that analyzes traffic patterns in real-time, and provides information for real-time decision making. 3. PROBLEM DEFINITION AND ANALYSIS Objective of our work is to propose a distributed framework for the analysis of Big data stream and predicts the complete set of frequent patterns from data stream. Our work model combines time sensitive window with distributed and parallel processing. Distributed and parallel processing heavily relies on data partitioning such as break down a large data set into multiple pieces that can be processed by independent processors. Let us consider the stream of continuous data, that is built on time sensitive sliding window (W), which can be divided into different splits of streams such as W=(w1,w2,w3,...,wn) over various recent time slots called (t1,t2,t3,...,tn). The splitted streams (w1,w2..wn) are applied to Hadoop Distributed File System (HDFS) and mapreduce paradigm to obtain frequent patterns. Hadoop is an open source distributed framework which is designed based on the Google s Map-reduce programming model. Hadoop is capable of analyzing large amount of data and supports write once read many access models. Hadoop has its own file system called Hadoop Distributed File system (HDFS) which is capable of running on commodity hardware with high fault tolerance ability. Data replication is one of the important features of HDFS, 4

5 Novel Framework for Distributed Data Stream Mining in Big data Analytics Using Time Sensitive Sliding Window which ensures data availability and automatic re-execution on multiple node failure. In this paper we have proposed algorithm which will use the power of Hadoop for mining the patterns in distributed environments Let us consider n different window W = {w1,w2,... wn}, each has a size m number of transactions and each window has been created in a time slots t0,t1,t2... tn The frequent patterns are to be obtained for all the windows transactions. Here we consider the block partitioning for the distribution of the datasets among all processing nodes. The Hadoop HDFS system allocated window transactions into M clusters of nodes in a server. Let s assume size of the partition Ti as Di. Now each partition Ti is divided into bi blocks {t1,t2,...,tbi }. The size of a block ti is defined as a default value of 64MB or according to the available memory in the processing node Ni and number of items, the average transaction width, and also the support threshold of a dataset. In order to minimize the cost of computation the window transactions are distributed to multiple commodity nodes. The distributed, parallelization and incremental updation environment was adopted in our framework. 4. PROPOSED SYSTEM COMPONENTS We would like to propose the framework based on distributed and mapreduce technology for a dynamic environment such as big data analytics. We propose our model based on tilted time frame model with distributed file system. The following important components are needed to propose our system. 4.1 Time Sensitive Sliding Window The sources of stream data are generating unstructured or semi structured data rapidly. We propose to divide the data stream based on time sensitive sliding window model. The sliding-window model of computation is motivated by the assumption that, recent data is more useful and pertinent than older data. The entire stream in a window W with size m is divided into n splits (w1,w2,w3,... wn), and each window is generated based on time slots (t1,t2,...,tn). 4.2 Hadoop Distributed File System (HDFS) HSFS in a Hadoop technology allocated the splits (subset) of data stream into data node in the commodity clusters of server. Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout the cluster. In this way, the map and reduce functions can be executed on smaller subsets of your larger data sets, and this provides the scalability that is needed for big data processing. It enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software s ability to detect and handle failures at the application layer. Hadoop enables a computing solution that is: Scalable New nodes can be added as needed, and added without needing to change data formats, and coding data is loaded, how jobs are written, or the applications on top. Cost effective Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data. Flexible Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of 5

6 sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide. Fault tolerant When you lose a node, the system redirects work to another location of the data and continues processing without missing a fright beat. 4.3 Hadoop MapReduceframework MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster. Processing can be done on data stored either in a filesystem (unstructured) or in a database (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted. Map step: Each worker node applies the map() function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed. Shuffle step: Worker nodes redistribute data based on the output keys (produced by the map() function), such that all data belonging to one key is located on the same worker node. Reduce step: Worker nodes now process each group of output data, per key, in parallel. a set of reducers can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is associative. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled. 4.4 Pattern Tree Structure An efficient and compact data structure is proposed to store and update the collected information. This due to bounded memory size and large amount of data streams coming continuously. For each window the pattern tree is maintained to store the frequent patterns and its counts. Each node in the tree represents a pattern and its frequency is recorded in the tree. Figure 1 shows structure of pattern tree. 5. SYSTEM FRAMEWORK The proposed framework employs distributed and parallel processing with hadoop architecture in order to obtain frequent patterns from the fixed size of data streams. Figure 2 shows the proposed layout of the system. The HDFS in a hadoop allocates the data stream splits into the default size of block in commodity clusters of hardware which is called Map function. Every transactions and replication of block is also maintained as a meta data in a name node. Another node called data node in the server performs frequent pattern mining in the block of stream data. A MapReduce computation has two phases map phase and reduce phase. The input to the computation is a data set of key/value pairs. It partition input key/value pairs into chunks, run map() tasks in parallel. After all map()s are completed, it consolidate all emitted values for each unique emitted key. It partition space of output map keys, and run reduce() in parallel. Reducer generates output and stores it in HDFS. Tasks in each phase are executed in a fault-tolerant manner. Having many map and reduce tasks enables good load 6

7 Novel Framework for Distributed Data Stream Mining in Big data Analytics Using Time Sensitive Sliding Window Figure 1. Pattern Tree. balancing. It provides an interface which is independent of backend technology. All nodes in the cluster of server perform mining operations and the result will be in key, value pair. The output of all map phase will be given to the shuffling and sorting phase. The results are clustere and each reducer will get key as item pair and value as list of all transactions in which that item pair occur. For each such key value pair the reducer will calculate the sum of all transaction in which that item pair occur and compare it with minimal support and emit the output item pair as key and value as null. In this way we get frequent item pairs with less database scan but it will increase the number of in-memory computation to generate combination. Input: W- Data Stream in a specified Window F-set of frequent patterns Minsup Minimum support threshold Steps 1. Split the window W in to sliding window frames (w1,w2,w3,... wn) 2. Assign block size = 64 MB in Hadoop framework 3. Initiate Name service and metadata information META for the stream W 4. Loading HDFS to allocate {w1,w2,w3,...,wn) streams into data node (d1,d2,d3,...,dn) of server 5. Ensure replication of streams 6. Implement Frequent pattern mining using Pattern tree with minsup parameter 7. Map the function Freqmining() to all data node (d1,d2,d3,...,dn) 8. Shuffle and sort the results (key,value ) pair 9. Group by Key (k1,v1), (k2,v2)... (kx,vx) 7

8 Figure 2. Proposed System Framework Distributed Data Stream Pseudo Code: distributed Frequent pattern Mining from data Stream. 10. Reduce function r1,r2... rx // As many reduce function as many number of key 11. Frequent pattern F=(f1,f2... ) 6. CONCLUSION Big data stream is a continuous data generated from real time environment which need to be processed and analysed immediately for decision making. The traditional system with single scan and low processing power is not sufficient to perform such an operation. In this paper we suggested to adopt data distribution like hadoop technology to manage the continuous flow of data distribution and simultaneous processing. The processing involves in finding frequent patterns from data splits. The splits are maintained in a tilted time frame window. Frequent pattern mining algorithm with haddop is employed to find the frequent itemset. 8

9 Novel Framework for Distributed Data Stream Mining in Big data Analytics Using Time Sensitive Sliding Window References [1] Y. Zhu and D. Shasha, Statstream: Statistical monitoring of thousands of data streams in real time, in Proceedings of the 28th International Conference on Very Large Data Bases, pp , VLDB Endowment, [2] C. Jin, W. Qian, C. Sha, J. X. Yu, and A. Zhou, Dynamically maintaining frequent items over a data stream, in Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp , ACM, [3] Y. Yang and G. Mao, A Self-Adaptive Sliding Window Technique for Mining Data Streams, in Intelligence Computation and Evolutionary Computation, pp , Springer, [4] N. Tatbul, U. Çetintemel, S. Zdonik, M. Cherniack, and M. Stonebraker, Load shedding on data streams, in Proceedings of the Workshop on Management and Processing of Data Streams (MPDS 03), San Diego, CA, USA, [5] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, Mining data streams: a review, ACM Sigmod Record, vol. 34, no. 2, pp , [6] G. De Francisci Morales, SAMOA: A platform for mining big data streams, in Proceedings of the 22nd International Conference on World Wide Web Companion, pp , International World Wide Web Conferences Steering Committee, [7] C. Giannella, J. Han, J. Pei, X. Yan, and P. S. Yu, Mining frequent patterns in data streams at multiple time granularities, Next Generation Data Mining, vol. 212, pp , [8] P. Domingos and G. Hulten, Mining high-speed data streams, in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp , ACM, [9] D. W. Cheung, J. Han, V. T. Ng, and C. Wong, Maintenance of discovered association rules in large databases: An incremental updating technique, in Data Engineering, Proceedings of the Twelfth International Conference on, pp , IEEE, [10] C.-H. Lee, C.-R. Lin, and M.-S. Chen, Sliding-window filtering: an efficient algorithm for incremental mining, in Proceedings of the Tenth International Conference on Information and Knowledge Management, pp , ACM, [11] D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik, Aurora: a new model and architecture for data stream management, The VLDB JournalThe International Journal on Very Large Data Bases, vol. 12, no. 2, pp , [12] L. Zeng, L. Li, L. Duan, K. Lu, Z. Shi, M. Wang, W. Wu, and P. Luo, Distributed data mining: a survey, Information Technology and Management, vol. 13, no. 4, pp , [13] L. Zhou, Z. Zhong, J. Chang, J. Li, J. Huang, and S. Feng, Balanced parallel fp-growth with mapreduce, in Information Computing and Telecommunications (YC-ICT), 2010 IEEE Youth Conference on, pp , IEEE, [14] L. Dhamdhere Jyoti and B. Deshpande Kiran, A Novel Methodology of Frequent Itemset Mining on Hadoop, [15] L. Pan, J. Qian, C. He, W. Fan, C. He, and F. Yang, NIM: Scalable Distributed Stream Process System on Mobile Network Data, in Data Mining Workshops (ICDMW), 2013 IEEE 13th International Conference on, pp , IEEE,

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

The basic data mining algorithms introduced may be enhanced in a number of ways.

The basic data mining algorithms introduced may be enhanced in a number of ways. DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Hadoop Technology for Flow Analysis of the Internet Traffic

Hadoop Technology for Flow Analysis of the Internet Traffic Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Are You Ready for Big Data?

Are You Ready for Big Data? Are You Ready for Big Data? Jim Gallo National Director, Business Analytics February 11, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: simmibagga12@gmail.com

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Are You Ready for Big Data?

Are You Ready for Big Data? Are You Ready for Big Data? Jim Gallo National Director, Business Analytics April 10, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?

More information

Advances in Natural and Applied Sciences

Advances in Natural and Applied Sciences AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

Efficient Analysis of Big Data Using Map Reduce Framework

Efficient Analysis of Big Data Using Map Reduce Framework Efficient Analysis of Big Data Using Map Reduce Framework Dr. Siddaraju 1, Sowmya C L 2, Rashmi K 3, Rahul M 4 1 Professor & Head of Department of Computer Science & Engineering, 2,3,4 Assistant Professor,

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

A SURVEY ON MAPREDUCE IN CLOUD COMPUTING

A SURVEY ON MAPREDUCE IN CLOUD COMPUTING A SURVEY ON MAPREDUCE IN CLOUD COMPUTING Dr.M.Newlin Rajkumar 1, S.Balachandar 2, Dr.V.Venkatesakumar 3, T.Mahadevan 4 1 Asst. Prof, Dept. of CSE,Anna University Regional Centre, Coimbatore, newlin_rajkumar@yahoo.co.in

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Big Data with Rough Set Using Map- Reduce

Big Data with Rough Set Using Map- Reduce Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

Mining Interesting Medical Knowledge from Big Data

Mining Interesting Medical Knowledge from Big Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from

More information

Map-Parallel Scheduling (mps) using Hadoop environment for job scheduler and time span for Multicore Processors

Map-Parallel Scheduling (mps) using Hadoop environment for job scheduler and time span for Multicore Processors Map-Parallel Scheduling (mps) using Hadoop environment for job scheduler and time span for Sudarsanam P Abstract G. Singaravel Parallel computing is an base mechanism for data process with scheduling task,

More information

ImprovedApproachestoHandleBigdatathroughHadoop

ImprovedApproachestoHandleBigdatathroughHadoop Global Journal of Computer Science and Technology: C Software & Data Engineering Volume 14 Issue 9 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Improving Apriori Algorithm to get better performance with Cloud Computing

Improving Apriori Algorithm to get better performance with Cloud Computing Improving Apriori Algorithm to get better performance with Cloud Computing Zeba Qureshi 1 ; Sanjay Bansal 2 Affiliation: A.I.T.R, RGPV, India 1, A.I.T.R, RGPV, India 2 ABSTRACT Cloud computing has become

More information

Data Refinery with Big Data Aspects

Data Refinery with Big Data Aspects International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

Transforming the Telecoms Business using Big Data and Analytics

Transforming the Telecoms Business using Big Data and Analytics Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe

More information

ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant

More information

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

More information

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance. Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

ISSN:2320-0790. Keywords: HDFS, Replication, Map-Reduce I Introduction:

ISSN:2320-0790. Keywords: HDFS, Replication, Map-Reduce I Introduction: ISSN:2320-0790 Dynamic Data Replication for HPC Analytics Applications in Hadoop Ragupathi T 1, Sujaudeen N 2 1 PG Scholar, Department of CSE, SSN College of Engineering, Chennai, India 2 Assistant Professor,

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON BIG DATA MANAGEMENT AND ITS SECURITY PRUTHVIKA S. KADU 1, DR. H. R.

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6

International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6 International Journal of Engineering Research ISSN: 2348-4039 & Management Technology Email: editor@ijermt.org November-2015 Volume 2, Issue-6 www.ijermt.org Modeling Big Data Characteristics for Discovering

More information

A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION

A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION Speed- Up Extension To Hadoop System- A Survey Of HDFS Data Placement Sayali Ashok Shivarkar, Prof.Deepali Gatade Computer Network, Sinhgad College of Engineering, Pune, India 1sayalishivarkar20@gmail.com

More information

Big Application Execution on Cloud using Hadoop Distributed File System

Big Application Execution on Cloud using Hadoop Distributed File System Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

Task Scheduling in Hadoop

Task Scheduling in Hadoop Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed

More information

A USE CASE OF BIG DATA EXPLORATION & ANALYSIS WITH HADOOP: STATISTICS REPORT GENERATION

A USE CASE OF BIG DATA EXPLORATION & ANALYSIS WITH HADOOP: STATISTICS REPORT GENERATION A USE CASE OF BIG DATA EXPLORATION & ANALYSIS WITH HADOOP: STATISTICS REPORT GENERATION Sumitha VS 1, Shilpa V 2 1 M.E. Final Year, Department of Computer Science Engineering (IT), UVCE, Bangalore, gvsumitha@gmail.com

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Hadoop Cluster Applications

Hadoop Cluster Applications Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Networking in the Hadoop Cluster

Networking in the Hadoop Cluster Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop

More information

Big Systems, Big Data

Big Systems, Big Data Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,

More information

Big Data: Study in Structured and Unstructured Data

Big Data: Study in Structured and Unstructured Data Big Data: Study in Structured and Unstructured Data Motashim Rasool 1, Wasim Khan 2 mail2motashim@gmail.com, khanwasim051@gmail.com Abstract With the overlay of digital world, Information is available

More information

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt

More information

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

REVIEW PAPER ON BIG DATA USING HADOOP

REVIEW PAPER ON BIG DATA USING HADOOP International Journal of Computer Engineering & Technology (IJCET) Volume 6, Issue 12, Dec 2015, pp. 65-71, Article ID: IJCET_06_12_008 Available online at http://www.iaeme.com/ijcet/issues.asp?jtype=ijcet&vtype=6&itype=12

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge Global Journal of Business Management and Information Technology. Volume 1, Number 2 (2011), pp. 85-93 Research India Publications http://www.ripublication.com Static Data Mining Algorithm with Progressive

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

More information

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

More information

StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data

StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data : High-throughput and Scalable Storage Technology for Streaming Data Munenori Maeda Toshihiro Ozawa Real-time analytical processing (RTAP) of vast amounts of time-series data from sensors, server logs,

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

A survey of big data architectures for handling massive data

A survey of big data architectures for handling massive data CSIT 6910 Independent Project A survey of big data architectures for handling massive data Jordy Domingos - jordydomingos@gmail.com Supervisor : Dr David Rossiter Content Table 1 - Introduction a - Context

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information