Massive Data Query Optimization on Large Clusters

Size: px
Start display at page:

Download "Massive Data Query Optimization on Large Clusters"

Transcription

1 Journal of Computational Information Systems 8: 8 (2012) Available at Massive Data Query Optimization on Large Clusters Guigang ZHANG, Chao LI, Yong ZHANG, Chunxiao XING Research Institute of Information Technology, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing , China Abstract The growing demand for massive data processing and analysis applications has led both academia and industry to design lots of new types of highly scalable massive data-intensive computing platforms base on the large clusters in the cloud environment. How to get an fast query response time, especially to those ad hoc queries, is becoming very important in the large clusters environment. In this paper, we designed a series of algorithms for the query optimization. We designed an efficient massive data query and optimization mechanism SemanQuery. SemanQuery have two characters: First, it has better semantics, that is to say it has some intelligent when processing massive data queries through a semantic matching algorithm. Second, In order to reduce the query cost, we constructed a very large query network in SemanQuery and optimize it. Simulation experiment and result showed that SemanQuery will improve the query efficiency better on large clusters. Keywords: Query Processing; Query Optimization; Query Network; Semanquery; Optimization Algorithm 1 Introduction With the development of cloud computing technologies, internet, mobile internet and the internet of things, all kinds of terminals and information collectors increase rapidly. Every things and peoples will produce massive data. IDC predicts that the data will arrive at 8ZB in 2015 year. All these big data will enter into the information system and they need to been stored, analyzed and used. How to process these big data is facing a very big difficult in the cloud environment. Massive data, it can also be known as the big data. It has becoming the hot research trend in the academicals and industry area. How to query the data that satisfied the millions of users requirements from the massive data of files, relational databases and cloud databases is facing a very big challenge. The challenge can be summarized into How to find these data that satisfied Project supported by National Basic Research Program of China (973 Program) No. 2011CB302302, the National Natural Science Foundation of China under Grant No , the Research Foundation of the Ministry of Railways and Tsinghua University under Grant No. J2010Z057, J2010Z059. Corresponding author. address: [email protected] (Guigang ZHANG) / Copyright 2012 Binary Information Press April 2012

2 3192 G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) the millions of users requirements quickly? In order to resolve this challenge, in this paper we proposed some techniques that include semantic technologies and query optimization technologies. We proposed a very big query network at first. The big query network includes all SQL query plans which satisfied millions users requirements. We assume all the files query plans in HDFS, GFS file systems and data query plans in databases can been converted into SQL query plans. We proposed a semantic matching algorithm between the users query requirements and big query network. The semantic matching algorithm mainly helps us to find a good query path in the big query network. 2 Related Work Data query and optimization is very important in database and data processing area. Lots of researchers make many researches in it. The data query and optimization can been divided into the RDMS query and optimization, OODMS query and optimization, traditional database optimization, distributed database query and optimization. Paper [1] make a analysis for an Object-Oriented implementation for extensible database query optimization. In the earlier stage, most of research on query and optimization is aimed at the traditional database such MySQL, Oracle, Sybase and DB2 etc. Later, more and more data should be stored into the distributed environment. And so, many research theories and methods focus on it. For example, the paper [2] proposed an efficient parallel skyline processing using hyper plane projections. With the development of cloud computing and cloud storage technologies, the big data s [3] query and optimization are becoming more and more popular in the recent research. The big data s processing include structured data processing, semi-structured data processing and no-structured data processing, especially the uncertain [4] data query and optimization processing. The massive no-structured data processing mainly use MapReduce computing model [5]. Based on the MapReduce, in order to improve the query and processing efficiency to the big data, lots of others internet massive data computing framework such as Twister, Haloop, Hadoop++, Spark, CrowdDB [6] and Yale university s HadoopDB are proposed in these recently years. All these new computing framework s objectives is to improve the query efficiency to big data. Query and optimization methods will improve the efficiency a lot to the massive data processing. Lots of researchers designed lots of query and optimization methods such as Top-K [7], processing on joins [8] and two-way selection [9] to data processing and query workload balance[10]. 3 SemanQuery Architecture SemanQuery architecture can be shown as Fig. 1. It can be described as the following: (1) All files will be stored in the local file system. The file system maybe is Windows file system, Linux file system or the other file systems. (2) RDMS (Relational Database Management System) may been run on the top of local file system or the DFS (Distributed File System) such as Google file system, Hadoop distributed file system and so on. Cloud databases mainly store the massive files index information and its metadata information and manage all these files. All this files (Txt files, Video files and picture files so on) and cloud databases final files will run on top of the distributed file system.

3 G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) (3) All RDMS databases and cloud databases are made up of many tables. All our queries will aim at these tables. Big query network will gain the data from these tables and bee processed by query plans. (4) Users can input query requirements and get query results from the user web interface. (5) When the SemanQuery get the users query requirements, SemanQuery will make a semantic matching with the big query network. If the big query network has these query plans, SemanQuery will get the query plans query paths in the big query network and executive these query plans according to the query paths. If SemanQuery cannot find the users query plans in the big query network, it will add the query plans to big query network and create a new big query network. When the new big query network was created renew, submit the users query requirements and executive the query plans. (6) After executive the query plans, the results will be expressed in the user web interface. Fig. 1: SemanQuery architecture 4 SemanQuery Implementation Method 4.1 Big query network In the cloud environment, millions of query plans been submitted by users every day. All these query plans can be constructed into a very big query network. First, we see an example 1. [Example 1] There are four selection plans as the following: S1 : Select T 1.A from T 1 where T 1.A > x

4 3194 G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) S2 : Select T 2.B, T 2.C from T 2 where T 2.D > y S3 : Select T 2.B from T 2 where T 2.D > z S4 : select T 1.A from T 1, T 2 where T 1.A = T 2.A and T 1.A > x Fig. 2: Graphic representation of S1, S2, S3 and S4 query plans The Fig. 2 shows a graphic representation of S1, S2, S3 and S4 query plans. The big query network generation algorithm is as follows: [Algorithm 1] Big Query Network Generation Algorithm Input: Tables, Query Conditions Output: Big Query Network 1. For (i=0; i<queryplan.numbers; i++){ 2. ParSer (QueryPlan[i]); 3. Get the Tables from QueryPlan[i]; 4. Construct a for every table; 5. Get the Conditions from QueryPlan[i]; 6. Construct a for every table; 7. Get the Query Out from QueryPlan[i]; 8. Construct a for every table; 9. Generating a QueryPlan[i] Tree. 10. } 11. Find the same Tables from all QueryPlan[i] Trees; 12. Combine all the same Tables; 13. End

5 G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) Fig. 3: Part of a big query network after optimization Fig. 4: A semantic matching result at the time [1] 4.2 Semantic matching We assume the T is the table node, C is the computing node (such as Selection, Union etc), R is the result node and the U is the user node. The fig. 3 is part of a big query network after optimization. The Fig. 4 shows a semantic matching result at the time [1] and the Fig. 5 shows a semantic matching result at the time [2]. From the Fig. 4 and Fig. 5, we can see that the query plans requirements at the time [1] and time [2] are all the sub set of the big query network. The query plans requirements at time [1] is the yellow part of the big query network showed in Fig. 4. The query plans requirements at time [2] is the green part of the big query network showed in Fig. 5. Fig. 5: A semantic matching result at the time [2] Unlike the Fig. 4 and Fig. 5, The Fig. 6 is not the same as the Fig. 4 and Fig. 5. Assume at

6 3196 G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) time [3], the all users whole query plan cannot be found completely in the big query network, that is to say, the query sub set are not the sub set of big query network. As shows in the Fig. 6, in the right part, only the light red part is the sub set of big query network and the red part (C14, C9 and U9) is not the sub set of big query network. And so, we will add the red part into the big query network as showed in the left part of Fig. 6. Fig. 6: A semantic matching result at the time [3] In Section 4.2, we describe the optimization methods of big query network, then we take Fig. 4, Fig. 5 and Fig. 6 describe the query network s station at time [1], time [2] and time [3]. The following algorithm 2 is the semantic matching algorithm. It realize all query network station at any time include time [1], time [2] and time [3]. [Algorithm 2] Semantic Matching Algorithm Input: A Serial of Query Plans at time [i], Big Query Network Output: A Query Sub Network, A New Big Query Network and A Query Sub Network. 1. Start 2. For (time[i=1]; i < ; time[i++]){ 3. For (QueryPlan[j=1]; i <= QueryPlan[TotalNumbers]; QueryPlan[j++]){ 4. Get the Tables from QueryPlan[j]; 5. If (QueryPlan[j].Tables BigQueryNetwork.Tables 6. Flag all QueryPlan[j].Tables in BigQueryNetwork.Tables 7. Else 8. Add all those tables, {tables QueryPlan[j].Tables, tables NOT BigQueryNetwork.Tables. 9. If (QueryPlan[j].Conditions BigQueryNetwork.Conditions 10. Flag all QueryPlan[j].Conditions in BigQueryNetwork.Conditions 11. Else

7 G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) Add all those Conditions, { Conditions QueryPlan[j]. Conditions, Conditions NOT BigQueryNetwork. Conditions. 13. Add all the Users OutPut Nodes 14. Connect all these Flag Nodes using 15. Get the Query Sub Network in the old Big Query Network OR in the New Big Query Network. 16. Optimization the Query Sub Network Using Equivalent Substitution Methods. 17. }} 18. END. 5 Simulation Experiment In this simulation experiment, we should compute and compare the cost in the Fig. 7. The part (a) of Fig. 7 is not optimized and the part (b) of Fig. 7 has been optimized. We compute the part (a) and part (b) s cost. Fig. 7: Simulation experiment According to the former big query network, we can get the cost of part (a) is equal to formula (1) and the cost of part (b) is equal to formula (2): Cost(N ooptimization) = n n Cost(C[i])+ Cost(U[i].T 1( )) = n (Cost(C1)+Cost(U 1.T 1( ))) (1) i=1 Cost(Optimization) = Cost(C1) + i=1 n Cost(U[i].T 1( )) = Cost(C1) + n Cost(U1.T 1( )) (2) i=1 Assume: the Table T1 has K records. C1=C2=Cn (Their conditions are the same: T1.A>x), and so, f1=f2=f3=k. The C1, C2 and Cn s filtering rate is θ, so the f4 = f5 = f6 = f1 θ; Table T1 s record unit transmission time is π; C1, C2 and Cn s unit traverse time is and C1, C2 and Cn s unit computing time is l. Assume π = t, = 0.25t, l = 0.25t, θ = 0.5. We can get the simulation result as the figure 8.

8 3198 G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) Fig. 8: Simulation experiment result 6 Conclusions and Future Work In the future, we will develop a better query and optimization method on their metadata index and original massive data s management. References [1] Navin Kabra, David J. DeWitt: OPT++: An Object-Oriented Implementation for Extensible Database Query Optimization [J]. VLDB J. 8 (1): (1999). [2] Henning Köhler, Jing Yang, Xiaofang Zhou. Efficient parallel skyline processing using hyperplane projections. Proceedings of the SIGMOD pages: [3] Yuan LIN, Hongfei LIN, Li HE, A Cluster-based Resource Correlative Query Expansion in Distributed Information Retrieval, Journal of Computational Information Systems, 1 (2012), [4] Li YE, Zhiguang QIN, Uncertain Range Queries for Revised Bead Model, Journal of Computational Information Systems, 1 (2012), [5] Eaman Jahani, Michael J. Cafarella, Christopher Ré. Automatic Optimization for MapReduce Programs. Proceedings of the VLDB2011. pages: [6] Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, Reynold Xin. CrowdDB: answering queries with crowdsourcing. Proceedings of the SIGMOD pages: [7] Minji Wu, Laure Berti-Equille, Amélie Marian, Cecilia M. Procopiuc, Divesh Srivastava. Processing Top-k Join Queries. Proceedings of the VLDB pages: [8] Akrivi Vlachou, Christos Doulkeridis, Neoklis Polyzotis. Skyline query processing over joins. Proceedings of the SIGMOD pages: [9] Xavier Martinez-Palau, David Dominguez-Sal, Josep-Lluis Larriba-Pey. Two-way Replacement Selection. (VLDB 2010). pages: [10] Eric Lo, Nick Cheng, Wing-Kai Hon. Generating Databases for Query Workloads. Proceedings of the VLDB pages:

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current

More information

BIG DATA CHALLENGES AND PERSPECTIVES

BIG DATA CHALLENGES AND PERSPECTIVES BIG DATA CHALLENGES AND PERSPECTIVES Meenakshi Sharma 1, Keshav Kishore 2 1 Student of Master of Technology, 2 Head of Department, Department of Computer Science and Engineering, A P Goyal Shimla University,

More information

http://www.paper.edu.cn

http://www.paper.edu.cn 5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission

More information

Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast

Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast International Conference on Civil, Transportation and Environment (ICCTE 2016) Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast Xiaodong Zhang1, a, Baotian Dong1, b, Weijia Zhang2,

More information

UPS battery remote monitoring system in cloud computing

UPS battery remote monitoring system in cloud computing , pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology

More information

KEYWORD SEARCH IN RELATIONAL DATABASES

KEYWORD SEARCH IN RELATIONAL DATABASES KEYWORD SEARCH IN RELATIONAL DATABASES N.Divya Bharathi 1 1 PG Scholar, Department of Computer Science and Engineering, ABSTRACT Adhiyamaan College of Engineering, Hosur, (India). Data mining refers to

More information

An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi

An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi International Conference on Applied Science and Engineering Innovation (ASEI 2015) An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi Institute of Computer Forensics,

More information

Applied research on data mining platform for weather forecast based on cloud storage

Applied research on data mining platform for weather forecast based on cloud storage Applied research on data mining platform for weather forecast based on cloud storage Haiyan Song¹, Leixiao Li 2* and Yuhong Fan 3* 1 Department of Software Engineering t, Inner Mongolia Electronic Information

More information

Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing

Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 E-commerce recommendation system on cloud computing

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]

More information

On a Hadoop-based Analytics Service System

On a Hadoop-based Analytics Service System Int. J. Advance Soft Compu. Appl, Vol. 7, No. 1, March 2015 ISSN 2074-8523 On a Hadoop-based Analytics Service System Mikyoung Lee, Hanmin Jung, and Minhee Cho Korea Institute of Science and Technology

More information

Computing Issues for Big Data Theory, Systems, and Applications

Computing Issues for Big Data Theory, Systems, and Applications Computing Issues for Big Data Theory, Systems, and Applications Beihang University Chunming Hu ([email protected]) Big Data Summit, with CyberC 2013 October 10, 2013. Beijing, China. Bio of Myself Chunming

More information

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Data-intensive HPC: opportunities and challenges. Patrick Valduriez Data-intensive HPC: opportunities and challenges Patrick Valduriez Big Data Landscape Multi-$billion market! Big data = Hadoop = MapReduce? No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard,

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft

More information

Design of Electronic Medical Record System Based on Cloud Computing Technology

Design of Electronic Medical Record System Based on Cloud Computing Technology TELKOMNIKA Indonesian Journal of Electrical Engineering Vol.12, No.5, May 2014, pp. 4010 ~ 4017 DOI: http://dx.doi.org/10.11591/telkomnika.v12i5.4392 4010 Design of Electronic Medical Record System Based

More information

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

An Approach to Implement Map Reduce with NoSQL Databases

An Approach to Implement Map Reduce with NoSQL Databases www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh

More information

Scalable Multiple NameNodes Hadoop Cloud Storage System

Scalable Multiple NameNodes Hadoop Cloud Storage System Vol.8, No.1 (2015), pp.105-110 http://dx.doi.org/10.14257/ijdta.2015.8.1.12 Scalable Multiple NameNodes Hadoop Cloud Storage System Kun Bi 1 and Dezhi Han 1,2 1 College of Information Engineering, Shanghai

More information

Big Data and Analytics: Challenges and Opportunities

Big Data and Analytics: Challenges and Opportunities Big Data and Analytics: Challenges and Opportunities Dr. Amin Beheshti Lecturer and Senior Research Associate University of New South Wales, Australia (Service Oriented Computing Group, CSE) Talk: Sharif

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Open Access Research on Database Massive Data Processing and Mining Method based on Hadoop Cloud Platform

Open Access Research on Database Massive Data Processing and Mining Method based on Hadoop Cloud Platform Send Orders for Reprints to [email protected] The Open Automation and Control Systems Journal, 2014, 6, 1463-1467 1463 Open Access Research on Database Massive Data Processing and Mining Method

More information

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Big Data Storage Architecture Design in Cloud Computing

Big Data Storage Architecture Design in Cloud Computing Big Data Storage Architecture Design in Cloud Computing Xuebin Chen 1, Shi Wang 1( ), Yanyan Dong 1, and Xu Wang 2 1 College of Science, North China University of Science and Technology, Tangshan, Hebei,

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Big Data Research in the AMPLab: BDAS and Beyond

Big Data Research in the AMPLab: BDAS and Beyond Big Data Research in the AMPLab: BDAS and Beyond Michael Franklin UC Berkeley 1 st Spark Summit December 2, 2013 UC BERKELEY AMPLab: Collaborative Big Data Research Launched: January 2011, 6 year planned

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

Design of Electric Energy Acquisition System on Hadoop

Design of Electric Energy Acquisition System on Hadoop , pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster

A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster , pp.11-20 http://dx.doi.org/10.14257/ ijgdc.2014.7.2.02 A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster Kehe Wu 1, Long Chen 2, Shichao Ye 2 and Yi Li 2 1 Beijing

More information

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering

HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering Chang Liu 1 Jun Qu 1 Guilin Qi 2 Haofen Wang 1 Yong Yu 1 1 Shanghai Jiaotong University, China {liuchang,qujun51319, whfcarter,yyu}@apex.sjtu.edu.cn

More information

A STUDY OF ADOPTING BIG DATA TO CLOUD COMPUTING

A STUDY OF ADOPTING BIG DATA TO CLOUD COMPUTING A STUDY OF ADOPTING BIG DATA TO CLOUD COMPUTING ASMAA IBRAHIM Technology Innovation and Entrepreneurship Center, Egypt [email protected] MOHAMED EL NAWAWY Technology Innovation and Entrepreneurship

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

Improving Data Processing Speed in Big Data Analytics Using. HDFS Method

Improving Data Processing Speed in Big Data Analytics Using. HDFS Method Improving Data Processing Speed in Big Data Analytics Using HDFS Method M.R.Sundarakumar Assistant Professor, Department Of Computer Science and Engineering, R.V College of Engineering, Bangalore, India

More information

A Novel Switch Mechanism for Load Balancing in Public Cloud

A Novel Switch Mechanism for Load Balancing in Public Cloud International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) A Novel Switch Mechanism for Load Balancing in Public Cloud Kalathoti Rambabu 1, M. Chandra Sekhar 2 1 M. Tech (CSE), MVR College

More information

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research

More information

Suresh Lakavath csir urdip Pune, India [email protected].

Suresh Lakavath csir urdip Pune, India lsureshit@gmail.com. A Big Data Hadoop Architecture for Online Analysis. Suresh Lakavath csir urdip Pune, India [email protected]. Ramlal Naik L Acme Tele Power LTD Haryana, India [email protected]. Abstract Big Data

More information

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges James Campbell Corporate Systems Engineer HP Vertica [email protected] Big

More information

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing

More information

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here> s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop

More information

SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK

SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK 3/2/2011 SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK Systems Group Dept. of Computer Science ETH Zürich, Switzerland SwissBox Humboldt University Dec. 2010 Systems Group = www.systems.ethz.ch

More information

Dynamic Adaptive Feedback of Load Balancing Strategy

Dynamic Adaptive Feedback of Load Balancing Strategy Journal of Information & Computational Science 8: 10 (2011) 1901 1908 Available at http://www.joics.com Dynamic Adaptive Feedback of Load Balancing Strategy Hongbin Wang a,b, Zhiyi Fang a,, Shuang Cui

More information

Telecom Data processing and analysis based on Hadoop

Telecom Data processing and analysis based on Hadoop COMPUTER MODELLING & NEW TECHNOLOGIES 214 18(12B) 658-664 Abstract Telecom Data processing and analysis based on Hadoop Guofan Lu, Qingnian Zhang *, Zhao Chen Wuhan University of Technology, Wuhan 4363,China

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

Topics in basic DBMS course

Topics in basic DBMS course Topics in basic DBMS course Database design Transaction processing Relational query languages (SQL), calculus, and algebra DBMS APIs Database tuning (physical database design) Basic query processing (ch

More information

Crowdsourcing Entity Resolution: a Short Overview and Open Issues

Crowdsourcing Entity Resolution: a Short Overview and Open Issues Crowdsourcing Entity Resolution: a Short Overview and Open Issues Xiao Chen Otto-von-Gueriecke University Magdeburg [email protected] ABSTRACT Entity resolution (ER) is a process to identify records that

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010 Hadoop: Distributed Data Processing Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010 Outline Scaling for Large Data Processing What is Hadoop? HDFS and MapReduce

More information

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

More information

Report on the Train Ticketing System

Report on the Train Ticketing System Report on the Train Ticketing System Author: Zaobo He, Bing Jiang, Zhuojun Duan 1.Introduction... 2 1.1 Intentions... 2 1.2 Background... 2 2. Overview of the Tasks... 3 2.1 Modules of the system... 3

More information

Research on Job Scheduling Algorithm in Hadoop

Research on Job Scheduling Algorithm in Hadoop Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: 2349-7017(O) Volume 1 Issue 3 (September 2014)

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: 2349-7017(O) Volume 1 Issue 3 (September 2014) SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE N.Alamelu Menaka * Department of Computer Applications Dr.Jabasheela Department of Computer Applications Abstract-We are in the age of big data which

More information

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,

More information

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms Volume 1, Issue 1 ISSN: 2320-5288 International Journal of Engineering Technology & Management Research Journal homepage: www.ijetmr.org Analysis and Research of Cloud Computing System to Comparison of

More information

A computational model for MapReduce job flow

A computational model for MapReduce job flow A computational model for MapReduce job flow Tommaso Di Noia, Marina Mongiello, Eugenio Di Sciascio Dipartimento di Ingegneria Elettrica e Dell informazione Politecnico di Bari Via E. Orabona, 4 70125

More information

CONCEPTUAL MODEL OF MULTI-AGENT BUSINESS COLLABORATION BASED ON CLOUD WORKFLOW

CONCEPTUAL MODEL OF MULTI-AGENT BUSINESS COLLABORATION BASED ON CLOUD WORKFLOW CONCEPTUAL MODEL OF MULTI-AGENT BUSINESS COLLABORATION BASED ON CLOUD WORKFLOW 1 XINQIN GAO, 2 MINGSHUN YANG, 3 YONG LIU, 4 XIAOLI HOU School of Mechanical and Precision Instrument Engineering, Xi'an University

More information

Parallel Data Warehouse

Parallel Data Warehouse MICROSOFT S ANALYTICS SOLUTIONS WITH PARALLEL DATA WAREHOUSE Parallel Data Warehouse Stefan Cronjaeger Microsoft May 2013 AGENDA PDW overview Columnstore and Big Data Business Intellignece Project Ability

More information

Discovering Business Insights in Big Data Using SQL-MapReduce

Discovering Business Insights in Big Data Using SQL-MapReduce Discovering Business Insights in Big Data Using SQL-MapReduce A Technical Whitepaper Rick F. van der Lans Independent Business Intelligence Analyst R20/Consultancy July 2013 Sponsored by Copyright 2013

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

Research on the UHF RFID Channel Coding Technology based on Simulink

Research on the UHF RFID Channel Coding Technology based on Simulink Vol. 6, No. 7, 015 Research on the UHF RFID Channel Coding Technology based on Simulink Changzhi Wang Shanghai 0160, China Zhicai Shi* Shanghai 0160, China Dai Jian Shanghai 0160, China Li Meng Shanghai

More information

How To Use Hadoop For Gis

How To Use Hadoop For Gis 2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop Big Data: Using ArcGIS with Apache Hadoop David Kaiser Erik Hoel Offering 1330 Esri UC2013. Technical Workshop.

More information

Next-Generation Cloud Analytics with Amazon Redshift

Next-Generation Cloud Analytics with Amazon Redshift Next-Generation Cloud Analytics with Amazon Redshift What s inside Introduction Why Amazon Redshift is Great for Analytics Cloud Data Warehousing Strategies for Relational Databases Analyzing Fast, Transactional

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

NoSQL: Robust and Efficient Data Management on Deduplication Process by using a mobile application

NoSQL: Robust and Efficient Data Management on Deduplication Process by using a mobile application NoSQL: Robust and Efficient Data Management on Deduplication Process by using a mobile application Hemn B.Abdalla, Jinzhao Lin, Guoquan Li Abstract Present Information technology has an enormous responsibility

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information

More information

Data Refinery with Big Data Aspects

Data Refinery with Big Data Aspects International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

CLOUD BASED PEER TO PEER NETWORK FOR ENTERPRISE DATAWAREHOUSE SHARING

CLOUD BASED PEER TO PEER NETWORK FOR ENTERPRISE DATAWAREHOUSE SHARING CLOUD BASED PEER TO PEER NETWORK FOR ENTERPRISE DATAWAREHOUSE SHARING Basangouda V.K 1,Aruna M.G 2 1 PG Student, Dept of CSE, M.S Engineering College, Bangalore,[email protected] 2 Associate Professor.,

More information

GeoSquare: A cloud-enabled geospatial information resources (GIRs) interoperate infrastructure for cooperation and sharing

GeoSquare: A cloud-enabled geospatial information resources (GIRs) interoperate infrastructure for cooperation and sharing GeoSquare: A cloud-enabled geospatial information resources (GIRs) interoperate infrastructure for cooperation and sharing Kai Hu 1, Huayi Wu 1, Zhipeng Gui 2, Lan You 1, Ping Shen 1, Shuang Gao 1, Jie

More information

Design call center management system of e-commerce based on BP neural network and multifractal

Design call center management system of e-commerce based on BP neural network and multifractal Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(6):951-956 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Design call center management system of e-commerce

More information

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance. Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

More information

Study on Redundant Strategies in Peer to Peer Cloud Storage Systems

Study on Redundant Strategies in Peer to Peer Cloud Storage Systems Applied Mathematics & Information Sciences An International Journal 2011 NSP 5 (2) (2011), 235S-242S Study on Redundant Strategies in Peer to Peer Cloud Storage Systems Wu Ji-yi 1, Zhang Jian-lin 1, Wang

More information

COMP9321 Web Application Engineering

COMP9321 Web Application Engineering COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411

More information

OPTIMIZATION STRATEGY OF CLOUD COMPUTING SERVICE COMPOSITION RESEARCH BASED ON ANP

OPTIMIZATION STRATEGY OF CLOUD COMPUTING SERVICE COMPOSITION RESEARCH BASED ON ANP OPTIMIZATION STRATEGY OF CLOUD COMPUTING SERVICE COMPOSITION RESEARCH BASED ON ANP Xing Xu School of Automation Huazhong University of Science and Technology Wuhan 430074, P.R.China E-mail: [email protected]

More information

Policy-based Pre-Processing in Hadoop

Policy-based Pre-Processing in Hadoop Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden [email protected], [email protected] Abstract While big data analytics provides

More information

Chase Wu New Jersey Ins0tute of Technology

Chase Wu New Jersey Ins0tute of Technology CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce

A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce , pp.231-242 http://dx.doi.org/10.14257/ijsia.2014.8.2.24 A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce Wang Jin-Song, Zhang Long, Shi Kai and Zhang Hong-hao School

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

ISSN:2320-0790. Keywords: HDFS, Replication, Map-Reduce I Introduction:

ISSN:2320-0790. Keywords: HDFS, Replication, Map-Reduce I Introduction: ISSN:2320-0790 Dynamic Data Replication for HPC Analytics Applications in Hadoop Ragupathi T 1, Sujaudeen N 2 1 PG Scholar, Department of CSE, SSN College of Engineering, Chennai, India 2 Assistant Professor,

More information

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data

More information