Massive Data Query Optimization on Large Clusters
|
|
|
- Ronald Powers
- 10 years ago
- Views:
Transcription
1 Journal of Computational Information Systems 8: 8 (2012) Available at Massive Data Query Optimization on Large Clusters Guigang ZHANG, Chao LI, Yong ZHANG, Chunxiao XING Research Institute of Information Technology, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing , China Abstract The growing demand for massive data processing and analysis applications has led both academia and industry to design lots of new types of highly scalable massive data-intensive computing platforms base on the large clusters in the cloud environment. How to get an fast query response time, especially to those ad hoc queries, is becoming very important in the large clusters environment. In this paper, we designed a series of algorithms for the query optimization. We designed an efficient massive data query and optimization mechanism SemanQuery. SemanQuery have two characters: First, it has better semantics, that is to say it has some intelligent when processing massive data queries through a semantic matching algorithm. Second, In order to reduce the query cost, we constructed a very large query network in SemanQuery and optimize it. Simulation experiment and result showed that SemanQuery will improve the query efficiency better on large clusters. Keywords: Query Processing; Query Optimization; Query Network; Semanquery; Optimization Algorithm 1 Introduction With the development of cloud computing technologies, internet, mobile internet and the internet of things, all kinds of terminals and information collectors increase rapidly. Every things and peoples will produce massive data. IDC predicts that the data will arrive at 8ZB in 2015 year. All these big data will enter into the information system and they need to been stored, analyzed and used. How to process these big data is facing a very big difficult in the cloud environment. Massive data, it can also be known as the big data. It has becoming the hot research trend in the academicals and industry area. How to query the data that satisfied the millions of users requirements from the massive data of files, relational databases and cloud databases is facing a very big challenge. The challenge can be summarized into How to find these data that satisfied Project supported by National Basic Research Program of China (973 Program) No. 2011CB302302, the National Natural Science Foundation of China under Grant No , the Research Foundation of the Ministry of Railways and Tsinghua University under Grant No. J2010Z057, J2010Z059. Corresponding author. address: [email protected] (Guigang ZHANG) / Copyright 2012 Binary Information Press April 2012
2 3192 G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) the millions of users requirements quickly? In order to resolve this challenge, in this paper we proposed some techniques that include semantic technologies and query optimization technologies. We proposed a very big query network at first. The big query network includes all SQL query plans which satisfied millions users requirements. We assume all the files query plans in HDFS, GFS file systems and data query plans in databases can been converted into SQL query plans. We proposed a semantic matching algorithm between the users query requirements and big query network. The semantic matching algorithm mainly helps us to find a good query path in the big query network. 2 Related Work Data query and optimization is very important in database and data processing area. Lots of researchers make many researches in it. The data query and optimization can been divided into the RDMS query and optimization, OODMS query and optimization, traditional database optimization, distributed database query and optimization. Paper [1] make a analysis for an Object-Oriented implementation for extensible database query optimization. In the earlier stage, most of research on query and optimization is aimed at the traditional database such MySQL, Oracle, Sybase and DB2 etc. Later, more and more data should be stored into the distributed environment. And so, many research theories and methods focus on it. For example, the paper [2] proposed an efficient parallel skyline processing using hyper plane projections. With the development of cloud computing and cloud storage technologies, the big data s [3] query and optimization are becoming more and more popular in the recent research. The big data s processing include structured data processing, semi-structured data processing and no-structured data processing, especially the uncertain [4] data query and optimization processing. The massive no-structured data processing mainly use MapReduce computing model [5]. Based on the MapReduce, in order to improve the query and processing efficiency to the big data, lots of others internet massive data computing framework such as Twister, Haloop, Hadoop++, Spark, CrowdDB [6] and Yale university s HadoopDB are proposed in these recently years. All these new computing framework s objectives is to improve the query efficiency to big data. Query and optimization methods will improve the efficiency a lot to the massive data processing. Lots of researchers designed lots of query and optimization methods such as Top-K [7], processing on joins [8] and two-way selection [9] to data processing and query workload balance[10]. 3 SemanQuery Architecture SemanQuery architecture can be shown as Fig. 1. It can be described as the following: (1) All files will be stored in the local file system. The file system maybe is Windows file system, Linux file system or the other file systems. (2) RDMS (Relational Database Management System) may been run on the top of local file system or the DFS (Distributed File System) such as Google file system, Hadoop distributed file system and so on. Cloud databases mainly store the massive files index information and its metadata information and manage all these files. All this files (Txt files, Video files and picture files so on) and cloud databases final files will run on top of the distributed file system.
3 G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) (3) All RDMS databases and cloud databases are made up of many tables. All our queries will aim at these tables. Big query network will gain the data from these tables and bee processed by query plans. (4) Users can input query requirements and get query results from the user web interface. (5) When the SemanQuery get the users query requirements, SemanQuery will make a semantic matching with the big query network. If the big query network has these query plans, SemanQuery will get the query plans query paths in the big query network and executive these query plans according to the query paths. If SemanQuery cannot find the users query plans in the big query network, it will add the query plans to big query network and create a new big query network. When the new big query network was created renew, submit the users query requirements and executive the query plans. (6) After executive the query plans, the results will be expressed in the user web interface. Fig. 1: SemanQuery architecture 4 SemanQuery Implementation Method 4.1 Big query network In the cloud environment, millions of query plans been submitted by users every day. All these query plans can be constructed into a very big query network. First, we see an example 1. [Example 1] There are four selection plans as the following: S1 : Select T 1.A from T 1 where T 1.A > x
4 3194 G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) S2 : Select T 2.B, T 2.C from T 2 where T 2.D > y S3 : Select T 2.B from T 2 where T 2.D > z S4 : select T 1.A from T 1, T 2 where T 1.A = T 2.A and T 1.A > x Fig. 2: Graphic representation of S1, S2, S3 and S4 query plans The Fig. 2 shows a graphic representation of S1, S2, S3 and S4 query plans. The big query network generation algorithm is as follows: [Algorithm 1] Big Query Network Generation Algorithm Input: Tables, Query Conditions Output: Big Query Network 1. For (i=0; i<queryplan.numbers; i++){ 2. ParSer (QueryPlan[i]); 3. Get the Tables from QueryPlan[i]; 4. Construct a for every table; 5. Get the Conditions from QueryPlan[i]; 6. Construct a for every table; 7. Get the Query Out from QueryPlan[i]; 8. Construct a for every table; 9. Generating a QueryPlan[i] Tree. 10. } 11. Find the same Tables from all QueryPlan[i] Trees; 12. Combine all the same Tables; 13. End
5 G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) Fig. 3: Part of a big query network after optimization Fig. 4: A semantic matching result at the time [1] 4.2 Semantic matching We assume the T is the table node, C is the computing node (such as Selection, Union etc), R is the result node and the U is the user node. The fig. 3 is part of a big query network after optimization. The Fig. 4 shows a semantic matching result at the time [1] and the Fig. 5 shows a semantic matching result at the time [2]. From the Fig. 4 and Fig. 5, we can see that the query plans requirements at the time [1] and time [2] are all the sub set of the big query network. The query plans requirements at time [1] is the yellow part of the big query network showed in Fig. 4. The query plans requirements at time [2] is the green part of the big query network showed in Fig. 5. Fig. 5: A semantic matching result at the time [2] Unlike the Fig. 4 and Fig. 5, The Fig. 6 is not the same as the Fig. 4 and Fig. 5. Assume at
6 3196 G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) time [3], the all users whole query plan cannot be found completely in the big query network, that is to say, the query sub set are not the sub set of big query network. As shows in the Fig. 6, in the right part, only the light red part is the sub set of big query network and the red part (C14, C9 and U9) is not the sub set of big query network. And so, we will add the red part into the big query network as showed in the left part of Fig. 6. Fig. 6: A semantic matching result at the time [3] In Section 4.2, we describe the optimization methods of big query network, then we take Fig. 4, Fig. 5 and Fig. 6 describe the query network s station at time [1], time [2] and time [3]. The following algorithm 2 is the semantic matching algorithm. It realize all query network station at any time include time [1], time [2] and time [3]. [Algorithm 2] Semantic Matching Algorithm Input: A Serial of Query Plans at time [i], Big Query Network Output: A Query Sub Network, A New Big Query Network and A Query Sub Network. 1. Start 2. For (time[i=1]; i < ; time[i++]){ 3. For (QueryPlan[j=1]; i <= QueryPlan[TotalNumbers]; QueryPlan[j++]){ 4. Get the Tables from QueryPlan[j]; 5. If (QueryPlan[j].Tables BigQueryNetwork.Tables 6. Flag all QueryPlan[j].Tables in BigQueryNetwork.Tables 7. Else 8. Add all those tables, {tables QueryPlan[j].Tables, tables NOT BigQueryNetwork.Tables. 9. If (QueryPlan[j].Conditions BigQueryNetwork.Conditions 10. Flag all QueryPlan[j].Conditions in BigQueryNetwork.Conditions 11. Else
7 G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) Add all those Conditions, { Conditions QueryPlan[j]. Conditions, Conditions NOT BigQueryNetwork. Conditions. 13. Add all the Users OutPut Nodes 14. Connect all these Flag Nodes using 15. Get the Query Sub Network in the old Big Query Network OR in the New Big Query Network. 16. Optimization the Query Sub Network Using Equivalent Substitution Methods. 17. }} 18. END. 5 Simulation Experiment In this simulation experiment, we should compute and compare the cost in the Fig. 7. The part (a) of Fig. 7 is not optimized and the part (b) of Fig. 7 has been optimized. We compute the part (a) and part (b) s cost. Fig. 7: Simulation experiment According to the former big query network, we can get the cost of part (a) is equal to formula (1) and the cost of part (b) is equal to formula (2): Cost(N ooptimization) = n n Cost(C[i])+ Cost(U[i].T 1( )) = n (Cost(C1)+Cost(U 1.T 1( ))) (1) i=1 Cost(Optimization) = Cost(C1) + i=1 n Cost(U[i].T 1( )) = Cost(C1) + n Cost(U1.T 1( )) (2) i=1 Assume: the Table T1 has K records. C1=C2=Cn (Their conditions are the same: T1.A>x), and so, f1=f2=f3=k. The C1, C2 and Cn s filtering rate is θ, so the f4 = f5 = f6 = f1 θ; Table T1 s record unit transmission time is π; C1, C2 and Cn s unit traverse time is and C1, C2 and Cn s unit computing time is l. Assume π = t, = 0.25t, l = 0.25t, θ = 0.5. We can get the simulation result as the figure 8.
8 3198 G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) Fig. 8: Simulation experiment result 6 Conclusions and Future Work In the future, we will develop a better query and optimization method on their metadata index and original massive data s management. References [1] Navin Kabra, David J. DeWitt: OPT++: An Object-Oriented Implementation for Extensible Database Query Optimization [J]. VLDB J. 8 (1): (1999). [2] Henning Köhler, Jing Yang, Xiaofang Zhou. Efficient parallel skyline processing using hyperplane projections. Proceedings of the SIGMOD pages: [3] Yuan LIN, Hongfei LIN, Li HE, A Cluster-based Resource Correlative Query Expansion in Distributed Information Retrieval, Journal of Computational Information Systems, 1 (2012), [4] Li YE, Zhiguang QIN, Uncertain Range Queries for Revised Bead Model, Journal of Computational Information Systems, 1 (2012), [5] Eaman Jahani, Michael J. Cafarella, Christopher Ré. Automatic Optimization for MapReduce Programs. Proceedings of the VLDB2011. pages: [6] Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, Reynold Xin. CrowdDB: answering queries with crowdsourcing. Proceedings of the SIGMOD pages: [7] Minji Wu, Laure Berti-Equille, Amélie Marian, Cecilia M. Procopiuc, Divesh Srivastava. Processing Top-k Join Queries. Proceedings of the VLDB pages: [8] Akrivi Vlachou, Christos Doulkeridis, Neoklis Polyzotis. Skyline query processing over joins. Proceedings of the SIGMOD pages: [9] Xavier Martinez-Palau, David Dominguez-Sal, Josep-Lluis Larriba-Pey. Two-way Replacement Selection. (VLDB 2010). pages: [10] Eric Lo, Nick Cheng, Wing-Kai Hon. Generating Databases for Query Workloads. Proceedings of the VLDB pages:
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
BIG DATA CHALLENGES AND PERSPECTIVES
BIG DATA CHALLENGES AND PERSPECTIVES Meenakshi Sharma 1, Keshav Kishore 2 1 Student of Master of Technology, 2 Head of Department, Department of Computer Science and Engineering, A P Goyal Shimla University,
http://www.paper.edu.cn
5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission
Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast
International Conference on Civil, Transportation and Environment (ICCTE 2016) Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast Xiaodong Zhang1, a, Baotian Dong1, b, Weijia Zhang2,
UPS battery remote monitoring system in cloud computing
, pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology
KEYWORD SEARCH IN RELATIONAL DATABASES
KEYWORD SEARCH IN RELATIONAL DATABASES N.Divya Bharathi 1 1 PG Scholar, Department of Computer Science and Engineering, ABSTRACT Adhiyamaan College of Engineering, Hosur, (India). Data mining refers to
An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi
International Conference on Applied Science and Engineering Innovation (ASEI 2015) An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi Institute of Computer Forensics,
Applied research on data mining platform for weather forecast based on cloud storage
Applied research on data mining platform for weather forecast based on cloud storage Haiyan Song¹, Leixiao Li 2* and Yuhong Fan 3* 1 Department of Software Engineering t, Inner Mongolia Electronic Information
Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 E-commerce recommendation system on cloud computing
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
On a Hadoop-based Analytics Service System
Int. J. Advance Soft Compu. Appl, Vol. 7, No. 1, March 2015 ISSN 2074-8523 On a Hadoop-based Analytics Service System Mikyoung Lee, Hanmin Jung, and Minhee Cho Korea Institute of Science and Technology
Computing Issues for Big Data Theory, Systems, and Applications
Computing Issues for Big Data Theory, Systems, and Applications Beihang University Chunming Hu ([email protected]) Big Data Summit, with CyberC 2013 October 10, 2013. Beijing, China. Bio of Myself Chunming
Data-intensive HPC: opportunities and challenges. Patrick Valduriez
Data-intensive HPC: opportunities and challenges Patrick Valduriez Big Data Landscape Multi-$billion market! Big data = Hadoop = MapReduce? No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard,
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
Design of Electronic Medical Record System Based on Cloud Computing Technology
TELKOMNIKA Indonesian Journal of Electrical Engineering Vol.12, No.5, May 2014, pp. 4010 ~ 4017 DOI: http://dx.doi.org/10.11591/telkomnika.v12i5.4392 4010 Design of Electronic Medical Record System Based
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
Scalable Multiple NameNodes Hadoop Cloud Storage System
Vol.8, No.1 (2015), pp.105-110 http://dx.doi.org/10.14257/ijdta.2015.8.1.12 Scalable Multiple NameNodes Hadoop Cloud Storage System Kun Bi 1 and Dezhi Han 1,2 1 College of Information Engineering, Shanghai
Big Data and Analytics: Challenges and Opportunities
Big Data and Analytics: Challenges and Opportunities Dr. Amin Beheshti Lecturer and Senior Research Associate University of New South Wales, Australia (Service Oriented Computing Group, CSE) Talk: Sharif
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Open Access Research on Database Massive Data Processing and Mining Method based on Hadoop Cloud Platform
Send Orders for Reprints to [email protected] The Open Automation and Control Systems Journal, 2014, 6, 1463-1467 1463 Open Access Research on Database Massive Data Processing and Mining Method
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Big Data Storage Architecture Design in Cloud Computing
Big Data Storage Architecture Design in Cloud Computing Xuebin Chen 1, Shi Wang 1( ), Yanyan Dong 1, and Xu Wang 2 1 College of Science, North China University of Science and Technology, Tangshan, Hebei,
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Big Data Research in the AMPLab: BDAS and Beyond
Big Data Research in the AMPLab: BDAS and Beyond Michael Franklin UC Berkeley 1 st Spark Summit December 2, 2013 UC BERKELEY AMPLab: Collaborative Big Data Research Launched: January 2011, 6 year planned
Manifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
Design of Electric Energy Acquisition System on Hadoop
, pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University
International Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster
, pp.11-20 http://dx.doi.org/10.14257/ ijgdc.2014.7.2.02 A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster Kehe Wu 1, Long Chen 2, Shichao Ye 2 and Yi Li 2 1 Beijing
Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering Chang Liu 1 Jun Qu 1 Guilin Qi 2 Haofen Wang 1 Yong Yu 1 1 Shanghai Jiaotong University, China {liuchang,qujun51319, whfcarter,yyu}@apex.sjtu.edu.cn
A STUDY OF ADOPTING BIG DATA TO CLOUD COMPUTING
A STUDY OF ADOPTING BIG DATA TO CLOUD COMPUTING ASMAA IBRAHIM Technology Innovation and Entrepreneurship Center, Egypt [email protected] MOHAMED EL NAWAWY Technology Innovation and Entrepreneurship
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Improving Data Processing Speed in Big Data Analytics Using. HDFS Method
Improving Data Processing Speed in Big Data Analytics Using HDFS Method M.R.Sundarakumar Assistant Professor, Department Of Computer Science and Engineering, R.V College of Engineering, Bangalore, India
A Novel Switch Mechanism for Load Balancing in Public Cloud
International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) A Novel Switch Mechanism for Load Balancing in Public Cloud Kalathoti Rambabu 1, M. Chandra Sekhar 2 1 M. Tech (CSE), MVR College
An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov
An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research
Suresh Lakavath csir urdip Pune, India [email protected].
A Big Data Hadoop Architecture for Online Analysis. Suresh Lakavath csir urdip Pune, India [email protected]. Ramlal Naik L Acme Tele Power LTD Haryana, India [email protected]. Abstract Big Data
Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges
Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges James Campbell Corporate Systems Engineer HP Vertica [email protected] Big
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK
3/2/2011 SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK Systems Group Dept. of Computer Science ETH Zürich, Switzerland SwissBox Humboldt University Dec. 2010 Systems Group = www.systems.ethz.ch
Dynamic Adaptive Feedback of Load Balancing Strategy
Journal of Information & Computational Science 8: 10 (2011) 1901 1908 Available at http://www.joics.com Dynamic Adaptive Feedback of Load Balancing Strategy Hongbin Wang a,b, Zhiyi Fang a,, Shuang Cui
Telecom Data processing and analysis based on Hadoop
COMPUTER MODELLING & NEW TECHNOLOGIES 214 18(12B) 658-664 Abstract Telecom Data processing and analysis based on Hadoop Guofan Lu, Qingnian Zhang *, Zhao Chen Wuhan University of Technology, Wuhan 4363,China
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
Topics in basic DBMS course
Topics in basic DBMS course Database design Transaction processing Relational query languages (SQL), calculus, and algebra DBMS APIs Database tuning (physical database design) Basic query processing (ch
Crowdsourcing Entity Resolution: a Short Overview and Open Issues
Crowdsourcing Entity Resolution: a Short Overview and Open Issues Xiao Chen Otto-von-Gueriecke University Magdeburg [email protected] ABSTRACT Entity resolution (ER) is a process to identify records that
R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],
Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010
Hadoop: Distributed Data Processing Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010 Outline Scaling for Large Data Processing What is Hadoop? HDFS and MapReduce
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
Report on the Train Ticketing System
Report on the Train Ticketing System Author: Zaobo He, Bing Jiang, Zhuojun Duan 1.Introduction... 2 1.1 Intentions... 2 1.2 Background... 2 2. Overview of the Tasks... 3 2.1 Modules of the system... 3
Research on Job Scheduling Algorithm in Hadoop
Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
International Journal of Innovative Research in Information Security (IJIRIS) ISSN: 2349-7017(O) Volume 1 Issue 3 (September 2014)
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE N.Alamelu Menaka * Department of Computer Applications Dr.Jabasheela Department of Computer Applications Abstract-We are in the age of big data which
Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics
Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,
Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms
Volume 1, Issue 1 ISSN: 2320-5288 International Journal of Engineering Technology & Management Research Journal homepage: www.ijetmr.org Analysis and Research of Cloud Computing System to Comparison of
A computational model for MapReduce job flow
A computational model for MapReduce job flow Tommaso Di Noia, Marina Mongiello, Eugenio Di Sciascio Dipartimento di Ingegneria Elettrica e Dell informazione Politecnico di Bari Via E. Orabona, 4 70125
CONCEPTUAL MODEL OF MULTI-AGENT BUSINESS COLLABORATION BASED ON CLOUD WORKFLOW
CONCEPTUAL MODEL OF MULTI-AGENT BUSINESS COLLABORATION BASED ON CLOUD WORKFLOW 1 XINQIN GAO, 2 MINGSHUN YANG, 3 YONG LIU, 4 XIAOLI HOU School of Mechanical and Precision Instrument Engineering, Xi'an University
Parallel Data Warehouse
MICROSOFT S ANALYTICS SOLUTIONS WITH PARALLEL DATA WAREHOUSE Parallel Data Warehouse Stefan Cronjaeger Microsoft May 2013 AGENDA PDW overview Columnstore and Big Data Business Intellignece Project Ability
Discovering Business Insights in Big Data Using SQL-MapReduce
Discovering Business Insights in Big Data Using SQL-MapReduce A Technical Whitepaper Rick F. van der Lans Independent Business Intelligence Analyst R20/Consultancy July 2013 Sponsored by Copyright 2013
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
Research on the UHF RFID Channel Coding Technology based on Simulink
Vol. 6, No. 7, 015 Research on the UHF RFID Channel Coding Technology based on Simulink Changzhi Wang Shanghai 0160, China Zhicai Shi* Shanghai 0160, China Dai Jian Shanghai 0160, China Li Meng Shanghai
How To Use Hadoop For Gis
2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop Big Data: Using ArcGIS with Apache Hadoop David Kaiser Erik Hoel Offering 1330 Esri UC2013. Technical Workshop.
Next-Generation Cloud Analytics with Amazon Redshift
Next-Generation Cloud Analytics with Amazon Redshift What s inside Introduction Why Amazon Redshift is Great for Analytics Cloud Data Warehousing Strategies for Relational Databases Analyzing Fast, Transactional
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
NoSQL: Robust and Efficient Data Management on Deduplication Process by using a mobile application
NoSQL: Robust and Efficient Data Management on Deduplication Process by using a mobile application Hemn B.Abdalla, Jinzhao Lin, Guoquan Li Abstract Present Information technology has an enormous responsibility
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Data Refinery with Big Data Aspects
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
CLOUD BASED PEER TO PEER NETWORK FOR ENTERPRISE DATAWAREHOUSE SHARING
CLOUD BASED PEER TO PEER NETWORK FOR ENTERPRISE DATAWAREHOUSE SHARING Basangouda V.K 1,Aruna M.G 2 1 PG Student, Dept of CSE, M.S Engineering College, Bangalore,[email protected] 2 Associate Professor.,
GeoSquare: A cloud-enabled geospatial information resources (GIRs) interoperate infrastructure for cooperation and sharing
GeoSquare: A cloud-enabled geospatial information resources (GIRs) interoperate infrastructure for cooperation and sharing Kai Hu 1, Huayi Wu 1, Zhipeng Gui 2, Lan You 1, Ping Shen 1, Shuang Gao 1, Jie
Design call center management system of e-commerce based on BP neural network and multifractal
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(6):951-956 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Design call center management system of e-commerce
Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.
Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics
Study on Redundant Strategies in Peer to Peer Cloud Storage Systems
Applied Mathematics & Information Sciences An International Journal 2011 NSP 5 (2) (2011), 235S-242S Study on Redundant Strategies in Peer to Peer Cloud Storage Systems Wu Ji-yi 1, Zhang Jian-lin 1, Wang
COMP9321 Web Application Engineering
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411
OPTIMIZATION STRATEGY OF CLOUD COMPUTING SERVICE COMPOSITION RESEARCH BASED ON ANP
OPTIMIZATION STRATEGY OF CLOUD COMPUTING SERVICE COMPOSITION RESEARCH BASED ON ANP Xing Xu School of Automation Huazhong University of Science and Technology Wuhan 430074, P.R.China E-mail: [email protected]
Policy-based Pre-Processing in Hadoop
Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden [email protected], [email protected] Abstract While big data analytics provides
Chase Wu New Jersey Ins0tute of Technology
CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce
, pp.231-242 http://dx.doi.org/10.14257/ijsia.2014.8.2.24 A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce Wang Jin-Song, Zhang Long, Shi Kai and Zhang Hong-hao School
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
ISSN:2320-0790. Keywords: HDFS, Replication, Map-Reduce I Introduction:
ISSN:2320-0790 Dynamic Data Replication for HPC Analytics Applications in Hadoop Ragupathi T 1, Sujaudeen N 2 1 PG Scholar, Department of CSE, SSN College of Engineering, Chennai, India 2 Assistant Professor,
E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms
E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data
