Advanced SQL Query To Flink Translator
|
|
|
- Jared Goodman
- 9 years ago
- Views:
Transcription
1 Advanced SQL Query To Flink Translator Yasien Ghallab Gouda Full Professor Mathematics and Computer Science Department Aswan University, Aswan, Egypt Hager Saleh Mohammed Researcher Computer Science Department Aswan University, Asawn, Egypt Mohamed Helmy Khafagy Assistant Professor Computer Science Department Fayoum University, Egypt ABSTRACT Information in the digital world, data play an important role in most of Computer Engineering applications. The increasing of data has been more difficult to store and analyze data using the traditional database. Apache Flink is a framework to Big Data Analytics in the large cluster. SQL-likes Query set of rules for make an interface between the user and big database, so very need to SQL To Flink translator that allow the user to run Advanced SQL Query top Flink without need writing JAVA code to reach their request, and also, Complex SQL Query in Flink is limited scalability. 2. In this paper, the system is devolved to run top Flink without changing in Flink framework. This system calls, Advanced SQL Query To Flink Translator This proposed system receives Advanced SQL Query from the user then generate Flink Code for executing this Query. Finally, it returns the results of Query to the user. General Terms: SQL Query, Apache Flink Keywords Big data, Flink, SQL Translator, Hadoop, Hive, Advanced SQL Query 1. INTRODUCTION The size of data in the world has been exploding, and analyzing large data sets so-called Big Data. The Big Data is huge and complex datasets consisting of a different structured and unstructured data which becomes difficult to store and analysis using traditional techniques database [8]. Big Data requires frameworks to analyze and process datasets such as Hadoop, MapReduce, and Flink. The Apache Hadoop is open-source software for reliable, scalable, distributed computing runs on distributed cluster. It is developed by Google MapReduce framework [2]. Hadoop consists of HDFS and MapReduce that have a good Load Balance Technique [13, 9]. MapReduce is a programming model for processing large data sets in distributed cluster implementation by Google in 2004 which provides an efficient solution to the data analysis challenge. The MapReduce framework requires that users implement their applications by coding their map and reduce functions. While this low-level hand coding offers a high flexibility in programming applications, it increases the difficulty in program debugging [3, 12]. Apache Flink is an open source framework for distributed stream and batch data processing run on distributed cluster. Flink core is a streaming data flow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. Flink also builds batch processing on top of the streaming engine, overlaying native iteration support, managed memory, and program optimization [1]. Apache Flink has some features the faster than Hadoop, provide input and output of Hadoop and can run Hadoop programming. SQL-likes Query is some of the rules for makes interface between user and database and helps user for manages and retrieves data from the big database. There are some translators provide SQL Query that translator run above Hadoop such as Hive [16], YSmart[12], S2mart [7], and Qmapper [17]. So the translator is built run above Flink for executing Advanced SQL Query because Advanced SQL Query in Flink is limited scalability The proposed system run above Flink without any change in Flink structure. The proposed system translates Advanced SQL Query to Flink Code for executing Advanced SQL Query on Flink. The proposed system handles Query that contains some keywords such as Where clauses contain ( BETWEEN, AND, OR), Sub Query in Where clauses contains IN, JOIN Types, OR- DER BY operation, TOP operation, COUNT Aggregation and Nested Query. Also proposed Technique facilitate many Algorithms and technique to run above Flink [15, 5, 11, 6, 10] The rest of the paper is organized as follows: Section 2 introduces the related works of relevant systems. Section 3 describes the proposed system architecture and the proposed system methodology. Section 4 represents the results of performed experiments and comparison between the proposed system and Hive. Finally, Section 5 concludes and the brief introduction to future work. 2. RELATED WORK In this section, an overview is introduced of related work presented so far: 2.1 Hive Hive, is an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in an SQL-like language called HiveQL. HiveQL transforms SQL query into MapReduce jobs that are executed using Hadoop. HiveQL allows users to create customs MapReduce scripts into queries. HiveQL has same features in SQL [16]. 2.2 S2MART Smart SQL to Map-Reduce Translators, Smart transforms the SQL queries into Map-Reduce jobs besides the inclusion of intra-query correlation by building an SQL relationship tree to minimize redundant operations and computations and build a spiral modeled database to store and retrieve the recently used query results for reducing data transfer cost and network transfer cost. S2MART applies the concept of views in a database to perform parallelization of big data easy and streamlined [7]. 2.3 QMAPPER A QMapper is a tool for utilizing query rewriting rules provides a cost-based plan evaluator to choose the optimized equivalent and MapReduce flow evaluation and enhanced the performance of Hive significantly [17]. 11
2 2.4 SQL TO FLINK Translator SQL To Flink Translator is a tool built above Apache Flink without effect in Flink structure to support simple SQL Queries. SQL TO Flink Translator receives SQL Query from the user.then generates the equivalent code for this query that it can be run on Flink. This translator has some limitations such as that SQL to Flink translator cannot translate Advanced Query and can not improve the performance for executing SQL Query [14]. Fig. 2. SQL Query Contains BETWEEN Operator 3. ADVANCED SQL QUERY TO FLINK TRANSLATOR 3.1 System Architecture The central feature of the proposed system is executing the Advanced Query on Flink without write Java Code for executing this Query on Flink. The system architecture is illustrated in Figure 1 that is divided into five phases: The first phase, The proposed system receives SQL Query from the user. Then Query parser checks SQL Query is correct. The second phase, the proposed system extracts tables and columns name from the input Query then recalls Java class dataset for each table has only extracted columns. The third phase, the proposed system extract some keywords from SQL Query such as Where Clauses contain ( BETWEEN, AND, OR) keywords,sub Query in Where clauses contains IN, JOIN Types, ORDER BY operation, TOP operation, COUNT Aggregation and detects Nested Query. The fourth phase, the proposed system generates Flink Code that executes the input Query. The last phase, the proposed system executes the Flink Code and returns the result to the user. Fig. 3. BETWEEN Operator Flink Code Where Clauses Contain AND & OR Operators. The AND operator filters a dataset if all condition is true. The OR operator filters a dataset if one condition is true. When the Query Parser finds Where Clauses contains AND & OR operators in input Query such as (see Figure 4). Then the proposed system generates Flink Code by calling Filter Function to executing input Query and returns the result from it, (see Figure 5). Fig. 4. Query Contain OR operator Fig. 1. System Architecture Fig. 5. AND & OR Operators Flink Code 3.2 Methodology The proposed system translates Query from a user if Query has Where clauses contain (BETWEEN, AND, OR) operators, Sub Query in Where clauses contains IN, JOIN Types, ORDER BY, TOP Clause, COUNT Aggregation and Nested Query. Each case is explained to view how the proposed system is handled each case Where Clauses Contains BETWEEN Operator. The BE- TWEEN operator filter values within range. When the Query Parser finds Where Clauses contains BETWEEN operator in input Query such as (see Figure 2). Then the proposed system generates Flink Code by calling Filter function to executing input Query and returns the result from it, (see Figure 3) Sub Query in Where Clauses Contains IN Keyword. The IN operator allows a user to add multi-values in Where Clauses. When the Query Parser finds Sub Query in Where Clauses contains IN in the input Query such as (see Figure 6).Then the proposed system generates Flink Code by calling cogroup Function and IN Operator Custom Function to executing input Query and returns the result from it, (see Figure 7). Fig. 6. Sub Query in Where Clauses Contains IN Keyword 12
3 Fig. 7. IN Keyword Flink Code Fig. 11. RIGHT OUTER JOIN Flink Code JOIN Types. SQL JOIN uses to combine rows from the multi-table. There is Types of JOIN handles in the proposed system. LEFT OUTER JOIN. LEFT OUTER JOIN returns all rows from the left table with matching rows in the right table and returns null values in the right table if not match rows with the left table. When the Query Parser finds LEFT OUTER JOIN in the input Query such as (see Figure 8). Then the proposed system generates Flink Code by calling CoGroup function and JoinType() custom function to executing input Query and returns the result from it, (see Figure 9) ORDER BY Keyword. ORDER BY is used to sort results by one column or multi-column, it sorts results in ascending or descending order. When the Query Parser finds ORDER BY in the input Query such as (see Figure 12). Then the proposed system generates Flink Code by calling sortpartion(fileds number, Order type) to executing input Query and returns the result from it, (see Figure 13). Fig. 12. Query Contains ORDER BY Keyword Fig. 8. LEFT OUTER JOIN Query Fig. 9. LEFT OUTER JOIN Flink Code RIGHT OUTER JOIN. RIGHT OUTER JOIN returns all rows from the right table with matching rows in the left table and returns null values in the left table if not match rows with the right table. When the Query Parser finds RIGHT OUTER JOIN in the input Query (see Figure 10). Then the proposed system generates Flink Code by calling CoGroup function and custom function Join Type() to executing input Query and returns the result from it, (see Figure 11). Fig. 13. ORDER BY Keyword Flink Code TOP Clause. TOP Clause is used to return the specified number of rows. When the Query Parser finds TOP Clause in the input Query such as (see Figure 14). Then the proposed system generates Flink Code by calling the first() function to executing input Query and returns the result from it, (see Figure 15). Fig. 10. RIGHT OUTER JOIN Query Fig. 14. Query Contains Top Clause 13
4 Fig. 15. Top Clause Flink Code COUNT Aggregation. COUNT Aggregation used to return the number of rows in the result. When the Query Parser finds COUNT Aggregation in the input Query such as (see Figure 16). Then the proposed system generates Flink Code by calling the count() function to executing input Query and returns the result from it,(see Figure 17). Fig. 16. Fig. 17. Query Contains COUNT Aggregation Count Aggregation Flink Code NESTED Query. When Query Parser finds sub-select in the input query such as (see Figure 18).Then the proposed system generates Flink code to executing sub-select (see Figure 19) and then the proposed system generates Flink code to executing top select depends on returns values from sub-select (see Figure 20). Fig. 20. Top-select Flink Code 4. EXPERIMENTAL RESULTS 4.1 DATA SET AND QUERIES Using dataset and Queries from TPC-H Benchmark. This benchmark illustrates decision support systems that provides large volumes of data, execute complexity queries, and give answers to critical business questions [4]. Every dataset is split to a different size for executing TPC-H Queries on this dataset. 4.2 ENVIRONMENT SETUP A Hadoop Single Node, Ubuntu virtual machines, and each one running Java(TM) SE Runtime Environment on Netbeans IDE. Hadoop version1.2.1 is installed, and one Namenode, and 2 Datanodes are configured. The Namenode and Datanodes have 20 GB of RAM, seven cores, and 100GB disk. Also, Hive is installed on the Hadoop Namenode and Datanodes. Flink 9 is used, Flink cluster is installed, and one Master Node and two Work Nodes are configured. The Master Node and Worker Node have 20 GB of RAM, seven cores, and 100GB disk. 4.3 Result Comparison between Advanced SQL Query To Flink Translator and HiveQl when run TCP-H Query 4 and TCP-H Query 13 on different data size TCP-H Query 4. In this system, TCP-H Query 4 (see Figure 21) is used because it contains cases that handle in the proposed system. Fig. 18. Query Contain Sub Query Fig. 21. TCP-H Query 4 Fig. 19. Sub-select Flink Code IN Figure 22 show a comparison between Advanced SQL Query to Flink translator and HiveQl when run TCP-H Query 4. Also, Advanced SQL Query To Flink translator enhances performance by average 21%. 14
5 Fig. 22. Compare Between Advanced SQL Query To Flink and HiveQl When run TCP-H Query 4 Using Different Size of Data TCP-H Query 13. In this system, TCP-H Query 13 (see Figure 23) is used because it contains cases that handle in the proposed system. Fig. 23. TCP-H Query 13 IN Figure 24 show a comparison between Advanced SQL Query to Flink translator and HiveQl when run TCP-H Query 13. Also, Advanced SQL Query To Flink translator enhances performance by average 29%. Fig. 24. Compare between Advanced SQL Query To Flink and HiveQl When run TCP-H Query 13 Using Different Size of Data 5. CONCLUSIONS AND FUTURE RESEARCH Execution of Advanced SQL Query on Flink without write Java Code is very necessary, so developed system that can execute Advanced SQL Query on Flink that has three main stages. First, receiving Advanced SQL Query from the user. Second, generates Flink Code for executing this Query. Third, verified the correctness of system by performing various experimental results using different Queries. Finally, achieve efficiency in all the experimental results. In the future work, build system to executing Advanced SQL Query on Flink in the short time. 6. REFERENCES [1] Apache flink, [2] Apache hadoop, [3] Mapreduce, [4] Tpc-h, [5] Marwah N Abdullah, Mohamed H Khafagy, and Fatma A Omara. Home: Hiveql optimization in multi-session environment. In 5th European Conference of Computer Science (ECCS 14), volume 80, page 89, [6] Hussien SH. Abdel Azez, Mohamed H. Khafagy, and Fatma A. Omara. Joum: An indexing methodology for improving join in hive star schema. International Journal of Scientific & Engineering Research, 6: , [7] Narayan Gowraj, Prasanna Venkatesh Ravi, V Mouniga, and MR Sumalatha. S2mart: smart sql to map-reduce translators. In Web Technologies and Applications, pages Springer, [8] Katarina Grolinger, Michael Hayes, Wilson Akio Higashino, Alexandra L Heureux, David S Allison, Miriam Capretz, et al. Challenges for mapreduce in big data. In Services (SERVICES), 2014 IEEE World Congress on, pages IEEE, [9] Hesham A Hefny, Mohamed Helmy Khafagy, and M Wahdan Ahmed. Comparative study load balance algorithms for map reduce environment. International Journal of Computer Applications, 106(18):41 50, [10] Mohamed Helmy Khafagy. Index to index two-way join algorithm. International Journal of Digital Content Technology and its Applications, 9(4):25, [11] Mohamed Helmy Khafagy. Indexed map-reduce join algorithm. International Journal of Scientific & Engineering Research, 6(5): , [12] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. Ysmart: Yet another sql-tomapreduce translator. In Distributed Computing Systems (ICDCS), st International Conference on, pages IEEE, [13] Ebada Sarhan, Atif Ghalwash, and Mohamed Khafagy. Queue weighting load-balancing technique for database replication in dynamic content web sites. In Proceedings of the 9th WSEAS International Conference on APPLIED COMPUTER SCIENCE, pages 50 55, [14] Fawzya Ramadan Sayed and Mohamed Helmy Khafagy. Sql to flink translator. IJCSI International Journal of Computer Science Issues, 12(1):169, [15] Mina Samir Shanoda, Samah Ahmed Senbel, and Mohamed Helmy Khafagy. Jomr: Multi-join optimizer technique to enhance map-reduce job. In Informatics and Systems (INFOS), th International Conference on, pages PDC 80. IEEE, [16] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2): , [17] Yingzhong Xu and Songlin Hu. Qmapper: a tool for sql optimization on hive using query rewriting. In Proceedings of the 22nd international conference on World Wide Web companion, pages International World Wide Web Conferences Steering Committee,
CASE STUDY OF HIVE USING HADOOP 1
CASE STUDY OF HIVE USING HADOOP 1 Sai Prasad Potharaju, 2 Shanmuk Srinivas A, 3 Ravi Kumar Tirandasu 1,2,3 SRES COE,Department of er Engineering, Kopargaon,Maharashtra, India 1 [email protected]
Enhancing Massive Data Analytics with the Hadoop Ecosystem
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3, Issue 11 November, 2014 Page No. 9061-9065 Enhancing Massive Data Analytics with the Hadoop Ecosystem Misha
11/18/15 CS 6030. q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.
by shatha muhi CS 6030 1 q Big Data: collections of large datasets (huge volume, high velocity, and variety of data). q Apache Hadoop framework emerged to solve big data management and processing challenges.
Approaches for parallel data loading and data querying
78 Approaches for parallel data loading and data querying Approaches for parallel data loading and data querying Vlad DIACONITA The Bucharest Academy of Economic Studies [email protected] This paper
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Alternatives to HIVE SQL in Hadoop File Structure
Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The
NetFlow Analysis with MapReduce
NetFlow Analysis with MapReduce Wonchul Kang, Yeonhee Lee, Youngseok Lee Chungnam National University {teshi85, yhlee06, lee}@cnu.ac.kr 2010.04.24(Sat) based on "An Internet Traffic Analysis Method with
Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014
Hadoop and Hive Introduction,Installation and Usage Saatvik Shah Data Analytics for Educational Data May 23, 2014 Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 1 / 15
HIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team
HIVE Data Warehousing & Analytics on Hadoop Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team Why Another Data Warehousing System? Problem: Data, data and more data 200GB per day in March 2008 back to
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT Gita Shah 1, Annappa 2 and K. C. Shet 3 1,2,3 Department of Computer Science & Engineering, National Institute of Technology,
JackHare: a framework for SQL to NoSQL translation using MapReduce
DOI 10.1007/s10515-013-0135-x JackHare: a framework for SQL to NoSQL translation using MapReduce Wu-Chun Chung Hung-Pin Lin Shih-Chang Chen Mon-Fong Jiang Yeh-Ching Chung Received: 15 December 2012 / Accepted:
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
CREDIT CARD DATA PROCESSING AND E-STATEMENT GENERATION WITH USE OF HADOOP
CREDIT CARD DATA PROCESSING AND E-STATEMENT GENERATION WITH USE OF HADOOP Ashvini A.Mali 1, N. Z. Tarapore 2 1 Research Scholar, Department of Computer Engineering, Vishwakarma Institute of Technology,
Analysis of Big data through Hadoop Ecosystem Components like Flume, MapReduce, Pig and Hive
Analysis of Big data through Hadoop Ecosystem Components like Flume, MapReduce, Pig and Hive Dr. E. Laxmi Lydia 1, Dr. M.Ben Swarup 2 1 Associate Professor, Department of Computer Science and Engineering,
Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis
Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis Prabin R. Sahoo Tata Consultancy Services Yantra Park, Thane Maharashtra, India ABSTRACT Hadoop Distributed
Hadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Big Data Weather Analytics Using Hadoop
Big Data Weather Analytics Using Hadoop Veershetty Dagade #1 Mahesh Lagali #2 Supriya Avadhani #3 Priya Kalekar #4 Professor, Computer science and Engineering Department, Jain College of Engineering, Belgaum,
Manifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
On a Hadoop-based Analytics Service System
Int. J. Advance Soft Compu. Appl, Vol. 7, No. 1, March 2015 ISSN 2074-8523 On a Hadoop-based Analytics Service System Mikyoung Lee, Hanmin Jung, and Minhee Cho Korea Institute of Science and Technology
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
HadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
BIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK
44 ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK Ashwitha Jain *, Dr. Venkatramana Bhat P ** * Student, Department of Computer Science & Engineering, Mangalore Institute of Technology & Engineering
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
A Study on Big Data Integration with Data Warehouse
A Study on Big Data Integration with Data Warehouse T.K.Das 1 and Arati Mohapatro 2 1 (School of Information Technology & Engineering, VIT University, Vellore,India) 2 (Department of Computer Science,
How To Create A Large Data Storage System
UT DALLAS Erik Jonsson School of Engineering & Computer Science Secure Data Storage and Retrieval in the Cloud Agenda Motivating Example Current work in related areas Our approach Contributions of this
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & MANAGEMENT INFORMATION SYSTEM (IJITMIS)
INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & MANAGEMENT INFORMATION SYSTEM (IJITMIS) International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976 ISSN 0976
A Database Hadoop Hybrid Approach of Big Data
A Database Hadoop Hybrid Approach of Big Data Rupali Y. Behare #1, Prof. S.S.Dandge #2 M.E. (Student), Department of CSE, Department, PRMIT&R, Badnera, SGB Amravati University, India 1. Assistant Professor,
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
Comparison of Different Implementation of Inverted Indexes in Hadoop
Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,
EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics
BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents
Big Data Analytics OverOnline Transactional Data Set
Big Data Analytics OverOnline Transactional Data Set Rohit Vaswani 1, Rahul Vaswani 2, Manish Shahani 3, Lifna Jos(Mentor) 4 1 B.E. Computer Engg. VES Institute of Technology, Mumbai -400074, Maharashtra,
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Big Data Analytics Predicting Risk of Readmissions of Diabetic Patients
Big Data Analytics Predicting Risk of Readmissions of Diabetic Patients Saumya Salian 1, Dr. G. Harisekaran 2 1 SRM University, Department of Information and Technology, SRM Nagar, Chennai 603203, India
Data and Algorithms of the Web: MapReduce
Data and Algorithms of the Web: MapReduce Mauro Sozio May 13, 2014 Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, 2014 1 / 39 Outline 1 MapReduce Introduction MapReduce
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Sector vs. Hadoop. A Brief Comparison Between the Two Systems
Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Using distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
Hadoop/Hive General Introduction
Hadoop/Hive General Introduction Open-Source Solution for Huge Data Sets Zheng Shao Facebook Data Team 11/18/2008 Data Scalability Problems Search Engine 10KB / doc * 20B docs = 200TB Reindex every 30
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE Anjali P P 1 and Binu A 2 1 Department of Information Technology, Rajagiri School of Engineering and Technology, Kochi. M G University, Kerala
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Survey Paper on Big Data Processing and Hadoop Components
Survey Paper on Big Data Processing and Hadoop Components Poonam S. Patil 1, Rajesh. N. Phursule 2 Department of Computer Engineering JSPM s Imperial College of Engineering and Research, Pune, India Abstract:
Indian Journal of Science The International Journal for Science ISSN 2319 7730 EISSN 2319 7749 2016 Discovery Publication. All Rights Reserved
Indian Journal of Science The International Journal for Science ISSN 2319 7730 EISSN 2319 7749 2016 Discovery Publication. All Rights Reserved Perspective Big Data Framework for Healthcare using Hadoop
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Design of Electric Energy Acquisition System on Hadoop
, pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
Brave New World: Hadoop vs. Spark
Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
Big Data and Hadoop with components like Flume, Pig, Hive and Jaql
Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.
Testing 3Vs (Volume, Variety and Velocity) of Big Data
Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an
Big Data Time Series Analysis
Big Data Time Series Analysis Integrating MapReduce and R Visakh C R Department of Information and Computer Science Aalto University Espoo, Finland [email protected] Abstract Explosion of data, especially
ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
APACHE HADOOP JERRIN JOSEPH CSU ID#2578741
APACHE HADOOP JERRIN JOSEPH CSU ID#2578741 CONTENTS Hadoop Hadoop Distributed File System (HDFS) Hadoop MapReduce Introduction Architecture Operations Conclusion References ABSTRACT Hadoop is an efficient
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
JOURNAL OF COMPUTER SCIENCE AND ENGINEERING
Exploration on Service Matching Methodology Based On Description Logic using Similarity Performance Parameters K.Jayasri Final Year Student IFET College of engineering [email protected] R.Rajmohan
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 7, July 2014, pg.759
Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,[email protected]
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,[email protected] Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
Reduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
