Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey



Similar documents
Survey of Hardware Platforms Available for Big Data Analytics Using K-means Clustering Algorithm

A Survey on Scalable Big Data Analytics Platform

A survey on platforms for big data analytics

APACHE HADOOP JERRIN JOSEPH CSU ID#

CSE-E5430 Scalable Cloud Computing Lecture 2

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Big Data and Apache Hadoop s MapReduce

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Large scale processing using Hadoop. Ján Vaňo

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System

Building Block Components to Control a Data Rate in the Apache Hadoop Compute Platform

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Big Data Research in the AMPLab: BDAS and Beyond

Platforms and Algorithms for Big Data Analytics Chandan K. Reddy Department of Computer Science Wayne State University

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Approaches for parallel data loading and data querying

Computing at Scale: Resource Scheduling Architectural Evolution and Introduction to Fuxi System

Spatial Data Analysis Using MapReduce Models

DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT

CS 294: Big Data System Research: Trends and Challenges

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

What s next for the Berkeley Data Analytics Stack?

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

NetFlow Analysis with MapReduce

Review on the Cloud Computing Programming Model

Fault Tolerance in Hadoop for Work Migration

Hadoop Big Data for Processing Data and Performing Workload

MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hybrid Software Architectures for Big

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014

Parallel Computing. Benson Muite. benson.

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Beyond Batch Processing: Towards Real-Time and Streaming Big Data

Dell In-Memory Appliance for Cloudera Enterprise

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

How Companies are! Using Spark

Deciphering Big Data Stacks: An Overview of Big Data Tools

Survey Paper on Big Data Processing and Hadoop Components

Managing large clusters resources

Brave New World: Hadoop vs. Spark

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Application Development. A Paradigm Shift

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

A Brief Introduction to Apache Tez

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

USC Viterbi School of Engineering

Hadoop Ecosystem B Y R A H I M A.

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Data Management Course Syllabus

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Integrating Hadoop and Parallel DBMS

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

BIG DATA TRENDS AND TECHNOLOGIES

The Berkeley AMPLab - Collaborative Big Data Research

Data and Algorithms of the Web: MapReduce

Spark: Cluster Computing with Working Sets

Big Data With Hadoop

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Comparison of Different Implementation of Inverted Indexes in Hadoop

NoSQL and Hadoop Technologies On Oracle Cloud

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

Apache Flink Next-gen data analysis. Kostas

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop. Sunday, November 25, 12

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Resource Scalability for Efficient Parallel Processing in Cloud

Snapshots in Hadoop Distributed File System

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

BSPCloud: A Hybrid Programming Library for Cloud Computing *

Radoop: Analyzing Big Data with RapidMiner and Hadoop

Processing Large Amounts of Images on Hadoop with OpenCV

Big Data Frameworks Course. Prof. Sasu Tarkoma

Challenges for Data Driven Systems

Transcription:

Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey By, Mr. Brijesh B. Mehta Admission No.: D14CO002 Supervised By, Dr. Udai Pratap Rao Computer Engineering Department S. V. National Institute of Technology, Surat m.brijesh@coed.svnit.ac.in 16/04/2015 Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 1 / 28

Outline of Presentation 1 Introduction Categorization of Big Data Analytics Platforms Advantages and Drawbacks of both Horizontal and Vertical Scaling Platforms 2 Horizontal Scaling Platforms Peer-to-peer Networks Apache Hadoop Spark 3 Vertical Scaling Platforms HPC Cluster Multicore CPU GPU FPGA 4 Comparison of Big Data Analytics Platforms 5 Conclusion and Future Scope Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 2 / 28

Outline for section 1 Introduction 1 Introduction Categorization of Big Data Analytics Platforms Advantages and Drawbacks of both Horizontal and Vertical Scaling Platforms 2 Horizontal Scaling Platforms Peer-to-peer Networks Apache Hadoop Spark 3 Vertical Scaling Platforms HPC Cluster Multicore CPU GPU FPGA 4 Comparison of Big Data Analytics Platforms 5 Conclusion and Future Scope Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 3 / 28

Introduction Categorization of Big Data Analytics Platforms Categorization of Big Data Analytics Platforms Fig-1: Categorization of big data analytics platforms [1] Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 4 / 28

Introduction Pros and Cons of Horizontal and Vertical Scaling Platforms Advantages and Drawbacks of both Horizontal and Vertical Scaling Platforms Table-1: Advantages and drawbacks of both horizontal and vertical scaling[1] Scaling Advantages Drawbacks Horizontal scaling Vertical scaling Performance can be increased in small steps as per requirements Up-gradation cost is relatively low System can be scaled as much as needed Most of the software supports it Management and installation of hardware is relatively easy Software required to handle all the complexity regarding data distribution and parallel processing Software which can take advantage of horizontal scaling are limited Cost is relatively high System can onlybe scaled up to certain limit Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 5 / 28

Outline for section 2 Horizontal Scaling Platforms 1 Introduction Categorization of Big Data Analytics Platforms Advantages and Drawbacks of both Horizontal and Vertical Scaling Platforms 2 Horizontal Scaling Platforms Peer-to-peer Networks Apache Hadoop Spark 3 Vertical Scaling Platforms HPC Cluster Multicore CPU GPU FPGA 4 Comparison of Big Data Analytics Platforms 5 Conclusion and Future Scope Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 6 / 28

Horizontal Scaling Platforms Peer-to-peer Networks Peer-to-peer Networks Fig-2: Architecture of peer-to-peer networks MPI is used to communicate and exchange data between nodes Major drawback of peer-to-peer networks is its inability of fault tolerance Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 7 / 28

Hadoop Stack Horizontal Scaling Platforms Apache Hadoop Fig-3: Hadoop stack with different components[1] Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 8 / 28

Horizontal Scaling Platforms Hadoop Stack (HDFS) Apache Hadoop Fig-4: Working of HDFS [2] Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 9 / 28

Horizontal Scaling Platforms Hadoop Stack (YARN) Apache Hadoop Fig-5: YARN Architecture [3] Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 10 / 28

Horizontal Scaling Platforms Hadoop Stack (MapReduce) Apache Hadoop Fig-6: Execution overview of MapReduce [4] Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 11 / 28

Horizontal Scaling Platforms Apache Hadoop Hadoop Stack (MapReduce Wrappers) Apache Pig[5], is a SQL-like environment developed at Yahoo. Hive[6], is also SQL-like data-warehousing solution developed at Facebook. Dryad LINQ[7], is a C#-like environment developed at Microsoft Research for providing better flexibility to.net framework users. Mahout[8], is a scalable machine learning library developed using MapReduce paradigm. Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 12 / 28

Horizontal Scaling Platforms Apache Hadoop Limitations of Hadoop(MapReduce) The major drawback of MapReduce is with iterative task In every iteration of MapReduce data is read and written on the disk which results in a I/O overhead Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 13 / 28

Horizontal Scaling Platforms Berkeley Data Analysis Stack I Spark Fig-7: Berkeley data analysis stack [9] Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 14 / 28

Horizontal Scaling Platforms Spark Berkeley Data Analysis Stack II Techyon [10], similar to HDFS but more aggressive use of memory and caching of frequently used files is supported Mesos [11], similar to YARN Spark [12], similar to MapReduce but in-memory processing is supported Spark wrappers, similar to MapReduce wrappers Spark Streaming (Large Scale real-time stream processing)[12] Blink DB (queries with bounded errors and bounded response times on very large data)[13] GraphX (Resilient distributed Graph System on Spark)[14] MLBase (distributed machine learning library based on Spark)[15] Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 15 / 28

Horizontal Scaling Platforms Spark Limitation of Spark(BDAS) Most of the components are in developing phase Comparatively less support is available Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 16 / 28

Outline for section 3 Vertical Scaling Platforms 1 Introduction Categorization of Big Data Analytics Platforms Advantages and Drawbacks of both Horizontal and Vertical Scaling Platforms 2 Horizontal Scaling Platforms Peer-to-peer Networks Apache Hadoop Spark 3 Vertical Scaling Platforms HPC Cluster Multicore CPU GPU FPGA 4 Comparison of Big Data Analytics Platforms 5 Conclusion and Future Scope Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 17 / 28

Vertical Scaling Platforms HPC Cluster HPC Clusters Also known as blades or supercomputers Having thousands of cores with different variety of disk organization, cache, communication mechanism, etc. MPI is generally used as the communication scheme The major drawback is the scalability Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 18 / 28

Multicore CPU Vertical Scaling Platforms Multicore CPU Fig-8: General architecture of multicore CPU Multithreading [16] is used to parallelize the task on CPU Major drawback is, its limited number of cores and dependency on system memory for data access which is limited to few gigabytes in size Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 19 / 28

GPU Vertical Scaling Platforms GPU Fig-9: GPGPU architecture [17] The major drawback of GPU is, its limited memory which is approx. 12GB per GPU Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 20 / 28

Vertical Scaling Platforms FPGA FPGA FPGAs are custom-build very specialized hardware component for specific applications [18] HDL [19] is used to program such components Some of the applications of FPGAs are, industrial control applications[20], developing educational tools[21], network security applications[22] The major drawback is, very high development cost Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 21 / 28

Comparison of Big Data Analytics Platforms Outline for section 4 1 Introduction Categorization of Big Data Analytics Platforms Advantages and Drawbacks of both Horizontal and Vertical Scaling Platforms 2 Horizontal Scaling Platforms Peer-to-peer Networks Apache Hadoop Spark 3 Vertical Scaling Platforms HPC Cluster Multicore CPU GPU FPGA 4 Comparison of Big Data Analytics Platforms 5 Conclusion and Future Scope Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 22 / 28

Comparison of Big Data Analytics Platforms Comparison of Big Data Analytics Platforms Scaling type Horizontal scaling Vertical scaling Table-2: Comparison of Big Data Analytics Platforms[1] Platforms System/Platform Application/Algorithm Scalability Data I/O Fault tolerance Real-time Data size Iterative performancport processing supported task sup- Peer-to-peer Networks Apache Hadoop Spark HPC Clusters Multicore CPU GPU FPGA Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 23 / 28

Outline for section 5 Conclusion and Future Scope 1 Introduction Categorization of Big Data Analytics Platforms Advantages and Drawbacks of both Horizontal and Vertical Scaling Platforms 2 Horizontal Scaling Platforms Peer-to-peer Networks Apache Hadoop Spark 3 Vertical Scaling Platforms HPC Cluster Multicore CPU GPU FPGA 4 Comparison of Big Data Analytics Platforms 5 Conclusion and Future Scope Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 24 / 28

Conclusion and Future Scope Conclusion and Future Scope Detailed discussion of various big data analytics platforms such as, peer-to-peer networks, Apache hadoop, spark, HPC clusters, multicore CPUs, GPUs, and FPGAs, are given Comparison of these platforms with respect to parameters such as, scalability, data I/O performance, fault tolerance, real time processing, data size supported, iterative task support, have also been given This qualitative analysis may be helpful to choose an appropriate platform for big data analytics As we are going to work on unstructured big data analytics, this comparison is going to be very helpful to us In future, we are going to survey various privacy preserving techniques for unstructured data Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 25 / 28

References References I [1] D. Singh and C. K. Reddy, A survey on platforms for big data analytics, Journal of Big Data, vol. 2, no. 1, p. 8, oct 2014, [Online] Available:http://www.journalofbigdata.com/content/2/1/8, [Accessed: 02-Mar-2015]. [2] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The hadoop distributed file system, in Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), ser. MSST 10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 1 10. [3] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O Malley, S. Radia, B. Reed, and E. Baldeschwieler, Apache Hadoop YARN: Yet Another Resource Negotiator, in Proceedings of the 4th Annual Symposium on Cloud Computing, ser. SOCC 13. New York, NY, USA: ACM, 2013, pp. 5:1 5:16. [4] J. Dean and S. Ghemawat, Mapreduce: Simplified data processing on large clusters, Communications of the ACM, vol. 51, no. 1, pp. 107 113, Jan. 2008. [5] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, Pig latin: A not-so-foreign language for data processing, in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD 08. New York, NY, USA: ACM, 2008, pp. 1099 1110. [6] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, Hive: A warehousing solution over a map-reduce framework, Proc. VLDB Endow., vol. 2, no. 2, pp. 1626 1629, Aug. 2009. [7] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey, Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language, in Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI 08. Berkeley, CA, USA: USENIX Association, 2008, pp. 1 14. [8] S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout in Action. Greenwich, CT, USA: Manning Publications Co., 2011. [9] Berkeley data analysis stack, [Online] Available:https://amplab.cs.berkeley.edu/software/, [Accessed: 11-Mar-2015]. [10] Techyon, [Online] Available:http://tachyon-project.org/, [Accessed: 11-Mar-2015]. [11] Mesos, [Online] Available:http://mesos.apache.org/, [Accessed: 11-Mar-2015]. Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 26 / 28

References References II [12] Spark: Lighting fast cluster computing, [Online] Available:https://spark.apache.org/, [Accessed:03-Mar-2015]. [13] S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica, Blinkdb: Queries with bounded errors and bounded response times on very large data, in Proceedings of the 8th ACM European Conference on Computer Systems, ser. EuroSys 13. New York, NY, USA: ACM, 2013, pp. 29 42. [14] R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica, Graphx: A resilient distributed graph system on spark, in First International Workshop on Graph Data Management Experiences and Systems, ser. GRADES 13. New York, NY, USA: ACM, 2013, pp. 2:1 2:6. [15] T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J. Franklin, and M. I. Jordan, Mlbase: A distributed machine-learning system. in Conference on Innovative Data systems Research(CIDR), 2013. [16] D. M. Tullsen, S. J. Eggers, and H. M. Levy, Simultaneous multithreading: Maximizing on-chip parallelism, in 25 Years of the International Symposia on Computer Architecture (Selected Papers), ser. ISCA 98. New York, NY, USA: ACM, 1998, pp. 533 544. [17] GPGPU architecture, [Online] Available:https://www.usenix.org/legacy/event/hotpar09/tech/full papers/kaldeway/kaldeway html/img2.png, [Accessed: 11-Mar-2015]. [18] S. Brown, R. Francis, J. Rose, and Z. Vranesic, Field-Programmable Gate Arrays, ser. VLSI, computer architecture and digital signal processing. Springer US, 1992. [19] D. Thomas and P. Moorby, The Verilog R Hardware Description Language. Springer London, Limited, 2008. [20] E. Monmasson, L. Idkhajine, M. Cirstea, I. Bahri, A. Tisan, and M. Naouar, FPGAs in industrial control applications, Industrial Informatics, IEEE Transactions on, vol. 7, no. 2, pp. 224 243, May 2011. [21] D. Bouldin, Impacting education using FPGAs, in Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, April 2004, pp. 142 147. [22] H. Chen, Y. Chen, and D. Summerville, A survey on the application of FPGAs for network infrastructure security, Communications Surveys Tutorials, IEEE, vol. 13, no. 4, pp. 541 561, Fourth 2011. Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 27 / 28

Thank you Thank You Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 28 / 28