Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey By, Mr. Brijesh B. Mehta Admission No.: D14CO002 Supervised By, Dr. Udai Pratap Rao Computer Engineering Department S. V. National Institute of Technology, Surat m.brijesh@coed.svnit.ac.in 16/04/2015 Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 1 / 28
Outline of Presentation 1 Introduction Categorization of Big Data Analytics Platforms Advantages and Drawbacks of both Horizontal and Vertical Scaling Platforms 2 Horizontal Scaling Platforms Peer-to-peer Networks Apache Hadoop Spark 3 Vertical Scaling Platforms HPC Cluster Multicore CPU GPU FPGA 4 Comparison of Big Data Analytics Platforms 5 Conclusion and Future Scope Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 2 / 28
Outline for section 1 Introduction 1 Introduction Categorization of Big Data Analytics Platforms Advantages and Drawbacks of both Horizontal and Vertical Scaling Platforms 2 Horizontal Scaling Platforms Peer-to-peer Networks Apache Hadoop Spark 3 Vertical Scaling Platforms HPC Cluster Multicore CPU GPU FPGA 4 Comparison of Big Data Analytics Platforms 5 Conclusion and Future Scope Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 3 / 28
Introduction Categorization of Big Data Analytics Platforms Categorization of Big Data Analytics Platforms Fig-1: Categorization of big data analytics platforms [1] Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 4 / 28
Introduction Pros and Cons of Horizontal and Vertical Scaling Platforms Advantages and Drawbacks of both Horizontal and Vertical Scaling Platforms Table-1: Advantages and drawbacks of both horizontal and vertical scaling[1] Scaling Advantages Drawbacks Horizontal scaling Vertical scaling Performance can be increased in small steps as per requirements Up-gradation cost is relatively low System can be scaled as much as needed Most of the software supports it Management and installation of hardware is relatively easy Software required to handle all the complexity regarding data distribution and parallel processing Software which can take advantage of horizontal scaling are limited Cost is relatively high System can onlybe scaled up to certain limit Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 5 / 28
Outline for section 2 Horizontal Scaling Platforms 1 Introduction Categorization of Big Data Analytics Platforms Advantages and Drawbacks of both Horizontal and Vertical Scaling Platforms 2 Horizontal Scaling Platforms Peer-to-peer Networks Apache Hadoop Spark 3 Vertical Scaling Platforms HPC Cluster Multicore CPU GPU FPGA 4 Comparison of Big Data Analytics Platforms 5 Conclusion and Future Scope Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 6 / 28
Horizontal Scaling Platforms Peer-to-peer Networks Peer-to-peer Networks Fig-2: Architecture of peer-to-peer networks MPI is used to communicate and exchange data between nodes Major drawback of peer-to-peer networks is its inability of fault tolerance Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 7 / 28
Hadoop Stack Horizontal Scaling Platforms Apache Hadoop Fig-3: Hadoop stack with different components[1] Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 8 / 28
Horizontal Scaling Platforms Hadoop Stack (HDFS) Apache Hadoop Fig-4: Working of HDFS [2] Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 9 / 28
Horizontal Scaling Platforms Hadoop Stack (YARN) Apache Hadoop Fig-5: YARN Architecture [3] Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 10 / 28
Horizontal Scaling Platforms Hadoop Stack (MapReduce) Apache Hadoop Fig-6: Execution overview of MapReduce [4] Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 11 / 28
Horizontal Scaling Platforms Apache Hadoop Hadoop Stack (MapReduce Wrappers) Apache Pig[5], is a SQL-like environment developed at Yahoo. Hive[6], is also SQL-like data-warehousing solution developed at Facebook. Dryad LINQ[7], is a C#-like environment developed at Microsoft Research for providing better flexibility to.net framework users. Mahout[8], is a scalable machine learning library developed using MapReduce paradigm. Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 12 / 28
Horizontal Scaling Platforms Apache Hadoop Limitations of Hadoop(MapReduce) The major drawback of MapReduce is with iterative task In every iteration of MapReduce data is read and written on the disk which results in a I/O overhead Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 13 / 28
Horizontal Scaling Platforms Berkeley Data Analysis Stack I Spark Fig-7: Berkeley data analysis stack [9] Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 14 / 28
Horizontal Scaling Platforms Spark Berkeley Data Analysis Stack II Techyon [10], similar to HDFS but more aggressive use of memory and caching of frequently used files is supported Mesos [11], similar to YARN Spark [12], similar to MapReduce but in-memory processing is supported Spark wrappers, similar to MapReduce wrappers Spark Streaming (Large Scale real-time stream processing)[12] Blink DB (queries with bounded errors and bounded response times on very large data)[13] GraphX (Resilient distributed Graph System on Spark)[14] MLBase (distributed machine learning library based on Spark)[15] Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 15 / 28
Horizontal Scaling Platforms Spark Limitation of Spark(BDAS) Most of the components are in developing phase Comparatively less support is available Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 16 / 28
Outline for section 3 Vertical Scaling Platforms 1 Introduction Categorization of Big Data Analytics Platforms Advantages and Drawbacks of both Horizontal and Vertical Scaling Platforms 2 Horizontal Scaling Platforms Peer-to-peer Networks Apache Hadoop Spark 3 Vertical Scaling Platforms HPC Cluster Multicore CPU GPU FPGA 4 Comparison of Big Data Analytics Platforms 5 Conclusion and Future Scope Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 17 / 28
Vertical Scaling Platforms HPC Cluster HPC Clusters Also known as blades or supercomputers Having thousands of cores with different variety of disk organization, cache, communication mechanism, etc. MPI is generally used as the communication scheme The major drawback is the scalability Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 18 / 28
Multicore CPU Vertical Scaling Platforms Multicore CPU Fig-8: General architecture of multicore CPU Multithreading [16] is used to parallelize the task on CPU Major drawback is, its limited number of cores and dependency on system memory for data access which is limited to few gigabytes in size Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 19 / 28
GPU Vertical Scaling Platforms GPU Fig-9: GPGPU architecture [17] The major drawback of GPU is, its limited memory which is approx. 12GB per GPU Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 20 / 28
Vertical Scaling Platforms FPGA FPGA FPGAs are custom-build very specialized hardware component for specific applications [18] HDL [19] is used to program such components Some of the applications of FPGAs are, industrial control applications[20], developing educational tools[21], network security applications[22] The major drawback is, very high development cost Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 21 / 28
Comparison of Big Data Analytics Platforms Outline for section 4 1 Introduction Categorization of Big Data Analytics Platforms Advantages and Drawbacks of both Horizontal and Vertical Scaling Platforms 2 Horizontal Scaling Platforms Peer-to-peer Networks Apache Hadoop Spark 3 Vertical Scaling Platforms HPC Cluster Multicore CPU GPU FPGA 4 Comparison of Big Data Analytics Platforms 5 Conclusion and Future Scope Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 22 / 28
Comparison of Big Data Analytics Platforms Comparison of Big Data Analytics Platforms Scaling type Horizontal scaling Vertical scaling Table-2: Comparison of Big Data Analytics Platforms[1] Platforms System/Platform Application/Algorithm Scalability Data I/O Fault tolerance Real-time Data size Iterative performancport processing supported task sup- Peer-to-peer Networks Apache Hadoop Spark HPC Clusters Multicore CPU GPU FPGA Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 23 / 28
Outline for section 5 Conclusion and Future Scope 1 Introduction Categorization of Big Data Analytics Platforms Advantages and Drawbacks of both Horizontal and Vertical Scaling Platforms 2 Horizontal Scaling Platforms Peer-to-peer Networks Apache Hadoop Spark 3 Vertical Scaling Platforms HPC Cluster Multicore CPU GPU FPGA 4 Comparison of Big Data Analytics Platforms 5 Conclusion and Future Scope Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 24 / 28
Conclusion and Future Scope Conclusion and Future Scope Detailed discussion of various big data analytics platforms such as, peer-to-peer networks, Apache hadoop, spark, HPC clusters, multicore CPUs, GPUs, and FPGAs, are given Comparison of these platforms with respect to parameters such as, scalability, data I/O performance, fault tolerance, real time processing, data size supported, iterative task support, have also been given This qualitative analysis may be helpful to choose an appropriate platform for big data analytics As we are going to work on unstructured big data analytics, this comparison is going to be very helpful to us In future, we are going to survey various privacy preserving techniques for unstructured data Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 25 / 28
References References I [1] D. Singh and C. K. Reddy, A survey on platforms for big data analytics, Journal of Big Data, vol. 2, no. 1, p. 8, oct 2014, [Online] Available:http://www.journalofbigdata.com/content/2/1/8, [Accessed: 02-Mar-2015]. [2] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The hadoop distributed file system, in Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), ser. MSST 10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 1 10. [3] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O Malley, S. Radia, B. Reed, and E. Baldeschwieler, Apache Hadoop YARN: Yet Another Resource Negotiator, in Proceedings of the 4th Annual Symposium on Cloud Computing, ser. SOCC 13. New York, NY, USA: ACM, 2013, pp. 5:1 5:16. [4] J. Dean and S. Ghemawat, Mapreduce: Simplified data processing on large clusters, Communications of the ACM, vol. 51, no. 1, pp. 107 113, Jan. 2008. [5] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, Pig latin: A not-so-foreign language for data processing, in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD 08. New York, NY, USA: ACM, 2008, pp. 1099 1110. [6] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, Hive: A warehousing solution over a map-reduce framework, Proc. VLDB Endow., vol. 2, no. 2, pp. 1626 1629, Aug. 2009. [7] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey, Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language, in Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI 08. Berkeley, CA, USA: USENIX Association, 2008, pp. 1 14. [8] S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout in Action. Greenwich, CT, USA: Manning Publications Co., 2011. [9] Berkeley data analysis stack, [Online] Available:https://amplab.cs.berkeley.edu/software/, [Accessed: 11-Mar-2015]. [10] Techyon, [Online] Available:http://tachyon-project.org/, [Accessed: 11-Mar-2015]. [11] Mesos, [Online] Available:http://mesos.apache.org/, [Accessed: 11-Mar-2015]. Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 26 / 28
References References II [12] Spark: Lighting fast cluster computing, [Online] Available:https://spark.apache.org/, [Accessed:03-Mar-2015]. [13] S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica, Blinkdb: Queries with bounded errors and bounded response times on very large data, in Proceedings of the 8th ACM European Conference on Computer Systems, ser. EuroSys 13. New York, NY, USA: ACM, 2013, pp. 29 42. [14] R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica, Graphx: A resilient distributed graph system on spark, in First International Workshop on Graph Data Management Experiences and Systems, ser. GRADES 13. New York, NY, USA: ACM, 2013, pp. 2:1 2:6. [15] T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J. Franklin, and M. I. Jordan, Mlbase: A distributed machine-learning system. in Conference on Innovative Data systems Research(CIDR), 2013. [16] D. M. Tullsen, S. J. Eggers, and H. M. Levy, Simultaneous multithreading: Maximizing on-chip parallelism, in 25 Years of the International Symposia on Computer Architecture (Selected Papers), ser. ISCA 98. New York, NY, USA: ACM, 1998, pp. 533 544. [17] GPGPU architecture, [Online] Available:https://www.usenix.org/legacy/event/hotpar09/tech/full papers/kaldeway/kaldeway html/img2.png, [Accessed: 11-Mar-2015]. [18] S. Brown, R. Francis, J. Rose, and Z. Vranesic, Field-Programmable Gate Arrays, ser. VLSI, computer architecture and digital signal processing. Springer US, 1992. [19] D. Thomas and P. Moorby, The Verilog R Hardware Description Language. Springer London, Limited, 2008. [20] E. Monmasson, L. Idkhajine, M. Cirstea, I. Bahri, A. Tisan, and M. Naouar, FPGAs in industrial control applications, Industrial Informatics, IEEE Transactions on, vol. 7, no. 2, pp. 224 243, May 2011. [21] D. Bouldin, Impacting education using FPGAs, in Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, April 2004, pp. 142 147. [22] H. Chen, Y. Chen, and D. Summerville, A survey on the application of FPGAs for network infrastructure security, Communications Surveys Tutorials, IEEE, vol. 13, no. 4, pp. 541 561, Fourth 2011. Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 27 / 28
Thank you Thank You Mr. Brijesh B. Mehta (SVNIT) Big Data Analytics Platforms: A Survey 16/04/2015 28 / 28