Problems to store, transfer and process the Big Data COURSE: COMPUTING CLUSTERS, GRIDS, AND CLOUDS LECTURER: ANDREY SHEVEL ITMO UNIVERSITY SAINT PETERSBURG 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 1
Outline 1. Introduction 2. Why big data 3. Big data characteristics 4. Big data problems Storing Transferring Processing 5. Conclusion 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 2
1. Introduction Big data is defined as data sets that are so large and complex that traditional database management concepts and tools are inadequate Big data is being generated by multiple sources such as social media, systems, sensors and mobile devices at an alarming velocity, volume and variety The Big Data is the combination of structured, semi-structured, unstructured, homogeneous and heterogeneous data 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 3
2. Why big data Massive amount of data is being generated from various sources every day For example, Facebook processes 500 TB of data daily 80% of the world s data is unstructured Companies use data analytics for competitive advantages 1. Faster and better decision making 2. Understand customers 3. Optimize business processes 4. Prevent threads and fraud 5. Capitalize on new sources of revenue 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 4
3. Big data characteristics 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 5
6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 6
4. Problems - Storage Current technologies of data management systems are not able to satisfy the needs of big data, and the increasing speed of storage capacity is much less than that of data Data set/domain Large Hadron Collider/Particle Physics (CERN) Description 13-15 petabytes in 2010 Internet Communications (Cisco) 667 exabytes in 2013 Social Media 12+ terabytes of tweets every day and growing. Average retweets are 144 per tweet. Human Digital Universe 7.9 zettabytes in 2015 Others RFIDS, smart electric meters, 4.6 billion camera phones w/ GPS 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 7
4. Problems - Storage Big data is heterogeneous Previous computer algorithms are not able to effectively store big data How to re-organize data? 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 8
4. Problems - Storage The crucial requirements of big data storage: 1. It can handle very large amount of data and keep scaling to keep up with data growth 2. It can provide the input/output operations per second (IOPS) necessary to deliver data to analytics tools Hyperscale computing environments for big data storage Hadoop and Cassandra as analytics engines 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 9
4. Problems - Transfer Conventional methods of transfer data: Transfer via the network, using TCP-based transfer methods (FTP, HTTP) Use storage medium Current communication network are unsuitable for such massive volume of big data 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 10
4. Problems - Transfer Solutions: 1. Process the data in place and transmit only the resulting information 2. Perform triage on the data and transmit only that data which is critical to downstream analysis 3. Parallel transmission techniques used on the internet 4. NICE Model for Big Data transfers 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 11
4. Problems - Processing Being able to extract real-time information from a large stream of data remains difficult The traditional serial algorithm is inefficient for the big data Processing big data requires extensive parallel processing and new analytics algorithms to provide timely and actionable information. Application parallelization Divide-and-conquer approach 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 12
5. Conclusion Big Data is not a new concept but very challenging. The problems seem to be solvable in the near-term, but present a long-term challenges that require a lot of research. It calls for scalable storage index and a distributed approach to retrieve required results in near real-time. 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 13
References [1] S. Kaisler, F. Armour, J. a Espinosa, and W. Money, Big Data: Issues and Challenges Moving Forward, 46th Hawaii Int. Conf. Syst. Sci., pp. 995 1004, 2013. [2] A. Adshead, "Big data storage: Defining big data and the type of storage it needs," [Online]. Available: http://www.computerweekly.com/podcast/big-data-storage-defining-big-data-andthe-type-of-storage-it-needs. [Accessed 1 June 2016]. [3] C. Sliwa, "Understanding stripped-down hyperscale storage for big data use cases," [Online]. Available: http://searchstorage.techtarget.com/podcast/understanding-stripped-downhyperscale-storage-for-big-data-use-cases. [Accessed 1 June 2016]. [4] "Big data", Wikipedia, 2016. [Online]. Available: https://en.wikipedia.org/wiki/big_data#cite_note-10. [Accessed: 01- Jun- 2016]. [5] A. Jacobs, "The Pathologies of Big Data", Queue, vol. 7, no. 6, p. 10, 2009. [6] Douglas and Laney (2008) The importance of big data : A definition. 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 14