Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014
Defining Big Not Just Massive Data Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze. - The McKinsey Global Institute, 2011i This data is more than just large, it is also data that is non-traditional and needs to be handled differently. Big Data is about adopting new technologies that enable the storage, processing, and analysis of data that was previously ignored. 12, pg. 19
Dark Data & Big Data Gartner marks dark data as information assets that organizations collect, process and store in the course of their regular business activity, but generally fail to use for other purposes. Hadoop clusters and NoSQL databases can process large volumes of data which makes it feasible to incorporate long-neglected information into big data analytics applications to unlock its business value. Edmunds.com Put a Hadoop-based data warehouse into production in February which has accelerated the process of mining dark data and has opened up new views of data that are helping the company reduce operating costs, said Paddy Hannon, VP of architecture, Edmunds, in Santa Monica, California.
Characteristics of Big Data
Defining Data - Volume Size of data. Big data comes in one size; that is large, or rather, Massive. In 1986, the world s technological capacity to receive information through one-way broadcast networks was 0.432 Zettabytes. In 2016, Internet traffic is expected to reach 1.3 Zettabytes From wikipedia
Defining Data - Velocity How fast data is being generated. Big data must be used as it is streaming into the enterprise to maximize its value to the business. Typically considers how quickly the data is arriving, is stored, and its associated rate of retrieval. Think of this as data in motion, or the speed at which the data is flowing. Examples: 1. # of Tweets/hour worldwide 2. Traffic Sensors from traffic in Los Angeles during rush hour traffic, or international airplane traffic sensors/signals while planes are in flight 3. Velocity Twitter processes 400,000,000 tweets/day or over 4,500 tweets per second.
Describing Big Data - Variety Variation of data types to include source, format, and structure. Big data extends beyond structured data, including unstructured data of all varieties, including text, audio, video, click streams, and log files. Example: Banking uses various types of banking transactions occurring around the world every minute iphone, phone, in person, computers, terminals, tellers..
Defining Data - Veracity
SQL Databases & NoSQL Traditional OLAP/OLTP Limitations: 1. A SQL database needs to know what is being stored in advance. 2. The Agile development approach doesn t work well. Each time new features are added, the schema of the database requires changes. 3. If the database is large, the process is slow. 4. Rapid iterations and frequent data changes result in frequent downtime.
NoSQL Advantages 1. NoSQL databases allow insertion of data without a predefined schema. 2. Application changes in real-time are easier, resulting in faster development. 3. Code integration is more reliable, and less database administration is needed. 4. NoSQL provides the ability to handle a variety of database technologies. It was developed in response to handling volume of data, frequency in which this data is accessed, performance and processing needs.
Sample No-SQL Databases by DB Type
& Big Data When the term Hadoop is often considered synonymous with the term, Big Data. So, what is Hadoop? Hadoop is an open-source software from Apache Software Foundation to store and process large non-relational data sets via a large, reliable, scalable distributed computing model. Commercialized Hadoop distributions are available from companies such as Hortonworks and Cloudera. 4
Key Hadoop Components
Elements of Hadoop Hadoop is a framework made of a variety of components that allows for the distributed processing of large data sets across a fault-tolerant cluster of servers. Hadoop Common: part of the core Hadoop project which includes the utilities that support the other Hadoop modules; Hadoop Distributed File System is a distributed file system that provides high-throughput access to application data; Hadoop YARN is a framework for job scheduling and cluster resource management Hadoop MapReduce is a YARN-based interface for parallel processing of large data sets. See more at: http://www.cioinsight.com/it-news-trends/slideshows/hadoopadoption-proves-slow-but-steady-08/#sthash.6u7xjwik.dpuf
Chief Advantages of Hadoop and MapReduce? 1. Potentially lower costs than analytical databases, and more scalability with reduced processing time and higher performance. 2. It s open source. Although this implies free, it s not entirely free, because you might want to pay for support. However, it s a lower-cost alternative. 3. There is no database license. Hadoop and other open source big data implementations offer a less expensive alternative to traditional, proprietary data warehouses.
Chief Advantages of Hadoop and MapReduce - II? Improved scalability over analytic databases. 1) It can handle very large amounts of data because you can take 10, 50, 100 machines to do the processing. The infrastructure around it handles the parallel processing. 2) These relatively simple routines can be written for mapping and reduction. The infrastructure takes responsibility for scheduling the jobs on each of the 100 machines and making sure that all 100 complete successfully. If one fails, it will redistribute that work to the other machines.
When Not To Use Hadoop
When to Use Big Data Tooling Users want to interact with their data: totality, exploration, and frequency. Totality refers to the increased desire to process and analyze all available data, rather than analyzing a sample of data and extrapolating the results. However: Apache Hadoop does not replace the data warehouse and NoSQL databases do not replace transactional relational databases. Neither do MapReduce, nor streaming analytics, Hive Apache s data warehousing application which is used to query Hadoop data stores
Gartner Prediction for Big Data By 2015 Big data demand will reach 4.4 million jobs globally, but only one-third of those jobs will be filled. Gartner says the demand for Big Data is growing, and enterprises will need to reassess their competencies and skills to respond to this opportunity. Jobs that are filled will result in real financial and competitive benefits for organizations. An important aspect of the challenge in filling these jobs lies in the fact that enterprises need people with new skills data management, analytics and business expertise and non-traditional skills necessary for extracting the value of Big Data, as well as artists and designers for data visualization. 3
Gartner Predictions for Big Data - II By 2016 Wearable smart electronics in shoes, tattoos and accessories will emerge as a $10 billion industry. Gartner claims the majority of revenue from wearable smart electronics over the next few years will come from athletic shoes and fitness tracking, communications devices for the ear, and automatic insulin delivery for diabetics. By 2017 40 per cent of enterprise contact information will have leaked into Facebook via employees increased use of mobile device collaboration applications. According to Gartner, while many organizations have been legitimately concerned about the physical coexistence of consumer and enterprise applications on devices that interact with IT infrastructure 3
The Hadoop Project & Components Hadoop delivers a highly-available service on top of a cluster of computers, each of which may be prone to failures. The project includes the following modules: 1. Hadoop Common: Common utilities that support the other Hadoop modules. 2. Hadoop Distributed File System (HDFS ): A distributed file system that provides high-throughput access to application data. 3. Hadoop YARN: A framework for job scheduling and cluster resource management. 4. Hadoop MapReduce: A core Hadoop analytics component using a YARNbased system for parallel processing of large data sets. Very complex analytics that are hard to do in SQL would be easy to do in MapReduce.
Hadoop 1.0 vs. 2.0
Overview of Apache Hadoop-Related Projects 1. Ambari : web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. - It includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heat maps. - It can also view MapReduce, Pig and Hive applications visually and provides a user interface with functionality to diagnose performance characteristics. 2. Avro TM is a data serialization system - http://avro.apache.org 3. Cassandra : A scalable multi-master database with no single points of failure. http://cassandra.apache.org 4. Chukwa : A data collection system for managing large distributed systems.
Overview Apache Hadoop-Related Projects - II 6. HBase : A scalable, distributed database that supports structured data storage for large tables. 7. Hive : A data warehouse infrastructure that provides data summarization and ad hoc querying. Runs on the MapReduce framework of platform Symphony. 8. Mahout : A Scalable machine learning and data mining library. 9. Pig : A high-level data-flow language and execution framework for parallel computation. Runs on the MapReduce framework of platform Symphony.
Overview Apache Hadoop-Related Projects - III 11.Spark : A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. 12.Oozie: the scheduler used to run/manage jobs. 13.Fair Scheduler is used for basic management of job submission is a distributed, reliable and highly available service for efficiently moving large amounts of data around a cluster. http://flume.apache.org 14. HCatalog is a table and storage management service for Hadoop
Tooling for Big Data - Top 16 Platforms Source: Information Week Jan. 30, 2014
References 1. Understanding Big Data- Analytics for Enterprise Class Hadoop and Streaming Data, Zikopoulos, Paul C., Eaton, Chris, et al, McGraw Hill, 2012. 2. The Forrester Wave : Enterprise Hadoop Solutions, Q1 2012, Kobielus, James G. 3. 7 Big Data Trends for 2014, December 27, 2013. Rijmenam, Mark van, http://smartdatacollective.com/bigdatastartups/174741/seven-big-data-trends-2014 9. Introduction to NoSQL, Fowler, Martin -- http://www.youtube.com/watch?v=qi_g07c_q5i 12. Harness the Power of Big Data The IBM Big Data Platform, Zikupulos, Paul, et al. 2013, McGraw Hill 13. IBM Whitepaper - Wrangling big data: Fundamentals of data lifecycle management 15. Hadoop Architecture, Keith McDonald, http://www.youtube.com/watch?v=yewlbxj3rv8 16. Intro to Map Reduce, MapRAcademy, http://www.youtube.com/watch?v=hfplubebhcm 17. How Big Is a Petabyte, Exabyte, Zettabyte, or a Yottabyte? http://highscalability.com/blog/2012/9/11/how-big-is-a-petabyte-exabyte-zettabyte-ora-yottabyte.html
Other Reading 1. Hadoop -- http://hadoop.apache.org 2. Avro -- http://avro.apache.org 3. Flume -- http://flume.apache.org 4. Hbase -- http://hbase.apache.org 5. Hive -- http://hive.apache.org 6. Lucene -- http://lucene.apache.org 7. Oozie -- http://oozie.apache.org 8. Pig -- http://pig.apache.org 9. Zookeeper -- http://zookeeper.apache.org