WA2192 Introduction to Big Data and NoSQL Web Age Solutions Inc. USA: 1-877-517-6540 Canada: 1-866-206-4644 Web: http://www.webagesolutions.com
The following terms are trademarks of other companies: Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. IBM, WebSphere, DB2 and Tivoli are trademarks of the International Business Machines Corporation in the United States, other countries, or both. Other company, product, and service names may be trademarks or service marks of others. For customizations of this book or other sales inquiries, please contact us at: USA: 1-877-517-6540, email: getinfousa@webagesolutions.com Canada: 1-866-206-4644 toll free, email: getinfo@webagesolutions.com Copyright 2013 Web Age Solutions Inc. This publication is protected by the copyright laws of Canada, United States and any other country where this book is sold. Unauthorized use of this material, including but not limited to, reproduction of the whole or part of the content, re-sale or transmission through fax, photocopy or e-mail is prohibited. To obtain authorization for any such activities, please write to: Web Age Solutions Inc. 439 University Ave Suite 820 Toronto Ontario, M5G 1Y8
Table of Contents Chapter 1 - Defining Big Data...7 1.1 Transforming Data into Business Information...7 1.2 Gartner's Definition of Big Data...8 1.3 More Definitions of Big Data...9 1.4 Challenges Posed by Big Data...9 1.5 The Cloud and Big Data...11 1.6 The Business Value of Big Data...12 1.7 Big Data: Hype or Reality?...12 1.8 Big Data Quiz...12 1.9 Big Data Quiz Answers...13 1.10 Summary...13 Chapter 2 - NoSQL and Big Data Systems Overview...15 2.1 Limitations of Relational Databases...15 2.2 What are NoSQL (Not Only SQL) Databases?...16 2.3 NoSQL Past and Present...17 2.4 NoSQL Database Properties...17 2.5 NoSQL Benefits...18 2.6 NoSQL Database Storage Types...19 2.7 The CAP Theorem...20 2.8 Limitations of NoSQL Databases...21 2.9 Big Data Sharding...22 2.10 Sharding Example...22 2.11 Amazon S3...23 2.12 Amazon Storage SLAs...24 2.13 Amazon Glacier...24 2.14 Amazon S3 Security...25 2.15 Data Lifecycle Management with Amazon S3...26 2.16 Amazon S3 Cost Monitoring...26 2.17 OpenStack...27 2.18 Object Store (Swift)...27 2.19 Components of Swift...28 2.20 Google BigTable...29 2.21 BigTable-based Applications...30 2.22 BigTable Design...30 2.23 Google App Engine...32 2.24 Google App Engine Billing...32 2.25 Google Cloud Storage...33 2.26 Hadoop...33 2.27 Hadoop's Core Components...34 2.28 Hadoop Distributed File System...35 2.29 Accessing HDFS...37 2.30 HBase...37 2.31 HBase design...38 2.32 MemcacheDB...38
2.33 MongoDB...39 2.34 MongoDB Operational Intelligence...41 2.35 MongoDB Use Cases...41 2.36 Quiz...42 2.37 Quiz Answers...42 2.38 Summary...42 Chapter 3 - Big Data Business Intelligence and Analytics...43 3.1 Comparison with other systems...43 3.2 NoSQL Data Querying and Processing...44 3.3 MapReduce programming model...45 3.4 Example of Map & Reduce Operations using JavaScript...46 3.5 Analyzing Big Data with Hadoop...47 3.6 Hadoop's MapReduce...47 3.7 Hadoop Streaming...48 3.8 Making things simpler with Hadoop Pig Latin...49 3.9 Example of a Pig Script in Batch Mode...50 3.10 Amazon Elastic MapReduce...50 3.11 Big Data in Google App Engine...51 3.12 Example of Google AppEngine Java Datastore API...53 3.13 MongoDB Data Model...53 3.14 MongoDB Query Language (QL)...54 3.15 The find and findone methods...55 3.16 A MongoDB QL Example...56 3.17 What is Hive...56 3.18 Interfacing with Hive...57 3.19 Business analytics with Hive...58 3.20 The UnQL Specification...58 3.21 Quiz...59 3.22 Quiz Answers...59 3.23 Summary...60 Chapter 4 - Big Data Real World Case Studies...61 4.1 Hadoop @ Yahoo...61 4.2 Yahoo for Hadoop...62 4.3 Yahoo!!...63 4.4 Big Data @ Facebook...63 4.5 Hive @ Facebook...64 4.6 Mailtrust (Rackspace's mail division)...65 4.7 Summary...65 Chapter 5 - Adopting NoSQL...67 5.1 Hype Cycle and Technology Adoption Model...67 5.2 Barriers to Adoption...68 5.3 Dismantling Barriers to Adoption...68 5.4 Use Cases for NoSQL Database Systems...70 5.5 Example Applications...70 5.6 Industry trends...71 5.7 Enterprise Big Data / NoSQL Offerings...72
5.8 NoSQL Technology Adoption Action Plan...73 5.9 Summary...74
Chapter 1 - Defining Big Data Objectives In this chapter, participants will learn about: Big Data definitions Challenges posed by Big Data How businesses can benefit from Big Data 1.1 Transforming Data into Business Information Success of an organization is predicated on its ability to convert raw data from various sources into useful business information As a rule, the more data is available, the more information can be harvested from it The amount of information that can be obtained from raw data is in direct proportion to the volume of the raw data (increasing the size of input data sets leads to a larger amount of useful information) Nowadays, data can be easily acquired but it normally comes in unstructured forms In many instances, the [useful information]/[information noise] ratio in data sets is very low The quality of information harvested from the data depends on the sophistication of the data processing algorithm In many respects, getting business information is similar to extracting gold from ore OLAP and Data Mining systems (deployed in data warehouses) are the traditional tools used by organizations for extracting business intelligence from data Notes A person of average lifespan, literacy and cultural exposure processes about 650 million words. Ian Pearson, of British Telecom, estimated that over an 80-year lifespan we process 10 terabytes of data. Source: The World As Information: Overload and Personal Design By Robert Abbott, Robert D. Abbott
Chapter 1 - Defining Big Data 1.2 Gartner's Definition of Big Data Gartner's analyst Doug Laney defined three dimensions to data growth challenges: increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources) In 2012, Gartner updated its definition as follows: "Big data are high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." Volume Data sizes accumulated in many organizations come to hundreds of terabytes, approaching the petabyte levels Variety Big Data comes in different formats as well as unformatted (unstructured) and various types like text, audio, voice, VoIP, images, video, e-mails, web traffic log files entries, sensor byte streams, etc. Velocity High traffic on-line banking web site can generate hundreds of TPS (transactions per second) each of which may be required to be subjected to fraud detection analysis in real or near-real time Figure source: http://www.amazon.com/understanding-big-data-analytics-enterprise/dp/0071790535 8
Chapter 1 - Defining Big Data Notes 1.3 More Definitions of Big Data There are different definitions of what Big Data is, however, one attribute of Big Data seems to more representative than others: The data gets mystically morphed into the Big Data category when traditional systems and tools (e.g. databases, OLAP and data-mining systems used in data marts or warehouses) may become either prohibitively expensive to handle the exponential growth of data volumes or found unsuitable for the job Big Data is stored electronically and lends itself to machine-oriented processing Processing of Big Data requires new approaches and tooling support NoSQL (Not Only SQL) databases have appeared, in part, to address the challenges posed by Big Data In some instances, Big Data sets may be seen as sparsely populated matrices or N-dimensional cubes with no rigid schema. A key value (KV) data set is an example of schema-less data. KV data sets include an array of key-value pairs where each key is the name of an attribute (sort of a column name in relational databases) pointing to the actual data. This kind of data does not always lend itself to processing using conventional database systems. 1.4 Challenges Posed by Big Data Traditional relational database technologies are not very well suited to accommodate the volume, variety and velocity characteristics of Big Data, in part, due to: Underlying rigid data model The database server, for the most part, is deployed on a single node with limited number of options for both vertical and horizontal scalability to accommodate over-capacity volumes Databases are a poor choice for elastic computing power provisioning required for handling rapid spikes in data volumes and throughput without increasing response time 9
Chapter 1 - Defining Big Data There is a growing number of use cases for real-time data processing (lightweight analytics is often sufficient) It is no longer enough to just capture, store and process Big Data using batch-oriented analytics in an offline environment (the "data-at-rest" processing paradigm) Applications are required to provide real-time, in-place data analysis without moving the data to a warehouse (the "data-in-motion" processing paradigm) Many organizations are faced with the piling up of unprocessed data that has the potential to aid their business in making informed tactical and strategic decisions Notes In response to the introduction of the XML data type, many database vendors introduced the special XML column type for storing XML documents in their databases. Things are always changing and now there is a new lightweight data-interchange format called JSON ((JavaScript Object Notation) very popular with Web 2.0-style dynamic web sites. Are vendors now going to introduce a new column type to support JSON format? The jury is still out on this one Database schema must be defined using a DDL (Data Description Language) during the database logical design phase; changes in the schema requires recreating tables with the new structure. An example of a system that provides real-time in-place data analysis without moving the data to a warehouse is the IBM InfoSphere Stream computing framework which enables "continuous and extremely fast analysis of massive volumes of information-in-motion to help improve business insights 10
Chapter 1 - Defining Big Data and decision making" Notes: 1.5 The Cloud and Big Data Gone are the days when only large corporations could afford storing massive data sets Physical storage capacity is increasing while the cost of data storage goes down The commodity hard drives are now have capacities over 1 TB (a million million [10 12 ] bytes) of data Still, on-premise physical storage constitutes a significant factor in the Total Cost of Ownership (TCO) for organizations Cloud vendors offer services for storing Big Data sets (Swift from Rackspace and OpenStack, S3 from Amazon, HRD from Google App Engine, etc.). If required, in-place processing capabilities are also available In cases when data security / confidentiality is a concern, the data to be stored and processed in the Cloud needs first to be sanitized or encrypted before uploading to the Cloud Cloud storage refers to any type of data storage that resides in the Cloud, including: services that provide database-like functionality; unstructured data services (file storage of digital media, for example); data synchronization services; or Network Attached Storage (NAS) services. Data services are often consumed in a pay-as-you-go model or, in this case, a pay-per-gb model (including both stored and transferred data). Cloud storage offers a number of benefits, such as the ability to store and retrieve large amounts of data in any location at any time. Data storage services are fast, inexpensive, and almost infinitely scalable; however, reliability can be an issue, as even the best services do sometimes fail. Transaction support is also an issue with cloud-based storage systems, a significant problem that needs to be addressed for storage services to be widely used in the enterprise. Source: http://msdn.microsoft.com 11
Chapter 1 - Defining Big Data 1.6 The Business Value of Big Data Most organizations use just a fraction of the data available to them as it is either too expensive to process it or business has no expertise to extract the relevant information Businesses that effectively leverage Big Data (that was originally discarded or not processed due to technology limitations) get a competitive advantage over their competitors Insights from Big Data help improve services and products, develop deeper customer relationships in a more agile and predictive manner and uncover new monetization opportunities Since storage costs of Big Data in many cases is not an issue, businesses may request their IT to extend retention period of some data feeds and come up with usage ideas later on Specialized Big Data solutions can offer real or near real-time analytics Overall, with Big Data, business agility is achieved New features can be incorporated into applications quickly and easily 1.7 Big Data: Hype or Reality? In its report " Hype Cycle for Cloud Computing, 2012", Gartner predicts that "Big Data will deliver transformational benefits to enterprises within two to five years, and by 2015 will enable enterprises adopting this technology to outperform competitors by 20% in every available financial metric." In the same report, Gartner places Big Data near the Peak of Inflated Expectations in the hype cycle, which can be defined as a phase that generates high amounts of enthusiasm and unrealistic expectations (i.e. what most people would call a buzzword). Source: http://www.rackspace.com 1.8 Big Data Quiz 1. What are the three main characteristics of Big Data 12
Chapter 1 - Defining Big Data 2. Name any one limitation of relational databases 3. What is the difference between the "data-at-rest" and "data-in-motion" processing 1. Volume, Variety and Velocity (V 3 ) 1.9 Big Data Quiz Answers 2. Rigid data model (as prescribed by a DDL) 3. "Data-at-rest" is a batch-oriented process running in offline settings, while "data-in-motion" refers to real-time, in-place data processing and analysis 1.10 Summary Nowadays, information can be easily acquired but making effective use of it beyond what can be achieved with traditional technologies requires introduction of new concepts, re-thinking the usefulness of data and getting new tooling support Organizations are faced with the growing amount of unprocessed data that can and should be used more intelligently Businesses that found ways to take advantage of Big Data are ahead of the competition 13