WA2192 Introduction to Big Data and NoSQL EVALUATION ONLY

Similar documents
How To Handle Big Data With A Data Scientist

So What s the Big Deal?

Implement Hadoop jobs to extract business value from large and varied data sets

The 3 questions to ask yourself about BIG DATA

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Transforming the Telecoms Business using Big Data and Analytics

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Applications for Big Data Analytics

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

How To Scale Out Of A Nosql Database

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Microsoft Big Data Solutions. Anar Taghiyev P-TSP

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

BIG DATA TECHNOLOGY. Hadoop Ecosystem

HDP Hadoop From concept to deployment.

Cloud Scale Distributed Data Storage. Jürmo Mehine

BIG DATA TRENDS AND TECHNOLOGIES

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Step by Step: Big Data Technology. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015

IBM Software Top tips for securing big data environments

Big Data Integration: A Buyer's Guide

Big Data: Tools and Technologies in Big Data

Hadoop Big Data for Processing Data and Performing Workload

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

The Next Wave of Data Management. Is Big Data The New Normal?

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Search and Real-Time Analytics on Big Data

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Big Data and Hadoop for the Executive A Reference Guide

Data Modeling for Big Data

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Big Data Analytics - Accelerated. stream-horizon.com

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Big Data on Microsoft Platform

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Real Time Big Data Processing

Luncheon Webinar Series May 13, 2013

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Tap into Hadoop and Other No SQL Sources

Big Data Solutions. Portal Development with MongoDB and Liferay. Solutions

Cost-Effective Business Intelligence with Red Hat and Open Source

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Big Data and Natural Language: Extracting Insight From Text

Hadoop implementation of MapReduce computational model. Ján Vaňo

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

INTRODUCTION TO CASSANDRA

Data processing goes big

Open source large scale distributed data management with Google s MapReduce and Bigtable

Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM

Big Data Explained. An introduction to Big Data Science.

Why Big Data in the Cloud?

Journal of Environmental Science, Computer Science and Engineering & Technology

Big Data Technologies Compared June 2014

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Hadoop. Sunday, November 25, 12

Microsoft Big Data. Solution Brief

IBM: An Early Leader across the Big Data Security Analytics Continuum Date: June 2013 Author: Jon Oltsik, Senior Principal Analyst

Big Data Big Data/Data Analytics & Software Development

Data Services Advisory

GigaSpaces Real-Time Analytics for Big Data

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Databases & Business Intelligence Part 1

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

A B S T R A C T. Index Terms: Hadoop, Clickstream, I. INTRODUCTION

Big Data at Cloud Scale

Open Source Technologies on Microsoft Azure

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Microsoft Azure Data Technologies: An Overview

Are You Ready for Big Data?

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Foundations of Business Intelligence: Databases and Information Management

Large scale processing using Hadoop. Ján Vaňo

The 4 Pillars of Technosoft s Big Data Practice

Big Data and Data Science: Behind the Buzz Words

Integrating Big Data into the Computing Curricula

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Cloud Computing and Big Data What Technical Writers Need to Know

Choosing The Right Big Data Tools For The Job A Polyglot Approach

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Big Data Defined Introducing DataStack 3.0

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Apache Hadoop: The Big Data Refinery

NoSQL for SQL Professionals William McKnight

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Transcription:

WA2192 Introduction to Big Data and NoSQL Web Age Solutions Inc. USA: 1-877-517-6540 Canada: 1-866-206-4644 Web: http://www.webagesolutions.com

The following terms are trademarks of other companies: Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. IBM, WebSphere, DB2 and Tivoli are trademarks of the International Business Machines Corporation in the United States, other countries, or both. Other company, product, and service names may be trademarks or service marks of others. For customizations of this book or other sales inquiries, please contact us at: USA: 1-877-517-6540, email: getinfousa@webagesolutions.com Canada: 1-866-206-4644 toll free, email: getinfo@webagesolutions.com Copyright 2013 Web Age Solutions Inc. This publication is protected by the copyright laws of Canada, United States and any other country where this book is sold. Unauthorized use of this material, including but not limited to, reproduction of the whole or part of the content, re-sale or transmission through fax, photocopy or e-mail is prohibited. To obtain authorization for any such activities, please write to: Web Age Solutions Inc. 439 University Ave Suite 820 Toronto Ontario, M5G 1Y8

Table of Contents Chapter 1 - Defining Big Data...7 1.1 Transforming Data into Business Information...7 1.2 Gartner's Definition of Big Data...8 1.3 More Definitions of Big Data...9 1.4 Challenges Posed by Big Data...9 1.5 The Cloud and Big Data...11 1.6 The Business Value of Big Data...12 1.7 Big Data: Hype or Reality?...12 1.8 Big Data Quiz...12 1.9 Big Data Quiz Answers...13 1.10 Summary...13 Chapter 2 - NoSQL and Big Data Systems Overview...15 2.1 Limitations of Relational Databases...15 2.2 What are NoSQL (Not Only SQL) Databases?...16 2.3 NoSQL Past and Present...17 2.4 NoSQL Database Properties...17 2.5 NoSQL Benefits...18 2.6 NoSQL Database Storage Types...19 2.7 The CAP Theorem...20 2.8 Limitations of NoSQL Databases...21 2.9 Big Data Sharding...22 2.10 Sharding Example...22 2.11 Amazon S3...23 2.12 Amazon Storage SLAs...24 2.13 Amazon Glacier...24 2.14 Amazon S3 Security...25 2.15 Data Lifecycle Management with Amazon S3...26 2.16 Amazon S3 Cost Monitoring...26 2.17 OpenStack...27 2.18 Object Store (Swift)...27 2.19 Components of Swift...28 2.20 Google BigTable...29 2.21 BigTable-based Applications...30 2.22 BigTable Design...30 2.23 Google App Engine...32 2.24 Google App Engine Billing...32 2.25 Google Cloud Storage...33 2.26 Hadoop...33 2.27 Hadoop's Core Components...34 2.28 Hadoop Distributed File System...35 2.29 Accessing HDFS...37 2.30 HBase...37 2.31 HBase design...38 2.32 MemcacheDB...38

2.33 MongoDB...39 2.34 MongoDB Operational Intelligence...41 2.35 MongoDB Use Cases...41 2.36 Quiz...42 2.37 Quiz Answers...42 2.38 Summary...42 Chapter 3 - Big Data Business Intelligence and Analytics...43 3.1 Comparison with other systems...43 3.2 NoSQL Data Querying and Processing...44 3.3 MapReduce programming model...45 3.4 Example of Map & Reduce Operations using JavaScript...46 3.5 Analyzing Big Data with Hadoop...47 3.6 Hadoop's MapReduce...47 3.7 Hadoop Streaming...48 3.8 Making things simpler with Hadoop Pig Latin...49 3.9 Example of a Pig Script in Batch Mode...50 3.10 Amazon Elastic MapReduce...50 3.11 Big Data in Google App Engine...51 3.12 Example of Google AppEngine Java Datastore API...53 3.13 MongoDB Data Model...53 3.14 MongoDB Query Language (QL)...54 3.15 The find and findone methods...55 3.16 A MongoDB QL Example...56 3.17 What is Hive...56 3.18 Interfacing with Hive...57 3.19 Business analytics with Hive...58 3.20 The UnQL Specification...58 3.21 Quiz...59 3.22 Quiz Answers...59 3.23 Summary...60 Chapter 4 - Big Data Real World Case Studies...61 4.1 Hadoop @ Yahoo...61 4.2 Yahoo for Hadoop...62 4.3 Yahoo!!...63 4.4 Big Data @ Facebook...63 4.5 Hive @ Facebook...64 4.6 Mailtrust (Rackspace's mail division)...65 4.7 Summary...65 Chapter 5 - Adopting NoSQL...67 5.1 Hype Cycle and Technology Adoption Model...67 5.2 Barriers to Adoption...68 5.3 Dismantling Barriers to Adoption...68 5.4 Use Cases for NoSQL Database Systems...70 5.5 Example Applications...70 5.6 Industry trends...71 5.7 Enterprise Big Data / NoSQL Offerings...72

5.8 NoSQL Technology Adoption Action Plan...73 5.9 Summary...74

Chapter 1 - Defining Big Data Objectives In this chapter, participants will learn about: Big Data definitions Challenges posed by Big Data How businesses can benefit from Big Data 1.1 Transforming Data into Business Information Success of an organization is predicated on its ability to convert raw data from various sources into useful business information As a rule, the more data is available, the more information can be harvested from it The amount of information that can be obtained from raw data is in direct proportion to the volume of the raw data (increasing the size of input data sets leads to a larger amount of useful information) Nowadays, data can be easily acquired but it normally comes in unstructured forms In many instances, the [useful information]/[information noise] ratio in data sets is very low The quality of information harvested from the data depends on the sophistication of the data processing algorithm In many respects, getting business information is similar to extracting gold from ore OLAP and Data Mining systems (deployed in data warehouses) are the traditional tools used by organizations for extracting business intelligence from data Notes A person of average lifespan, literacy and cultural exposure processes about 650 million words. Ian Pearson, of British Telecom, estimated that over an 80-year lifespan we process 10 terabytes of data. Source: The World As Information: Overload and Personal Design By Robert Abbott, Robert D. Abbott

Chapter 1 - Defining Big Data 1.2 Gartner's Definition of Big Data Gartner's analyst Doug Laney defined three dimensions to data growth challenges: increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources) In 2012, Gartner updated its definition as follows: "Big data are high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." Volume Data sizes accumulated in many organizations come to hundreds of terabytes, approaching the petabyte levels Variety Big Data comes in different formats as well as unformatted (unstructured) and various types like text, audio, voice, VoIP, images, video, e-mails, web traffic log files entries, sensor byte streams, etc. Velocity High traffic on-line banking web site can generate hundreds of TPS (transactions per second) each of which may be required to be subjected to fraud detection analysis in real or near-real time Figure source: http://www.amazon.com/understanding-big-data-analytics-enterprise/dp/0071790535 8

Chapter 1 - Defining Big Data Notes 1.3 More Definitions of Big Data There are different definitions of what Big Data is, however, one attribute of Big Data seems to more representative than others: The data gets mystically morphed into the Big Data category when traditional systems and tools (e.g. databases, OLAP and data-mining systems used in data marts or warehouses) may become either prohibitively expensive to handle the exponential growth of data volumes or found unsuitable for the job Big Data is stored electronically and lends itself to machine-oriented processing Processing of Big Data requires new approaches and tooling support NoSQL (Not Only SQL) databases have appeared, in part, to address the challenges posed by Big Data In some instances, Big Data sets may be seen as sparsely populated matrices or N-dimensional cubes with no rigid schema. A key value (KV) data set is an example of schema-less data. KV data sets include an array of key-value pairs where each key is the name of an attribute (sort of a column name in relational databases) pointing to the actual data. This kind of data does not always lend itself to processing using conventional database systems. 1.4 Challenges Posed by Big Data Traditional relational database technologies are not very well suited to accommodate the volume, variety and velocity characteristics of Big Data, in part, due to: Underlying rigid data model The database server, for the most part, is deployed on a single node with limited number of options for both vertical and horizontal scalability to accommodate over-capacity volumes Databases are a poor choice for elastic computing power provisioning required for handling rapid spikes in data volumes and throughput without increasing response time 9

Chapter 1 - Defining Big Data There is a growing number of use cases for real-time data processing (lightweight analytics is often sufficient) It is no longer enough to just capture, store and process Big Data using batch-oriented analytics in an offline environment (the "data-at-rest" processing paradigm) Applications are required to provide real-time, in-place data analysis without moving the data to a warehouse (the "data-in-motion" processing paradigm) Many organizations are faced with the piling up of unprocessed data that has the potential to aid their business in making informed tactical and strategic decisions Notes In response to the introduction of the XML data type, many database vendors introduced the special XML column type for storing XML documents in their databases. Things are always changing and now there is a new lightweight data-interchange format called JSON ((JavaScript Object Notation) very popular with Web 2.0-style dynamic web sites. Are vendors now going to introduce a new column type to support JSON format? The jury is still out on this one Database schema must be defined using a DDL (Data Description Language) during the database logical design phase; changes in the schema requires recreating tables with the new structure. An example of a system that provides real-time in-place data analysis without moving the data to a warehouse is the IBM InfoSphere Stream computing framework which enables "continuous and extremely fast analysis of massive volumes of information-in-motion to help improve business insights 10

Chapter 1 - Defining Big Data and decision making" Notes: 1.5 The Cloud and Big Data Gone are the days when only large corporations could afford storing massive data sets Physical storage capacity is increasing while the cost of data storage goes down The commodity hard drives are now have capacities over 1 TB (a million million [10 12 ] bytes) of data Still, on-premise physical storage constitutes a significant factor in the Total Cost of Ownership (TCO) for organizations Cloud vendors offer services for storing Big Data sets (Swift from Rackspace and OpenStack, S3 from Amazon, HRD from Google App Engine, etc.). If required, in-place processing capabilities are also available In cases when data security / confidentiality is a concern, the data to be stored and processed in the Cloud needs first to be sanitized or encrypted before uploading to the Cloud Cloud storage refers to any type of data storage that resides in the Cloud, including: services that provide database-like functionality; unstructured data services (file storage of digital media, for example); data synchronization services; or Network Attached Storage (NAS) services. Data services are often consumed in a pay-as-you-go model or, in this case, a pay-per-gb model (including both stored and transferred data). Cloud storage offers a number of benefits, such as the ability to store and retrieve large amounts of data in any location at any time. Data storage services are fast, inexpensive, and almost infinitely scalable; however, reliability can be an issue, as even the best services do sometimes fail. Transaction support is also an issue with cloud-based storage systems, a significant problem that needs to be addressed for storage services to be widely used in the enterprise. Source: http://msdn.microsoft.com 11

Chapter 1 - Defining Big Data 1.6 The Business Value of Big Data Most organizations use just a fraction of the data available to them as it is either too expensive to process it or business has no expertise to extract the relevant information Businesses that effectively leverage Big Data (that was originally discarded or not processed due to technology limitations) get a competitive advantage over their competitors Insights from Big Data help improve services and products, develop deeper customer relationships in a more agile and predictive manner and uncover new monetization opportunities Since storage costs of Big Data in many cases is not an issue, businesses may request their IT to extend retention period of some data feeds and come up with usage ideas later on Specialized Big Data solutions can offer real or near real-time analytics Overall, with Big Data, business agility is achieved New features can be incorporated into applications quickly and easily 1.7 Big Data: Hype or Reality? In its report " Hype Cycle for Cloud Computing, 2012", Gartner predicts that "Big Data will deliver transformational benefits to enterprises within two to five years, and by 2015 will enable enterprises adopting this technology to outperform competitors by 20% in every available financial metric." In the same report, Gartner places Big Data near the Peak of Inflated Expectations in the hype cycle, which can be defined as a phase that generates high amounts of enthusiasm and unrealistic expectations (i.e. what most people would call a buzzword). Source: http://www.rackspace.com 1.8 Big Data Quiz 1. What are the three main characteristics of Big Data 12

Chapter 1 - Defining Big Data 2. Name any one limitation of relational databases 3. What is the difference between the "data-at-rest" and "data-in-motion" processing 1. Volume, Variety and Velocity (V 3 ) 1.9 Big Data Quiz Answers 2. Rigid data model (as prescribed by a DDL) 3. "Data-at-rest" is a batch-oriented process running in offline settings, while "data-in-motion" refers to real-time, in-place data processing and analysis 1.10 Summary Nowadays, information can be easily acquired but making effective use of it beyond what can be achieved with traditional technologies requires introduction of new concepts, re-thinking the usefulness of data and getting new tooling support Organizations are faced with the growing amount of unprocessed data that can and should be used more intelligently Businesses that found ways to take advantage of Big Data are ahead of the competition 13