Beginner s Guide to. BigDataAnalytics



Similar documents
How To Handle Big Data With A Data Scientist

Implement Hadoop jobs to extract business value from large and varied data sets

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Are You Ready for Big Data?

Big Data and Data Science: Behind the Buzz Words

BIG DATA TECHNOLOGY. Hadoop Ecosystem

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Transforming the Telecoms Business using Big Data and Analytics

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

BIG DATA TRENDS AND TECHNOLOGIES

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Data processing goes big

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Navigating Big Data business analytics

White Paper: What You Need To Know About Hadoop

Big Data Integration: A Buyer's Guide

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

The Future of Data Management

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

A Brief Outline on Bigdata Hadoop

Are You Ready for Big Data?

Hadoop Ecosystem B Y R A H I M A.

White Paper: Hadoop for Intelligence Analysis

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

How To Scale Out Of A Nosql Database

Big Data Explained. An introduction to Big Data Science.

Workshop on Hadoop with Big Data

Data Refinery with Big Data Aspects

How to Enhance Traditional BI Architecture to Leverage Big Data

How To Turn Big Data Into An Insight

Advanced Big Data Analytics with R and Hadoop

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

BIG DATA What it is and how to use?

Data Mining in the Swamp

Customized Report- Big Data

Big Data: Tools and Technologies in Big Data

Microsoft SQL Server 2012 with Hadoop

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Outline. What is Big data and where they come from? How we deal with Big data?

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Cloudera Certified Developer for Apache Hadoop

The 4 Pillars of Technosoft s Big Data Practice

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

The Next Wave of Data Management. Is Big Data The New Normal?

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hexaware E-book on Predictive Analytics

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Databricks. A Primer

Hadoop and Map-Reduce. Swati Gore

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Large scale processing using Hadoop. Ján Vaňo

Interactive data analytics drive insights

Big Data and Apache Hadoop Adoption:

Sentimental Analysis using Hadoop Phase 2: Week 2

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Apache Hadoop: The Big Data Refinery

BIG DATA CHALLENGES AND PERSPECTIVES

Manifest for Big Data Pig, Hive & Jaql

Chapter 7. Using Hadoop Cluster and MapReduce

Cloudera Enterprise Data Hub in Telecom:

Big Data. Fast Forward. Putting data to productive use

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Hadoop implementation of MapReduce computational model. Ján Vaňo

BIG DATA-AS-A-SERVICE

The Future of Data Management with Hadoop and the Enterprise Data Hub

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

Databricks. A Primer

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

In-Memory Analytics for Big Data

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

Tap into Hadoop and Other No SQL Sources

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

White Paper: Evaluating Big Data Analytical Capabilities For Government Use

Sunnie Chung. Cleveland State University

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Bringing Big Data into the Enterprise

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Data Warehouse design

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Hadoop Big Data for Processing Data and Performing Workload

WHITE PAPER ON. Operational Analytics. HTC Global Services Inc. Do not copy or distribute.

So What s the Big Deal?

Transcription:

Beginner s Guide to BigDataAnalytics

Introduction Big Data, What do these two words really mean? Yes everyone is talking about it but frankly, not many really understand what the hype is all about. This book by Jigsaw Academy aims to give one an understanding of Big Data and what makes data big, while also elaborating in simple language the challenges of Big Data, the emerging technologies and the Big Data landscape. Finally we talk about careers in Big Data and what the future could hold in store for the industry. This book is also a useful companion to those of you enrolled in Jigsaw's Course Big Data Analytics Using Hadoop and R. You can use this book to compliment your learning and better understand Big Data. Please note the blue boxes in every chapter which link the content in the chapter to the modules covered in the course. Enjoy the book. Big Data Team Jigsaw (Led by Team Lead Kiran P.V.) pg01

Outline What is Big Data 03 2 What Makes Data Big 04 3 Challenges of Big Data Processing 07 5 Big Data Technologies 10 09 Big Data and Analytics 15 Unstructured Data and Text Analytics 20 17 Big Data in Action 20 22 Big Data Landscape 23 26 Big Data Career Paths 30 31 Big Data in the Future 30 33 Learn more about Big Data 30 34 pg02

CHAPTER 01 WhatisBigData? I don't know how the cost of hard disks has decreased so rapidly. These days one can buy a terabyte hard drive for just $100 a friend told me couple of years ago. It's hard not to agree with him and a quick review of historical facts validated his opinion. In the 1990's, the cost per 1 gigabyte of hard disk was around $10,000 and now it can be purchased at only $0.1. The price has dropped 100,000 times over a span of 20 years. Currently we are even seeing that a few giga bytes of hard disk space are being offered free of cost by email service providers and file hosting services. For personal accounts, Gmail offers about 15 gigabytes of free hard disk space whereas file hosting service Dropbox offers up to 3.5 gigabytes. However, these values are on the higher side for business accounts. One would wonder how enterprises are influenced by the lower costs of storage space. For one, it definitely provides them with more opportunities of storing data around their product and service offerings. Virtually every industry is seeing a tremendous explosion in terms of new data sources and is dependent on advanced storage technologies. Increased adoption of internet and smart phones enabled individuals across the globe to leave a huge digital footprint of online data which is wanted by many enterprises. In the past, for example, banks used to store customer data mostly around demographic information tracked from application forms and further transaction information tracked from passbooks. These days, the customer data being stored is enormous and varies widely across mobile usage, online transactions, ATM withdrawals, customer feedback, social media comments and credit bureau information. All these new sources of data which did not exist in the past can be categorized under the new word Big Data. Big Data can be easily referred to as data which is huge, but more importantly Big Data is data that comes from multiple sources rather than just one. Big Data is definitely one of the more fascinating evolutions of the 21st century in the world of IT. The truth is that Big Data has opened up tremendous opportunities and has provided us with endless solutions to deal with social, economic and business problems. For enterprises, it is a huge untapped The Big Data and Hadoop Overview source of profit, which if used appropriately will be the key to staying ahead of Module provides pre-class videos their competition. In order to deal with Big Data effectively, they need to and lots of reading material on the importance of Big Data and how it is depend on advanced database technologies and faster processing capabilities. transforming the way enterprises are Just having Big Data is not a sufficient criterion for success; enterprises also implementing data based strategies need to implement analytics effectively, in order to be able to garner insights to become more competitive. that help improve profitability. They should actively pursue the art and science of capturing, analysing and creating value out of Big Data. pg03

CHAPTER 02 WhatmakesdataBig? We live in the era of Big Data and it is not leaving any industry untouched be it financial services, consumer goods, e-commerce, transportation, manufacturing or social media. Across all industries, enterprises now have access to an increasing number of both internal and external Big Data sources. Internal sources typically track information around demographics, online or offline transactions, sensors, server logs, website traffic and emails. This list is not exhaustive and varies from industry to industry. External sources on the other hand are mostly related to social media information from online discussions, comments and feedback shared about a particular product or service. Another major source of Big Data is machine data which consists of real-time information from sensors and web logs that monitor customer s online activity. In the coming years, as we continue to develop new ways of data generation either online or offline by leveraging technological advancements, the one correct prediction we can make is this; the growth of Big Data is not going to stop. Although Big Data is more about data being captured from multiple sources and size at a higher level, there are many technical definitions which provide more clarity. Orielly Strata group states that Big Data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn t fit the structures of your database architectures. In simple terms, Big Data needs multiple systems to efficiently handle and process data rather than a single system. Say an online e-commerce enterprise in a single region is generating about 1000 gigabytes of data on a daily basis which can be handled and processed using traditional database systems. On expanding operations to a global level, their daily data generation has increased 10000 times and is currently at 10 petabytes (1 petabyte = 1000000 gigabytes). To handle this kind of data, traditional database systems do not have required capabilities and enterprises need to depend on Big Data technologies such as Hadoop which uses a distributed computing framework. We will learn more about these technical topics in subsequent chapters. To further simplify our Big Data understanding, we can rely on three major characteristics of Big Data i.e. volume, variety and velocity which are more commonly referred as 3 V s of Big Data. Occasionally, some resources do talk about a not so common characteristic of Big Data i.e. Veracity which is referred as the 4th V of Big Data. All these 4 characteristics provide more details around the nature, scope and structure of the Big Data. Commonly Big Data is characterized by 3 V s and these provide context for a new class of possibilities. You will learn more about how these characteristics help achieve more information from massive amounts of data in Big Data and Hadoop Overview Module. pg04

Volume Volume deals with the size aspect of Big Data. With technical advancements in global connectivity and social media, the physical extent of data generated on a daily basis is growing exponentially. Every day, about 30 billion pieces of information is shared globally. An IDC Digital Universe study, estimated that global data volume was about 1.8 zetabyte as of 2011 and will grow about 5 times by 2015. A zetabyte is a quantity of information or information storage capacity equal to one billion terabytes which is 1 followed by 21 zeroes of bytes. Across many enterprises, these increasing volumes pose an immediate challenge to traditional database systems with regards to the storing and processing of data. Variety Big Data comes from sources such as conversations on social media, media files shared on social networks, online transactions, smart phone usage, climate sensor data, financial market data and many more. The underlying formats of data coming out of these sources would vary in terms of excel sheets, text documents, audio files and server log files which can broadly classified under either structured or unstructured data types. Structured data formats typically refer to a defined way of storing data i.e. clearly marked out rows and columns whereas unstructured data formats do not have any order and mostly refer to text, audio and video data. Unstructured formats of data are more a recent phenomenon and traditional database systems do not possess required capabilities to process this kind of information. pg05

Velocity Increased volumes of data have put a lot of stress on the processing abilities of traditional database systems. Enterprises should be able to quickly process incoming data from various sources and then share it with the business to ensure the smooth functioning of day-to-day operations. This quick flow of data within an enterprise refers to the velocity characteristic of Big Data. Another important aspect is also about the ability to provide relevant services to the end user on a real time basis. For example, Amazon provides instant recommendation services depending on the users search and location. Based on the entered keyword, these services need to search through their entire historical transactions and share relevant results which hopefully would convert into a potential purchase. The effects of velocity are very similar to volume, and enterprises need to rely on advanced processing technologies to efficiently handle Big Data. Veracity Though enterprises have access to lot of Big Data, some aspects of it would be missing. Over the years, we are aware that data quality issues usually happen due to human entry error or due to some individuals withholding information. In the Big Data era where most of the data capturing processes are automated, the same issues can occur due to technical lapses which arise due to system or sensor malfunction. Whatever may be the reasons, one should be careful to deal with inconsistency in Big Data before using it for any kind of analysis. pg06

CHAPTER 03 ChallengesofBigDataProcessing Just having a big source of data is not enough to become successful; enterprises need to implement relevant processes and systems which will help them extract value out of it. An important aspect here is what one should do with it. In the absence of a business context, data in itself is meaningless and would just occupy space in the storage servers. Also many Big Data sources tend to have missing or low content information as described by the veracity characteristic earlier. The actual power of Big Data will surface only by applying analytics on top of it, when it is used to generate useful insights to guide future decision making. Irrespective of the size of the data, whether big or small, analytics methodologies need to be implemented to reap benefits. This typically involves the cleaning, analysing, interpreting and finally visualizing the hidden patterns that emerge from the data. Due to sheer volume, variety and velocity of Big Data; the processing capacity of traditional database systems is strained beyond their limits. So enterprises need to look out for advanced processing technologies and capabilities to effectively manage Big Data and further implement analytics on it. One of the major aspects of any Big Data processing framework would be to successfully handle huge amounts of information without compromising on query time. Traditional database systems lack the required infrastructure and internal designs to process Big NoSQL technologies are helping enterprises to achieve more than what Data at a scale of petabytes or exabytes. Since these systems tend to operate was possible previously. In the Big out of a single machine with huge hard disk and processing capabilities, there Data and Hadoop Overview Module, are a set of limitations which comes with it. The first one is a scalability issue. the evolution and benefits of new technologies such as Hadoop and With a continuous increase in data volumes, the storage capacity of these MongoDB are discussed and also how systems needs to be continuously increased, and this can be an expensive these help in overcoming the limitations of traditional IT systems for option. The second one is a slow querying time because the storage load is solving Big Data problem. already operating at maximum levels and enterprises cannot wait for days to pg07

get their daily reports. These limitations call for alternate approaches based on scalable storage and a distributed approach to scaling. Big Data sources are diverse and inputs for these systems can be in the form of structured or unstructured formats. Since the origin of these data formats is spread across the globe, most times these won t have a predefined order and requires pre-processing before using it for any analysis. A common use of Big Data processing is to make use of unstructured data, specifically comments on social media and by tracking customer sentiments towards various product and service offerings. Due to their inherent static design, many traditional database systems can handle only structured data and as such do not provide any alternatives for unstructured data. For example, SQL based database systems depend on schema designs which clearly define the nature of data being loaded and used to process transactional data. Since unstructured data for the most part does not have a proper structure, it would be impossible to handle it in SQL based systems. Luckily for us, there exist alternatives in the form of NoSQL databases, which can handle both structured and unstructured data formats. The majority of client applications run by enterprises are based on real-time, and instant support on services has become a priority. This requires processing of Big Data by the minute in order to provide relevant service to the customers. For example, based on user search keyword, Google instantly processes information stored in their million databases and returns relevant links within a matter of seconds. Similarly, banks need to track global online transactions at any time of the day and further update their databases so that it will reflect in a customer s online account immediately. These services require enterprises to have a system which can ensure the fast movement of data without any potential failures. In order to handle this velocity of Big Data coupled with volume and variety, enterprises need to depend on sophisticated databases which form part of the NoSQL category. These databases relax the limits of the schema based design of SQL systems and stores data in keyvalue pairs, all which are capable of handling all the 3 V s of Big Data. pg08

CHAPTER 04 BigDataTechnologies- OverviewofComponents With the growing challenges of Big Data and limitations of traditional data management solutions, enterprises need to leverage new technologies and data warehousing architectures which have significant IT implications. These technologies vary in terms of functionalities ranging from storing and processing massive volumes of data to performing various analysis on the data at the lowest level of granularity. For example, by integrating unstructured data such as text fields, social media chatter, and email documents, enterprises can leverage new sources of data which can reveal new insights about the customers. According to a market forecast report by ABI Research, the world-wide IT spending on Big Data technologies 1 exceeded $31 billion in 2013 and is projected to reach $114 billion by 2018. Most existing Big Data technologies fall under the open source paradigm. These are free to use and can be experimented upon by anyone. In the current Big Data technology landscape, there are many open-source tools which can potentially solve any problem but one should have the right knowledge and niche expertise in order to efficiently work with these technologies. One of the most popular and widely adopted open source Big Data technologies is Apache Hadoop. It is more formally defined as an open-source software framework that supports distributed processing of vast amounts of data across clusters of computers by using a simple programming model. Apache Hadoop is considered a costeffective solution which provides capabilities to scale up from single servers to thousands of machines, each offering local computation and storage. In simple terms, it is more like cluster of machines interconnected by a network system processing chunks of data at the same time rather than depending on one single machine which is time consuming and in-efficient especially in the case of Big Data. A Hadoop cluster can be made up of a single or thousands of machines which are commonly termed as nodes. pg09

Let us try to understand this concept using a simple example. Say, an apartment complex housing 50 families has a single washing machine catering to laundry needs. Assuming on an average if each family process 10 clothes per day and time taken by washing machine to clean about 50 clothes is one hour, then total time taken by the washing machine to meet the entire apartment needs per day would be 10 hours. Now the apartment manager is considering increasing the capacity to 100 houses, and definitely this would put tremendous stress on the washing machine s daily load management capacity. In order to deal with this situation, the manager should probably consider buying 4 more washing machines. With a total of 5 machines working together, the entire apartment complexes laundry needs can be managed within 4 hours on any given day. The new solution also allows the families to be more flexible with respect to the time of their washing machine usage. This example briefly captures the essence of implementing distributed processing solutions using a cluster of machines rather than depending on one single machine to meet the growing Big Data needs. Invented and named by Doug Cutting after his son s elephant toy, Hadoop Ecosystem comprises of multiple projects which provide complete data management solutions needed by an enterprise. Some of the projects of Hadoop Ecosystem include HDFS, MapReduce, Hive, HBase, Pig and others. Though evolution of Hadoop dates back to early 2000 s, its main stream usage picked up momentum only a couple of years ago. Major advantage is its ability to efficiently manage and process unstructured data. Since about 80% of Big Data consists of unstructured data, it has become more of a strategic choice for many enterprises to implement Hadoop bases solutions. Let's briefly review some of the key components of Hadoop Ecosystem. HDFS(HadoopDistributedFileSystem) Two primary components of Apache Hadoop are HDFS which provides distributed data storage capabilities and MapReduce which is a parallel programming framework. To better understand how Hadoop allows scalability, one should understand how HDFS functions. HDFS breaks down the processing data into smaller pieces called blocks, and stores them across various nodes of a Hadoop cluster. This mechanism of HDFS enables one to handle Big Data more efficiently and in a cost-effective way by employing low cost commodity hardware on the nodes of the Apache Hadoop is the most popular IT solution for effectively dealing with Big Data. With the help of Big Data and Hadoop Overview, Hadoop Data Management and Processing Complex Data using Hadoop modules, you will learn technical aspects of setting up of Hadoop Cluster, its Architecture, HDFS and MapReduce Framework and other components using hands-on examples. Two primary components of Apache Hadoop are HDFS which provides distributed data storage capabilities and MapReduce which is a parallel programming framework. To better understand how Hadoop allows scalability, one should understand how HDFS functions. HDFS breaks down the processing data into smaller pieces called blocks, and stores them across various nodes of a Hadoop cluster. This mechanism of HDFS enables one to handle Big Data more efficiently and in a cost-effective way by employing low cost commodity hardware on the nodes of the Hadoop cluster. Unlike relational databases which depend of defining schemas to store structured data, HDFS puts no restrictions on the type of data and can easily handle unstructured data too. Based on the NoSQL principle, HDFS allows for schema less storage of data which makes it more popular when it comes to Big Data management. pg10

MapReduce MapReduce forms the heart of Hadoop and is a programming model which processes data stored on the nodes of Hadoop cluster, in a parallel and distributed manner. Typically a MapReduce program consists of two components: Map() and Reduce() procedures. Both of these phases work on key-value pairs. A key/value pair is a set of two linked data items: a key, which is a unique identifier for some item of data, and the value, which is either the data that is identified or a pointer to the location of that data. These key/value pairs can be like a customer unique identifiers and location details or URLs paired with number of visits. What goes into key/value pairs is subjective and is dependent on the type of problem being solved. Map() procedure or job performs operations such as filtering and sorting which takes individual elements of data and further break it down into key/value pairs. After execution of Map() job, Reduce() implements summary functions where the output will be in an aggregated form. Always remember the order of any MapReduce program involves the execution of Reduce() job followed by Map() job. Also, the output of Map() job will act as an input to the Reduce() job. MapReduceExample Let s look at a simple example. Assume you have three text documents, and each file contains specific number of words. Say the first document contains a sentence I like Hadoop and all the documents are stored in HDFS. The end objective is to find out the frequency of words present in all the text documents. For this we need to write Map and Reduce jobs to process the text data and summarize the word distribution. As the Map job executes, the documents are first sent to the mapper that will count each unique word for each document: a list of (key/value) pairs is thus created with the word as key and its count as value. For example, the results produced from one mapper task for the first text document would look like this: (I,1) (Like,1) (Hadoop,1) The list of (key/value) pairs generated by all mapper tasks are then processed by the reducer that basically aggregates the (key/value) pairs from each mapper to finally produce a list with all the words and the summed counts from the three mappers, producing a final result set as follows: pg11

(I,1) (Like,1) (Hadoop,3) (Is,2) (Fun,1) (So,1) (Great,1) This one is a simple and straight forward example. Even though a real time application would be quite complex and often involves processing millions or billions of rows, the key principle behind a MapReduce execution would remain the same. JavaAPIs In order to deal with Hadoop programming at the MapReduce level, one would need to work with Java APIs. Since Hadoop framework is developed on Java platform, MapReduce programming using Java language is more native by design. Hadoop developers or analysts should have fair knowledge of Java concepts to process queries on data stored in various cluster nodes. Running MapReduce jobs involves installation of eclipse environment for Hadoop, writing Map and Reduce job scripts, compiling them into a jar file and then further executing these jobs on the data stored in HDFS. For those who are averse to java programming and who do not have a developer background, alternatives exist for Hadoop programming in terms of Pig, Hive and Hadoop Streaming components. Using Hadoop streaming component, it is easier to create and run MapReduce jobs with any general programming languages such as Ruby, Python, Perl, C++, R etc. Pig Pig comes to the rescue for non-technical professionals and makes it more approachable to work with Big Data on Hadoop clusters. It is a highly interactive and script based environment for executing MapReduce jobs on the data stored in HDFS. It consists of a data flow language, called Pig Latin, which supports writing MapReduce programs with more ease and less amount of code in comparison to usage of Java APIs. In many ways, the functionality of Pig is very much similar to how SQL Analyzing Big Data is a key component operates in relational database management systems. It also supports many of any enterprise's IT strategies related user-defined functions, which can be embedded and executed along with a to Hadoop. In Processing Complex Data using Hadoop Module, you will Java program. gain strong command in components such as Hive, Pig and Impala which Hive enable faster querying and aggregation of data from Hadoop cluster. Hive enables the connection between the worlds of Hadoop and SQL. It is very beneficial for people with strong SQL skills. Hive is a data warehouse infrastructure built on top of Hadoop that provides data summarization, querying, and analysis capabilities using an SQL-like language called HiveQL. Similar to Pig, Hive functions like an abstraction on top of MapReduce, and queries run will be converted to a series of MapReduce jobs at the time of execution. Since the internal architecture is very similar to that of relational databases, Hive is used to handle structured data and enables easy integration between Hadoop and other business intelligence tools. pg12

Impala Impala, similar to Hive provides an interactive SQL based query engine for data sitting on Hadoop servers. It is an open-source program for handling and ensuring the availability of large quantities of data. This engine was developed by Hadoop distribution vendor Cloudera, and currently can be accessed under open source Apache license. As is the case with Hive, Impala supports widely known SQL-style query language, meaning that users can put their SQL expertise directly to use on Big Data. Based on comparison results published by Cloudera, Impala offers 6 to 69 times faster querying times than Hive thus making it a first choice among enterprises when it comes to performing Big Data analyses on Hadoop. HadoopStreaming Hadoop Streaming component is a utility which allows us to write map and reduce programs in languages other than java. It uses UNIX standard streams as the interface between Hadoop and MapReduce job, and thus any language that supports reading standard input and writing standard output can be used. It supports most of the programming languages such as ruby, Hadoop Streaming is an essential python, perl, and.net. So when you come across a MapReduce job written in utility and quite helpful for any of these languages, then surely execution will be handled by the Hadoop programmers who prefer programming with Python or R over Java. In the Streaming component. Performing Analytics on Hadoop Module, you will learn about running R HBase scripts for MapReduce jobs through Hadoop Streaming utility. HBase is a column-oriented database within Hadoop Ecosystem and runs on top of HDFS. Hadoop is a batch-oriented system which allows loading data into HDFS, processing and then retrieving. This kind of operating mechanism would not be ideal for tasks involving regular reading and writing of data. MapReduce programs can read input data and write outputs directly from HBase. Apart from using Java API, Hive and Pig can be used to write MapReduce programs to be implemented on data sitting in HBase. SqoopandFlume These components enable connectivity between Hadoop and the rest of the data world. Sqoop is a tool which allows transfer of data between Hadoop cluster and SQL based relational databases. Using Sqoop, one can easily import data from external enterprise data warehouses or relational databases, and can efficiently store it in HDFS or Hive databases. As Sqoop is used to connect with traditional databases, Flume is used to collect and aggregate application data into HDFS. Typically, it is used to collect large amounts of log data from distributed servers. Flume s architecture is streaming data flow-based and can be easily integrated with Hadoop or Hive for analysis of data. Some of the common applications of Flume component are to transport massive quantities of event data such web logs, network traffic data, social media generated data like twitter feed Hadoop Data Management Module provides detailed introduction with hands-on exposure to database components of Hadoop such as HBase, Hive, Sqoop and Flume. Also you will be able to develop more indepth knowledge on how to load and query data using these components. pg13

and email messages. ZookeeperandOozie While Hadoop offers a great model for scaling data processing applications through its distributed file system (HDFS), Map/Reduce and numerous add-ons including Hive and Pig, all this power and distributed processing requires coordination and smooth workflow which can be achieved by Zookeeper and Oozie components. ZooKeeper is a component of Hadoop ecosystem which enables highly reliable distributed coordination. Within Hadoop cluster, Zookeeper looks into synchronization and configuration of nodes and stores information around how these nodes can access different services relating to MapReduce implementations. Oozie is an open source workflow scheduler system to manage data processing jobs across Hadoop cluster. It provides mechanisms to schedule the execution of MapReduce jobs based either on time-based criteria or on data availability. It allows for repetitive execution of multi-stage workflows that can describe a complete end-to-end process thus reducing the need for custom coding for each stage. pg14

CHAPTER 05 BigDataandAnalytics So far, we have learned about various technological and database architectural components that supports Big Data management. The real imperative of Big Data lies in the enterprise s ability to derive actionable insights and to create business value. Building capabilities of analysing Big Data would provide unique opportunities for enterprises and also put them ahead of their competition. Also these analyses can be performed on more detailed and complete data, as compared to traditional analysis which would be limited only to samples. However, performing analytics on Big Data is a challenging task considering the volumes and complex structures involved. To deal with this, enterprises need to able to find the right mix of tools, expertise and analytics techniques. Many early adopters of Big Data such as Google, Yahoo, Amazon and ebay are considered to be pioneers in analysing Big Data. For example, ebay launches successful products and services by employing analytics on demographic and behavioural data from their millions of customers. Data used for analysis can come in various forms - user behaviour data, transactional data, and customer service data. On the other hand, Amazon offers services of recommendation engine on their home page. It leverages Big Data analytics on data relating to customer s buying history and demographics to identify hidden patterns and provides accurate recommendations for potential new purchases. The real value of Big Data lies in the insights it can generate. Processing Complex Data using Hadoop Module provides hands-on techniques and knowledge to analyze Big Data with the help of Hadoop Components. In Performing Analytics on Hadoop Module, you will learn about how analytics tools can be used to run some advanced analyses on data residing on Hadoop. 2 HoweBayleveragesBigData? Online auction giant ebay regularly monitors and analyzes huge amounts of information from their 100 million customer interactions. They use this data to conduct experiments on its users in order to maximise selling prices and customer satisfaction. On an average, they run about 200 experimental tests at the same time which range from barely-noticeable alterations, to the dimensions of product images, right up to complete overhauls, to the ways in which content for users' personal ebay home pages is displayed. Their huge customer base creates 12Tb of data per day from every button they click to every product they buy, which is continually added to an existing pile of Big Data. As the data is queried by automatic monitoring systems and employees looking to find more meaning from it, data throughput reaches 100 petabytes (102,400TB) per day. One of the business problems ebay handles is to achieve the highest buying price possible for all items users place for sale, as profits come from a cut of each sale. Its data scientists perform advanced analytics by looking at all variables in the way items are presented and sold. As one of the solutions to this problem, they began to study the impact on selling price by the quality of the picture in a listing. They used Hadoop framework to process petabytes of pictures due to its capability of handling unstructured data. Later these pictures were analyzed and re analyzed and data scientists managed to extract some structured information such as how much they were sold for, how many people viewed. Towards the end, they managed to establish a possible relation and concluded that better pg15

image quality actually does provide a better price. AnalyticsProjectFramework Before doing a deep dive into Big Data, the first and important aspect of any analysis is to identify the business problem. This is a fundamental step even with traditional data analytics projects. Once the business problem is defined, then Big Data can be leveraged to search for hidden patterns and get valuable insights. Typically some of the analytics problems being solved can be of the following nature. Predicting customer churn behaviour to design reach out campaigns Understanding online and offline marketing impacts on sales Identifying whether a transaction is fraudulent or not Using customer purchasing patterns to recommend new products Forecasting of sales for better inventory management Irrespective of any problem across verticals, the methodology involved in implementing data analytics projects would remain the same. Major difference between Big Data analytics projects and traditional data projects would be the scale of data being handled and the combination of required tools. On the other hand, the business problems, analytics techniques and project methodology would remain the same and is independent of the data being handled. As part of any analytics project cycle, the processes typically involved are problem definition, data gathering, selecting the right technique, performing analysis and visualizing the final results. Let us get some more perspective on various stages involved in implementing an analytics project using a used cars price prediction example. Problem Identification The first question that needs to be asked in any data analytics project would be what is the problem we are trying to solve? In today s Big Data world, enterprises are performing data analytics over various kinds of business problems. It becomes essential to figure out which problem would create higher business impact and further focussing on it to maximize ROI. In the case of used cars price prediction example, determining value of a used car based on a variety of characteristics such as make, model and engine that would benefit retailers. Such information would benefit them to better manage the supply and demand flow in a highly price volatile market. Also with robust knowledge of price variations by each model type, retailers can target buyers with relevant promotions and targeted discounts. Gathering required Data After identification of business problem, data needs to be gathered from various sources which will be useful for the analysis. Based on the problem definition, the data attributes of these datasets can be defined. pg16

For prediction of used car prices, we will require the sales data across years which capture the information on type of car sold, number of years it was in use and final amount paid by the buyer. Additional data can be captured on condition and performance of the car related to mileage and internal characteristics such as type, trim, doors and cruise control. These days with rapid growth of social media and other data sources, more data can be captured around brand perception of used cars and insurance claims related to the car which provides more insights on price variations. Choosing the Right Analytics Technique Picking the right technique for any given problem is as critical as finding the right kind of data to begin with. In analytics projects, often we depend on various tools and algorithms to work on various data problems. Say for example, R is known for its statistical offerings while Python is popular for text data processing. For solution extractions, statistical techniques rely on business context and have specific use cases like clustering algorithms are used for solving customer segmentation problems, time series algorithms are used for forecasting problems and recommendation algorithms are used to provide insights on more relevant products or services. Before applying any technique, gathered data needs to undergo a set of data operations, such as data cleansing, data aggregation, data augmentation, data sorting, and data formatting which are collectively referred to as pre-processing steps. These steps translate the raw data into a fixed data format which is then shared as input to various algorithms. Since problems involved in the used car example is forecasting of price values, regression techniques can be used. At a broad level these deal with predictions of continuous variables like price, income, age, sales etc. Many algorithms can implement regression techniques such as linear regression, random forests, neural networks and support vector machines which vary in terms of complexity of implementation and scope of business interpretation. At this point, this might sound more technical but getting a general idea is of more relevance here. You will be able to appreciate the underlying concepts more while using these techniques in real-time projects. Implementing Analytics Techniques As discussed in above section, analytics problems can be solved with the help of a variety of statistical techniques. When it comes to implementation of these techniques, there are lot of options available in terms of analytics tools such as SAS, R and Python. SAS is more popular amongst the enterprises because of its ease of use and R or Python are open source tools which have lot of takers amongst academia and programmers. On an average, almost 80% of the time of any analytics project goes in problem identification, data gathering and pre-processing steps while the remaining 20% is used for implementing chosen techniques and visualizing the final results. In the case of Big Data, the same algorithms can be translated to MapReduce algorithms for running them on Hadoop clusters which often requires more efforts and specialized expertise. In Hadoop Ecosystem, Mahout Component use Java programming language to implement statistical techniques such as classification, recommendation algorithm and others. Depending on the data volumes being gathered for used car prediction problem, say to implement linear regression technique, SAS or R tool can be used for smaller datasets and Hadoop integrated with R or SAS can be pg17

used for Big Data. Another alternative for Big Data would be making use of Mahout Component which requires Java expertise. Visualizing End Results Data visualization is used for displaying the output of analytics projects. Typically this would be the last step of any analytics projects where visualization techniques are implemented either for validating the technique outcomes or to present end results to non-technical management team. This can be done with various data visualization software s such as Tableau, Spotfire and also in-built capabilities of SAS and R. In comparison with SAS, R offers a variety of packages namely ggplot2 and lattice for visualization of datasets. After building linear regression model for user car price prediction, visualization techniques are implemented to validate the statistical results and to further check whether these results are satisfying the technique assumptions. Some of the standard validation techniques of linear regression model are heteroskedasticity, autocorrelation, and multicollinearity. Above plots showcase visualization examples of performing these validation techniques on final model results of used car price prediction example. DifferentkindsofAnalytics By looking at the used car price prediction example, one can understand that irrespective of any domain the framework for implementing analytics projects remains the same. However, as different enterprises work on solving various business problems, there are different kinds of analytics with domain specific applications. Some of pg18

the common ones are marketing analytics, customer analytics, risk analytics, fraud analytics, human resource analytics and web analytics which are classified based on different business functions. Marketing analytics in any enterprise revolves around increasing efficiency and maximizing marketing performance through analysis such as marketing mix optimization, marketing mix modeling, price analysis and promotional analysis to name a few. On the other hand, customer analytics deals with understanding of customer behaviours and increasing loyalty using analysis like customer segmentation, attrition analysis and life time value analysis. Another common classification exists which is based on complexity level of analytics techniques being implemented across any enterprise and is independent of the domain. These kinds are broadly classified under basic analytics and advanced analytics categories. BasicAnalytics Basic analytics techniques are generally used to explore your data which include simple visualizations or simple statistics. Some of the common examples are: Slicing and dicing refers to breaking down of data into smaller sets that are easier to explore. This is more employed as a preliminary step to gain more understanding into data attributes and how different techniques can be used and also how much computational power is required to implement a full scale analysis. Anomaly identification is the process of detecting outliers, such as an event where the actual observation differs from the expected value. This might involve computing some summary statistics like mean, median, and range values and also sometimes involves visualization techniques such as scatter plot, box plot etc. to identify outliers through visual means. AdvancedAnalytics Advanced analytics involves applications of statistical algorithms for complex analysis on either structured or unstructured data. Among its many use cases, these techniques can be deployed to find patterns in data, prediction, forecasting, and complex event processing. With the growth of Big Data and enterprise's need to stay ahead of competition, advanced analytics implementations have become main stream as an integral part of their decision making process. Some of the examples of advanced analytics are: Text Analytics is the process of analyzing unstructured text, extracting relevant information, and transforming it into structured information on which statistical techniques can be applied. Since much of Big Data comprises unstructured data, text analytics has become one of the main stream applications amongst Big Data analytics. Predictive Modeling consists of statistical or data-mining solutions including algorithms and techniques to determine future outcomes. A predictive model is made up of a number of predictors, which are variable factors that are likely to influence future behavior or results. In marketing, for example, a customer's gender, age, and purchase history might predict the likelihood of a future sale. Some of the other common applications include churn prediction, fraud detection, customer segmentation, marketing spend optimization and many more. pg19

CHAPTER 06 UnstructuredDataand TextAnalytics Unstructured data usually takes up lots of storage capacity and more difficult to analyze when compared with structured data which is relatively easy to handle and process. It is basically information which is text heavy. In most cases that do not have a predefined data model and also does not fit well into traditional database management systems. At an enterprise level, only 20% of the Big Data being handled is structured and the remaining 80% is unstructured. Most of the unstructured data these days is machine generated from various sources such as satellite images, video surveillance, scientific sensors, weather monitoring, social media, mobile and other web content. Data coming from these sources would be in the form of text, images, videos, web logs, and other customary machine formats like sensor output. Few key facts related to unstructured data are: Most new data is unstructured and represents almost 95 percent of all data generated Unstructured data tends to grow exponentially, and is estimated to be doubling every year Unstructured data is vastly underutilized due to limitations of traditional IT technologies With the evolution of Big Data technologies, enterprises can effectively process unstructured data and derive business value out of it. Most firms currently implement NoSQL based technologies mainly Hadoop whose capabilities extend beyond the traditional databases. Regardless of the native formats, Hadoop can store different types of data from multiple databases with no prior need of schema. Within Hadoop Ecosystem, HDFS is used for storage which handles non-predefined data models and MapReduce framework for quick processing of large volumes of unstructured data. Later using the data sitting in Hadoop, enterprises can tap into traditionally unexplored information and can start making more decisions based on hard data. pg20

We have seen in an example in an earlier chapter, how ebay (a giant online marketplace) tries to achieve highest buying price of items by understanding the impact of the quality of picture shared in the listing. To find a possible solution, data science teams at ebay performed extensive image analysis and successfully found a relationship between listing views and items sold. This is a classic real-time example of unstructured data processing with the help of Hadoop. Generally in order to create value out of unstructured data, some of the most common analytics methodologies are text analytics, image and audio analysis. Out of these, text analytics has been adopted as a mainstream activity across many enterprises with increased usage of Hadoop and other Big Data technologies. Majority of enterprise Hadoop applications are implemented to deal with unstructured data. In the Processing Complex Data using Hadoop and Performing Analytics on Hadoop Modules, you will learn more about how to handle and analyze text data with the help of real time examples leveraging Twitter and Email data. Text analytics is commonly referred to as the process of analyzing unstructured text, extracting relevant information, and transforming it into structured information that can then be leveraged in various ways. The analysis and extraction processes used in text analytics takes advantage of techniques that originated in computational linguistics, statistics, and other computer science disciplines. In a Big Data scenario, the applications of text analytics is wide spread around social media analysis, brand perception and sentiment analysis, and even in areas of churn and fraud prediction. Increasingly enterprises across all verticals are looking for ways to combine both structured and unstructured data to get a complete view about their customer s perceptions towards various product and service offerings. In the context of Big Data analytics, text analytics implementations can be done with the help of Hadoop components such as Pig, Hive, and MapReduce programming using Java, Python and other languages. These components are equipped with in-build custom functions which are suited for the processing of unstructured data formats like text, images and videos. The key to successfully handling unstructured data is to bring structure on the native format and then applying analytics or statistical techniques on top of it. Apart from Hadoop solutions, other commercial text analytics tools are offered by vendors like Attensity, Clarabridge, IBM and SAS in Big Data space. pg21

CHAPTER 07 BigDatainAction Enterprises are spending millions of dollars on scaling up their existing IT infrastructures with Big Data technologies to meet the end goal of achieving more business value out of data and staying ahead of their competition. Unlike traditional data warehousing and BI opportunities, Big Data and analytics opportunities are more business hypothesis driven and often revolve around exploratory activities. This scenario is consistent across all verticals since Big Data is being generated from every function of a business ranging from manufacturing to sales. The key to success in dealing with Big Data is in any enterprise s ability to define relevant business problems, combine structured and unstructured data sources, and identify hidden trends. The business problem being handled varies by task as some are computationally intensive while others are more data analysis intensive. Understanding the nature of the problem is very essential for picking the correct approach. In order to exploit Big Data analytics, often enterprises develop a compelling business use case clearly outlining what business outcomes are to be achieved. One of such user cases in IT domain is the implementation of the Aadhar Big Data project by the government of India. Aadhar is a 12- digit unique number issued for all residents in India with a goal of creating Applications and Use cases of Big Data are many spanning across business domains. the world s largest biometric database. The objective of this project is to The case studies taught in the course helps deliver more efficient public services and facilitate direct cash transfers to you appreciate both IT and business aspects of Big Data. They cover domains the people. such as Finance, E-Commerce, Airline and Social Media which will provide hands on exposure in terms of processing Big Data Likewise, many organizations are basing their business cases on the and then analyze it to solve various following benefits that can be derived from Big Data and analytics: business problems. Smarter decisions: Enable decision making beyond traditional practices Faster decisions: Reducing the dependency on bureaucracy within an organization Impactful decisions: Focus on value generating efforts and capabilities pg22

In search of achieving these benefits, many business verticals such as Telecom, Banking, Insurance, HealthCare, Retail, IT and Manufacturing are all riding the wave of Big Data analytics. Now we will review further how some of these industries are leveraging Big Data to solve their business problems. Retail & E-Commerce Retail is one of the high potential areas for Big Data. A survey conducted by research firm IDC revealed that retailers are increasingly looking at Big Data and analytics to derive business benefits. Companies can bring together both online and offline data along with transactions information to better understand factors that drive the shoppers behavioural traits. Beyond purchase data, retailers are looking at a whole array of new data sources web browsing data, social data and geo-tracking data which further helps in thorough segmentation of customers. Combining this new information with traditional data, they started doing high-end analytics like market-basket analytics, seasonal sales analytics, inventory optimization analytics, and pricing optimization analytics. In the case of e-mail targeting, traditional approach has been to scan through the entire customer base, develop a list of customers and then send out mass mailers to all. However, Big Data advantage is personalization, by understanding consumers browsing history retailers can share specific messages related to items of the search and then offer that shopper a targeted promotion. Also with the help of location data from mobile devices and if a particular customer is present in a store, they can be offered specific coupons to motivate them into making a purchase. Telecom Similar to other sectors, communications service providers all over the globe are witnessing significant data growth due to increased adoption of smart phones, rise of social media and growth of mobile networks. Many of these firms are tackling Big Data challenges so as to gain more market share and increase profit margins. Big Data can help service providers achieve some of the key business objectives provide better customer service with the help of internal and external data, implement innovative product services using segmentation techniques, and develop strategies to generate new sources of revenue. Over the last few years telecom operators have moved away from a traditional model of data warehousing towards a centralized data repository model with integrated reporting solutions. With exponential growth of Big Data in this sector, these operators are now looking towards new technologies as a cost-effective solution to process the growing volumes of data. One of the common applications most telecom operators implement is around integrating network performance data with subscriber usage patterns. This is to understand what is happening in the complex intersection of network and services (voice, data, and content). It generally helps them to detect network performance issues in real time and provide quality customer services to maximize their customer satisfaction. Financial Services Historically Banking and financial management firms are rife with transactions data, with hundreds of millions of pg23

records generated on a daily basis. With digitalisation, a variety of data sources social media, information portals and customized web applications are adding more information to industry s existing ocean of data. Implementing Big Data solutions enable these enterprises to collect and organize a host of additional data such as customer preferences, behaviors, interaction history, events and location-specific details in a cost-effective manner. Using this huge information, many financial services firms run sophisticated analytics to determine the best set of actions to recommend to a customer, such as a targeted promotion, a cross-sell/up-sell offer, a discounted product or fee, or a retention offer. In addition the Big Data technologies add value through real-time insight generation and help in faster decision making. One of the major developments is to integrate external data sources such as social media with the internal IT infrastructure, which provides a broader view on customers, products and services at enterprise level. Customer segmentation is a key tool for sales, promotion, and marketing campaigns across financial services firms. Using the available data, generally customers are grouped into different segments based on their expectations, needs and other characteristics. The advantages from such an implementation are multi-fold for enterprises in terms of increasing loyalty with customers, selling more products and services and also cutting costs by better management of resources. HealthCare Big Data has many implications for patients, providers, researchers, payers, and other health-care constituents. Today s patients are demanding more information about their health-care options so that they understand their choices and can participate in decisions about their care. In a health-care scenario, traditionally the key data sources have been patient demographics and medical history, diagnostic, clinical trials data and drug effectiveness index. If these traditional data sources are combined with data provided by patients themselves through social media channels and telematics, it can become a valuable source of information for researchers and others who are constantly in search of options to reduce costs, boost outcomes, and improve treatment. One of the major applications of Big Data has been in the area of DNA analysis. With the latest tools and technologies, one can analyze an entire individual human DNA sequence and compare it against those of other individuals and groups in smaller timeframes. The current relatively low cost to perform individual DNA analysis (thousands of dollars) has made this tool accessible to a substantial number of people compared to the initial cost of millions of dollars a few years ago after the first full human genome was analyzed. Smart Cities Growth of Big Data and digitalization has resulted in the availability of a wide range of information about cities, their physical infrastructure, services, and interactions between people. Smarter Cities are leveraging this Big Data to improve infrastructure, planning and management, and human services as a system of systems with the goal of making cities more desirable, liveable, sustainable, and green. Some of the applications include mass transit, utilities, environment, emergency response, big event planning, public safety, and social programs. IBM has been a pioneer in providing Big Data solutions in this area under their flagship program smarter planet. As pg24

one of the projects, they have implemented a project using Big Data for better traffic flow in Dublin, Ireland. By utilizing the GPS information from buses, IBM has been able to more accurately measure the arrival and departure times and pass that data on to travellers via the transportation system s notification boards. This information would enable people to make better use of their time which would further increase the confidence in public transport system. pg25

CHAPTER 08 BigDataLandscape HadoopDistributionOfferings Although Hadoop and its projects are completely open source, a large number of companies have developed their own Hadoop distributions which are more ready to use. These distributions are packaged and guaranteed to have HDFS and MapReduce components, and all other supporting tools. There are several distributions available, such as ones provided by EMC and Intel, as well as those provided by hardware vendors like IBM which are typically all-in-one solutions that include hardware. But the three biggest and most prevalent Hadoop distributions that exist today are Cloudera, MapR and Hortonworks. If you are looking for a quick plugand-play option, then each of these vendors offers VM images with Linux and Hadoop already installed. Apache Hadoop, the original release of Hadoop comes from apache.org backed up by community support of Apache Software Foundation. Many of the original Hadoop releases are done by this group with the latest one being Hadoop 2.0. Other companies or organizations that release products containing modified or extended versions of the Apache Hadoop are generally termed as Hadoop Distributions. One important point to note here would be these Hadoop distributions will be continuously upgraded to keep up with the latest Apache Hadoop releases launched by Apache Software Foundation. There are many Hadoop distributions available, and Cloudera CDH4 is one of the Some of the companies that include Apache Hadoop and provide widely used distributions at enterprise level. In the Big Data and Hadoop Overview additional capabilities in terms of commercial support, and other utilities Module, you will learn about installation and related to Hadoop are, Cloudera s Hadoop Distribution, CDH4 version working with CDH4 Hadoop Distribution. includes HDFS, YARN, HBase, MapReduce, Hive, Pig, Zookeeper, Oozie, Cloudera CDH4 Installation includes Apache Hadoop along with other components such Mahout, Hue, and other open source tools (including the real-time query as Pig, Hive and Impala for Big Data engine - Impala). Cloudera Manager Free Edition includes all of CDH, processing. plus a basic Manager supporting up to 50 cluster nodes. Cloudera Enterprise combines CDH with a more sophisticated Manager pg26

supporting an unlimited number of cluster nodes, proactive monitoring, and additional data analysis tools. Hortonworks Hadoop Distribution, HDP version 2.0 includes HDFS, YARN, HBase, MapReduce, Hive, Pig, HCatalog, Zookeeper, Oozie, Mahout, Hue, Ambari, Tez, and a real-time version of Hive (Stinger) and other open source tools. It also provides high-availability support, a high-performance Hive ODBC driver, and Talend Open Studio for Big Data. MapR Hadoop Distribution, M7 version includes HDFS, HBase, MapReduce, Hive, Mahout, Oozie, Pig, ZooKeeper, Hue, and other open source tools. It also includes direct NFS access, snapshots, and mirroring for high availability, a proprietary HBase implementation that is fully compatible with Apache APIs, and a MapR management console. IBM Infosphere BigInsights is available in two editions. The Basic Edition includes HDFS, HBase, MapReduce, Hive, Mahout, Oozie, Pig, ZooKeeper, Hue, and several other open source tools, as well as a basic version of the IBM installer and data access tools. The Enterprise Edition adds sophisticated job management tools, a data access layer that integrates with major data sources, and BigSheets (a spreadsheet-like interface for manipulating data in the cluster). Handling Big Data on Cloud is one of Intel Distribution for Apache Hadoop is a product based on Apache the growing practices among enterprises due to low cost and better processing Hadoop, containing optimizations for Intel's latest CPUs and chipsets. It capabilities. You will be provided AWS includes the Intel Manager for Apache Hadoop for managing a cluster. instance with Hadoop installation to work on the assignments and other case study problems as part of virtual lab Amazon Elastic MapReduce is a cloud service that enables users to easily offering. process vast amounts of data at a cheaper cost. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). It includes HDFS (with S3 support), HBase (proprietary backup recovery), MapReduce, Hive (added support for Dynamo), Pig, and Zookeeper. Windows Azure HDInsight is a Hadoop solution for the Azure cloud. It is integrated with the Microsoft management console for easy deployment and integration with System Center. It can be integrated with Excel through a Hive Excel plug-in. Further, it also offers connectivity services with Microsoft SQL Server Analysis Services (SSAS), PowerPivot, and Power View through the Hive Open Database Connectivity (ODBC) driver. For Big Data analysis, apart from knowing about analytics project cycle and kinds of analysis that can be done, enterprises should also leverage the right kind of analytics tools to efficiently deal with Big Data. Broadly the classification of Big Data analysis tools can be made around statistical technique offerings and business intelligence integration capabilities. Although Hadoop components can be used to achieve each of these independently, it is not a specialized analytics tool and is popularly used only for its distributed framework. Let's explore some of the tools which offer extensive visualizations, drag-and-drop options, and easy-to-install pg27

scripts. AnalyticsImplementations With the explosion of Big Data, there has been a quick growth of tools providing statistical capabilities at a larger scale. Since statistics is critical for identifying and quantifying relationships between various attributes in the data, it is one of the key components of many analytics tools catering to Big Data. Some of the more interesting tools include: R is an open source programming and statistical language that is rapidly R is the most popular open source analytics tool. In the Performing gaining popularity in Big Data space. It has been widely used among Analytics on Hadoop Module, you will universities and startup companies alike from many years, but lot of recent learn about R syntax, handling data and running statistical tests with R, and also interest can be attributed to its open source nature and also flexibility of about techniques to integrate R with integration with open source Big Data technologies such as Hadoop. In Hadoop for implementing MapReduce terms of statistical capabilities, R is very versatile and has more than 4000 programs from R console. packages which can deal with any problem related to Big Data analysis. Also with the introduction of RHadoop packages by Revolution Analytics, now anyone can easily work with Hadoop cluster, interact with data in HDFS, and run MapReduce queries written in R syntax. SAS has been a pioneer in business analytics software and services over the last decade, and is also the largest independent vendor in the business intelligence market. In order to deal with the data deluge, SAS has recently upgraded their services towards Big Data handling capabilities. These will help users to perform data manipulation and exploration analysis on Hadoop. Unlike working with Hadoop which requires specialized expertise, enterprises can leverage their existing SAS skills to work easily with Big Data. It also offers text analytics capabilities as part of its overall analytics platform and text data is viewed as simply another source of data. Apache Mahout, a statistical component of Hadoop Ecosystem, provides scalable machine learning algorithms on top of the Hadoop platform. Mahout provides algorithms for clustering, classification, and collaborative filtering implemented on top of Apache Hadoop using MapReduce. However, it requires java programming expertise to successfully work and implement MapReduce queries using Mahout. MADlib, one of the latest developments, is an open-source library that supports in-database analytics. It provides data-parallel implementations of mathematical, statistical, and machine-learning methods that support structured, semistructured, and unstructured data. TextAnalyticsImplementations Though text analytics can be grouped under Big Data analytics, it is always a good idea to deal with it separately due to specific applications only for unstructured data. Here is an overview of some of the players pg28

in the text analysis Big Data market. Attensity is one of the original text analytics companies that began developing and selling products more than ten years ago. It offers several engines for text analytics around Auto-Classification, Entity Extraction, and Exhaustive Extraction. Attensity text analytics tools uses Hadoop framework to store data and are focused on social and multichannel analytics by analyzing text for reporting from both internal and external sources. Clarabridge is another pure-play text analytics vendor which extensively deals with unstructured data processing. It offers its solution as a Software as a Service (SaaS). Software giant IBM offers IBM Content Analytics solutions in the text analytics space. This tool is used to transform content into analyzed information, and further made available for detailed analysis similar to the way structured data would be analyzed in a BI toolset. BusinessIntelligenceIntegration Generally, enterprise level business intelligence needs cater to regular reporting, generating dashboards and creating visualizations. Hive component of Hadoop provides traditional database features and business intelligence integration capabilities to meet enterprise's reporting and analysis needs on structured data. Though it uses SQL like query language for performing Big Data analysis; additional ready to use features such as drag-anddrop and automated reporting are not supported. Many alternative tools exist which provide advanced business intelligence reporting on Big Data sitting in Hadoop cluster. These BI tools provide rich, user friendly environment to slice and dice data. We will review Tableau is one of the leading data some of the widely used ones. visualization tools at enterprise level. In the Performing Analytics on Hadoop Module, you will learn about working Tableau has been gaining popularity across enterprises as the go-to BI tool with Tableau and further able to build web dashboards and complex when it comes to analysing and visualizing Big Data. It offers direct visualizations using the data residing in connections for many high-performance databases, cubes, Hadoop, and Hadoop cluster. cloud data sources such as Salesforce.com and Google Analytics. Tableau has a fast, in-memory analytical engine which can work directly with Big Data to create reports and dashboards. It also provides features around publishing web dashboards on a server and enables easy sharing across the enterprise. With the availability of more than 30 data base plugin ranging from Big Databases to traditional SQL databases, Tableau is attaining a status of should have BI tool across many enterprises and industry verticals. Datameer Analytics Solution (DAS), is a business integration platform for Hadoop and provides comprehensive capabilities to analyze both structured and unstructured data. Major specialization is for enabling analysis capabilities on large volumes of data stored in Hadoop cluster. It has a spreadsheet interface with over 200 analytic functions and visualization including reports, charts and dashboards. DAS provides support for all major Hadoop distributions including Apache, Cloudera, EMC Greenplum HD, IBM BigInsights, MapR, Yahoo! and Amazon. Pivotal, an EMC spinoff offers big-data storage and analytics capabilities. Pivotal Big Data solutions offers wide set pg29

of enterprise data products: MPP and column store databases, in-memory data processing, and Hadoop. It also provides in-database integrations with SAS analytics and is one of the fast growing BI vendors in Big Data analytics space. Pentaho Big Data Solutions supports the entire Big Data analytics process ranging from discovering and preparing data sources, to integration, visualization, analysis and prediction. For IT and developers, Pentaho provides a complete, visual design environment to simplify and accelerate data preparation and modeling. For business users, Pentaho provides visualization and exploration of data. And for data analysts and scientists, Pentaho provides full data discovery, exploration and predictive analytics. Another commercial solution offered by IBM includes the combination InfoSphere BigInsights and Cognos software. This gives organizations a powerful solution to translate large amounts of data into valuable, actionable insights. InfoSphere BigInsights software provides Big Data processing capabilities whereas Cognos software offers enterprise level BI capabilities. pg30

CHAPTER 09 BigDataCareerRoles As the field of Big Data is booming, many enterprises are actively looking out for the right talent with relevant IT expertise and deep analytical skills. According to information technology research and advisory firm Gartner, Big Data will create more than 4.4 million jobs by 2015, opening up plenty of opportunities for analysts, computer scientists, mathematicians and other data-savvy job seekers. In spite of this explosion in business demand, enterprises are currently short of experts who can work with new tools and technologies, and make sense of unstructured data flowing from mobile devices, sensors, social media and other sources. Earlier Big Data skills were quite popular in defence and technology sectors. As the Big Data technologies became cheaper and more easily accessible, more and more sectors joined the movement and the competition for Big Data talent is becoming fiercer. Currently e-commerce companies and social media services are leading the demand. Other sectors on the lookout for big-data skills include food manufacturers, retailers, consulting companies, gaming, online travel, consumer finance, telecommunications and insurance, according to a report published by a career site Dice.com. Big Amongst Big Data Analtyics jobs, the Data talent should consist of a combination of skills good knowledge of Data Scientist role is a top requirement statistical and analytical skills, an understanding of how Big Data can be for many firms. Jigsaw Academy has an extensive industry network to facilitate used to make better business decisions; and computer programming placements for its students as and when expertise for Big Data analysis. suitable positions are available. All the participants get one on one support for resume and interview preparation. Let us review some of the common roles across Big Data talent, ITFocused Hadoop Developer Amongst the existing Big Data technologies, Hadoop is the most preferred choice of many enterprises because of its advantages around its flexibility and low cost. Major responsibilities of a Hadoop Developer include design, development and implementation of data processing applications using Apache in a real-time project setup. These roles require a thorough understanding of Hadoop Framework such as HDFS, MapReduce and other components with a major emphasis on IT applications. Also, Hadoop Developers should have a strong working knowledge Java programming language and preferably exposure to other languages such as C, C++, Python and PHP. Typically, this role is most sought after by software professionals, ETL developers, and data engineers who have a solid foundation of Hadoop Architecture. Hadoop Administrator Hadoop Administrator role is similar to traditional DBA (Database Administrator), however in the context of pg31

configuring, deploying and maintaining of Hadoop Cluster. As the spending on Hadoop technologies by enterprises is on the rise, the need for specialists who can work with such a framework for storage and large-scale processing of data sets. As a Hadoop administrator, one should have a very good understanding of Apache Hadoop and its Ecosystem, and also expertise around UNIX and Cloud frameworks which are more common for setting up Hadoop clusters. This role is more preferred by traditional data warehousing specialists and database administrators who are willing to scale up their expertise for Big Data. AnalyticsFocused Big Data Analyst The Big Data talent needed by many enterprises should contain the ability to access, manipulate, and analyze massive data sets in the Hadoop cluster using SQL and familiar scripting languages. Major focus of this role would be on analysing large volumes of data using analytics tools like SAS and R by integrating with Hadoop cluster. Exposure to business intelligence tools like Tableau is also required, as it is widely used across enterprises for regular generation of interactive reports and dashboards on various business needs. Having knowledge of tools like Impala, Hive, and Pig is also required especially to implement real-time analytics functions. Generally professionals working as data analysts, BI specialists, and business analysts who work extensively with data and are willing to take up Big Data challenges would be ideal for this Big Data analyst role. Data Scientist Harvard Business Review has termed data scientist as the sexiest job of the 21st century. Definitely it is one of the most competent talents in high demand majorly for the skills around statistical knowledge, computing expertise and ability to work with Big Data technologies. One important aspect which many have a wrong notion about is that data scientist role is not always tied to Big Data projects, but more to do with the increased breadth and depth of data being examined, as compared to traditional roles. The demand for data scientists is increasing tremendously as more and more industry verticals are replying on data based decision making to become more profitable. Apart from having skills around Hadoop framework, statistical tools such as R and Python, and analytics techniques, a data scientist should also possess good domain expertise and communication skills to present technical findings to the management team in terms of final recommendations. Traditional roles such as statistician, predictive analyst, business analyst and business intelligence analyst, coupled with Big Data skills would be more ideal for a data scientist career. pg32

CHAPTER 10 BigDataintheFuture As the dependence of our lives on mobile and online technologies is going to increase over the years, so is the growth of Big Data. It is essential for individuals and enterprises of all sizes, types, and industries to embrace this data deluge. So far Big Data technologies and tools such as Hadoop have helped us to deal with data challenges around volume, variety and velocity. Since Apache Hadoop technology is rapidly evolving, it is definitely a big challenge to keep up with the latest developments. Our Industry Experts will update the content in line with latest developments and provide additional videos and references so that you won t miss a thing. With Big Data applications still at an early stage but rapidly progressing year on year, surely we will witness many more technological solutions packaged into distributions, appliances and on-demand cloud services around implementing analytics solution. Existing cohort of solutions include Hadoop connectors for analytical tools such as Tableau and R, but these can be further improved by more support for better interactivity and abstract programming in new tools. Abstract programming is about running a data processing task with less code than with a regular programming language like Java and is aimed to help nonprogrammers handle Big Data easily without getting into much of technical nuances. In terms of enterprise level Big Data spending, currently most of the share goes to IT implementations in setting up infrastructure and data processing needs. Though business intelligence components are being deployed, full scale benefits of Big Data analytics are not yet achieved. One major trend that can be expected in future would be the increased impact of data science teams on business decision making. As they tend to be integrated more along the side of any enterprise s business operations, we will see more repeatable processes at a daily level starting with raw data and finishing with data products either in the form of reports, dashboards or other applications. Another future Big Data trend would be around visualization which typically satisfies exploration and explanation needs in a data workflow. While many know that visualization output is an end result of data science project, it also can be used as an exploration tool to identify anomalies and patterns in the data. Traditionally visualization plays a vital role in delivering data manipulation abilities to those without any direct programming or statistical skills. Now with increasing need amongst managers and business analysts to make sense of Big Data, it will become essential for enterprises to build analytics capabilities around visualization tools to support non-data savvy professionals. pg33

CHAPTER 11 LearnmoreaboutBigData So far this guide has provided a comprehensive introduction about various topics around the field of Big Data and analytics. Like increasing volumes of Big Data, the demand for skilled individuals in these areas and the salaries offered are growing quickly. Fortunately, you can start building the required expertise by exploring free resources available online and reading some best-selling books. These resources differ from each other in terms of details emphasizing more on either technical or business application aspects. Irrespective of whether you are going to purchase a new book or refer an online resource, get ready for to derive more knowledge about the fascinating world of Big Data which is believed to transform the way businesses are run and further help them to achieve competitive success. BooksonBigData Big Data: A Revolution That Will Transform How We Live, Work, and Think by Viktor Mayer-Schonberger and Kenneth Niel Cukier Big Data at work: Dispelling the myths, uncovering the opportunities by Thomas H. Davenport Taming The Big Data Tidal Wave by Bill Franks BooksonHadoop Hadoop: The Definitive Guide 3rd Edition by Tom White Hadoop in Practice by Alex Holmes Hadoop in Action by Chuck Lam pg34

BooksonAnalytics Keeping Up with the Quants: Your Guide to Understanding and Using Analytics by Thomas H. Davenport and Jinho Kim Competing on Analytics: The New Science of Winning by Thomas H. Davenport and Jeanne G. Harris Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die by Eric Siegel Super Crunchers: Why Thinking-By-Numbers is the New Way To Be Smart by Ian Ayres Data Science for Business: What you need to know about data mining and data-analytic thinking by Foster Provost and Tom Fawcett PopularBlogsonBigDataandHadoop Smarter Computing Blog - Maintained by IBM which includes articles around Big Data and cloud computing Planet Big Data - An aggregator of worldwide blogs about Big Data, Hadoop, and related topics. Big Data Forrester Blogs - An aggregation of blogs and articles from enterprise experts focusing on Big Data topics Hadoop Wizard - A website dedicated to help people learn how to use Hadoop for Big Data analytics Yahoo! Hadoop Tutorial - It includes free materials that cover in detail on how to use the Hadoop distributed data processing environment Hadoop 360 - Exclusive Hadoop site maintained by data science central community Cloudera Developer Blog - Big Data best practices, how-to's, and internals from Cloudera Engineering and the community The Hortonworks Blog - Collation of articles around Hadoop related to latest releases, trends and updates from the expert team of Hortonworks Big Data University - An IBM initiative which offers free online courses taught by the leading experts in the field MapR Academy - It provides few free training resources to help individuals and teams learn and use Hadoop Hadoop Screencast - A collection of good quality screencasts on installation and working with Apache Hadoop and the various components of the Apache Hadoop Ecosystem Hadoop Essentials - A six-part recorded webinar series offered for free by Cloudera about introduction and motivation for Hadoop Hortonworks Sandbox - It is provided as a self-contained virtual machine by Hortonworks with hands on video tutorials and pre-installation of single node Hadoop cluster 1. 2. VideoResourcesforHadoop References https://www.abiresearch.com/press/big-data-spending-to-reach-114-billion-in-2018-loo http://www.v3.co.uk/v3-uk/news/2302017/ebay-using-big-data-analytics-to-drive-up-price-listings Did you enjoy this book? We would love to hear your feedback or suggestions if any. Do write to us at info@jigsawacademy.com pg35