2 Introduction Big Data, What do these two words really mean? Yes everyone is talking about it but frankly, not many really understand what the hype is all about. This book by Jigsaw Academy aims to give one an understanding of Big Data and what makes data big, while also elaborating in simple language the challenges of Big Data, the emerging technologies and the Big Data landscape. Finally we talk about careers in Big Data and what the future could hold in store for the industry. This book is also a useful companion to those of you enrolled in Jigsaw's Course Big Data Analytics Using Hadoop and R. You can use this book to compliment your learning and better understand Big Data. Please note the blue boxes in every chapter which link the content in the chapter to the modules covered in the course. Enjoy the book. Big Data Team Jigsaw (Led by Team Lead Kiran P.V.) pg01
3 Outline What is Big Data 03 2 What Makes Data Big 04 3 Challenges of Big Data Processing 07 5 Big Data Technologies Big Data and Analytics 15 Unstructured Data and Text Analytics Big Data in Action Big Data Landscape Big Data Career Paths Big Data in the Future Learn more about Big Data pg02
4 CHAPTER 01 WhatisBigData? I don't know how the cost of hard disks has decreased so rapidly. These days one can buy a terabyte hard drive for just $100 a friend told me couple of years ago. It's hard not to agree with him and a quick review of historical facts validated his opinion. In the 1990's, the cost per 1 gigabyte of hard disk was around $10,000 and now it can be purchased at only $0.1. The price has dropped 100,000 times over a span of 20 years. Currently we are even seeing that a few giga bytes of hard disk space are being offered free of cost by service providers and file hosting services. For personal accounts, Gmail offers about 15 gigabytes of free hard disk space whereas file hosting service Dropbox offers up to 3.5 gigabytes. However, these values are on the higher side for business accounts. One would wonder how enterprises are influenced by the lower costs of storage space. For one, it definitely provides them with more opportunities of storing data around their product and service offerings. Virtually every industry is seeing a tremendous explosion in terms of new data sources and is dependent on advanced storage technologies. Increased adoption of internet and smart phones enabled individuals across the globe to leave a huge digital footprint of online data which is wanted by many enterprises. In the past, for example, banks used to store customer data mostly around demographic information tracked from application forms and further transaction information tracked from passbooks. These days, the customer data being stored is enormous and varies widely across mobile usage, online transactions, ATM withdrawals, customer feedback, social media comments and credit bureau information. All these new sources of data which did not exist in the past can be categorized under the new word Big Data. Big Data can be easily referred to as data which is huge, but more importantly Big Data is data that comes from multiple sources rather than just one. Big Data is definitely one of the more fascinating evolutions of the 21st century in the world of IT. The truth is that Big Data has opened up tremendous opportunities and has provided us with endless solutions to deal with social, economic and business problems. For enterprises, it is a huge untapped The Big Data and Hadoop Overview source of profit, which if used appropriately will be the key to staying ahead of Module provides pre-class videos their competition. In order to deal with Big Data effectively, they need to and lots of reading material on the importance of Big Data and how it is depend on advanced database technologies and faster processing capabilities. transforming the way enterprises are Just having Big Data is not a sufficient criterion for success; enterprises also implementing data based strategies need to implement analytics effectively, in order to be able to garner insights to become more competitive. that help improve profitability. They should actively pursue the art and science of capturing, analysing and creating value out of Big Data. pg03
5 CHAPTER 02 WhatmakesdataBig? We live in the era of Big Data and it is not leaving any industry untouched be it financial services, consumer goods, e-commerce, transportation, manufacturing or social media. Across all industries, enterprises now have access to an increasing number of both internal and external Big Data sources. Internal sources typically track information around demographics, online or offline transactions, sensors, server logs, website traffic and s. This list is not exhaustive and varies from industry to industry. External sources on the other hand are mostly related to social media information from online discussions, comments and feedback shared about a particular product or service. Another major source of Big Data is machine data which consists of real-time information from sensors and web logs that monitor customer s online activity. In the coming years, as we continue to develop new ways of data generation either online or offline by leveraging technological advancements, the one correct prediction we can make is this; the growth of Big Data is not going to stop. Although Big Data is more about data being captured from multiple sources and size at a higher level, there are many technical definitions which provide more clarity. Orielly Strata group states that Big Data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn t fit the structures of your database architectures. In simple terms, Big Data needs multiple systems to efficiently handle and process data rather than a single system. Say an online e-commerce enterprise in a single region is generating about 1000 gigabytes of data on a daily basis which can be handled and processed using traditional database systems. On expanding operations to a global level, their daily data generation has increased times and is currently at 10 petabytes (1 petabyte = gigabytes). To handle this kind of data, traditional database systems do not have required capabilities and enterprises need to depend on Big Data technologies such as Hadoop which uses a distributed computing framework. We will learn more about these technical topics in subsequent chapters. To further simplify our Big Data understanding, we can rely on three major characteristics of Big Data i.e. volume, variety and velocity which are more commonly referred as 3 V s of Big Data. Occasionally, some resources do talk about a not so common characteristic of Big Data i.e. Veracity which is referred as the 4th V of Big Data. All these 4 characteristics provide more details around the nature, scope and structure of the Big Data. Commonly Big Data is characterized by 3 V s and these provide context for a new class of possibilities. You will learn more about how these characteristics help achieve more information from massive amounts of data in Big Data and Hadoop Overview Module. pg04
6 Volume Volume deals with the size aspect of Big Data. With technical advancements in global connectivity and social media, the physical extent of data generated on a daily basis is growing exponentially. Every day, about 30 billion pieces of information is shared globally. An IDC Digital Universe study, estimated that global data volume was about 1.8 zetabyte as of 2011 and will grow about 5 times by A zetabyte is a quantity of information or information storage capacity equal to one billion terabytes which is 1 followed by 21 zeroes of bytes. Across many enterprises, these increasing volumes pose an immediate challenge to traditional database systems with regards to the storing and processing of data. Variety Big Data comes from sources such as conversations on social media, media files shared on social networks, online transactions, smart phone usage, climate sensor data, financial market data and many more. The underlying formats of data coming out of these sources would vary in terms of excel sheets, text documents, audio files and server log files which can broadly classified under either structured or unstructured data types. Structured data formats typically refer to a defined way of storing data i.e. clearly marked out rows and columns whereas unstructured data formats do not have any order and mostly refer to text, audio and video data. Unstructured formats of data are more a recent phenomenon and traditional database systems do not possess required capabilities to process this kind of information. pg05
7 Velocity Increased volumes of data have put a lot of stress on the processing abilities of traditional database systems. Enterprises should be able to quickly process incoming data from various sources and then share it with the business to ensure the smooth functioning of day-to-day operations. This quick flow of data within an enterprise refers to the velocity characteristic of Big Data. Another important aspect is also about the ability to provide relevant services to the end user on a real time basis. For example, Amazon provides instant recommendation services depending on the users search and location. Based on the entered keyword, these services need to search through their entire historical transactions and share relevant results which hopefully would convert into a potential purchase. The effects of velocity are very similar to volume, and enterprises need to rely on advanced processing technologies to efficiently handle Big Data. Veracity Though enterprises have access to lot of Big Data, some aspects of it would be missing. Over the years, we are aware that data quality issues usually happen due to human entry error or due to some individuals withholding information. In the Big Data era where most of the data capturing processes are automated, the same issues can occur due to technical lapses which arise due to system or sensor malfunction. Whatever may be the reasons, one should be careful to deal with inconsistency in Big Data before using it for any kind of analysis. pg06
8 CHAPTER 03 ChallengesofBigDataProcessing Just having a big source of data is not enough to become successful; enterprises need to implement relevant processes and systems which will help them extract value out of it. An important aspect here is what one should do with it. In the absence of a business context, data in itself is meaningless and would just occupy space in the storage servers. Also many Big Data sources tend to have missing or low content information as described by the veracity characteristic earlier. The actual power of Big Data will surface only by applying analytics on top of it, when it is used to generate useful insights to guide future decision making. Irrespective of the size of the data, whether big or small, analytics methodologies need to be implemented to reap benefits. This typically involves the cleaning, analysing, interpreting and finally visualizing the hidden patterns that emerge from the data. Due to sheer volume, variety and velocity of Big Data; the processing capacity of traditional database systems is strained beyond their limits. So enterprises need to look out for advanced processing technologies and capabilities to effectively manage Big Data and further implement analytics on it. One of the major aspects of any Big Data processing framework would be to successfully handle huge amounts of information without compromising on query time. Traditional database systems lack the required infrastructure and internal designs to process Big NoSQL technologies are helping enterprises to achieve more than what Data at a scale of petabytes or exabytes. Since these systems tend to operate was possible previously. In the Big out of a single machine with huge hard disk and processing capabilities, there Data and Hadoop Overview Module, are a set of limitations which comes with it. The first one is a scalability issue. the evolution and benefits of new technologies such as Hadoop and With a continuous increase in data volumes, the storage capacity of these MongoDB are discussed and also how systems needs to be continuously increased, and this can be an expensive these help in overcoming the limitations of traditional IT systems for option. The second one is a slow querying time because the storage load is solving Big Data problem. already operating at maximum levels and enterprises cannot wait for days to pg07
9 get their daily reports. These limitations call for alternate approaches based on scalable storage and a distributed approach to scaling. Big Data sources are diverse and inputs for these systems can be in the form of structured or unstructured formats. Since the origin of these data formats is spread across the globe, most times these won t have a predefined order and requires pre-processing before using it for any analysis. A common use of Big Data processing is to make use of unstructured data, specifically comments on social media and by tracking customer sentiments towards various product and service offerings. Due to their inherent static design, many traditional database systems can handle only structured data and as such do not provide any alternatives for unstructured data. For example, SQL based database systems depend on schema designs which clearly define the nature of data being loaded and used to process transactional data. Since unstructured data for the most part does not have a proper structure, it would be impossible to handle it in SQL based systems. Luckily for us, there exist alternatives in the form of NoSQL databases, which can handle both structured and unstructured data formats. The majority of client applications run by enterprises are based on real-time, and instant support on services has become a priority. This requires processing of Big Data by the minute in order to provide relevant service to the customers. For example, based on user search keyword, Google instantly processes information stored in their million databases and returns relevant links within a matter of seconds. Similarly, banks need to track global online transactions at any time of the day and further update their databases so that it will reflect in a customer s online account immediately. These services require enterprises to have a system which can ensure the fast movement of data without any potential failures. In order to handle this velocity of Big Data coupled with volume and variety, enterprises need to depend on sophisticated databases which form part of the NoSQL category. These databases relax the limits of the schema based design of SQL systems and stores data in keyvalue pairs, all which are capable of handling all the 3 V s of Big Data. pg08
10 CHAPTER 04 BigDataTechnologies- OverviewofComponents With the growing challenges of Big Data and limitations of traditional data management solutions, enterprises need to leverage new technologies and data warehousing architectures which have significant IT implications. These technologies vary in terms of functionalities ranging from storing and processing massive volumes of data to performing various analysis on the data at the lowest level of granularity. For example, by integrating unstructured data such as text fields, social media chatter, and documents, enterprises can leverage new sources of data which can reveal new insights about the customers. According to a market forecast report by ABI Research, the world-wide IT spending on Big Data technologies 1 exceeded $31 billion in 2013 and is projected to reach $114 billion by Most existing Big Data technologies fall under the open source paradigm. These are free to use and can be experimented upon by anyone. In the current Big Data technology landscape, there are many open-source tools which can potentially solve any problem but one should have the right knowledge and niche expertise in order to efficiently work with these technologies. One of the most popular and widely adopted open source Big Data technologies is Apache Hadoop. It is more formally defined as an open-source software framework that supports distributed processing of vast amounts of data across clusters of computers by using a simple programming model. Apache Hadoop is considered a costeffective solution which provides capabilities to scale up from single servers to thousands of machines, each offering local computation and storage. In simple terms, it is more like cluster of machines interconnected by a network system processing chunks of data at the same time rather than depending on one single machine which is time consuming and in-efficient especially in the case of Big Data. A Hadoop cluster can be made up of a single or thousands of machines which are commonly termed as nodes. pg09
11 Let us try to understand this concept using a simple example. Say, an apartment complex housing 50 families has a single washing machine catering to laundry needs. Assuming on an average if each family process 10 clothes per day and time taken by washing machine to clean about 50 clothes is one hour, then total time taken by the washing machine to meet the entire apartment needs per day would be 10 hours. Now the apartment manager is considering increasing the capacity to 100 houses, and definitely this would put tremendous stress on the washing machine s daily load management capacity. In order to deal with this situation, the manager should probably consider buying 4 more washing machines. With a total of 5 machines working together, the entire apartment complexes laundry needs can be managed within 4 hours on any given day. The new solution also allows the families to be more flexible with respect to the time of their washing machine usage. This example briefly captures the essence of implementing distributed processing solutions using a cluster of machines rather than depending on one single machine to meet the growing Big Data needs. Invented and named by Doug Cutting after his son s elephant toy, Hadoop Ecosystem comprises of multiple projects which provide complete data management solutions needed by an enterprise. Some of the projects of Hadoop Ecosystem include HDFS, MapReduce, Hive, HBase, Pig and others. Though evolution of Hadoop dates back to early 2000 s, its main stream usage picked up momentum only a couple of years ago. Major advantage is its ability to efficiently manage and process unstructured data. Since about 80% of Big Data consists of unstructured data, it has become more of a strategic choice for many enterprises to implement Hadoop bases solutions. Let's briefly review some of the key components of Hadoop Ecosystem. HDFS(HadoopDistributedFileSystem) Two primary components of Apache Hadoop are HDFS which provides distributed data storage capabilities and MapReduce which is a parallel programming framework. To better understand how Hadoop allows scalability, one should understand how HDFS functions. HDFS breaks down the processing data into smaller pieces called blocks, and stores them across various nodes of a Hadoop cluster. This mechanism of HDFS enables one to handle Big Data more efficiently and in a cost-effective way by employing low cost commodity hardware on the nodes of the Apache Hadoop is the most popular IT solution for effectively dealing with Big Data. With the help of Big Data and Hadoop Overview, Hadoop Data Management and Processing Complex Data using Hadoop modules, you will learn technical aspects of setting up of Hadoop Cluster, its Architecture, HDFS and MapReduce Framework and other components using hands-on examples. Two primary components of Apache Hadoop are HDFS which provides distributed data storage capabilities and MapReduce which is a parallel programming framework. To better understand how Hadoop allows scalability, one should understand how HDFS functions. HDFS breaks down the processing data into smaller pieces called blocks, and stores them across various nodes of a Hadoop cluster. This mechanism of HDFS enables one to handle Big Data more efficiently and in a cost-effective way by employing low cost commodity hardware on the nodes of the Hadoop cluster. Unlike relational databases which depend of defining schemas to store structured data, HDFS puts no restrictions on the type of data and can easily handle unstructured data too. Based on the NoSQL principle, HDFS allows for schema less storage of data which makes it more popular when it comes to Big Data management. pg10
12 MapReduce MapReduce forms the heart of Hadoop and is a programming model which processes data stored on the nodes of Hadoop cluster, in a parallel and distributed manner. Typically a MapReduce program consists of two components: Map() and Reduce() procedures. Both of these phases work on key-value pairs. A key/value pair is a set of two linked data items: a key, which is a unique identifier for some item of data, and the value, which is either the data that is identified or a pointer to the location of that data. These key/value pairs can be like a customer unique identifiers and location details or URLs paired with number of visits. What goes into key/value pairs is subjective and is dependent on the type of problem being solved. Map() procedure or job performs operations such as filtering and sorting which takes individual elements of data and further break it down into key/value pairs. After execution of Map() job, Reduce() implements summary functions where the output will be in an aggregated form. Always remember the order of any MapReduce program involves the execution of Reduce() job followed by Map() job. Also, the output of Map() job will act as an input to the Reduce() job. MapReduceExample Let s look at a simple example. Assume you have three text documents, and each file contains specific number of words. Say the first document contains a sentence I like Hadoop and all the documents are stored in HDFS. The end objective is to find out the frequency of words present in all the text documents. For this we need to write Map and Reduce jobs to process the text data and summarize the word distribution. As the Map job executes, the documents are first sent to the mapper that will count each unique word for each document: a list of (key/value) pairs is thus created with the word as key and its count as value. For example, the results produced from one mapper task for the first text document would look like this: (I,1) (Like,1) (Hadoop,1) The list of (key/value) pairs generated by all mapper tasks are then processed by the reducer that basically aggregates the (key/value) pairs from each mapper to finally produce a list with all the words and the summed counts from the three mappers, producing a final result set as follows: pg11
13 (I,1) (Like,1) (Hadoop,3) (Is,2) (Fun,1) (So,1) (Great,1) This one is a simple and straight forward example. Even though a real time application would be quite complex and often involves processing millions or billions of rows, the key principle behind a MapReduce execution would remain the same. JavaAPIs In order to deal with Hadoop programming at the MapReduce level, one would need to work with Java APIs. Since Hadoop framework is developed on Java platform, MapReduce programming using Java language is more native by design. Hadoop developers or analysts should have fair knowledge of Java concepts to process queries on data stored in various cluster nodes. Running MapReduce jobs involves installation of eclipse environment for Hadoop, writing Map and Reduce job scripts, compiling them into a jar file and then further executing these jobs on the data stored in HDFS. For those who are averse to java programming and who do not have a developer background, alternatives exist for Hadoop programming in terms of Pig, Hive and Hadoop Streaming components. Using Hadoop streaming component, it is easier to create and run MapReduce jobs with any general programming languages such as Ruby, Python, Perl, C++, R etc. Pig Pig comes to the rescue for non-technical professionals and makes it more approachable to work with Big Data on Hadoop clusters. It is a highly interactive and script based environment for executing MapReduce jobs on the data stored in HDFS. It consists of a data flow language, called Pig Latin, which supports writing MapReduce programs with more ease and less amount of code in comparison to usage of Java APIs. In many ways, the functionality of Pig is very much similar to how SQL Analyzing Big Data is a key component operates in relational database management systems. It also supports many of any enterprise's IT strategies related user-defined functions, which can be embedded and executed along with a to Hadoop. In Processing Complex Data using Hadoop Module, you will Java program. gain strong command in components such as Hive, Pig and Impala which Hive enable faster querying and aggregation of data from Hadoop cluster. Hive enables the connection between the worlds of Hadoop and SQL. It is very beneficial for people with strong SQL skills. Hive is a data warehouse infrastructure built on top of Hadoop that provides data summarization, querying, and analysis capabilities using an SQL-like language called HiveQL. Similar to Pig, Hive functions like an abstraction on top of MapReduce, and queries run will be converted to a series of MapReduce jobs at the time of execution. Since the internal architecture is very similar to that of relational databases, Hive is used to handle structured data and enables easy integration between Hadoop and other business intelligence tools. pg12
14 Impala Impala, similar to Hive provides an interactive SQL based query engine for data sitting on Hadoop servers. It is an open-source program for handling and ensuring the availability of large quantities of data. This engine was developed by Hadoop distribution vendor Cloudera, and currently can be accessed under open source Apache license. As is the case with Hive, Impala supports widely known SQL-style query language, meaning that users can put their SQL expertise directly to use on Big Data. Based on comparison results published by Cloudera, Impala offers 6 to 69 times faster querying times than Hive thus making it a first choice among enterprises when it comes to performing Big Data analyses on Hadoop. HadoopStreaming Hadoop Streaming component is a utility which allows us to write map and reduce programs in languages other than java. It uses UNIX standard streams as the interface between Hadoop and MapReduce job, and thus any language that supports reading standard input and writing standard output can be used. It supports most of the programming languages such as ruby, Hadoop Streaming is an essential python, perl, and.net. So when you come across a MapReduce job written in utility and quite helpful for any of these languages, then surely execution will be handled by the Hadoop programmers who prefer programming with Python or R over Java. In the Streaming component. Performing Analytics on Hadoop Module, you will learn about running R HBase scripts for MapReduce jobs through Hadoop Streaming utility. HBase is a column-oriented database within Hadoop Ecosystem and runs on top of HDFS. Hadoop is a batch-oriented system which allows loading data into HDFS, processing and then retrieving. This kind of operating mechanism would not be ideal for tasks involving regular reading and writing of data. MapReduce programs can read input data and write outputs directly from HBase. Apart from using Java API, Hive and Pig can be used to write MapReduce programs to be implemented on data sitting in HBase. SqoopandFlume These components enable connectivity between Hadoop and the rest of the data world. Sqoop is a tool which allows transfer of data between Hadoop cluster and SQL based relational databases. Using Sqoop, one can easily import data from external enterprise data warehouses or relational databases, and can efficiently store it in HDFS or Hive databases. As Sqoop is used to connect with traditional databases, Flume is used to collect and aggregate application data into HDFS. Typically, it is used to collect large amounts of log data from distributed servers. Flume s architecture is streaming data flow-based and can be easily integrated with Hadoop or Hive for analysis of data. Some of the common applications of Flume component are to transport massive quantities of event data such web logs, network traffic data, social media generated data like twitter feed Hadoop Data Management Module provides detailed introduction with hands-on exposure to database components of Hadoop such as HBase, Hive, Sqoop and Flume. Also you will be able to develop more indepth knowledge on how to load and query data using these components. pg13
15 and messages. ZookeeperandOozie While Hadoop offers a great model for scaling data processing applications through its distributed file system (HDFS), Map/Reduce and numerous add-ons including Hive and Pig, all this power and distributed processing requires coordination and smooth workflow which can be achieved by Zookeeper and Oozie components. ZooKeeper is a component of Hadoop ecosystem which enables highly reliable distributed coordination. Within Hadoop cluster, Zookeeper looks into synchronization and configuration of nodes and stores information around how these nodes can access different services relating to MapReduce implementations. Oozie is an open source workflow scheduler system to manage data processing jobs across Hadoop cluster. It provides mechanisms to schedule the execution of MapReduce jobs based either on time-based criteria or on data availability. It allows for repetitive execution of multi-stage workflows that can describe a complete end-to-end process thus reducing the need for custom coding for each stage. pg14
16 CHAPTER 05 BigDataandAnalytics So far, we have learned about various technological and database architectural components that supports Big Data management. The real imperative of Big Data lies in the enterprise s ability to derive actionable insights and to create business value. Building capabilities of analysing Big Data would provide unique opportunities for enterprises and also put them ahead of their competition. Also these analyses can be performed on more detailed and complete data, as compared to traditional analysis which would be limited only to samples. However, performing analytics on Big Data is a challenging task considering the volumes and complex structures involved. To deal with this, enterprises need to able to find the right mix of tools, expertise and analytics techniques. Many early adopters of Big Data such as Google, Yahoo, Amazon and ebay are considered to be pioneers in analysing Big Data. For example, ebay launches successful products and services by employing analytics on demographic and behavioural data from their millions of customers. Data used for analysis can come in various forms - user behaviour data, transactional data, and customer service data. On the other hand, Amazon offers services of recommendation engine on their home page. It leverages Big Data analytics on data relating to customer s buying history and demographics to identify hidden patterns and provides accurate recommendations for potential new purchases. The real value of Big Data lies in the insights it can generate. Processing Complex Data using Hadoop Module provides hands-on techniques and knowledge to analyze Big Data with the help of Hadoop Components. In Performing Analytics on Hadoop Module, you will learn about how analytics tools can be used to run some advanced analyses on data residing on Hadoop. 2 HoweBayleveragesBigData? Online auction giant ebay regularly monitors and analyzes huge amounts of information from their 100 million customer interactions. They use this data to conduct experiments on its users in order to maximise selling prices and customer satisfaction. On an average, they run about 200 experimental tests at the same time which range from barely-noticeable alterations, to the dimensions of product images, right up to complete overhauls, to the ways in which content for users' personal ebay home pages is displayed. Their huge customer base creates 12Tb of data per day from every button they click to every product they buy, which is continually added to an existing pile of Big Data. As the data is queried by automatic monitoring systems and employees looking to find more meaning from it, data throughput reaches 100 petabytes (102,400TB) per day. One of the business problems ebay handles is to achieve the highest buying price possible for all items users place for sale, as profits come from a cut of each sale. Its data scientists perform advanced analytics by looking at all variables in the way items are presented and sold. As one of the solutions to this problem, they began to study the impact on selling price by the quality of the picture in a listing. They used Hadoop framework to process petabytes of pictures due to its capability of handling unstructured data. Later these pictures were analyzed and re analyzed and data scientists managed to extract some structured information such as how much they were sold for, how many people viewed. Towards the end, they managed to establish a possible relation and concluded that better pg15
17 image quality actually does provide a better price. AnalyticsProjectFramework Before doing a deep dive into Big Data, the first and important aspect of any analysis is to identify the business problem. This is a fundamental step even with traditional data analytics projects. Once the business problem is defined, then Big Data can be leveraged to search for hidden patterns and get valuable insights. Typically some of the analytics problems being solved can be of the following nature. Predicting customer churn behaviour to design reach out campaigns Understanding online and offline marketing impacts on sales Identifying whether a transaction is fraudulent or not Using customer purchasing patterns to recommend new products Forecasting of sales for better inventory management Irrespective of any problem across verticals, the methodology involved in implementing data analytics projects would remain the same. Major difference between Big Data analytics projects and traditional data projects would be the scale of data being handled and the combination of required tools. On the other hand, the business problems, analytics techniques and project methodology would remain the same and is independent of the data being handled. As part of any analytics project cycle, the processes typically involved are problem definition, data gathering, selecting the right technique, performing analysis and visualizing the final results. Let us get some more perspective on various stages involved in implementing an analytics project using a used cars price prediction example. Problem Identification The first question that needs to be asked in any data analytics project would be what is the problem we are trying to solve? In today s Big Data world, enterprises are performing data analytics over various kinds of business problems. It becomes essential to figure out which problem would create higher business impact and further focussing on it to maximize ROI. In the case of used cars price prediction example, determining value of a used car based on a variety of characteristics such as make, model and engine that would benefit retailers. Such information would benefit them to better manage the supply and demand flow in a highly price volatile market. Also with robust knowledge of price variations by each model type, retailers can target buyers with relevant promotions and targeted discounts. Gathering required Data After identification of business problem, data needs to be gathered from various sources which will be useful for the analysis. Based on the problem definition, the data attributes of these datasets can be defined. pg16
18 For prediction of used car prices, we will require the sales data across years which capture the information on type of car sold, number of years it was in use and final amount paid by the buyer. Additional data can be captured on condition and performance of the car related to mileage and internal characteristics such as type, trim, doors and cruise control. These days with rapid growth of social media and other data sources, more data can be captured around brand perception of used cars and insurance claims related to the car which provides more insights on price variations. Choosing the Right Analytics Technique Picking the right technique for any given problem is as critical as finding the right kind of data to begin with. In analytics projects, often we depend on various tools and algorithms to work on various data problems. Say for example, R is known for its statistical offerings while Python is popular for text data processing. For solution extractions, statistical techniques rely on business context and have specific use cases like clustering algorithms are used for solving customer segmentation problems, time series algorithms are used for forecasting problems and recommendation algorithms are used to provide insights on more relevant products or services. Before applying any technique, gathered data needs to undergo a set of data operations, such as data cleansing, data aggregation, data augmentation, data sorting, and data formatting which are collectively referred to as pre-processing steps. These steps translate the raw data into a fixed data format which is then shared as input to various algorithms. Since problems involved in the used car example is forecasting of price values, regression techniques can be used. At a broad level these deal with predictions of continuous variables like price, income, age, sales etc. Many algorithms can implement regression techniques such as linear regression, random forests, neural networks and support vector machines which vary in terms of complexity of implementation and scope of business interpretation. At this point, this might sound more technical but getting a general idea is of more relevance here. You will be able to appreciate the underlying concepts more while using these techniques in real-time projects. Implementing Analytics Techniques As discussed in above section, analytics problems can be solved with the help of a variety of statistical techniques. When it comes to implementation of these techniques, there are lot of options available in terms of analytics tools such as SAS, R and Python. SAS is more popular amongst the enterprises because of its ease of use and R or Python are open source tools which have lot of takers amongst academia and programmers. On an average, almost 80% of the time of any analytics project goes in problem identification, data gathering and pre-processing steps while the remaining 20% is used for implementing chosen techniques and visualizing the final results. In the case of Big Data, the same algorithms can be translated to MapReduce algorithms for running them on Hadoop clusters which often requires more efforts and specialized expertise. In Hadoop Ecosystem, Mahout Component use Java programming language to implement statistical techniques such as classification, recommendation algorithm and others. Depending on the data volumes being gathered for used car prediction problem, say to implement linear regression technique, SAS or R tool can be used for smaller datasets and Hadoop integrated with R or SAS can be pg17
19 used for Big Data. Another alternative for Big Data would be making use of Mahout Component which requires Java expertise. Visualizing End Results Data visualization is used for displaying the output of analytics projects. Typically this would be the last step of any analytics projects where visualization techniques are implemented either for validating the technique outcomes or to present end results to non-technical management team. This can be done with various data visualization software s such as Tableau, Spotfire and also in-built capabilities of SAS and R. In comparison with SAS, R offers a variety of packages namely ggplot2 and lattice for visualization of datasets. After building linear regression model for user car price prediction, visualization techniques are implemented to validate the statistical results and to further check whether these results are satisfying the technique assumptions. Some of the standard validation techniques of linear regression model are heteroskedasticity, autocorrelation, and multicollinearity. Above plots showcase visualization examples of performing these validation techniques on final model results of used car price prediction example. DifferentkindsofAnalytics By looking at the used car price prediction example, one can understand that irrespective of any domain the framework for implementing analytics projects remains the same. However, as different enterprises work on solving various business problems, there are different kinds of analytics with domain specific applications. Some of pg18
20 the common ones are marketing analytics, customer analytics, risk analytics, fraud analytics, human resource analytics and web analytics which are classified based on different business functions. Marketing analytics in any enterprise revolves around increasing efficiency and maximizing marketing performance through analysis such as marketing mix optimization, marketing mix modeling, price analysis and promotional analysis to name a few. On the other hand, customer analytics deals with understanding of customer behaviours and increasing loyalty using analysis like customer segmentation, attrition analysis and life time value analysis. Another common classification exists which is based on complexity level of analytics techniques being implemented across any enterprise and is independent of the domain. These kinds are broadly classified under basic analytics and advanced analytics categories. BasicAnalytics Basic analytics techniques are generally used to explore your data which include simple visualizations or simple statistics. Some of the common examples are: Slicing and dicing refers to breaking down of data into smaller sets that are easier to explore. This is more employed as a preliminary step to gain more understanding into data attributes and how different techniques can be used and also how much computational power is required to implement a full scale analysis. Anomaly identification is the process of detecting outliers, such as an event where the actual observation differs from the expected value. This might involve computing some summary statistics like mean, median, and range values and also sometimes involves visualization techniques such as scatter plot, box plot etc. to identify outliers through visual means. AdvancedAnalytics Advanced analytics involves applications of statistical algorithms for complex analysis on either structured or unstructured data. Among its many use cases, these techniques can be deployed to find patterns in data, prediction, forecasting, and complex event processing. With the growth of Big Data and enterprise's need to stay ahead of competition, advanced analytics implementations have become main stream as an integral part of their decision making process. Some of the examples of advanced analytics are: Text Analytics is the process of analyzing unstructured text, extracting relevant information, and transforming it into structured information on which statistical techniques can be applied. Since much of Big Data comprises unstructured data, text analytics has become one of the main stream applications amongst Big Data analytics. Predictive Modeling consists of statistical or data-mining solutions including algorithms and techniques to determine future outcomes. A predictive model is made up of a number of predictors, which are variable factors that are likely to influence future behavior or results. In marketing, for example, a customer's gender, age, and purchase history might predict the likelihood of a future sale. Some of the other common applications include churn prediction, fraud detection, customer segmentation, marketing spend optimization and many more. pg19
Business Analytics Big Data Next-Generation Analytics the way we see it Table of contents Executive summary 1 Introduction: What is big data and why is it different? 3 The business opportunity 7 Traditional
1 Contents Introduction. 1 View Point Phil Shelley, CTO, Sears Holdings Making it Real Industry Use Cases Retail Extreme Personalization. 6 Airlines Smart Pricing. 9 Auto Warranty and Insurance Efficiency.
OPEN DATA CENTER ALLIANCE : sm Big Data Consumer Guide SM Table of Contents Legal Notice...3 Executive Summary...4 Introduction...5 Objective...5 Big Data 101...5 Defining Big Data...5 Big Data Evolution...7
For Big Data Analytics There s No Such Thing as Too Big The Compelling Economics and Technology of Big Data Computing March 2012 By: 4syth.com Emerging big data thought leaders Forsyth Communications 2012.
Chapter 1 Grasping the Fundamentals of Big Data In This Chapter Looking at a history of data management Understanding why big data matters to business Applying big data to business effectiveness Defining
The Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics A Kimball Group White Paper By Ralph Kimball Table of Contents Executive Summary... 1 About the Author... 1 Introduction...
How to embrace Big Data A methodology to look at the new technology Contents 2 Big Data in a nutshell 3 Big data in Italy 3 Data volume is not an issue 4 Italian firms embrace Big Data 4 Big Data strategies
Database Systems Journal vol. III, no. 4/2012 3 Perspectives on Big Data and Big Data Analytics Elena Geanina ULARU, Florina Camelia PUICAN, Anca APOSTU, Manole VELICANU 1 Phd. Student, Institute of Doctoral
Big Data The definitive guide to the revolution in business analytics shaping tomorrow with you THE WHITE BOOK OF... Big Data The definitive guide to the revolution in business analytics THE WHITE BOOK
Emergence and Taxonomy of Big Data as a Service Benoy Bhagattjee Working Paper CISL# 2014-06 May 2014 Composite Information Systems Laboratory (CISL) Sloan School of Management, Room E62-422 Massachusetts
NESSI White Paper, December 2012 Big Data A New World of Opportunities Contents 1. Executive Summary... 3 2. Introduction... 4 2.1. Political context... 4 2.2. Research and Big Data... 5 2.3. Purpose of
Customer Cloud Architecture for Big Data and Analytics Executive Overview Using analytics reveals patterns, trends and associations in data that help an organization understand the behavior of the people
INTELLIGENT BUSINESS STRATEGIES W H I T E P A P E R Architecting A Big Data Platform for Analytics By Mike Ferguson Intelligent Business Strategies October 2012 Prepared for: Table of Contents Introduction...
3 Big Data: Challenges and Opportunities Roberto V. Zicari Contents Introduction... 104 The Story as it is Told from the Business Perspective... 104 The Story as it is Told from the Technology Perspective...
MASARYK UNIVERSITY FACULTY OF INFORMATICS Best Practices in Scalable Web Development MASTER THESIS Martin Novák May, 2014 Brno, Czech Republic Declaration Hereby I declare that this paper is my original
Big Data: Big Opportunities to Create Business Value Report and recommendations based on discussions with the Leadership Council for Information Advantage Council Members Rich Adduci, Vice President and
Big Data Analytics ALTERYX SPECIAL EDITION by Michael Wessler, OCP & CISSP Big Data Analytics For Dummies, Alteryx Special Edition Published by John Wiley & Sons, Inc. 111 River St. Hoboken, NJ 07030-5774
W I N T E R C O R P O R A T I O N Executive Report BIG DATA: BUSINESS OPPORTUNITIES, REQUIREMENTS AND ORACLE S APPROACH RICHARD WINTER December 2011 SUMMARY NEW SOURCES OF DATA and distinctive types of
An Oracle White Paper March 2013 Big Data Analytics Advanced Analytics in Oracle Database Advanced Analytics in Oracle Database Disclaimer The following is intended to outline our general product direction.
At the Big Data Crossroads: turning towards a smarter travel experience Thomas H. Davenport Visiting Professor at Harvard Business School Amadeus IT Group is committed to minimizing its carbon footprint.
Big Data in Big Companies Date: May 2013 Authored by: Thomas H. Davenport Jill Dyché Copyright Thomas H. Davenport and SAS Institute Inc. All Rights Reserved. Used with permission Introduction Big data
CHAPTER 1.9 Making Big Data Something More than the Next Big Thing ANANT GUPTA HCL Technologies Big data is the business buzzword du jour. But how can you turn this hot topic into a real source of business
International Journal of Computer Science and Applications, Technomathematics Research Foundation Vol. 11, No. 3, pp. 116 127, 2014 ANALYTICS ON BIG AVIATION DATA: TURNING DATA INTO INSIGHTS RAJENDRA AKERKAR
Linked data Connecting and exploiting big data Whilst big data may represent a step forward in business intelligence and analytics, Fujitsu sees particular additional value in linking and exploiting big
Final Report Duke Energy Emerging Technologies Organization Data Modeling and Analytics Initiative April 2014 Duke Energy Data Modeling and Analytics Initiative Table of Contents Table of Contents List
Analytic Platforms: Beyond the Traditional Data Warehouse By Merv Adrian and Colin White BeyeNETWORK Custom Research Report Prepared for Vertica Executive Summary The once staid and settled database market
Introduction to Data Mining and Knowledge Discovery Third Edition by Two Crows Corporation RELATED READINGS Data Mining 99: Technology Report, Two Crows Corporation, 1999 M. Berry and G. Linoff, Data Mining