Beginner s Guide to. BigDataAnalytics

Size: px
Start display at page:

Download "Beginner s Guide to. BigDataAnalytics"

Transcription

1 Beginner s Guide to BigDataAnalytics

2 Introduction Big Data, What do these two words really mean? Yes everyone is talking about it but frankly, not many really understand what the hype is all about. This book by Jigsaw Academy aims to give one an understanding of Big Data and what makes data big, while also elaborating in simple language the challenges of Big Data, the emerging technologies and the Big Data landscape. Finally we talk about careers in Big Data and what the future could hold in store for the industry. This book is also a useful companion to those of you enrolled in Jigsaw's Course Big Data Analytics Using Hadoop and R. You can use this book to compliment your learning and better understand Big Data. Please note the blue boxes in every chapter which link the content in the chapter to the modules covered in the course. Enjoy the book. Big Data Team Jigsaw (Led by Team Lead Kiran P.V.) pg01

3 Outline What is Big Data 03 2 What Makes Data Big 04 3 Challenges of Big Data Processing 07 5 Big Data Technologies Big Data and Analytics 15 Unstructured Data and Text Analytics Big Data in Action Big Data Landscape Big Data Career Paths Big Data in the Future Learn more about Big Data pg02

4 CHAPTER 01 WhatisBigData? I don't know how the cost of hard disks has decreased so rapidly. These days one can buy a terabyte hard drive for just $100 a friend told me couple of years ago. It's hard not to agree with him and a quick review of historical facts validated his opinion. In the 1990's, the cost per 1 gigabyte of hard disk was around $10,000 and now it can be purchased at only $0.1. The price has dropped 100,000 times over a span of 20 years. Currently we are even seeing that a few giga bytes of hard disk space are being offered free of cost by service providers and file hosting services. For personal accounts, Gmail offers about 15 gigabytes of free hard disk space whereas file hosting service Dropbox offers up to 3.5 gigabytes. However, these values are on the higher side for business accounts. One would wonder how enterprises are influenced by the lower costs of storage space. For one, it definitely provides them with more opportunities of storing data around their product and service offerings. Virtually every industry is seeing a tremendous explosion in terms of new data sources and is dependent on advanced storage technologies. Increased adoption of internet and smart phones enabled individuals across the globe to leave a huge digital footprint of online data which is wanted by many enterprises. In the past, for example, banks used to store customer data mostly around demographic information tracked from application forms and further transaction information tracked from passbooks. These days, the customer data being stored is enormous and varies widely across mobile usage, online transactions, ATM withdrawals, customer feedback, social media comments and credit bureau information. All these new sources of data which did not exist in the past can be categorized under the new word Big Data. Big Data can be easily referred to as data which is huge, but more importantly Big Data is data that comes from multiple sources rather than just one. Big Data is definitely one of the more fascinating evolutions of the 21st century in the world of IT. The truth is that Big Data has opened up tremendous opportunities and has provided us with endless solutions to deal with social, economic and business problems. For enterprises, it is a huge untapped The Big Data and Hadoop Overview source of profit, which if used appropriately will be the key to staying ahead of Module provides pre-class videos their competition. In order to deal with Big Data effectively, they need to and lots of reading material on the importance of Big Data and how it is depend on advanced database technologies and faster processing capabilities. transforming the way enterprises are Just having Big Data is not a sufficient criterion for success; enterprises also implementing data based strategies need to implement analytics effectively, in order to be able to garner insights to become more competitive. that help improve profitability. They should actively pursue the art and science of capturing, analysing and creating value out of Big Data. pg03

5 CHAPTER 02 WhatmakesdataBig? We live in the era of Big Data and it is not leaving any industry untouched be it financial services, consumer goods, e-commerce, transportation, manufacturing or social media. Across all industries, enterprises now have access to an increasing number of both internal and external Big Data sources. Internal sources typically track information around demographics, online or offline transactions, sensors, server logs, website traffic and s. This list is not exhaustive and varies from industry to industry. External sources on the other hand are mostly related to social media information from online discussions, comments and feedback shared about a particular product or service. Another major source of Big Data is machine data which consists of real-time information from sensors and web logs that monitor customer s online activity. In the coming years, as we continue to develop new ways of data generation either online or offline by leveraging technological advancements, the one correct prediction we can make is this; the growth of Big Data is not going to stop. Although Big Data is more about data being captured from multiple sources and size at a higher level, there are many technical definitions which provide more clarity. Orielly Strata group states that Big Data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn t fit the structures of your database architectures. In simple terms, Big Data needs multiple systems to efficiently handle and process data rather than a single system. Say an online e-commerce enterprise in a single region is generating about 1000 gigabytes of data on a daily basis which can be handled and processed using traditional database systems. On expanding operations to a global level, their daily data generation has increased times and is currently at 10 petabytes (1 petabyte = gigabytes). To handle this kind of data, traditional database systems do not have required capabilities and enterprises need to depend on Big Data technologies such as Hadoop which uses a distributed computing framework. We will learn more about these technical topics in subsequent chapters. To further simplify our Big Data understanding, we can rely on three major characteristics of Big Data i.e. volume, variety and velocity which are more commonly referred as 3 V s of Big Data. Occasionally, some resources do talk about a not so common characteristic of Big Data i.e. Veracity which is referred as the 4th V of Big Data. All these 4 characteristics provide more details around the nature, scope and structure of the Big Data. Commonly Big Data is characterized by 3 V s and these provide context for a new class of possibilities. You will learn more about how these characteristics help achieve more information from massive amounts of data in Big Data and Hadoop Overview Module. pg04

6 Volume Volume deals with the size aspect of Big Data. With technical advancements in global connectivity and social media, the physical extent of data generated on a daily basis is growing exponentially. Every day, about 30 billion pieces of information is shared globally. An IDC Digital Universe study, estimated that global data volume was about 1.8 zetabyte as of 2011 and will grow about 5 times by A zetabyte is a quantity of information or information storage capacity equal to one billion terabytes which is 1 followed by 21 zeroes of bytes. Across many enterprises, these increasing volumes pose an immediate challenge to traditional database systems with regards to the storing and processing of data. Variety Big Data comes from sources such as conversations on social media, media files shared on social networks, online transactions, smart phone usage, climate sensor data, financial market data and many more. The underlying formats of data coming out of these sources would vary in terms of excel sheets, text documents, audio files and server log files which can broadly classified under either structured or unstructured data types. Structured data formats typically refer to a defined way of storing data i.e. clearly marked out rows and columns whereas unstructured data formats do not have any order and mostly refer to text, audio and video data. Unstructured formats of data are more a recent phenomenon and traditional database systems do not possess required capabilities to process this kind of information. pg05

7 Velocity Increased volumes of data have put a lot of stress on the processing abilities of traditional database systems. Enterprises should be able to quickly process incoming data from various sources and then share it with the business to ensure the smooth functioning of day-to-day operations. This quick flow of data within an enterprise refers to the velocity characteristic of Big Data. Another important aspect is also about the ability to provide relevant services to the end user on a real time basis. For example, Amazon provides instant recommendation services depending on the users search and location. Based on the entered keyword, these services need to search through their entire historical transactions and share relevant results which hopefully would convert into a potential purchase. The effects of velocity are very similar to volume, and enterprises need to rely on advanced processing technologies to efficiently handle Big Data. Veracity Though enterprises have access to lot of Big Data, some aspects of it would be missing. Over the years, we are aware that data quality issues usually happen due to human entry error or due to some individuals withholding information. In the Big Data era where most of the data capturing processes are automated, the same issues can occur due to technical lapses which arise due to system or sensor malfunction. Whatever may be the reasons, one should be careful to deal with inconsistency in Big Data before using it for any kind of analysis. pg06

8 CHAPTER 03 ChallengesofBigDataProcessing Just having a big source of data is not enough to become successful; enterprises need to implement relevant processes and systems which will help them extract value out of it. An important aspect here is what one should do with it. In the absence of a business context, data in itself is meaningless and would just occupy space in the storage servers. Also many Big Data sources tend to have missing or low content information as described by the veracity characteristic earlier. The actual power of Big Data will surface only by applying analytics on top of it, when it is used to generate useful insights to guide future decision making. Irrespective of the size of the data, whether big or small, analytics methodologies need to be implemented to reap benefits. This typically involves the cleaning, analysing, interpreting and finally visualizing the hidden patterns that emerge from the data. Due to sheer volume, variety and velocity of Big Data; the processing capacity of traditional database systems is strained beyond their limits. So enterprises need to look out for advanced processing technologies and capabilities to effectively manage Big Data and further implement analytics on it. One of the major aspects of any Big Data processing framework would be to successfully handle huge amounts of information without compromising on query time. Traditional database systems lack the required infrastructure and internal designs to process Big NoSQL technologies are helping enterprises to achieve more than what Data at a scale of petabytes or exabytes. Since these systems tend to operate was possible previously. In the Big out of a single machine with huge hard disk and processing capabilities, there Data and Hadoop Overview Module, are a set of limitations which comes with it. The first one is a scalability issue. the evolution and benefits of new technologies such as Hadoop and With a continuous increase in data volumes, the storage capacity of these MongoDB are discussed and also how systems needs to be continuously increased, and this can be an expensive these help in overcoming the limitations of traditional IT systems for option. The second one is a slow querying time because the storage load is solving Big Data problem. already operating at maximum levels and enterprises cannot wait for days to pg07

9 get their daily reports. These limitations call for alternate approaches based on scalable storage and a distributed approach to scaling. Big Data sources are diverse and inputs for these systems can be in the form of structured or unstructured formats. Since the origin of these data formats is spread across the globe, most times these won t have a predefined order and requires pre-processing before using it for any analysis. A common use of Big Data processing is to make use of unstructured data, specifically comments on social media and by tracking customer sentiments towards various product and service offerings. Due to their inherent static design, many traditional database systems can handle only structured data and as such do not provide any alternatives for unstructured data. For example, SQL based database systems depend on schema designs which clearly define the nature of data being loaded and used to process transactional data. Since unstructured data for the most part does not have a proper structure, it would be impossible to handle it in SQL based systems. Luckily for us, there exist alternatives in the form of NoSQL databases, which can handle both structured and unstructured data formats. The majority of client applications run by enterprises are based on real-time, and instant support on services has become a priority. This requires processing of Big Data by the minute in order to provide relevant service to the customers. For example, based on user search keyword, Google instantly processes information stored in their million databases and returns relevant links within a matter of seconds. Similarly, banks need to track global online transactions at any time of the day and further update their databases so that it will reflect in a customer s online account immediately. These services require enterprises to have a system which can ensure the fast movement of data without any potential failures. In order to handle this velocity of Big Data coupled with volume and variety, enterprises need to depend on sophisticated databases which form part of the NoSQL category. These databases relax the limits of the schema based design of SQL systems and stores data in keyvalue pairs, all which are capable of handling all the 3 V s of Big Data. pg08

10 CHAPTER 04 BigDataTechnologies- OverviewofComponents With the growing challenges of Big Data and limitations of traditional data management solutions, enterprises need to leverage new technologies and data warehousing architectures which have significant IT implications. These technologies vary in terms of functionalities ranging from storing and processing massive volumes of data to performing various analysis on the data at the lowest level of granularity. For example, by integrating unstructured data such as text fields, social media chatter, and documents, enterprises can leverage new sources of data which can reveal new insights about the customers. According to a market forecast report by ABI Research, the world-wide IT spending on Big Data technologies 1 exceeded $31 billion in 2013 and is projected to reach $114 billion by Most existing Big Data technologies fall under the open source paradigm. These are free to use and can be experimented upon by anyone. In the current Big Data technology landscape, there are many open-source tools which can potentially solve any problem but one should have the right knowledge and niche expertise in order to efficiently work with these technologies. One of the most popular and widely adopted open source Big Data technologies is Apache Hadoop. It is more formally defined as an open-source software framework that supports distributed processing of vast amounts of data across clusters of computers by using a simple programming model. Apache Hadoop is considered a costeffective solution which provides capabilities to scale up from single servers to thousands of machines, each offering local computation and storage. In simple terms, it is more like cluster of machines interconnected by a network system processing chunks of data at the same time rather than depending on one single machine which is time consuming and in-efficient especially in the case of Big Data. A Hadoop cluster can be made up of a single or thousands of machines which are commonly termed as nodes. pg09

11 Let us try to understand this concept using a simple example. Say, an apartment complex housing 50 families has a single washing machine catering to laundry needs. Assuming on an average if each family process 10 clothes per day and time taken by washing machine to clean about 50 clothes is one hour, then total time taken by the washing machine to meet the entire apartment needs per day would be 10 hours. Now the apartment manager is considering increasing the capacity to 100 houses, and definitely this would put tremendous stress on the washing machine s daily load management capacity. In order to deal with this situation, the manager should probably consider buying 4 more washing machines. With a total of 5 machines working together, the entire apartment complexes laundry needs can be managed within 4 hours on any given day. The new solution also allows the families to be more flexible with respect to the time of their washing machine usage. This example briefly captures the essence of implementing distributed processing solutions using a cluster of machines rather than depending on one single machine to meet the growing Big Data needs. Invented and named by Doug Cutting after his son s elephant toy, Hadoop Ecosystem comprises of multiple projects which provide complete data management solutions needed by an enterprise. Some of the projects of Hadoop Ecosystem include HDFS, MapReduce, Hive, HBase, Pig and others. Though evolution of Hadoop dates back to early 2000 s, its main stream usage picked up momentum only a couple of years ago. Major advantage is its ability to efficiently manage and process unstructured data. Since about 80% of Big Data consists of unstructured data, it has become more of a strategic choice for many enterprises to implement Hadoop bases solutions. Let's briefly review some of the key components of Hadoop Ecosystem. HDFS(HadoopDistributedFileSystem) Two primary components of Apache Hadoop are HDFS which provides distributed data storage capabilities and MapReduce which is a parallel programming framework. To better understand how Hadoop allows scalability, one should understand how HDFS functions. HDFS breaks down the processing data into smaller pieces called blocks, and stores them across various nodes of a Hadoop cluster. This mechanism of HDFS enables one to handle Big Data more efficiently and in a cost-effective way by employing low cost commodity hardware on the nodes of the Apache Hadoop is the most popular IT solution for effectively dealing with Big Data. With the help of Big Data and Hadoop Overview, Hadoop Data Management and Processing Complex Data using Hadoop modules, you will learn technical aspects of setting up of Hadoop Cluster, its Architecture, HDFS and MapReduce Framework and other components using hands-on examples. Two primary components of Apache Hadoop are HDFS which provides distributed data storage capabilities and MapReduce which is a parallel programming framework. To better understand how Hadoop allows scalability, one should understand how HDFS functions. HDFS breaks down the processing data into smaller pieces called blocks, and stores them across various nodes of a Hadoop cluster. This mechanism of HDFS enables one to handle Big Data more efficiently and in a cost-effective way by employing low cost commodity hardware on the nodes of the Hadoop cluster. Unlike relational databases which depend of defining schemas to store structured data, HDFS puts no restrictions on the type of data and can easily handle unstructured data too. Based on the NoSQL principle, HDFS allows for schema less storage of data which makes it more popular when it comes to Big Data management. pg10

12 MapReduce MapReduce forms the heart of Hadoop and is a programming model which processes data stored on the nodes of Hadoop cluster, in a parallel and distributed manner. Typically a MapReduce program consists of two components: Map() and Reduce() procedures. Both of these phases work on key-value pairs. A key/value pair is a set of two linked data items: a key, which is a unique identifier for some item of data, and the value, which is either the data that is identified or a pointer to the location of that data. These key/value pairs can be like a customer unique identifiers and location details or URLs paired with number of visits. What goes into key/value pairs is subjective and is dependent on the type of problem being solved. Map() procedure or job performs operations such as filtering and sorting which takes individual elements of data and further break it down into key/value pairs. After execution of Map() job, Reduce() implements summary functions where the output will be in an aggregated form. Always remember the order of any MapReduce program involves the execution of Reduce() job followed by Map() job. Also, the output of Map() job will act as an input to the Reduce() job. MapReduceExample Let s look at a simple example. Assume you have three text documents, and each file contains specific number of words. Say the first document contains a sentence I like Hadoop and all the documents are stored in HDFS. The end objective is to find out the frequency of words present in all the text documents. For this we need to write Map and Reduce jobs to process the text data and summarize the word distribution. As the Map job executes, the documents are first sent to the mapper that will count each unique word for each document: a list of (key/value) pairs is thus created with the word as key and its count as value. For example, the results produced from one mapper task for the first text document would look like this: (I,1) (Like,1) (Hadoop,1) The list of (key/value) pairs generated by all mapper tasks are then processed by the reducer that basically aggregates the (key/value) pairs from each mapper to finally produce a list with all the words and the summed counts from the three mappers, producing a final result set as follows: pg11

13 (I,1) (Like,1) (Hadoop,3) (Is,2) (Fun,1) (So,1) (Great,1) This one is a simple and straight forward example. Even though a real time application would be quite complex and often involves processing millions or billions of rows, the key principle behind a MapReduce execution would remain the same. JavaAPIs In order to deal with Hadoop programming at the MapReduce level, one would need to work with Java APIs. Since Hadoop framework is developed on Java platform, MapReduce programming using Java language is more native by design. Hadoop developers or analysts should have fair knowledge of Java concepts to process queries on data stored in various cluster nodes. Running MapReduce jobs involves installation of eclipse environment for Hadoop, writing Map and Reduce job scripts, compiling them into a jar file and then further executing these jobs on the data stored in HDFS. For those who are averse to java programming and who do not have a developer background, alternatives exist for Hadoop programming in terms of Pig, Hive and Hadoop Streaming components. Using Hadoop streaming component, it is easier to create and run MapReduce jobs with any general programming languages such as Ruby, Python, Perl, C++, R etc. Pig Pig comes to the rescue for non-technical professionals and makes it more approachable to work with Big Data on Hadoop clusters. It is a highly interactive and script based environment for executing MapReduce jobs on the data stored in HDFS. It consists of a data flow language, called Pig Latin, which supports writing MapReduce programs with more ease and less amount of code in comparison to usage of Java APIs. In many ways, the functionality of Pig is very much similar to how SQL Analyzing Big Data is a key component operates in relational database management systems. It also supports many of any enterprise's IT strategies related user-defined functions, which can be embedded and executed along with a to Hadoop. In Processing Complex Data using Hadoop Module, you will Java program. gain strong command in components such as Hive, Pig and Impala which Hive enable faster querying and aggregation of data from Hadoop cluster. Hive enables the connection between the worlds of Hadoop and SQL. It is very beneficial for people with strong SQL skills. Hive is a data warehouse infrastructure built on top of Hadoop that provides data summarization, querying, and analysis capabilities using an SQL-like language called HiveQL. Similar to Pig, Hive functions like an abstraction on top of MapReduce, and queries run will be converted to a series of MapReduce jobs at the time of execution. Since the internal architecture is very similar to that of relational databases, Hive is used to handle structured data and enables easy integration between Hadoop and other business intelligence tools. pg12

14 Impala Impala, similar to Hive provides an interactive SQL based query engine for data sitting on Hadoop servers. It is an open-source program for handling and ensuring the availability of large quantities of data. This engine was developed by Hadoop distribution vendor Cloudera, and currently can be accessed under open source Apache license. As is the case with Hive, Impala supports widely known SQL-style query language, meaning that users can put their SQL expertise directly to use on Big Data. Based on comparison results published by Cloudera, Impala offers 6 to 69 times faster querying times than Hive thus making it a first choice among enterprises when it comes to performing Big Data analyses on Hadoop. HadoopStreaming Hadoop Streaming component is a utility which allows us to write map and reduce programs in languages other than java. It uses UNIX standard streams as the interface between Hadoop and MapReduce job, and thus any language that supports reading standard input and writing standard output can be used. It supports most of the programming languages such as ruby, Hadoop Streaming is an essential python, perl, and.net. So when you come across a MapReduce job written in utility and quite helpful for any of these languages, then surely execution will be handled by the Hadoop programmers who prefer programming with Python or R over Java. In the Streaming component. Performing Analytics on Hadoop Module, you will learn about running R HBase scripts for MapReduce jobs through Hadoop Streaming utility. HBase is a column-oriented database within Hadoop Ecosystem and runs on top of HDFS. Hadoop is a batch-oriented system which allows loading data into HDFS, processing and then retrieving. This kind of operating mechanism would not be ideal for tasks involving regular reading and writing of data. MapReduce programs can read input data and write outputs directly from HBase. Apart from using Java API, Hive and Pig can be used to write MapReduce programs to be implemented on data sitting in HBase. SqoopandFlume These components enable connectivity between Hadoop and the rest of the data world. Sqoop is a tool which allows transfer of data between Hadoop cluster and SQL based relational databases. Using Sqoop, one can easily import data from external enterprise data warehouses or relational databases, and can efficiently store it in HDFS or Hive databases. As Sqoop is used to connect with traditional databases, Flume is used to collect and aggregate application data into HDFS. Typically, it is used to collect large amounts of log data from distributed servers. Flume s architecture is streaming data flow-based and can be easily integrated with Hadoop or Hive for analysis of data. Some of the common applications of Flume component are to transport massive quantities of event data such web logs, network traffic data, social media generated data like twitter feed Hadoop Data Management Module provides detailed introduction with hands-on exposure to database components of Hadoop such as HBase, Hive, Sqoop and Flume. Also you will be able to develop more indepth knowledge on how to load and query data using these components. pg13

15 and messages. ZookeeperandOozie While Hadoop offers a great model for scaling data processing applications through its distributed file system (HDFS), Map/Reduce and numerous add-ons including Hive and Pig, all this power and distributed processing requires coordination and smooth workflow which can be achieved by Zookeeper and Oozie components. ZooKeeper is a component of Hadoop ecosystem which enables highly reliable distributed coordination. Within Hadoop cluster, Zookeeper looks into synchronization and configuration of nodes and stores information around how these nodes can access different services relating to MapReduce implementations. Oozie is an open source workflow scheduler system to manage data processing jobs across Hadoop cluster. It provides mechanisms to schedule the execution of MapReduce jobs based either on time-based criteria or on data availability. It allows for repetitive execution of multi-stage workflows that can describe a complete end-to-end process thus reducing the need for custom coding for each stage. pg14

16 CHAPTER 05 BigDataandAnalytics So far, we have learned about various technological and database architectural components that supports Big Data management. The real imperative of Big Data lies in the enterprise s ability to derive actionable insights and to create business value. Building capabilities of analysing Big Data would provide unique opportunities for enterprises and also put them ahead of their competition. Also these analyses can be performed on more detailed and complete data, as compared to traditional analysis which would be limited only to samples. However, performing analytics on Big Data is a challenging task considering the volumes and complex structures involved. To deal with this, enterprises need to able to find the right mix of tools, expertise and analytics techniques. Many early adopters of Big Data such as Google, Yahoo, Amazon and ebay are considered to be pioneers in analysing Big Data. For example, ebay launches successful products and services by employing analytics on demographic and behavioural data from their millions of customers. Data used for analysis can come in various forms - user behaviour data, transactional data, and customer service data. On the other hand, Amazon offers services of recommendation engine on their home page. It leverages Big Data analytics on data relating to customer s buying history and demographics to identify hidden patterns and provides accurate recommendations for potential new purchases. The real value of Big Data lies in the insights it can generate. Processing Complex Data using Hadoop Module provides hands-on techniques and knowledge to analyze Big Data with the help of Hadoop Components. In Performing Analytics on Hadoop Module, you will learn about how analytics tools can be used to run some advanced analyses on data residing on Hadoop. 2 HoweBayleveragesBigData? Online auction giant ebay regularly monitors and analyzes huge amounts of information from their 100 million customer interactions. They use this data to conduct experiments on its users in order to maximise selling prices and customer satisfaction. On an average, they run about 200 experimental tests at the same time which range from barely-noticeable alterations, to the dimensions of product images, right up to complete overhauls, to the ways in which content for users' personal ebay home pages is displayed. Their huge customer base creates 12Tb of data per day from every button they click to every product they buy, which is continually added to an existing pile of Big Data. As the data is queried by automatic monitoring systems and employees looking to find more meaning from it, data throughput reaches 100 petabytes (102,400TB) per day. One of the business problems ebay handles is to achieve the highest buying price possible for all items users place for sale, as profits come from a cut of each sale. Its data scientists perform advanced analytics by looking at all variables in the way items are presented and sold. As one of the solutions to this problem, they began to study the impact on selling price by the quality of the picture in a listing. They used Hadoop framework to process petabytes of pictures due to its capability of handling unstructured data. Later these pictures were analyzed and re analyzed and data scientists managed to extract some structured information such as how much they were sold for, how many people viewed. Towards the end, they managed to establish a possible relation and concluded that better pg15

17 image quality actually does provide a better price. AnalyticsProjectFramework Before doing a deep dive into Big Data, the first and important aspect of any analysis is to identify the business problem. This is a fundamental step even with traditional data analytics projects. Once the business problem is defined, then Big Data can be leveraged to search for hidden patterns and get valuable insights. Typically some of the analytics problems being solved can be of the following nature. Predicting customer churn behaviour to design reach out campaigns Understanding online and offline marketing impacts on sales Identifying whether a transaction is fraudulent or not Using customer purchasing patterns to recommend new products Forecasting of sales for better inventory management Irrespective of any problem across verticals, the methodology involved in implementing data analytics projects would remain the same. Major difference between Big Data analytics projects and traditional data projects would be the scale of data being handled and the combination of required tools. On the other hand, the business problems, analytics techniques and project methodology would remain the same and is independent of the data being handled. As part of any analytics project cycle, the processes typically involved are problem definition, data gathering, selecting the right technique, performing analysis and visualizing the final results. Let us get some more perspective on various stages involved in implementing an analytics project using a used cars price prediction example. Problem Identification The first question that needs to be asked in any data analytics project would be what is the problem we are trying to solve? In today s Big Data world, enterprises are performing data analytics over various kinds of business problems. It becomes essential to figure out which problem would create higher business impact and further focussing on it to maximize ROI. In the case of used cars price prediction example, determining value of a used car based on a variety of characteristics such as make, model and engine that would benefit retailers. Such information would benefit them to better manage the supply and demand flow in a highly price volatile market. Also with robust knowledge of price variations by each model type, retailers can target buyers with relevant promotions and targeted discounts. Gathering required Data After identification of business problem, data needs to be gathered from various sources which will be useful for the analysis. Based on the problem definition, the data attributes of these datasets can be defined. pg16

18 For prediction of used car prices, we will require the sales data across years which capture the information on type of car sold, number of years it was in use and final amount paid by the buyer. Additional data can be captured on condition and performance of the car related to mileage and internal characteristics such as type, trim, doors and cruise control. These days with rapid growth of social media and other data sources, more data can be captured around brand perception of used cars and insurance claims related to the car which provides more insights on price variations. Choosing the Right Analytics Technique Picking the right technique for any given problem is as critical as finding the right kind of data to begin with. In analytics projects, often we depend on various tools and algorithms to work on various data problems. Say for example, R is known for its statistical offerings while Python is popular for text data processing. For solution extractions, statistical techniques rely on business context and have specific use cases like clustering algorithms are used for solving customer segmentation problems, time series algorithms are used for forecasting problems and recommendation algorithms are used to provide insights on more relevant products or services. Before applying any technique, gathered data needs to undergo a set of data operations, such as data cleansing, data aggregation, data augmentation, data sorting, and data formatting which are collectively referred to as pre-processing steps. These steps translate the raw data into a fixed data format which is then shared as input to various algorithms. Since problems involved in the used car example is forecasting of price values, regression techniques can be used. At a broad level these deal with predictions of continuous variables like price, income, age, sales etc. Many algorithms can implement regression techniques such as linear regression, random forests, neural networks and support vector machines which vary in terms of complexity of implementation and scope of business interpretation. At this point, this might sound more technical but getting a general idea is of more relevance here. You will be able to appreciate the underlying concepts more while using these techniques in real-time projects. Implementing Analytics Techniques As discussed in above section, analytics problems can be solved with the help of a variety of statistical techniques. When it comes to implementation of these techniques, there are lot of options available in terms of analytics tools such as SAS, R and Python. SAS is more popular amongst the enterprises because of its ease of use and R or Python are open source tools which have lot of takers amongst academia and programmers. On an average, almost 80% of the time of any analytics project goes in problem identification, data gathering and pre-processing steps while the remaining 20% is used for implementing chosen techniques and visualizing the final results. In the case of Big Data, the same algorithms can be translated to MapReduce algorithms for running them on Hadoop clusters which often requires more efforts and specialized expertise. In Hadoop Ecosystem, Mahout Component use Java programming language to implement statistical techniques such as classification, recommendation algorithm and others. Depending on the data volumes being gathered for used car prediction problem, say to implement linear regression technique, SAS or R tool can be used for smaller datasets and Hadoop integrated with R or SAS can be pg17

19 used for Big Data. Another alternative for Big Data would be making use of Mahout Component which requires Java expertise. Visualizing End Results Data visualization is used for displaying the output of analytics projects. Typically this would be the last step of any analytics projects where visualization techniques are implemented either for validating the technique outcomes or to present end results to non-technical management team. This can be done with various data visualization software s such as Tableau, Spotfire and also in-built capabilities of SAS and R. In comparison with SAS, R offers a variety of packages namely ggplot2 and lattice for visualization of datasets. After building linear regression model for user car price prediction, visualization techniques are implemented to validate the statistical results and to further check whether these results are satisfying the technique assumptions. Some of the standard validation techniques of linear regression model are heteroskedasticity, autocorrelation, and multicollinearity. Above plots showcase visualization examples of performing these validation techniques on final model results of used car price prediction example. DifferentkindsofAnalytics By looking at the used car price prediction example, one can understand that irrespective of any domain the framework for implementing analytics projects remains the same. However, as different enterprises work on solving various business problems, there are different kinds of analytics with domain specific applications. Some of pg18

20 the common ones are marketing analytics, customer analytics, risk analytics, fraud analytics, human resource analytics and web analytics which are classified based on different business functions. Marketing analytics in any enterprise revolves around increasing efficiency and maximizing marketing performance through analysis such as marketing mix optimization, marketing mix modeling, price analysis and promotional analysis to name a few. On the other hand, customer analytics deals with understanding of customer behaviours and increasing loyalty using analysis like customer segmentation, attrition analysis and life time value analysis. Another common classification exists which is based on complexity level of analytics techniques being implemented across any enterprise and is independent of the domain. These kinds are broadly classified under basic analytics and advanced analytics categories. BasicAnalytics Basic analytics techniques are generally used to explore your data which include simple visualizations or simple statistics. Some of the common examples are: Slicing and dicing refers to breaking down of data into smaller sets that are easier to explore. This is more employed as a preliminary step to gain more understanding into data attributes and how different techniques can be used and also how much computational power is required to implement a full scale analysis. Anomaly identification is the process of detecting outliers, such as an event where the actual observation differs from the expected value. This might involve computing some summary statistics like mean, median, and range values and also sometimes involves visualization techniques such as scatter plot, box plot etc. to identify outliers through visual means. AdvancedAnalytics Advanced analytics involves applications of statistical algorithms for complex analysis on either structured or unstructured data. Among its many use cases, these techniques can be deployed to find patterns in data, prediction, forecasting, and complex event processing. With the growth of Big Data and enterprise's need to stay ahead of competition, advanced analytics implementations have become main stream as an integral part of their decision making process. Some of the examples of advanced analytics are: Text Analytics is the process of analyzing unstructured text, extracting relevant information, and transforming it into structured information on which statistical techniques can be applied. Since much of Big Data comprises unstructured data, text analytics has become one of the main stream applications amongst Big Data analytics. Predictive Modeling consists of statistical or data-mining solutions including algorithms and techniques to determine future outcomes. A predictive model is made up of a number of predictors, which are variable factors that are likely to influence future behavior or results. In marketing, for example, a customer's gender, age, and purchase history might predict the likelihood of a future sale. Some of the other common applications include churn prediction, fraud detection, customer segmentation, marketing spend optimization and many more. pg19

III Big Data Technologies

III Big Data Technologies III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

Are You Ready for Big Data?

Are You Ready for Big Data? Are You Ready for Big Data? Jim Gallo National Director, Business Analytics February 11, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

Transforming the Telecoms Business using Big Data and Analytics

Transforming the Telecoms Business using Big Data and Analytics Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required. What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

White Paper: What You Need To Know About Hadoop

White Paper: What You Need To Know About Hadoop CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack

More information

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords From A to Z By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords Big data is one of the, well, biggest trends in IT today, and it has spawned a whole new generation

More information

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics BIG DATA & ANALYTICS Transforming the business and driving revenue through big data and analytics Collection, storage and extraction of business value from data generated from a variety of sources are

More information

Buyer s Guide to Big Data Integration

Buyer s Guide to Big Data Integration SEPTEMBER 2013 Buyer s Guide to Big Data Integration Sponsored by Contents Introduction 1 Challenges of Big Data Integration: New and Old 1 What You Need for Big Data Integration 3 Preferred Technology

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

White Paper: Hadoop for Intelligence Analysis

White Paper: Hadoop for Intelligence Analysis CTOlabs.com White Paper: Hadoop for Intelligence Analysis July 2011 A White Paper providing context, tips and use cases on the topic of analysis over large quantities of data. Inside: Apache Hadoop and

More information

The Future of Data Management

The Future of Data Management The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class

More information

Navigating Big Data business analytics

Navigating Big Data business analytics mwd a d v i s o r s Navigating Big Data business analytics Helena Schwenk A special report prepared for Actuate May 2013 This report is the third in a series and focuses principally on explaining what

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Are You Ready for Big Data?

Are You Ready for Big Data? Are You Ready for Big Data? Jim Gallo National Director, Business Analytics April 10, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?

More information

Turning Big Data into Big Insights

Turning Big Data into Big Insights mwd a d v i s o r s Turning Big Data into Big Insights Helena Schwenk A special report prepared for Actuate May 2013 This report is the fourth in a series and focuses principally on explaining what s needed

More information

BIG DATA IS MESSY PARTNER WITH SCALABLE

BIG DATA IS MESSY PARTNER WITH SCALABLE BIG DATA IS MESSY PARTNER WITH SCALABLE SCALABLE SYSTEMS HADOOP SOLUTION WHAT IS BIG DATA? Each day human beings create 2.5 quintillion bytes of data. In the last two years alone over 90% of the data on

More information

www.pwc.com/oracle Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

www.pwc.com/oracle Next presentation starting soon Business Analytics using Big Data to gain competitive advantage www.pwc.com/oracle Next presentation starting soon Business Analytics using Big Data to gain competitive advantage If every image made and every word written from the earliest stirring of civilization

More information

How to Enhance Traditional BI Architecture to Leverage Big Data

How to Enhance Traditional BI Architecture to Leverage Big Data B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Defining Big Not Just Massive Data Big data refers to data sets whose size is beyond the ability of typical database software tools

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Firebird meets NoSQL (Apache HBase) Case Study

Firebird meets NoSQL (Apache HBase) Case Study Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Data Refinery with Big Data Aspects

Data Refinery with Big Data Aspects International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

Big Data Explained. An introduction to Big Data Science.

Big Data Explained. An introduction to Big Data Science. Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of

More information

Customized Report- Big Data

Customized Report- Big Data GINeVRA Digital Research Hub Customized Report- Big Data 1 2014. All Rights Reserved. Agenda Context Challenges and opportunities Solutions Market Case studies Recommendations 2 2014. All Rights Reserved.

More information

Big Data: Tools and Technologies in Big Data

Big Data: Tools and Technologies in Big Data Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can

More information

Outline. What is Big data and where they come from? How we deal with Big data?

Outline. What is Big data and where they come from? How we deal with Big data? What is Big Data Outline What is Big data and where they come from? How we deal with Big data? Big Data Everywhere! As a human, we generate a lot of data during our everyday activity. When you buy something,

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

Microsoft SQL Server 2012 with Hadoop

Microsoft SQL Server 2012 with Hadoop Microsoft SQL Server 2012 with Hadoop Debarchan Sarkar Chapter No. 1 "Introduction to Big Data and Hadoop" In this package, you will find: A Biography of the author of the book A preview chapter from the

More information

How Big Is Big Data Adoption? Survey Results. Survey Results... 4. Big Data Company Strategy... 6

How Big Is Big Data Adoption? Survey Results. Survey Results... 4. Big Data Company Strategy... 6 Survey Results Table of Contents Survey Results... 4 Big Data Company Strategy... 6 Big Data Business Drivers and Benefits Received... 8 Big Data Integration... 10 Big Data Implementation Challenges...

More information

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014 Forecast of Big Data Trends Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014 Big Data transforms Business 2 Data created every minute Source http://mashable.com/2012/06/22/data-created-every-minute/

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate

More information

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this

More information

The 4 Pillars of Technosoft s Big Data Practice

The 4 Pillars of Technosoft s Big Data Practice beyond possible Big Use End-user applications Big Analytics Visualisation tools Big Analytical tools Big management systems The 4 Pillars of Technosoft s Big Practice Overview Businesses have long managed

More information

The Next Wave of Data Management. Is Big Data The New Normal?

The Next Wave of Data Management. Is Big Data The New Normal? The Next Wave of Data Management Is Big Data The New Normal? Table of Contents Introduction 3 Separating Reality and Hype 3 Why Are Firms Making IT Investments In Big Data? 4 Trends In Data Management

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture. Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in

More information

Apache Hadoop: The Big Data Refinery

Apache Hadoop: The Big Data Refinery Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data

More information

Hexaware E-book on Predictive Analytics

Hexaware E-book on Predictive Analytics Hexaware E-book on Predictive Analytics Business Intelligence & Analytics Actionable Intelligence Enabled Published on : Feb 7, 2012 Hexaware E-book on Predictive Analytics What is Data mining? Data mining,

More information

BIG DATA CHALLENGES AND PERSPECTIVES

BIG DATA CHALLENGES AND PERSPECTIVES BIG DATA CHALLENGES AND PERSPECTIVES Meenakshi Sharma 1, Keshav Kishore 2 1 Student of Master of Technology, 2 Head of Department, Department of Computer Science and Engineering, A P Goyal Shimla University,

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Databricks. A Primer

Databricks. A Primer Databricks A Primer Who is Databricks? Databricks vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache Spark, a powerful

More information

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014 The Data Warehouse Mission Identify all possible enterprise data assets Select those assets

More information

Big Data. Fast Forward. Putting data to productive use

Big Data. Fast Forward. Putting data to productive use Big Data Putting data to productive use Fast Forward What is big data, and why should you care? Get familiar with big data terminology, technologies, and techniques. Getting started with big data to realize

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

In-Memory Analytics for Big Data

In-Memory Analytics for Big Data In-Memory Analytics for Big Data Game-changing technology for faster, better insights WHITE PAPER SAS White Paper Table of Contents Introduction: A New Breed of Analytics... 1 SAS In-Memory Overview...

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are

More information

Interactive data analytics drive insights

Interactive data analytics drive insights Big data Interactive data analytics drive insights Daniel Davis/Invodo/S&P. Screen images courtesy of Landmark Software and Services By Armando Acosta and Joey Jablonski The Apache Hadoop Big data has

More information

Sentimental Analysis using Hadoop Phase 2: Week 2

Sentimental Analysis using Hadoop Phase 2: Week 2 Sentimental Analysis using Hadoop Phase 2: Week 2 MARKET / INDUSTRY, FUTURE SCOPE BY ANKUR UPRIT The key value type basically, uses a hash table in which there exists a unique key and a pointer to a particular

More information

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP Pythian White Paper TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP ABSTRACT As companies increasingly rely on big data to steer decisions, they also find themselves looking for ways to simplify

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012. Viswa Sharma Solutions Architect Tata Consultancy Services

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012. Viswa Sharma Solutions Architect Tata Consultancy Services Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012 Viswa Sharma Solutions Architect Tata Consultancy Services 1 Agenda What is Hadoop Why Hadoop? The Net Generation is here Sizing the

More information

Big Data and Apache Hadoop Adoption:

Big Data and Apache Hadoop Adoption: Expert Reference Series of White Papers Big Data and Apache Hadoop Adoption: Key Challenges and Rewards 1-800-COURSES www.globalknowledge.com Big Data and Apache Hadoop Adoption: Key Challenges and Rewards

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QlikView Technical Case Study Series Big Data June 2012 qlikview.com Introduction This QlikView technical case study focuses on the QlikView deployment

More information

Cloudera Enterprise Data Hub in Telecom:

Cloudera Enterprise Data Hub in Telecom: Cloudera Enterprise Data Hub in Telecom: Three Customer Case Studies Version: 103 Table of Contents Introduction 3 Cloudera Enterprise Data Hub for Telcos 4 Cloudera Enterprise Data Hub in Telecom: Customer

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance. Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

More information

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of

More information

The Future of Data Management with Hadoop and the Enterprise Data Hub

The Future of Data Management with Hadoop and the Enterprise Data Hub The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah Cofounder & CTO, Cloudera, Inc. Twitter: @awadallah 1 2 Cloudera Snapshot Founded 2008, by former employees of Employees

More information

BIG DATA-AS-A-SERVICE

BIG DATA-AS-A-SERVICE White Paper BIG DATA-AS-A-SERVICE What Big Data is about What service providers can do with Big Data What EMC can do to help EMC Solutions Group Abstract This white paper looks at what service providers

More information

White Paper: Evaluating Big Data Analytical Capabilities For Government Use

White Paper: Evaluating Big Data Analytical Capabilities For Government Use CTOlabs.com White Paper: Evaluating Big Data Analytical Capabilities For Government Use March 2012 A White Paper providing context and guidance you can use Inside: The Big Data Tool Landscape Big Data

More information

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem: Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Chapter 6 Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc. 2011 2014. All Rights Reserved

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc. 2011 2014. All Rights Reserved Hortonworks & SAS Analytics everywhere. Page 1 A change in focus. A shift in Advertising From mass branding A shift in Financial Services From Educated Investing A shift in Healthcare From mass treatment

More information

Databricks. A Primer

Databricks. A Primer Databricks A Primer Who is Databricks? Databricks was founded by the team behind Apache Spark, the most active open source project in the big data ecosystem today. Our mission at Databricks is to dramatically

More information

Tap into Hadoop and Other No SQL Sources

Tap into Hadoop and Other No SQL Sources Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data

More information

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning

More information

Big Data Zurich, November 23. September 2011

Big Data Zurich, November 23. September 2011 Institute of Technology Management Big Data Projektskizze «Competence Center Automotive Intelligence» Zurich, November 11th 23. September 2011 Felix Wortmann Assistant Professor Technology Management,

More information

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM An Overview Contents Contents... 1 BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM... 1 Program Overview... 4 Curriculum... 5 Module 1: Big Data: Hadoop

More information

Hadoop Big Data for Processing Data and Performing Workload

Hadoop Big Data for Processing Data and Performing Workload Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer

More information

WHITE PAPER ON. Operational Analytics. HTC Global Services Inc. Do not copy or distribute. www.htcinc.com

WHITE PAPER ON. Operational Analytics. HTC Global Services Inc. Do not copy or distribute. www.htcinc.com WHITE PAPER ON Operational Analytics www.htcinc.com Contents Introduction... 2 Industry 4.0 Standard... 3 Data Streams... 3 Big Data Age... 4 Analytics... 5 Operational Analytics... 6 IT Operations Analytics...

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: simmibagga12@gmail.com

More information

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to

More information

A financial software company

A financial software company A financial software company Projecting USD10 million revenue lift with the IBM Netezza data warehouse appliance Overview The need A financial software company sought to analyze customer engagements to

More information

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Chapter 6. Foundations of Business Intelligence: Databases and Information Management Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

Sunnie Chung. Cleveland State University

Sunnie Chung. Cleveland State University Sunnie Chung Cleveland State University Data Scientist Big Data Processing Data Mining 2 INTERSECT of Computer Scientists and Statisticians with Knowledge of Data Mining AND Big data Processing Skills:

More information

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information