Exploring the Concept of Big Data

Transcription

1 Exploring the Concept of Big Data Technologies, Solutions and Best Practices Ilkka Ritola Bachelor's thesis of the Degree Programme in Business Information Technology Bachelor of Business Administration TORNIO 2014

2 ABSTRACT LAPLAND UNIVERSITY OF APPLIED SCIENCES, Business and Culture Degree programme: Writer(s): Thesis title: Bachelor of Business Administration Ritola, Ilkka Exploring the Concept of Big Data - Technologies, Solutions and Best Practices Pages (of which appendices): 58 (0) Date: May 28, 2014 Thesis instructor(s): Vladimir Ryabov This research work aims to explore the concept of Big Data in detail. More specifically, the objective is to define and clarify the concept by examining the exact meaning of Big Data and understanding its underlying technologies. Moreover, this research discusses Big Data analytics solutions offered in the market today, along with best practices on implementing at strategic level. Big Data is one of the most talked about topics of the recent years in both business and information technology media. The importance of Big Data cannot be understated and the data velocity of data creation continues to skyrocket. Many have taken full advantage of the new possibilities provided by recent developments in the field, but many more are lagging behind or just starting to consider the possibilities. Therefore, this research work attempts to provide a good starting point for anyone interested in Big Data. The main research approach employed in this research work is exploratory research using the most relevant and up to date literature as well as research works. This research work is theoretical in nature but explores the subject of Big Data in its applicable context. Big Data is more than just data that is available in substantial amounts. It encompasses a variety of different data types and data sources. The amount of data varies greatly and so does the time sensitivity of the date. Moreover, the aim of organization using the data changes the Big Data problem drastically. Therefore, understanding both the technological foundation of various tools as well as the business aims of the Big Data project becomes crucial. Thus, industry leading solutions as well as best practices in implementing Big Data projects are discussed in this research work as well. Keywords: Big Data, Big Data Analytics

3 3 CONTENTS ABSTRACT FIGURES TABLES 1 INTRODUCTION Background and motivation Objectives Structure Research scope, questions and methodology Research scope Research questions Methodology Limitations Big Data concept Big Data definition Big Data characteristics Big Data applications and limitations Big Data technological challenges Big Data technologies Massively Parallel Processing (MPP) MapReduce NoSQL databases Column oriented database Key/value database Document oriented database In-memory data management Big Data and cloud computing BIG DATA SOFTWARE SOLUTIONS Apache Hadoop Hadoop Distributed File System (HDFS) Hadoop YARN and Hadoop MapReduce...35

4 Apache Hbase Apache Hive Cloudera Enterprise CDH Cloudera Manager and Cloudera Navigator MongoDB Oracle Big Data Appliance Oracle Big Data Appliance IBM Big Data platform IBM InfoSphere Data Explorer IBM InfoSphere BigInsights IBM InfoSphere Streams IBM PureData System for Analytics powered by Netezza Analytic accelerators Information integration and governance IMPLEMENTING A BIG DATA PLATFORM Business alignment Big Data workforce Organizational considerations Integration of Big Data with traditional data warehouse and BI systems Conclusions...53 REFERENCES...55

5 5 FIGURES Figure 1. Four Vs of Big Data (Walker 2012)...14 Figure 2. Massively Parallel Processing Data Storage System (Franks 2013, 94)...21 Figure 3. MapReduce Data Flow...23 Figure 4. Data Access in Row Storage and Column Storage (Plattner 2009, 2)...26 Figure 5. A typical Hadoop configuration (Schmarzo 2013, 178)...32 Figure 6. HDFS Architecture (The Apache Software Foundation 2013)...34 Figure 7. MapReduce, YARN and HDFS (Hortonworks 2014.)...35 Figure 8. Cloudera's Enterprise Data Hub (Cloudera 2014, a)...36 Figure 9. CDH Components (Cloudera 2014, b)...37 Figure 10. Oracle Big Data solutions (Oracle 2013, 8.)...40 Figure 11. The Oracle Big Data Appliance (Oracle 2013, 9.)...41 Figure 12. Services offered by the IBM Big Data platform (Zikopoulos et al. 2013, 56.)...43 Figure 14. Integrated information platforms (Devlin 2012, 7.)...51 TABLES Table 1. Data Measures (Franks 2012, 89)...15 Table 2. Important Characteristics of SQL and NoSQL Databases (Hurwitz et al. 2013, 55)...25 Table 3. IBM BigInsights Hadoop components (Zikopoulos et al. 2013, 91)...44

6 6 1 INTRODUCTION Background of the thesis topic is discussed in this chapter and the importance of the topic is motivated. Furthermore the objectives of the research work are described in detail and the structure of this thesis is outlined. 1.1 Background and motivation Business Intelligence (hereinafter BI) has been a significant part of the strategic approach of a vast majority of companies for a long time. Understanding data has become of critical importance in the business environment. The amount of data created is increasing due to social media, mobile devices, cheap sensors and advances in computing technology, thus leading to a situation where the requirements for data management and analytics are changing beyond what traditional BI solutions are capable of. According to Gantz and Reinsel, the digital data created, replicated and consumed in the world is doubling every two years. In addition, Gantz and Reinsel have estimated that in the year 2020 alone there will be approximately exabytes of digital data that will be created, replicated and consumed. This data comes from sources such as social media, mobile devices, music and movies, digital pictures, transactions, and science projects. All this data provides a plethora of analytical benefits. Yet, only a fraction of this data has been explored for its analytical value. By 2020 approximately 33% of all this data will be valuable but only if tagged and analyzed. (Gantz & Reinsel 2012, 1-9.) This data is often referred as Big Data. During recent years, Big Data has become an area of intense interest for businesses, scientists and Governments. Harvard Business Review, for example, called data scientist the sexiest job of the 21 st century (Davenport & Patil 2012). Google Trends indicates that the search term Big Data shows a sharp increase in interest starting from 2011 (Google 2014). Even though this indicates that Big Data surged into the popular consciousness only recently, it has existed in one way or the other far longer than that. According to Gartner, Big Data is high-volume, high-velocity and high-variety information assets that demand cost-

7 7 effective, innovative forms of information processing for enhanced insight and decision making. (Gartner 2014.) At its most fundamental level, Big Data is the challenge organizations need to solve in order access the analytical value provided the increasing data volumes created constantly. Organizations have always tried to take advantage of data for competitive advantage but the recent developments in information technology have manifested both unprecedented challenges as well as opportunities in the data analytics field. Minelli, Michael and Dhiraj divide the reasons for the rise of Big Data analytics into three main parts. The advances in computing technology, increased amount of data available, and the convergence of traditional data management and analytics software with open-source software and commodity hardware. (Minelli & Michael & Dhiraj 2013, 1-2.) There have been many data driven success stories during the last decades including high profile companies such as Amazon, Netflix, Google and Facebook. These companies have fully embraced Big Data. Google and Facebook have also played a part in developing the necessary tools to store and analyze high data volumes. Moreover, a lot of the early technologies, developed by these companies, were developed with the opensource principles in mind and have since been adopted by numerous developers such as The Apache Software Foundation and software vendors such as Cloudera, IBM, and Oracle. It is important to point out that while many of the technological developments have been created by these big businesses, Big Data is not beneficial for only social media companies and retailers. It can be utilized in a wide variety of fields from education to science projects to business purposes. The interest for Big Data continues to rise and it is clear that organizations consider implementing Big Data analytics programs (IDG Enterprise 2014, 4-10). Yet, Big Data is often misunderstood and underutilized. For many organizations Big Data is still a confusing concept that holds vague promises and ambiguous opportunities. Therefore, this research attempts to clarify the Big Data concept. To fully understand Big Data concept it is critical to understand its main characteristics, challenges, and underlying technologies. Therefore, emphasis should be put on exploring the Big Data concept holistically in order to make informed decisions regarding the implementation of Big Data management platforms.

8 8 1.2 Objectives The main objectives of researching Big Data are to understand explore the concept of Big Data, its benefit and limitations, and the implementation of Big Data analytics projects. To achieve this it is critical to understand what Big Data is and while this research does not attempt to propose a final definition, for that would be an impractical endeavor with a concept as dynamic as Big Data, it does explore the concept and its underlying technologies in depth in an attempt to clarify the concept in the minds of both business leaders and academics alike. Furthermore, this research conducts an up to date examinations on the most relevant Big Data analytics software solutions offered by the current industry leaders such as IBM and Oracle. This accompanied by a discussion into the implementation process of Big Data projects, including individual components, human resources and best practices of such endeavors. Ultimately, this research paper is not designed to be a definitive guide into making concrete decisions concerning Big Data. The aim is to provide a starting place for anyone interested in exploring the benefits and limitations of Big Data. 1.3 Structure This thesis is divided into 7 chapters. Research scope, methodology, research questions and limitations are discussed in the second chapter. The third chapter presents a discussion regarding the Big Data concept. The fourth chapter discusses the underlying technologies of Big Data analytics. The most important Big Data software solutions are explored in the fifth chapter. The sixth chapter provides an exploration into implementation process and best practices of Big Data projects. The seventh chapter concludes the research by discussing the research work and by providing suggestions for further research.

9 9 2 RESEARCH SCOPE, QUESTIONS AND METHODOLOGY The scope of this research is outlined in this chapter. The research questions are outlines along with the methodology that was used in conducting this research. 2.1 Research scope This research examines the phenomenon known as Big Data. An exploration of the real world value and benefits of Big Data is conducted to separate the facts from the marketing hype that surrounds Big Data. Benefits and limitations of Big Data are discussed. The research also explores the Big Data concept by looking at existing literature and current definitions provided by industry leaders such as IBM and SAS in an attempt to clarify the meaning of the concept. Even though this research does not introduce any new terminology to the current vernacular, a careful look is taken at the current definitions to understand this ever evolving subject in depth. In addition, the research explores the technological developments that have made Big Data analytics possible. Specifically, the research focuses on the technologies that are essential for feasible Big Data analytics programs that organizations are undertaking. Furthermore, an analysis on the leading Big Data software solutions is conducted. A critical comparison is formulated concerning the different services offered. Emphasis is put on understanding the approach each software vendor is taking with their Big Data solutions. 2.2 Research questions The following research questions are defined in order to achieve the objectives of the research work. 1. What is Big Data and why is it critical to understand Big Data benefits and limitations? What are the underlying technologies of Big Data?

10 10 Big Data is a discussed extensively in the media and industry literature but the exact meaning of Big Data remains unclear. Numerous different, even contradictory, definitions are often used by the media and Big Data companies. In the interest of further, cohesive research and understanding the benefits of Big Data it is important that the Big Data concept is understood clearly. This can be achieved by exploring the characteristics and underlying technologies that are associated with Big Data. The focus is put on exploring the most important technologies of Big Data analytics. 2. What are the features of the leading Big Data applications offered in the market today? What are their main differences and similarities? This research question examines the different Big Data analytics solutions provided by the current industry leaders. A comparison of these Big Data applications is conducted to better understand the alignment of data driven approach and business value. 3. How are Big Data management platforms implemented in organizations? What are the critical components of successfully implemented Big Data management programs? This research question explores the importance of Big Data management programs as a critical component of an organization's decision making process. Relevant literature and industry best practices are utilized in understanding the process of implementing Big Data management project. 2.3 Methodology This research draws from exploratory research due to the novelty of the relatively new topic that is Big Data. Qualitative approach is utilized. Numerical data is also used. The main research technique of this research work is literature review and only the most relevant and up to date works are used because of the high velocity which the field is evolving. This research is exploratory in nature because the concept of Big Data is relatively new. In addition, the Big Data analytics field is evolving at a high velocity which has made

11 11 understanding Big Data challenging. As pointed out by McDaniel Jr. and Gates, exploratory research is used to to obtain greater understanding of a concept or to help crystallize the definition of a problem. It is also used to identify important variables to be studied. Exploratory research is preliminary research, not the definitive research used to determine a course of action. (McDaniel Jr. & Gates 2010, 43.) This research attempts to clarify the concept of Big Data by exploring the relevant literature, therefor an exploratory research approach was chosen. In addition to academic literature, definitions and information from current industry leaders are utilized. Because this research explores the importance of Big Data analytics through literature, it deals mainly with qualitative data. According to Smith and Albaum, using secondary data sources can help to clarify a problem and is considerably more cost and time efficient than collecting primary data. (Smith & Albaum 2005, ) Therefore, numerical data from secondary and tertiary sources are also used to further explore the benefits, limitations and weaknesses of Big Data. The criteria for choosing the literature and materials used were to only utilize the most relevant and up to date literature from reputable authors and publishers. 2.4 Limitations This research has two main limitations. Firstly, the concept of Big Data is evolving with a high velocity and the information provided by this research only remains relevant as long as new developments are made in the field. Moreover, this research utilizes information and white papers provided by companies who are providing Big Data solutions and while these can be an excellent sources of information it is important to realize the promotional aspect of these white papers and research works sponsored by the software vendors.

12 12 3. BIG DATA CONCEPT The concept of Big Data is explored is in this chapter by looking at existing definitions and common characteristics that are used to describe Big Data. Furthermore, various use cases for Big Data are examined to further increase the clarity of the concept. 3.1 Big Data definition Big Data as a concept has been in general use for just a few years. Nevertheless, Big Data has become a buzzword during this short time period recent years, yet the exact meaning of the concept remains unclear and is often used synonymously with other related concepts such as BI and data mining. Furthermore, various stakeholders have created new definitions or additions to existing definitions that best suit their interests. Nevertheless, to capture the essence of Big Data, consistent themes can be found by examining various definitions provided by the industry leaders and relevant literature. IBM offers a very direct approach in describing Big Data stating that Every day, we create 2.5 quintillion bytes of data so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data. (IBM 2014, original emphasis.) IBM makes it clear that Big Data involves notable amounts of data that comes from a wide variety of sources. Additionally, IBM emphasizes the increasing speed of data generation. SAP, in their website, offers a more promotion oriented view on Big Data. Big Data is an opportunity to change how you work, play, and live by tracking new signals within digital noise. From major league sports and personalized shopping to carbon reduction and cancer treatment, organizations are using Big Data to re-imagine achieving what is possible. (SAP 2014.) Clearly, SAP is focusing on the benefits of Big Data rather than trying to provide a clear-cut definition of the concept. Moreover, SAP is emphasizing the fact that Big Data can provide value in a wide variety of fields.

13 13 McKinsey approaches Big Data concept from a different point of view. Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. This definition is intentionally subjective and incorporates a moving definition of how big a dataset needs to be in order to be considered big data i.e., we don t define big data in terms of being larger than a certain number of terabytes (thousands of gigabytes). We assume that, as technology advances over time, the size of datasets that qualify as big data will also increase. Also note that the definition can vary by sector, depending on what kinds of software tools are commonly available and what sizes of datasets are common in a particular industry. With those caveats, big data in many sectors today will range from a few dozen terabytes to multiple petabytes (thousands of terabytes). (McKinsey 2011, 1. Original emphasis.) Big Data does not refer to a specific type of data, but rather to the challenge created by the substantial data amounts available to organizations. It is apparent that there is a lot of variation in the definitions. Thus, rather than delving into the dozens of formal definitions, Big Data can perhaps be best understood by examining the characteristics associated with it. One of the earliest and widely cited sources on the characteristics of Big Data are included in a 2001 Meta report by Doug Laney (Laney 2001). Even though the actual concept Big Data is not explicitly mentioned in the report, the characteristics, or Three Vs as they are often called, which are mentioned in it have become the standard way of describing Big Data. These are Volume, Velocity and Variety. While more Vs and other characteristics have been added to describe Big Data by various entities, the three original Vs remain the widely accepted basis for describing Big Data. SAS, for example, have added two additional dimensions to complement the original three Vs. These are variability and complexity (SAS 2014). IBM, includes a fourth V; veracity. Furthermore, a fifth V, value is commonly associated with Big Data. In other words, there is no consensus on the exact characteristics of Big Data. Nevertheless there are multiple characteristics which most of the software vendors in the field agree upon. Figure 1, adopted from Walker (2012), depicts four of these Vs.

14 14 Figure 1. Four Vs of Big Data (Walker 2012) Volume refers to the sheer size of the substantial data amounts and the challenged it brings with it. Velocity refers to a challenge created by data that requires very fast response. Another challenge is the varying formats and structures that data is stored in. Moreover, to truly take advantage of Big Data it is important to manage the integrity of the data. This is becoming increasingly challenging and increasingly important part of Big Data analytics. 3.2 Big Data characteristics The volume dimension is perhaps the biggest differentiating factors of Big Data analytics compared to traditional data analytics. As its name implies, Big Data involves considerable amounts of data. However, an important question to find answers to is what exactly does big mean in this context. A recent survey by Intel indicates that large companies generate approximately 200 TB of data each week (Intel 2012, 24). Such substantial volumes of data makes it challenging, if not impossible, to derive meaningful and valuable knowledge in a timely manner using traditional BI tools. Nonetheless, what is considered big today will seem positively diminutive in the future due to the high velocity in the increase of data generated. Therefore, it is important to keep in mind that big is always a relative expression. Table 1, adopted from Franks, illustrates the scale of data amounts involved in Big Data analytics (Franks 2012, 89).

15 15 Table 1. Data Measures (Franks 2012, 89) Table 1, lists the most commonly used data measures as well as puts the data volumes into perspective by using concrete examples. While not a precise measure, it illustrates the sheer volume of data that is created and managed by organizations in today's world. Traditionally much smaller data sets known as samples were used for data analysis. The main reason for this has mainly been data storage and processing constraints. Even though this method is effective in gaining predictive insight, the results are suboptimal. In today's Big Data world these constraints are becoming less relevant. By taking advantage of the ability to analyze the entire data set, organizations can discover subtle patterns which remain undetected when analyzing merely a small sample of the whole. The results will also be more reliable and accurate and by having data from extensive periods of time it is possible to form complex predictive models that take into account

16 16 all the inter-related components that would not be possible otherwise. (Minelli et al. 2013, 37.) Franks points out that because the hardware limitations on how much data is stored are largely removed it has become possible to store all the data in order to utilize it for analytics. Nevertheless, much of the data that is being stored holds no or very little value. Therefore, the key point to make about Big Data is that it is not the data that is stored that is valuable but rather than what is done with the data and what decisions are made based on this data driven approach. (Franks 2012, 6-8) Big Data is not merely big. The main difference between Big Data analytics and BI is the type of data that is being analyzed. Traditional BI tools are utilized for analysis of structured data but cannot effectively extract information from semi-structured and unstructured data. Big Data solutions are specifically designed to store and process semi-structured and unstructured data. While there are also hybrid solutions that can handle all types of data, it is unlikely that they will replace traditional BI tools in the near future. (Simon 2013, ) Structured data resides firmly in a relational row-column database and is designed to be clean, effective and easy to analyze. This includes set data types and predefined rules on where the data is stored. Unstructured data, on the contrary, does not have a predefined data model nor does it fit into a relational database. In addition, where structured data tends to grow largely in linear fashion, unstructured data grows exponentially. While usually text heavy, unstructured data contains numbers, dates and facts as well. Semistructured data is used to describe data that is structured and has the ability to enforce hierarchies within the data but does not fit well into formal data models. (Minelli et al 2013, 11.) According to Simon, the internal data generated by organizations have stayed roughly at the levels for the past decade. This internal data is usually highly structured and has been used by BI solutions for a long time. Most of the external data, which is growing much faster than the internal data, is unstructured and comes from multiple different sources. (Simon 2013, 39.)

17 17 Velocity refers to the time sensitivity of Big Data. This means that to maximize the value of Big Data, it must be often times be analyzed inside a very narrow time window or, in some cases, in real time as the data keeps streaming in. According to Hurwitz & Nugent & Halper & Kaufman, the real-time aspect of data is most relevant when the aim of the analytics process is time sensitive and business critical. Examples of such instances could be monitoring social media streams, or giving customers coupon depending on what they have bought at the point of sale. (Hurwitz & Nugent & Halper & Kaufman 2013, 32.) Considering the sheer size, velocity and the variety of Big Data, there is a certain amount of uncertainty inherently involved in analyzing it. It is impractical, if not impossible, to clean up the data for precise analysis. This problem is often referred as data veracity and is a central problem to be solved when utilizing Big Data. The uncertainty and imprecision varies depending on the situation, but must be factored in when performing analysis. (Walker 2012) Regardless of how many different Vs one counts to be among the critical characteristics that describe Big Data, the fact is that Big Data comes in many different shapes and forms. Big Data is a concept that is used to describe data that is challenging to manage using traditional tools. Furthermore, not all Big Data has all of the defining characteristics and not all Big Data exhibit the characteristics in equal degree. A relatively small data sets can be considered to be Big Data if, for example, the variety makes it complicate to work with. Mike Olson, co-founder of Cloudera, a pioneering Big Data company, sums up his personal view of Big Data, after giving the standard answer of 3 Vs, thusly: If there is data you want to work with and it doesn't fit where you want to put it, it is Big Data. (Olson & Li 2013.) 3.3 Big Data applications and limitations Another way of understanding Big Data is by looking at benefits Big Data analytics can bring about when utilized in various fields. Big Data can provide great benefits in almost every field. Data Science Central lists ten practical examples as follows: dialoguing with consumers, de-developing products, conducting risk analysis, keeping data safe, creating new revenue streams real time customization of websites, reducing

18 18 maintenance costs, organizing tailored healthcare, providing enterprise-wide insights, and making cities smarter. (Data Science Central 2012.) Instead of going into details on these high level examples, a more illuminating approach is to explore the actual real life examples of Big Data being utilized to its full potential. Netflix is a relevant example, the video streaming giant, who controls the whole customer experience, from the moment they sign up all the way up to the actual viewing experience of watching a movie or a TV series. This allows Netflix to collect high data amounts about their users, including viewing preferences such as preferred genres. This helps Netflix to understand their customers in ways that were never before possible. Essentially, a clever use of Big Data has allowed Netflix to identify opportunities for new TV shows that are likely to have strong fan base. (Sweney 2014.) Furthermore, by utilizing Big Data and data visualization technologies, Netflix is able to analyze the covers of their shows and create covers which are likely to be appealing to the targeted customer group. (Simon 2014.) Big Data analytics is not beneficial for big businesses only. In the 2012 United States presidential elections data analytics was utilized. MIT Technology Review article from 2012 explains how Big Data was utilized during the campaign. The Obama 2012 campaign used data analytics and the experimental method to assemble a winning coalition vote by vote. In doing so, it overturned the long dominance of TV advertising in U.S. politics and created something new in the world: a national campaign run like a local ward election, where the interests of individual voters were known and addressed. (Issenberg, 2012.) Furthermore, Obama's data scientists were able to identify potential voters and ways to get their attention as well as to raise funds through a metric driven ecampaign were optimal subject lines, senders and messages were identified (Scherer 2012). Nevertheless, it is important to maintain the critical approach in discussing Big Data. While Big Data can be a powerful source of knowledge, it also has its limitations. Simon points out that crowdsourcing everything is not always the appropriate approach to decision making. The vision of an individual might prove to be the correct course to take in certain situations. Moreover, Simon emphasizes the need for human judgment when using Big Data for decision making because ultimately it is the human making the important decisions. (Simon 2013, )

19 19 It would be impossible to discuss Big Data without mentioning privacy concerns. Ever since Glen Greenwald reported on the widespread data collection practices of the National Security Agency in 2013 (Greenwald 2013), Big Data privacy concerns have been discussed extensively in the media. These revelations might have brought the privacy conversation into the limelight, but these worries have been relevant for a much longer time. Simon points out that companies such as Amazon, Apple, Facebook and Google have long been scrutinized for privacy issues and many have hard time trusting these companies with their personal data, some avoiding them all together. (Simon 2013, ) Perhaps the most significant issue is that while data collection, storing and analytics technologies have been evolving at high pace, the laws and social norms have not dealt with such issues before. 3.4 Big Data technological challenges Big Data introduces many technological challenges to organizations looking to leverage its benefits. These include storage challenges and data processing challenges. Each of the Big Data Vs; volume, variety, velocity, and veracity provide different challenges that require new technological approaches. In is likely that a large organization has to overcome one more of these challenges in order to realize the most important V: Value. The high volumes of data introduce a storage challenge. Because data amounts being generated by organizations are growing rapidly, a centralized database system is an impractical data storage solution. Ohlhorst suggests that a Big Data storage system should be highly scalable in order to meet the changing requirements for data capacity. These storage systems need to be able to increase their capacity and processing power in modules. (Ohlhorst 2013, ) The variety of data, structured, semi-structured, and unstructured data requires a database which can efficiently store and process these data types. Relational databases are optimized for structured data. NoSQL is a category of databases that are designed specifically to overcome the limitations of relational databases.

20 20 To fully leverage Big Data, it is often important to process and analyze Big Data in realtime. The challenge is to increase performance all the while the data amounts are increasing as well. This has led to the adoption of in-memory data management technologies for real time analytics. These technologies and their utility in Big Data analytics are discussed in more detail in chapter 4.

21 21 4. BIG DATA TECHNOLOGIES This chapter discusses the most critical underlying technologies of Big Data. As discussed in previous chapter, Big Data is not a one specific type of data but compasses a wide variety of data analytics situations which traditional tools are not equipped to store and process. Therefore, most of the technologies discussed in this chapter are designed to address a specific aspect of the Big Data analytics technology stack. 4.1 Massively Parallel Processing (MPP) Massively Parallel Processing (hereinafter MPP) databases provide a cost effective, scalable, and flexible data warehouse environment for high data amounts. MPP platform consists of multiple commodity servers which are all capable of processing data in parallel. MPP databases can be utilized as an extension or as a replacement for traditional data warehouse systems. (Schmarzo 2013, ) MPP database systems distribute the data among multiple servers. These servers each have their own storage and central processing units (hereinafter CPU). This in contrast to the centralized model where all data is stored and processed using a one powerful server. (Franks 2013, ) Figure 2, adopted from Franks, illustrates these approaches (Franks 2013, 94). Figure 2. Massively Parallel Processing Data Storage System (Franks 2013, 94)

22 22 Figure 2 illustrates the main strength of the MPP database system. A centralized server is appropriate as long as the data sets are relatively small, but with high data sets it becomes challenging for a centralized server to process them in a timely manner. By dividing the large set of data into manageable pieces and distributing it among multiple servers it can be processed more efficiently. In addition, MPP systems store multiple copies data into multiple locations. This redundancy of data makes recovering data easy in the case of a hardware failure. (Franks 2013, ) 4.2 MapReduce MapReduce is a technology that helps to overcome both the volume and variety challenges of Big Data analytics. Like MPP, MapReduce utilizes a distributed approach to data processing but unlike MPP, MapReduce is not a database system. According to Hurwitz et al. MapReduce is a software framework which facilitates the processing of high amounts of unstructured data in parallel, using a distributed set of processors. (Hurwitz et al. 2013, 101.) Franks points out that although it has been suggested that MapReduce will replace databases entirely, MapReduce is neither a database nor a direct competitor for them. Rather it is a complementary technology for the current databases and best used by focusing on its strengths. (Franks 2012, 110.) MapReduce consists of two main programming processes: map and reduce. The programming model is summarized by Dean and Ghemawat thusly: The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce. Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function. The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user s reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory. (Dean & Ghemawat 2004, 2, original emphasis.)

23 23 These steps are utilized in the MapReduce framework which distributes them among worker nodes to be processed. Commodity hardware can be used to perform these processes. Without any interaction between them and each of them runs the same code on the portion of data distributed to them by MapReduce. (Franks 2012, ) MapReduce has proven to be effective in processing unstructured data. With structured data, where the data is already defined into carefully constructed relationships in a relational database, raw data streams are not easily analyzed by traditional tools. Unstructured or semi-structured data include a variety of data types, such as machine generated web logs, sensor data, images and text and often it is imperative for organizations to derive meaningful insights from that data in a timely and efficient manner. This is where MapReduce shows its strength. (Franks 2013, ) Figure 3, adopted from Hurwitz et al. illustrates how the MapReduce process works. (Hurwitz et al. 2013, 106.) Figure 3. MapReduce Data Flow

24 24 As Figure 3 illustrates, all operations are independent. This allows MapReduce to break substantial data set, involving up to terabytes of data, into smaller segments, and process the segments independently. (Hurwitz et al. 2013, 106.) 4.3 NoSQL databases Relational database management systems (hereinafter RDBMS) have been relied upon to solve data storage challenges. A wide ecosystem of applications and platforms has been built around this relational database management system paradigm. The arrival of Big Data has created entirely new challenges that cannot be addressed by traditional RDBMS approach. Due to the large amounts of unstructured data available and the fact that traditional relational databases are impractical in storing unstructured data, new types of data storage systems are needed to fully utilize this data. NoSQL databases are designed to address these challenges. NoSQL comes from the words no and simple query language (hereinafter SQL), or not only and SQL. NoSQL is used now to describe database systems that do not follow the RDBMS principles. These databases are specifically designed to handle large amounts of data, including unstructured data, in a time sensitive manner. (Vaish 2013, 8.) Cattell identifies NoSQL databases using six key features as follows: The ability to scale horizontally, the ability to distribute data over many servers, a simple call level interface or protocol, a weaker concurrency model than the ACID utilized by most relational databases, efficient use of distributed indexes and RAM, and the ability to dynamically add new attributes to records. (Cattell 2011, 1.) Traditional RDBMS solutions usually utilize the so called ACID model. ACID is an acronym of the words atomicity, consistency, isolation and durability. Atomicity means that unless everything in a given transaction succeeds, it is cancelled and rolled back. Consistency ensures that a transaction does not leave a database in an inconsistent state. Isolation refers to the fact that transactions cannot interfere with each other and lastly, durability means that transactions persists even in the case of an application restart. These qualities, while important for a database to function as intended are not entirely compatible with considerably high data amounts. Therefore, a new database model

25 25 known as BASE has been adopted to make up for these shortcomings of ACID. BASE consists of three principles. Basic availability: Each request is guaranteed a response, be it successful or failed. Soft state: The state of the system may change over time, at times without any input. Eventual consistency: The database may be inconsistent at times but will eventually be consistent. It is important to realize that NoSQL and the BASE model are not ideal solutions in cases were transactions are of critical importance such as banking. BASE model, just like ACID model, is not the answer to all database problems. (Vaish 2013, 9-12.) Vaish points out that NoSQL does not actually refer to any specific database or even a database type but is rather a concept that is used, at its highest level, to filter out a specific set of databases; the RDBMS. (9, 2013) Tiwari identifies five different types of data stores which fall under the NoSQL umbrella: Column oriented databases, key/value pair databases, document databases, graph databases and XML databases. (1-19, 2011) Table 2 below illustrates the main differences between SQL and various NoSQL databases. Table 2. Important Characteristics of SQL and NoSQL Databases (Hurwitz et al. 2013, 55) Table 2 lists the most important characteristics most frequently associated with various database types. As it is with most rapidly evolving technologies, these characteristics

26 26 are not set in stone and expectations exist. Nevertheless, the table provides a good reference point for further examination. Column oriented database, document oriented database, and key/value database types are discussed briefly below Column oriented database The main defining factor of column oriented databases is that, unlike relational databases which store data as rows, it stores data as columns. Vaish explains the main advantages of column oriented approach. Firstly, most column oriented databases are flexible because columns can be added without having to worry about filling default values. Furthermore, they perform very well when having to process large datasets. Column oriented databases also allow partial data access without having to process unrelated columns, leading to an increased performance. (Vaish 2013, 27.) While it is possible to simulate column oriented database in a row oriented database, a recent research suggests that such simulations perform poorly in comparison to purposely built column oriented databases. (Abadi & Madden & Hachem 2008, 14.) According to Plattner, most applications in enterprise are based on set processing rather than single row access. This means that having data stored as columns is more efficient. (Plattner 2009, 2.) Figure 4, adopted from Plattner illustrates the fundamental difference between row operations and column operations. (Plattner 2009, 2). Figure 4. Data Access in Row Storage and Column Storage (Plattner 2009, 2)

27 27 In terms of set processing, the more data is stored in a database, the more apparent the effectiveness of column database becomes. With a column store it is possible to process only a specific column, whereas a row store needs to process each row to access the column value requested. While this is not significant when processing relatively small datasets, the difference becomes significant with high data amounts. For the same reasons, performing row operations in a column oriented database is inefficient compared to performing those processes in a row database. Therefore, column stores should be used when having to process high data amounts using column operations Key/value database Key/value database is a simple database type where a specific value is stored into the database and accessed through a specific key. The key/value database does not have a rigid schema and is very flexible. Key/value database can either be stored in an inmemory system or in a persistent store. Because values are stored against a key, key/value databases serve as very effective in-memory caches. In-memory key/value database works very much like associative arrays and hash tables. (Vaish 2013, 41.) Document oriented database Document oriented databases are designed to store semi-structured data. Data is stored into a set of records, sometimes referred as documents, which are similar to rows in a RDBMS with the difference that document oriented database can store semi-structured data. While document databases do not have a strict schema, indexes can be created and queried. Records in a document oriented databases may vary to a great degree and they may have entirely difference set of rows and columns. (Vaish 2013, ) The main advantage of document oriented databases is their flexible schema. This makes them effective in web-based applications where the data that needs to be stored might be changing in structure constantly. In addition, many document oriented databases allows a partial record updates and retrievals which increase the processing efficiency of the system especially with high data amounts. Moreover, searching across the entire database is achieved effectively. This is due to the fact that unlike relational

28 28 databases and column oriented databases which have tables, the document oriented database has no strict schema. (Vaish 2013, 31.) 4.4 In-memory data management As discussed previously, real-time analytics is becoming increasingly important part of data analytics. This combined with the increasing data amounts organizations are managing means that faster processing of data is required. This can be achieved by MPP, NoSQL databases and the increasing power of hardware, but the most of the time the slowest component of processing data is the mechanical, disk-based data storage. Inmemory data management, sometimes called main-memory data management, is an attempt to overcome this challenge. In the past, the main limiting factor of in-memory storage has been the relatively small amounts of memory available and high costs. According to Ohlhorst, prices of siliconbased memory has fallen to a point where the capacity to cost ratio has made in-memory data management a viable alternative to disk-based storage. With in-memory technologies organizations now have the ability to analyze large data sets in real time thus allowing managers to base their decisions on real-time data. (Ohlhorst 2013, ) However, this does not mean that all data should be stored in an in-memory data storage. For example, SAP HANA, an in-memory real time analytics platform which will be discussed in chapter 5, typically manages data amounts of approximately 100 terabytes (SAP 2013, 23). These storage amounts alone, while impressive, do not meet the data management needs of large enterprises. Thus it is critical to utilize in-memory data technologies for data that must be processed quickly and analyzed in real-time, while utilizing mechanical, disk-based storages for extremely large sets of data that do not share the same need for real-time analysis. 4.5 Big Data and cloud computing Increasing attention has been paid to the benefits cloud computing technologies have for Big Data analytics. As is suggested by Schmarzo, cloud computing provides benefits such as flexibility, scalable storage capabilities, and access to substantial processing

29 29 power (Schmarzo 2013, 190). According to Mell and Grance, cloud computing is a model that enables on-demand access to a shared and configurable computing systems conveniently. These computing include networks, servers, storage, applications and services and they need to be provisioned rapidly and with a minimal need for management and service provider interaction. Mell and Grance identify five essential characteristics of cloud computing. (Mell & Grance 2011, 2.) 1. On-demand self-service: User has the capability to access the computing resources as needed, without any human interaction with the service provider. 2. Broad network access: Computing resources can be accessed through a network using standard mechanisms that encourage the use of a variety of thin or thick clients, i.e. mobile phones, tablets, laptops, and workstations. 3. Resource pooling: Computing resources are pooled in order to serve multiple consumers using a multi-tenant model. The system dynamically assigns physical and virtual resources according to the customer demands. The customer generally has no knowledge or control over the exact location of the resources. 4. Rapid elasticity: Resource can be provisioned and released in a flexible manner in order to react rapidly to changes in customer demands. 5. Measured service: Cloud computing systems automatically controls and optimizes resources by utilizing a metering system, which gives the user adequate information regarding the service, while maintaining a level of abstraction appropriate for the service. In addition, Mell and Grance list three cloud computing service models Software as a Service (hereinafter SaaS), Platform as a Service (hereinafter PaaS), and Infrastructure as a Service (hereinafter IaaS). SaaS model provides the consumer an access to the service provider's application running on a cloud infrastructure. The consumer has a very limited set of configuration options. PaaS model provides the consumer the capability to use the service provider's cloud infrastructure to implement consumercreated or acquired applications. The consumer does not control the cloud infrastructure but has control over the applications. IaaS model provides the consumer computing resources. The consumer does not control the cloud infrastructure but is able to deploy a wide variety of software, including operation systems and applications. Moreover, there are four cloud deployment models: private cloud, community cloud, public cloud, and hybrid cloud. (Mell et al. 2011, 2-3.)