Exploring the Concept of Big Data

Size: px
Start display at page:

Download "Exploring the Concept of Big Data"

Transcription

1 Exploring the Concept of Big Data Technologies, Solutions and Best Practices Ilkka Ritola Bachelor's thesis of the Degree Programme in Business Information Technology Bachelor of Business Administration TORNIO 2014

2 ABSTRACT LAPLAND UNIVERSITY OF APPLIED SCIENCES, Business and Culture Degree programme: Writer(s): Thesis title: Bachelor of Business Administration Ritola, Ilkka Exploring the Concept of Big Data - Technologies, Solutions and Best Practices Pages (of which appendices): 58 (0) Date: May 28, 2014 Thesis instructor(s): Vladimir Ryabov This research work aims to explore the concept of Big Data in detail. More specifically, the objective is to define and clarify the concept by examining the exact meaning of Big Data and understanding its underlying technologies. Moreover, this research discusses Big Data analytics solutions offered in the market today, along with best practices on implementing at strategic level. Big Data is one of the most talked about topics of the recent years in both business and information technology media. The importance of Big Data cannot be understated and the data velocity of data creation continues to skyrocket. Many have taken full advantage of the new possibilities provided by recent developments in the field, but many more are lagging behind or just starting to consider the possibilities. Therefore, this research work attempts to provide a good starting point for anyone interested in Big Data. The main research approach employed in this research work is exploratory research using the most relevant and up to date literature as well as research works. This research work is theoretical in nature but explores the subject of Big Data in its applicable context. Big Data is more than just data that is available in substantial amounts. It encompasses a variety of different data types and data sources. The amount of data varies greatly and so does the time sensitivity of the date. Moreover, the aim of organization using the data changes the Big Data problem drastically. Therefore, understanding both the technological foundation of various tools as well as the business aims of the Big Data project becomes crucial. Thus, industry leading solutions as well as best practices in implementing Big Data projects are discussed in this research work as well. Keywords: Big Data, Big Data Analytics

3 3 CONTENTS ABSTRACT FIGURES TABLES 1 INTRODUCTION Background and motivation Objectives Structure Research scope, questions and methodology Research scope Research questions Methodology Limitations Big Data concept Big Data definition Big Data characteristics Big Data applications and limitations Big Data technological challenges Big Data technologies Massively Parallel Processing (MPP) MapReduce NoSQL databases Column oriented database Key/value database Document oriented database In-memory data management Big Data and cloud computing BIG DATA SOFTWARE SOLUTIONS Apache Hadoop Hadoop Distributed File System (HDFS) Hadoop YARN and Hadoop MapReduce...35

4 Apache Hbase Apache Hive Cloudera Enterprise CDH Cloudera Manager and Cloudera Navigator MongoDB Oracle Big Data Appliance Oracle Big Data Appliance IBM Big Data platform IBM InfoSphere Data Explorer IBM InfoSphere BigInsights IBM InfoSphere Streams IBM PureData System for Analytics powered by Netezza Analytic accelerators Information integration and governance IMPLEMENTING A BIG DATA PLATFORM Business alignment Big Data workforce Organizational considerations Integration of Big Data with traditional data warehouse and BI systems Conclusions...53 REFERENCES...55

5 5 FIGURES Figure 1. Four Vs of Big Data (Walker 2012)...14 Figure 2. Massively Parallel Processing Data Storage System (Franks 2013, 94)...21 Figure 3. MapReduce Data Flow...23 Figure 4. Data Access in Row Storage and Column Storage (Plattner 2009, 2)...26 Figure 5. A typical Hadoop configuration (Schmarzo 2013, 178)...32 Figure 6. HDFS Architecture (The Apache Software Foundation 2013)...34 Figure 7. MapReduce, YARN and HDFS (Hortonworks 2014.)...35 Figure 8. Cloudera's Enterprise Data Hub (Cloudera 2014, a)...36 Figure 9. CDH Components (Cloudera 2014, b)...37 Figure 10. Oracle Big Data solutions (Oracle 2013, 8.)...40 Figure 11. The Oracle Big Data Appliance (Oracle 2013, 9.)...41 Figure 12. Services offered by the IBM Big Data platform (Zikopoulos et al. 2013, 56.)...43 Figure 14. Integrated information platforms (Devlin 2012, 7.)...51 TABLES Table 1. Data Measures (Franks 2012, 89)...15 Table 2. Important Characteristics of SQL and NoSQL Databases (Hurwitz et al. 2013, 55)...25 Table 3. IBM BigInsights Hadoop components (Zikopoulos et al. 2013, 91)...44

6 6 1 INTRODUCTION Background of the thesis topic is discussed in this chapter and the importance of the topic is motivated. Furthermore the objectives of the research work are described in detail and the structure of this thesis is outlined. 1.1 Background and motivation Business Intelligence (hereinafter BI) has been a significant part of the strategic approach of a vast majority of companies for a long time. Understanding data has become of critical importance in the business environment. The amount of data created is increasing due to social media, mobile devices, cheap sensors and advances in computing technology, thus leading to a situation where the requirements for data management and analytics are changing beyond what traditional BI solutions are capable of. According to Gantz and Reinsel, the digital data created, replicated and consumed in the world is doubling every two years. In addition, Gantz and Reinsel have estimated that in the year 2020 alone there will be approximately exabytes of digital data that will be created, replicated and consumed. This data comes from sources such as social media, mobile devices, music and movies, digital pictures, transactions, and science projects. All this data provides a plethora of analytical benefits. Yet, only a fraction of this data has been explored for its analytical value. By 2020 approximately 33% of all this data will be valuable but only if tagged and analyzed. (Gantz & Reinsel 2012, 1-9.) This data is often referred as Big Data. During recent years, Big Data has become an area of intense interest for businesses, scientists and Governments. Harvard Business Review, for example, called data scientist the sexiest job of the 21 st century (Davenport & Patil 2012). Google Trends indicates that the search term Big Data shows a sharp increase in interest starting from 2011 (Google 2014). Even though this indicates that Big Data surged into the popular consciousness only recently, it has existed in one way or the other far longer than that. According to Gartner, Big Data is high-volume, high-velocity and high-variety information assets that demand cost-

7 7 effective, innovative forms of information processing for enhanced insight and decision making. (Gartner 2014.) At its most fundamental level, Big Data is the challenge organizations need to solve in order access the analytical value provided the increasing data volumes created constantly. Organizations have always tried to take advantage of data for competitive advantage but the recent developments in information technology have manifested both unprecedented challenges as well as opportunities in the data analytics field. Minelli, Michael and Dhiraj divide the reasons for the rise of Big Data analytics into three main parts. The advances in computing technology, increased amount of data available, and the convergence of traditional data management and analytics software with open-source software and commodity hardware. (Minelli & Michael & Dhiraj 2013, 1-2.) There have been many data driven success stories during the last decades including high profile companies such as Amazon, Netflix, Google and Facebook. These companies have fully embraced Big Data. Google and Facebook have also played a part in developing the necessary tools to store and analyze high data volumes. Moreover, a lot of the early technologies, developed by these companies, were developed with the opensource principles in mind and have since been adopted by numerous developers such as The Apache Software Foundation and software vendors such as Cloudera, IBM, and Oracle. It is important to point out that while many of the technological developments have been created by these big businesses, Big Data is not beneficial for only social media companies and retailers. It can be utilized in a wide variety of fields from education to science projects to business purposes. The interest for Big Data continues to rise and it is clear that organizations consider implementing Big Data analytics programs (IDG Enterprise 2014, 4-10). Yet, Big Data is often misunderstood and underutilized. For many organizations Big Data is still a confusing concept that holds vague promises and ambiguous opportunities. Therefore, this research attempts to clarify the Big Data concept. To fully understand Big Data concept it is critical to understand its main characteristics, challenges, and underlying technologies. Therefore, emphasis should be put on exploring the Big Data concept holistically in order to make informed decisions regarding the implementation of Big Data management platforms.

8 8 1.2 Objectives The main objectives of researching Big Data are to understand explore the concept of Big Data, its benefit and limitations, and the implementation of Big Data analytics projects. To achieve this it is critical to understand what Big Data is and while this research does not attempt to propose a final definition, for that would be an impractical endeavor with a concept as dynamic as Big Data, it does explore the concept and its underlying technologies in depth in an attempt to clarify the concept in the minds of both business leaders and academics alike. Furthermore, this research conducts an up to date examinations on the most relevant Big Data analytics software solutions offered by the current industry leaders such as IBM and Oracle. This accompanied by a discussion into the implementation process of Big Data projects, including individual components, human resources and best practices of such endeavors. Ultimately, this research paper is not designed to be a definitive guide into making concrete decisions concerning Big Data. The aim is to provide a starting place for anyone interested in exploring the benefits and limitations of Big Data. 1.3 Structure This thesis is divided into 7 chapters. Research scope, methodology, research questions and limitations are discussed in the second chapter. The third chapter presents a discussion regarding the Big Data concept. The fourth chapter discusses the underlying technologies of Big Data analytics. The most important Big Data software solutions are explored in the fifth chapter. The sixth chapter provides an exploration into implementation process and best practices of Big Data projects. The seventh chapter concludes the research by discussing the research work and by providing suggestions for further research.

9 9 2 RESEARCH SCOPE, QUESTIONS AND METHODOLOGY The scope of this research is outlined in this chapter. The research questions are outlines along with the methodology that was used in conducting this research. 2.1 Research scope This research examines the phenomenon known as Big Data. An exploration of the real world value and benefits of Big Data is conducted to separate the facts from the marketing hype that surrounds Big Data. Benefits and limitations of Big Data are discussed. The research also explores the Big Data concept by looking at existing literature and current definitions provided by industry leaders such as IBM and SAS in an attempt to clarify the meaning of the concept. Even though this research does not introduce any new terminology to the current vernacular, a careful look is taken at the current definitions to understand this ever evolving subject in depth. In addition, the research explores the technological developments that have made Big Data analytics possible. Specifically, the research focuses on the technologies that are essential for feasible Big Data analytics programs that organizations are undertaking. Furthermore, an analysis on the leading Big Data software solutions is conducted. A critical comparison is formulated concerning the different services offered. Emphasis is put on understanding the approach each software vendor is taking with their Big Data solutions. 2.2 Research questions The following research questions are defined in order to achieve the objectives of the research work. 1. What is Big Data and why is it critical to understand Big Data benefits and limitations? What are the underlying technologies of Big Data?

10 10 Big Data is a discussed extensively in the media and industry literature but the exact meaning of Big Data remains unclear. Numerous different, even contradictory, definitions are often used by the media and Big Data companies. In the interest of further, cohesive research and understanding the benefits of Big Data it is important that the Big Data concept is understood clearly. This can be achieved by exploring the characteristics and underlying technologies that are associated with Big Data. The focus is put on exploring the most important technologies of Big Data analytics. 2. What are the features of the leading Big Data applications offered in the market today? What are their main differences and similarities? This research question examines the different Big Data analytics solutions provided by the current industry leaders. A comparison of these Big Data applications is conducted to better understand the alignment of data driven approach and business value. 3. How are Big Data management platforms implemented in organizations? What are the critical components of successfully implemented Big Data management programs? This research question explores the importance of Big Data management programs as a critical component of an organization's decision making process. Relevant literature and industry best practices are utilized in understanding the process of implementing Big Data management project. 2.3 Methodology This research draws from exploratory research due to the novelty of the relatively new topic that is Big Data. Qualitative approach is utilized. Numerical data is also used. The main research technique of this research work is literature review and only the most relevant and up to date works are used because of the high velocity which the field is evolving. This research is exploratory in nature because the concept of Big Data is relatively new. In addition, the Big Data analytics field is evolving at a high velocity which has made

11 11 understanding Big Data challenging. As pointed out by McDaniel Jr. and Gates, exploratory research is used to to obtain greater understanding of a concept or to help crystallize the definition of a problem. It is also used to identify important variables to be studied. Exploratory research is preliminary research, not the definitive research used to determine a course of action. (McDaniel Jr. & Gates 2010, 43.) This research attempts to clarify the concept of Big Data by exploring the relevant literature, therefor an exploratory research approach was chosen. In addition to academic literature, definitions and information from current industry leaders are utilized. Because this research explores the importance of Big Data analytics through literature, it deals mainly with qualitative data. According to Smith and Albaum, using secondary data sources can help to clarify a problem and is considerably more cost and time efficient than collecting primary data. (Smith & Albaum 2005, ) Therefore, numerical data from secondary and tertiary sources are also used to further explore the benefits, limitations and weaknesses of Big Data. The criteria for choosing the literature and materials used were to only utilize the most relevant and up to date literature from reputable authors and publishers. 2.4 Limitations This research has two main limitations. Firstly, the concept of Big Data is evolving with a high velocity and the information provided by this research only remains relevant as long as new developments are made in the field. Moreover, this research utilizes information and white papers provided by companies who are providing Big Data solutions and while these can be an excellent sources of information it is important to realize the promotional aspect of these white papers and research works sponsored by the software vendors.

12 12 3. BIG DATA CONCEPT The concept of Big Data is explored is in this chapter by looking at existing definitions and common characteristics that are used to describe Big Data. Furthermore, various use cases for Big Data are examined to further increase the clarity of the concept. 3.1 Big Data definition Big Data as a concept has been in general use for just a few years. Nevertheless, Big Data has become a buzzword during this short time period recent years, yet the exact meaning of the concept remains unclear and is often used synonymously with other related concepts such as BI and data mining. Furthermore, various stakeholders have created new definitions or additions to existing definitions that best suit their interests. Nevertheless, to capture the essence of Big Data, consistent themes can be found by examining various definitions provided by the industry leaders and relevant literature. IBM offers a very direct approach in describing Big Data stating that Every day, we create 2.5 quintillion bytes of data so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data. (IBM 2014, original emphasis.) IBM makes it clear that Big Data involves notable amounts of data that comes from a wide variety of sources. Additionally, IBM emphasizes the increasing speed of data generation. SAP, in their website, offers a more promotion oriented view on Big Data. Big Data is an opportunity to change how you work, play, and live by tracking new signals within digital noise. From major league sports and personalized shopping to carbon reduction and cancer treatment, organizations are using Big Data to re-imagine achieving what is possible. (SAP 2014.) Clearly, SAP is focusing on the benefits of Big Data rather than trying to provide a clear-cut definition of the concept. Moreover, SAP is emphasizing the fact that Big Data can provide value in a wide variety of fields.

13 13 McKinsey approaches Big Data concept from a different point of view. Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. This definition is intentionally subjective and incorporates a moving definition of how big a dataset needs to be in order to be considered big data i.e., we don t define big data in terms of being larger than a certain number of terabytes (thousands of gigabytes). We assume that, as technology advances over time, the size of datasets that qualify as big data will also increase. Also note that the definition can vary by sector, depending on what kinds of software tools are commonly available and what sizes of datasets are common in a particular industry. With those caveats, big data in many sectors today will range from a few dozen terabytes to multiple petabytes (thousands of terabytes). (McKinsey 2011, 1. Original emphasis.) Big Data does not refer to a specific type of data, but rather to the challenge created by the substantial data amounts available to organizations. It is apparent that there is a lot of variation in the definitions. Thus, rather than delving into the dozens of formal definitions, Big Data can perhaps be best understood by examining the characteristics associated with it. One of the earliest and widely cited sources on the characteristics of Big Data are included in a 2001 Meta report by Doug Laney (Laney 2001). Even though the actual concept Big Data is not explicitly mentioned in the report, the characteristics, or Three Vs as they are often called, which are mentioned in it have become the standard way of describing Big Data. These are Volume, Velocity and Variety. While more Vs and other characteristics have been added to describe Big Data by various entities, the three original Vs remain the widely accepted basis for describing Big Data. SAS, for example, have added two additional dimensions to complement the original three Vs. These are variability and complexity (SAS 2014). IBM, includes a fourth V; veracity. Furthermore, a fifth V, value is commonly associated with Big Data. In other words, there is no consensus on the exact characteristics of Big Data. Nevertheless there are multiple characteristics which most of the software vendors in the field agree upon. Figure 1, adopted from Walker (2012), depicts four of these Vs.

14 14 Figure 1. Four Vs of Big Data (Walker 2012) Volume refers to the sheer size of the substantial data amounts and the challenged it brings with it. Velocity refers to a challenge created by data that requires very fast response. Another challenge is the varying formats and structures that data is stored in. Moreover, to truly take advantage of Big Data it is important to manage the integrity of the data. This is becoming increasingly challenging and increasingly important part of Big Data analytics. 3.2 Big Data characteristics The volume dimension is perhaps the biggest differentiating factors of Big Data analytics compared to traditional data analytics. As its name implies, Big Data involves considerable amounts of data. However, an important question to find answers to is what exactly does big mean in this context. A recent survey by Intel indicates that large companies generate approximately 200 TB of data each week (Intel 2012, 24). Such substantial volumes of data makes it challenging, if not impossible, to derive meaningful and valuable knowledge in a timely manner using traditional BI tools. Nonetheless, what is considered big today will seem positively diminutive in the future due to the high velocity in the increase of data generated. Therefore, it is important to keep in mind that big is always a relative expression. Table 1, adopted from Franks, illustrates the scale of data amounts involved in Big Data analytics (Franks 2012, 89).

15 15 Table 1. Data Measures (Franks 2012, 89) Table 1, lists the most commonly used data measures as well as puts the data volumes into perspective by using concrete examples. While not a precise measure, it illustrates the sheer volume of data that is created and managed by organizations in today's world. Traditionally much smaller data sets known as samples were used for data analysis. The main reason for this has mainly been data storage and processing constraints. Even though this method is effective in gaining predictive insight, the results are suboptimal. In today's Big Data world these constraints are becoming less relevant. By taking advantage of the ability to analyze the entire data set, organizations can discover subtle patterns which remain undetected when analyzing merely a small sample of the whole. The results will also be more reliable and accurate and by having data from extensive periods of time it is possible to form complex predictive models that take into account

16 16 all the inter-related components that would not be possible otherwise. (Minelli et al. 2013, 37.) Franks points out that because the hardware limitations on how much data is stored are largely removed it has become possible to store all the data in order to utilize it for analytics. Nevertheless, much of the data that is being stored holds no or very little value. Therefore, the key point to make about Big Data is that it is not the data that is stored that is valuable but rather than what is done with the data and what decisions are made based on this data driven approach. (Franks 2012, 6-8) Big Data is not merely big. The main difference between Big Data analytics and BI is the type of data that is being analyzed. Traditional BI tools are utilized for analysis of structured data but cannot effectively extract information from semi-structured and unstructured data. Big Data solutions are specifically designed to store and process semi-structured and unstructured data. While there are also hybrid solutions that can handle all types of data, it is unlikely that they will replace traditional BI tools in the near future. (Simon 2013, ) Structured data resides firmly in a relational row-column database and is designed to be clean, effective and easy to analyze. This includes set data types and predefined rules on where the data is stored. Unstructured data, on the contrary, does not have a predefined data model nor does it fit into a relational database. In addition, where structured data tends to grow largely in linear fashion, unstructured data grows exponentially. While usually text heavy, unstructured data contains numbers, dates and facts as well. Semistructured data is used to describe data that is structured and has the ability to enforce hierarchies within the data but does not fit well into formal data models. (Minelli et al 2013, 11.) According to Simon, the internal data generated by organizations have stayed roughly at the levels for the past decade. This internal data is usually highly structured and has been used by BI solutions for a long time. Most of the external data, which is growing much faster than the internal data, is unstructured and comes from multiple different sources. (Simon 2013, 39.)

17 17 Velocity refers to the time sensitivity of Big Data. This means that to maximize the value of Big Data, it must be often times be analyzed inside a very narrow time window or, in some cases, in real time as the data keeps streaming in. According to Hurwitz & Nugent & Halper & Kaufman, the real-time aspect of data is most relevant when the aim of the analytics process is time sensitive and business critical. Examples of such instances could be monitoring social media streams, or giving customers coupon depending on what they have bought at the point of sale. (Hurwitz & Nugent & Halper & Kaufman 2013, 32.) Considering the sheer size, velocity and the variety of Big Data, there is a certain amount of uncertainty inherently involved in analyzing it. It is impractical, if not impossible, to clean up the data for precise analysis. This problem is often referred as data veracity and is a central problem to be solved when utilizing Big Data. The uncertainty and imprecision varies depending on the situation, but must be factored in when performing analysis. (Walker 2012) Regardless of how many different Vs one counts to be among the critical characteristics that describe Big Data, the fact is that Big Data comes in many different shapes and forms. Big Data is a concept that is used to describe data that is challenging to manage using traditional tools. Furthermore, not all Big Data has all of the defining characteristics and not all Big Data exhibit the characteristics in equal degree. A relatively small data sets can be considered to be Big Data if, for example, the variety makes it complicate to work with. Mike Olson, co-founder of Cloudera, a pioneering Big Data company, sums up his personal view of Big Data, after giving the standard answer of 3 Vs, thusly: If there is data you want to work with and it doesn't fit where you want to put it, it is Big Data. (Olson & Li 2013.) 3.3 Big Data applications and limitations Another way of understanding Big Data is by looking at benefits Big Data analytics can bring about when utilized in various fields. Big Data can provide great benefits in almost every field. Data Science Central lists ten practical examples as follows: dialoguing with consumers, de-developing products, conducting risk analysis, keeping data safe, creating new revenue streams real time customization of websites, reducing

18 18 maintenance costs, organizing tailored healthcare, providing enterprise-wide insights, and making cities smarter. (Data Science Central 2012.) Instead of going into details on these high level examples, a more illuminating approach is to explore the actual real life examples of Big Data being utilized to its full potential. Netflix is a relevant example, the video streaming giant, who controls the whole customer experience, from the moment they sign up all the way up to the actual viewing experience of watching a movie or a TV series. This allows Netflix to collect high data amounts about their users, including viewing preferences such as preferred genres. This helps Netflix to understand their customers in ways that were never before possible. Essentially, a clever use of Big Data has allowed Netflix to identify opportunities for new TV shows that are likely to have strong fan base. (Sweney 2014.) Furthermore, by utilizing Big Data and data visualization technologies, Netflix is able to analyze the covers of their shows and create covers which are likely to be appealing to the targeted customer group. (Simon 2014.) Big Data analytics is not beneficial for big businesses only. In the 2012 United States presidential elections data analytics was utilized. MIT Technology Review article from 2012 explains how Big Data was utilized during the campaign. The Obama 2012 campaign used data analytics and the experimental method to assemble a winning coalition vote by vote. In doing so, it overturned the long dominance of TV advertising in U.S. politics and created something new in the world: a national campaign run like a local ward election, where the interests of individual voters were known and addressed. (Issenberg, 2012.) Furthermore, Obama's data scientists were able to identify potential voters and ways to get their attention as well as to raise funds through a metric driven ecampaign were optimal subject lines, senders and messages were identified (Scherer 2012). Nevertheless, it is important to maintain the critical approach in discussing Big Data. While Big Data can be a powerful source of knowledge, it also has its limitations. Simon points out that crowdsourcing everything is not always the appropriate approach to decision making. The vision of an individual might prove to be the correct course to take in certain situations. Moreover, Simon emphasizes the need for human judgment when using Big Data for decision making because ultimately it is the human making the important decisions. (Simon 2013, )

19 19 It would be impossible to discuss Big Data without mentioning privacy concerns. Ever since Glen Greenwald reported on the widespread data collection practices of the National Security Agency in 2013 (Greenwald 2013), Big Data privacy concerns have been discussed extensively in the media. These revelations might have brought the privacy conversation into the limelight, but these worries have been relevant for a much longer time. Simon points out that companies such as Amazon, Apple, Facebook and Google have long been scrutinized for privacy issues and many have hard time trusting these companies with their personal data, some avoiding them all together. (Simon 2013, ) Perhaps the most significant issue is that while data collection, storing and analytics technologies have been evolving at high pace, the laws and social norms have not dealt with such issues before. 3.4 Big Data technological challenges Big Data introduces many technological challenges to organizations looking to leverage its benefits. These include storage challenges and data processing challenges. Each of the Big Data Vs; volume, variety, velocity, and veracity provide different challenges that require new technological approaches. In is likely that a large organization has to overcome one more of these challenges in order to realize the most important V: Value. The high volumes of data introduce a storage challenge. Because data amounts being generated by organizations are growing rapidly, a centralized database system is an impractical data storage solution. Ohlhorst suggests that a Big Data storage system should be highly scalable in order to meet the changing requirements for data capacity. These storage systems need to be able to increase their capacity and processing power in modules. (Ohlhorst 2013, ) The variety of data, structured, semi-structured, and unstructured data requires a database which can efficiently store and process these data types. Relational databases are optimized for structured data. NoSQL is a category of databases that are designed specifically to overcome the limitations of relational databases.

20 20 To fully leverage Big Data, it is often important to process and analyze Big Data in realtime. The challenge is to increase performance all the while the data amounts are increasing as well. This has led to the adoption of in-memory data management technologies for real time analytics. These technologies and their utility in Big Data analytics are discussed in more detail in chapter 4.

21 21 4. BIG DATA TECHNOLOGIES This chapter discusses the most critical underlying technologies of Big Data. As discussed in previous chapter, Big Data is not a one specific type of data but compasses a wide variety of data analytics situations which traditional tools are not equipped to store and process. Therefore, most of the technologies discussed in this chapter are designed to address a specific aspect of the Big Data analytics technology stack. 4.1 Massively Parallel Processing (MPP) Massively Parallel Processing (hereinafter MPP) databases provide a cost effective, scalable, and flexible data warehouse environment for high data amounts. MPP platform consists of multiple commodity servers which are all capable of processing data in parallel. MPP databases can be utilized as an extension or as a replacement for traditional data warehouse systems. (Schmarzo 2013, ) MPP database systems distribute the data among multiple servers. These servers each have their own storage and central processing units (hereinafter CPU). This in contrast to the centralized model where all data is stored and processed using a one powerful server. (Franks 2013, ) Figure 2, adopted from Franks, illustrates these approaches (Franks 2013, 94). Figure 2. Massively Parallel Processing Data Storage System (Franks 2013, 94)

22 22 Figure 2 illustrates the main strength of the MPP database system. A centralized server is appropriate as long as the data sets are relatively small, but with high data sets it becomes challenging for a centralized server to process them in a timely manner. By dividing the large set of data into manageable pieces and distributing it among multiple servers it can be processed more efficiently. In addition, MPP systems store multiple copies data into multiple locations. This redundancy of data makes recovering data easy in the case of a hardware failure. (Franks 2013, ) 4.2 MapReduce MapReduce is a technology that helps to overcome both the volume and variety challenges of Big Data analytics. Like MPP, MapReduce utilizes a distributed approach to data processing but unlike MPP, MapReduce is not a database system. According to Hurwitz et al. MapReduce is a software framework which facilitates the processing of high amounts of unstructured data in parallel, using a distributed set of processors. (Hurwitz et al. 2013, 101.) Franks points out that although it has been suggested that MapReduce will replace databases entirely, MapReduce is neither a database nor a direct competitor for them. Rather it is a complementary technology for the current databases and best used by focusing on its strengths. (Franks 2012, 110.) MapReduce consists of two main programming processes: map and reduce. The programming model is summarized by Dean and Ghemawat thusly: The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce. Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function. The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user s reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory. (Dean & Ghemawat 2004, 2, original emphasis.)

23 23 These steps are utilized in the MapReduce framework which distributes them among worker nodes to be processed. Commodity hardware can be used to perform these processes. Without any interaction between them and each of them runs the same code on the portion of data distributed to them by MapReduce. (Franks 2012, ) MapReduce has proven to be effective in processing unstructured data. With structured data, where the data is already defined into carefully constructed relationships in a relational database, raw data streams are not easily analyzed by traditional tools. Unstructured or semi-structured data include a variety of data types, such as machine generated web logs, sensor data, images and text and often it is imperative for organizations to derive meaningful insights from that data in a timely and efficient manner. This is where MapReduce shows its strength. (Franks 2013, ) Figure 3, adopted from Hurwitz et al. illustrates how the MapReduce process works. (Hurwitz et al. 2013, 106.) Figure 3. MapReduce Data Flow

24 24 As Figure 3 illustrates, all operations are independent. This allows MapReduce to break substantial data set, involving up to terabytes of data, into smaller segments, and process the segments independently. (Hurwitz et al. 2013, 106.) 4.3 NoSQL databases Relational database management systems (hereinafter RDBMS) have been relied upon to solve data storage challenges. A wide ecosystem of applications and platforms has been built around this relational database management system paradigm. The arrival of Big Data has created entirely new challenges that cannot be addressed by traditional RDBMS approach. Due to the large amounts of unstructured data available and the fact that traditional relational databases are impractical in storing unstructured data, new types of data storage systems are needed to fully utilize this data. NoSQL databases are designed to address these challenges. NoSQL comes from the words no and simple query language (hereinafter SQL), or not only and SQL. NoSQL is used now to describe database systems that do not follow the RDBMS principles. These databases are specifically designed to handle large amounts of data, including unstructured data, in a time sensitive manner. (Vaish 2013, 8.) Cattell identifies NoSQL databases using six key features as follows: The ability to scale horizontally, the ability to distribute data over many servers, a simple call level interface or protocol, a weaker concurrency model than the ACID utilized by most relational databases, efficient use of distributed indexes and RAM, and the ability to dynamically add new attributes to records. (Cattell 2011, 1.) Traditional RDBMS solutions usually utilize the so called ACID model. ACID is an acronym of the words atomicity, consistency, isolation and durability. Atomicity means that unless everything in a given transaction succeeds, it is cancelled and rolled back. Consistency ensures that a transaction does not leave a database in an inconsistent state. Isolation refers to the fact that transactions cannot interfere with each other and lastly, durability means that transactions persists even in the case of an application restart. These qualities, while important for a database to function as intended are not entirely compatible with considerably high data amounts. Therefore, a new database model

25 25 known as BASE has been adopted to make up for these shortcomings of ACID. BASE consists of three principles. Basic availability: Each request is guaranteed a response, be it successful or failed. Soft state: The state of the system may change over time, at times without any input. Eventual consistency: The database may be inconsistent at times but will eventually be consistent. It is important to realize that NoSQL and the BASE model are not ideal solutions in cases were transactions are of critical importance such as banking. BASE model, just like ACID model, is not the answer to all database problems. (Vaish 2013, 9-12.) Vaish points out that NoSQL does not actually refer to any specific database or even a database type but is rather a concept that is used, at its highest level, to filter out a specific set of databases; the RDBMS. (9, 2013) Tiwari identifies five different types of data stores which fall under the NoSQL umbrella: Column oriented databases, key/value pair databases, document databases, graph databases and XML databases. (1-19, 2011) Table 2 below illustrates the main differences between SQL and various NoSQL databases. Table 2. Important Characteristics of SQL and NoSQL Databases (Hurwitz et al. 2013, 55) Table 2 lists the most important characteristics most frequently associated with various database types. As it is with most rapidly evolving technologies, these characteristics

26 26 are not set in stone and expectations exist. Nevertheless, the table provides a good reference point for further examination. Column oriented database, document oriented database, and key/value database types are discussed briefly below Column oriented database The main defining factor of column oriented databases is that, unlike relational databases which store data as rows, it stores data as columns. Vaish explains the main advantages of column oriented approach. Firstly, most column oriented databases are flexible because columns can be added without having to worry about filling default values. Furthermore, they perform very well when having to process large datasets. Column oriented databases also allow partial data access without having to process unrelated columns, leading to an increased performance. (Vaish 2013, 27.) While it is possible to simulate column oriented database in a row oriented database, a recent research suggests that such simulations perform poorly in comparison to purposely built column oriented databases. (Abadi & Madden & Hachem 2008, 14.) According to Plattner, most applications in enterprise are based on set processing rather than single row access. This means that having data stored as columns is more efficient. (Plattner 2009, 2.) Figure 4, adopted from Plattner illustrates the fundamental difference between row operations and column operations. (Plattner 2009, 2). Figure 4. Data Access in Row Storage and Column Storage (Plattner 2009, 2)

27 27 In terms of set processing, the more data is stored in a database, the more apparent the effectiveness of column database becomes. With a column store it is possible to process only a specific column, whereas a row store needs to process each row to access the column value requested. While this is not significant when processing relatively small datasets, the difference becomes significant with high data amounts. For the same reasons, performing row operations in a column oriented database is inefficient compared to performing those processes in a row database. Therefore, column stores should be used when having to process high data amounts using column operations Key/value database Key/value database is a simple database type where a specific value is stored into the database and accessed through a specific key. The key/value database does not have a rigid schema and is very flexible. Key/value database can either be stored in an inmemory system or in a persistent store. Because values are stored against a key, key/value databases serve as very effective in-memory caches. In-memory key/value database works very much like associative arrays and hash tables. (Vaish 2013, 41.) Document oriented database Document oriented databases are designed to store semi-structured data. Data is stored into a set of records, sometimes referred as documents, which are similar to rows in a RDBMS with the difference that document oriented database can store semi-structured data. While document databases do not have a strict schema, indexes can be created and queried. Records in a document oriented databases may vary to a great degree and they may have entirely difference set of rows and columns. (Vaish 2013, ) The main advantage of document oriented databases is their flexible schema. This makes them effective in web-based applications where the data that needs to be stored might be changing in structure constantly. In addition, many document oriented databases allows a partial record updates and retrievals which increase the processing efficiency of the system especially with high data amounts. Moreover, searching across the entire database is achieved effectively. This is due to the fact that unlike relational

28 28 databases and column oriented databases which have tables, the document oriented database has no strict schema. (Vaish 2013, 31.) 4.4 In-memory data management As discussed previously, real-time analytics is becoming increasingly important part of data analytics. This combined with the increasing data amounts organizations are managing means that faster processing of data is required. This can be achieved by MPP, NoSQL databases and the increasing power of hardware, but the most of the time the slowest component of processing data is the mechanical, disk-based data storage. Inmemory data management, sometimes called main-memory data management, is an attempt to overcome this challenge. In the past, the main limiting factor of in-memory storage has been the relatively small amounts of memory available and high costs. According to Ohlhorst, prices of siliconbased memory has fallen to a point where the capacity to cost ratio has made in-memory data management a viable alternative to disk-based storage. With in-memory technologies organizations now have the ability to analyze large data sets in real time thus allowing managers to base their decisions on real-time data. (Ohlhorst 2013, ) However, this does not mean that all data should be stored in an in-memory data storage. For example, SAP HANA, an in-memory real time analytics platform which will be discussed in chapter 5, typically manages data amounts of approximately 100 terabytes (SAP 2013, 23). These storage amounts alone, while impressive, do not meet the data management needs of large enterprises. Thus it is critical to utilize in-memory data technologies for data that must be processed quickly and analyzed in real-time, while utilizing mechanical, disk-based storages for extremely large sets of data that do not share the same need for real-time analysis. 4.5 Big Data and cloud computing Increasing attention has been paid to the benefits cloud computing technologies have for Big Data analytics. As is suggested by Schmarzo, cloud computing provides benefits such as flexibility, scalable storage capabilities, and access to substantial processing

29 29 power (Schmarzo 2013, 190). According to Mell and Grance, cloud computing is a model that enables on-demand access to a shared and configurable computing systems conveniently. These computing include networks, servers, storage, applications and services and they need to be provisioned rapidly and with a minimal need for management and service provider interaction. Mell and Grance identify five essential characteristics of cloud computing. (Mell & Grance 2011, 2.) 1. On-demand self-service: User has the capability to access the computing resources as needed, without any human interaction with the service provider. 2. Broad network access: Computing resources can be accessed through a network using standard mechanisms that encourage the use of a variety of thin or thick clients, i.e. mobile phones, tablets, laptops, and workstations. 3. Resource pooling: Computing resources are pooled in order to serve multiple consumers using a multi-tenant model. The system dynamically assigns physical and virtual resources according to the customer demands. The customer generally has no knowledge or control over the exact location of the resources. 4. Rapid elasticity: Resource can be provisioned and released in a flexible manner in order to react rapidly to changes in customer demands. 5. Measured service: Cloud computing systems automatically controls and optimizes resources by utilizing a metering system, which gives the user adequate information regarding the service, while maintaining a level of abstraction appropriate for the service. In addition, Mell and Grance list three cloud computing service models Software as a Service (hereinafter SaaS), Platform as a Service (hereinafter PaaS), and Infrastructure as a Service (hereinafter IaaS). SaaS model provides the consumer an access to the service provider's application running on a cloud infrastructure. The consumer has a very limited set of configuration options. PaaS model provides the consumer the capability to use the service provider's cloud infrastructure to implement consumercreated or acquired applications. The consumer does not control the cloud infrastructure but has control over the applications. IaaS model provides the consumer computing resources. The consumer does not control the cloud infrastructure but is able to deploy a wide variety of software, including operation systems and applications. Moreover, there are four cloud deployment models: private cloud, community cloud, public cloud, and hybrid cloud. (Mell et al. 2011, 2-3.)

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: simmibagga12@gmail.com

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

The Future of Data Management

The Future of Data Management The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class

More information

NoSQL for SQL Professionals William McKnight

NoSQL for SQL Professionals William McKnight NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to

More information

www.pwc.com/oracle Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

www.pwc.com/oracle Next presentation starting soon Business Analytics using Big Data to gain competitive advantage www.pwc.com/oracle Next presentation starting soon Business Analytics using Big Data to gain competitive advantage If every image made and every word written from the earliest stirring of civilization

More information

The 3 questions to ask yourself about BIG DATA

The 3 questions to ask yourself about BIG DATA The 3 questions to ask yourself about BIG DATA Do you have a big data problem? Companies looking to tackle big data problems are embarking on a journey that is full of hype, buzz, confusion, and misinformation.

More information

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics

More information

Big Data Are You Ready? Thomas Kyte http://asktom.oracle.com

Big Data Are You Ready? Thomas Kyte http://asktom.oracle.com Big Data Are You Ready? Thomas Kyte http://asktom.oracle.com The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated

More information

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D. Big Data Technology ดร.ช ชาต หฤไชยะศ กด Choochart Haruechaiyasak, Ph.D. Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center (NECTEC) National Science and Technology

More information

How Big Is Big Data Adoption? Survey Results. Survey Results... 4. Big Data Company Strategy... 6

How Big Is Big Data Adoption? Survey Results. Survey Results... 4. Big Data Company Strategy... 6 Survey Results Table of Contents Survey Results... 4 Big Data Company Strategy... 6 Big Data Business Drivers and Benefits Received... 8 Big Data Integration... 10 Big Data Implementation Challenges...

More information

Big Data Technologies Compared June 2014

Big Data Technologies Compared June 2014 Big Data Technologies Compared June 2014 Agenda What is Big Data Big Data Technology Comparison Summary Other Big Data Technologies Questions 2 What is Big Data by Example The SKA Telescope is a new development

More information

INTRODUCTION TO CASSANDRA

INTRODUCTION TO CASSANDRA INTRODUCTION TO CASSANDRA This ebook provides a high level overview of Cassandra and describes some of its key strengths and applications. WHAT IS CASSANDRA? Apache Cassandra is a high performance, open

More information

Big Data Zurich, November 23. September 2011

Big Data Zurich, November 23. September 2011 Institute of Technology Management Big Data Projektskizze «Competence Center Automotive Intelligence» Zurich, November 11th 23. September 2011 Felix Wortmann Assistant Professor Technology Management,

More information

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper Offload Enterprise Data Warehouse (EDW) to Big Data Lake Oracle Exadata, Teradata, Netezza and SQL Server Ample White Paper EDW (Enterprise Data Warehouse) Offloads The EDW (Enterprise Data Warehouse)

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Transforming the Telecoms Business using Big Data and Analytics

Transforming the Telecoms Business using Big Data and Analytics Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe

More information

Interactive data analytics drive insights

Interactive data analytics drive insights Big data Interactive data analytics drive insights Daniel Davis/Invodo/S&P. Screen images courtesy of Landmark Software and Services By Armando Acosta and Joey Jablonski The Apache Hadoop Big data has

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

Hadoop for Enterprises:

Hadoop for Enterprises: Hadoop for Enterprises: Overcoming the Major Challenges Introduction to Big Data Big Data are information assets that are high volume, velocity, and variety. Big Data demands cost-effective, innovative

More information

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Datenverwaltung im Wandel - Building an Enterprise Data Hub with Datenverwaltung im Wandel - Building an Enterprise Data Hub with Cloudera Bernard Doering Regional Director, Central EMEA, Cloudera Cloudera Your Hadoop Experts Founded 2008, by former employees of Employees

More information

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances INSIGHT Oracle's All- Out Assault on the Big Data Market: Offering Hadoop, R, Cubes, and Scalable IMDB in Familiar Packages Carl W. Olofson IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA

More information

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of

More information

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this

More information

Big Systems, Big Data

Big Systems, Big Data Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,

More information

Getting Started Practical Input For Your Roadmap

Getting Started Practical Input For Your Roadmap Getting Started Practical Input For Your Roadmap Mike Ferguson Managing Director, Intelligent Business Strategies BA4ALL Big Data & Analytics Insight Conference Stockholm, May 2015 About Mike Ferguson

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84 Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics

More information

Big Data Solutions. Portal Development with MongoDB and Liferay. Solutions

Big Data Solutions. Portal Development with MongoDB and Liferay. Solutions Big Data Solutions Portal Development with MongoDB and Liferay Solutions Introduction Companies have made huge investments in Business Intelligence and analytics to better understand their clients and

More information

Big Data Integration: A Buyer's Guide

Big Data Integration: A Buyer's Guide SEPTEMBER 2013 Buyer s Guide to Big Data Integration Sponsored by Contents Introduction 1 Challenges of Big Data Integration: New and Old 1 What You Need for Big Data Integration 3 Preferred Technology

More information

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop Volume 4, Issue 1, January 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Transitioning

More information

Big Data: Are You Ready? Kevin Lancaster

Big Data: Are You Ready? Kevin Lancaster Big Data: Are You Ready? Kevin Lancaster Director, Engineered Systems Oracle Europe, Middle East & Africa 1 A Data Explosion... Traditional Data Sources Billing engines Custom developed New, Non-Traditional

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014 5 Keys to Unlocking the Big Data Analytics Puzzle Anurag Tandon Director, Product Marketing March 26, 2014 1 A Little About Us A global footprint. A proven innovator. A leader in enterprise analytics for

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS

TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS 9 8 TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS Assist. Prof. Latinka Todoranova Econ Lit C 810 Information technology is a highly dynamic field of research. As part of it, business intelligence

More information

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores Composite Software October 2010 TABLE OF CONTENTS INTRODUCTION... 3 BUSINESS AND IT DRIVERS... 4 NOSQL DATA STORES LANDSCAPE...

More information

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum Siva Ravada Senior Director of Development Oracle Spatial and MapViewer 2 Evolving Technology Platforms

More information

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics ESSENTIALS EMC ISILON Use the industry's first and only scale-out NAS solution with native Hadoop

More information

Big Data Defined Introducing DataStack 3.0

Big Data Defined Introducing DataStack 3.0 Big Data Big Data Defined Introducing DataStack 3.0 Inside: Executive Summary... 1 Introduction... 2 Emergence of DataStack 3.0... 3 DataStack 1.0 to 2.0... 4 DataStack 2.0 Refined for Large Data & Analytics...

More information

Customized Report- Big Data

Customized Report- Big Data GINeVRA Digital Research Hub Customized Report- Big Data 1 2014. All Rights Reserved. Agenda Context Challenges and opportunities Solutions Market Case studies Recommendations 2 2014. All Rights Reserved.

More information

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

CIO Guide How to Use Hadoop with Your SAP Software Landscape

CIO Guide How to Use Hadoop with Your SAP Software Landscape SAP Solutions CIO Guide How to Use with Your SAP Software Landscape February 2013 Table of Contents 3 Executive Summary 4 Introduction and Scope 6 Big Data: A Definition A Conventional Disk-Based RDBMs

More information

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities Technology Insight Paper Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities By John Webster February 2015 Enabling you to make the best technology decisions Enabling

More information

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Defining Big Not Just Massive Data Big data refers to data sets whose size is beyond the ability of typical database software tools

More information

Ubuntu and Hadoop: the perfect match

Ubuntu and Hadoop: the perfect match WHITE PAPER Ubuntu and Hadoop: the perfect match February 2012 Copyright Canonical 2012 www.canonical.com Executive introduction In many fields of IT, there are always stand-out technologies. This is definitely

More information

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world Analytics March 2015 White paper Why NoSQL? Your database options in the new non-relational world 2 Why NoSQL? Contents 2 New types of apps are generating new types of data 2 A brief history of NoSQL 3

More information

Tap into Hadoop and Other No SQL Sources

Tap into Hadoop and Other No SQL Sources Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

Big Data Management and Security

Big Data Management and Security Big Data Management and Security Audit Concerns and Business Risks Tami Frankenfield Sr. Director, Analytics and Enterprise Data Mercury Insurance What is Big Data? Velocity + Volume + Variety = Value

More information

Gain Contextual Awareness for a Smarter Digital Enterprise with SAP HANA Vora

Gain Contextual Awareness for a Smarter Digital Enterprise with SAP HANA Vora SAP Brief SAP Technology SAP HANA Vora Objectives Gain Contextual Awareness for a Smarter Digital Enterprise with SAP HANA Vora Bridge the divide between enterprise data and Big Data Bridge the divide

More information

BIG DATA MARKETING: THE NEXUS OF MARKETING, ANALYSTS, AND IT

BIG DATA MARKETING: THE NEXUS OF MARKETING, ANALYSTS, AND IT BIG DATA MARKETING: THE NEXUS OF MARKETING, ANALYSTS, AND IT The term Big Data is definitely a leading contender for the marketing buzz-phrase of 2012. On November 11, 2011, a Google search on the phrase

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

The emergence of big data technology and analytics

The emergence of big data technology and analytics ABSTRACT The emergence of big data technology and analytics Bernice Purcell Holy Family University The Internet has made new sources of vast amount of data available to business executives. Big data is

More information

Why NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1

Why NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1 Why NoSQL? Your database options in the new non- relational world 2015 IBM Cloudant 1 Table of Contents New types of apps are generating new types of data... 3 A brief history on NoSQL... 3 NoSQL s roots

More information

BIG DATA IN BUSINESS ENVIRONMENT

BIG DATA IN BUSINESS ENVIRONMENT Scientific Bulletin Economic Sciences, Volume 14/ Issue 1 BIG DATA IN BUSINESS ENVIRONMENT Logica BANICA 1, Alina HAGIU 2 1 Faculty of Economics, University of Pitesti, Romania olga.banica@upit.ro 2 Faculty

More information

<Insert Picture Here> Big Data

<Insert Picture Here> Big Data Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big

More information

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014 White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page

More information

Big Data a threat or a chance?

Big Data a threat or a chance? Big Data a threat or a chance? Helwig Hauser University of Bergen, Dept. of Informatics Big Data What is Big Data? well, lots of data, right? we come back to this in a moment. certainly, a buzz-word but

More information

Big data: Unlocking strategic dimensions

Big data: Unlocking strategic dimensions Big data: Unlocking strategic dimensions By Teresa de Onis and Lisa Waddell Dell Inc. New technologies help decision makers gain insights from all types of data from traditional databases to high-visibility

More information

Big Data Explained. An introduction to Big Data Science.

Big Data Explained. An introduction to Big Data Science. Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of

More information

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,

More information

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords From A to Z By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords Big data is one of the, well, biggest trends in IT today, and it has spawned a whole new generation

More information

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate

More information

DATA VISUALIZATION: When Data Speaks Business PRODUCT ANALYSIS REPORT IBM COGNOS BUSINESS INTELLIGENCE. Technology Evaluation Centers

DATA VISUALIZATION: When Data Speaks Business PRODUCT ANALYSIS REPORT IBM COGNOS BUSINESS INTELLIGENCE. Technology Evaluation Centers PRODUCT ANALYSIS REPORT IBM COGNOS BUSINESS INTELLIGENCE DATA VISUALIZATION: When Data Speaks Business Jorge García, TEC Senior BI and Data Management Analyst Technology Evaluation Centers Contents About

More information

BIG DATA-AS-A-SERVICE

BIG DATA-AS-A-SERVICE White Paper BIG DATA-AS-A-SERVICE What Big Data is about What service providers can do with Big Data What EMC can do to help EMC Solutions Group Abstract This white paper looks at what service providers

More information

Big Data: Tools and Technologies in Big Data

Big Data: Tools and Technologies in Big Data Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can

More information

Data Refinery with Big Data Aspects

Data Refinery with Big Data Aspects International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data

More information

Big Data. Lyle Ungar, University of Pennsylvania

Big Data. Lyle Ungar, University of Pennsylvania Big Data Big data will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus. McKinsey Data Scientist: The Sexiest Job of the 21st Century -

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

Performance Testing of Big Data Applications

Performance Testing of Big Data Applications Paper submitted for STC 2013 Performance Testing of Big Data Applications Author: Mustafa Batterywala: Performance Architect Impetus Technologies mbatterywala@impetus.co.in Shirish Bhale: Director of Engineering

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Understanding the Value of In-Memory in the IT Landscape

Understanding the Value of In-Memory in the IT Landscape February 2012 Understing the Value of In-Memory in Sponsored by QlikView Contents The Many Faces of In-Memory 1 The Meaning of In-Memory 2 The Data Analysis Value Chain Your Goals 3 Mapping Vendors to

More information

A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL

A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL *Hung-Ming Chen, Chuan-Chien Hou, and Tsung-Hsi Lin Department of Construction Engineering National Taiwan University

More information

How To Turn Big Data Into An Insight

How To Turn Big Data Into An Insight mwd a d v i s o r s Turning Big Data into Big Insights Helena Schwenk A special report prepared for Actuate May 2013 This report is the fourth in a series and focuses principally on explaining what s needed

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

NoSQL. Thomas Neumann 1 / 22

NoSQL. Thomas Neumann 1 / 22 NoSQL Thomas Neumann 1 / 22 What are NoSQL databases? hard to say more a theme than a well defined thing Usually some or all of the following: no SQL interface no relational model / no schema no joins,

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014 Forecast of Big Data Trends Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014 Big Data transforms Business 2 Data created every minute Source http://mashable.com/2012/06/22/data-created-every-minute/

More information

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data: Global Digital Data Growth Growing leaps and bounds by 40+% Year over Year! 2009 =.8 Zetabytes =.08

More information

In-Memory Analytics for Big Data

In-Memory Analytics for Big Data In-Memory Analytics for Big Data Game-changing technology for faster, better insights WHITE PAPER SAS White Paper Table of Contents Introduction: A New Breed of Analytics... 1 SAS In-Memory Overview...

More information

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory) WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

Big Data Challenges. Alexandru Adrian TOLE Romanian American University, Bucharest, Romania adrian.tole@yahoo.com

Big Data Challenges. Alexandru Adrian TOLE Romanian American University, Bucharest, Romania adrian.tole@yahoo.com Database Systems Journal vol. IV, no. 3/2013 31 Big Data Challenges Alexandru Adrian TOLE Romanian American University, Bucharest, Romania adrian.tole@yahoo.com The amount of data that is traveling across

More information

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics BIG DATA & ANALYTICS Transforming the business and driving revenue through big data and analytics Collection, storage and extraction of business value from data generated from a variety of sources are

More information

NOSQL, BIG DATA AND GRAPHS. Technology Choices for Today s Mission- Critical Applications

NOSQL, BIG DATA AND GRAPHS. Technology Choices for Today s Mission- Critical Applications NOSQL, BIG DATA AND GRAPHS Technology Choices for Today s Mission- Critical Applications 2 NOSQL, BIG DATA AND GRAPHS NOSQL, BIG DATA AND GRAPHS TECHNOLOGY CHOICES FOR TODAY S MISSION- CRITICAL APPLICATIONS

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

BIG DATA CHALLENGES AND PERSPECTIVES

BIG DATA CHALLENGES AND PERSPECTIVES BIG DATA CHALLENGES AND PERSPECTIVES Meenakshi Sharma 1, Keshav Kishore 2 1 Student of Master of Technology, 2 Head of Department, Department of Computer Science and Engineering, A P Goyal Shimla University,

More information

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy Presented by: Jeffrey Zhang and Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop?

More information

Luncheon Webinar Series May 13, 2013

Luncheon Webinar Series May 13, 2013 Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration

More information

The Next Wave of Data Management. Is Big Data The New Normal?

The Next Wave of Data Management. Is Big Data The New Normal? The Next Wave of Data Management Is Big Data The New Normal? Table of Contents Introduction 3 Separating Reality and Hype 3 Why Are Firms Making IT Investments In Big Data? 4 Trends In Data Management

More information

Why Big Data in the Cloud?

Why Big Data in the Cloud? Have 40 Why Big Data in the Cloud? Colin White, BI Research January 2014 Sponsored by Treasure Data TABLE OF CONTENTS Introduction The Importance of Big Data The Role of Cloud Computing Using Big Data

More information

NoSQL Database Options

NoSQL Database Options NoSQL Database Options Introduction For this report, I chose to look at MongoDB, Cassandra, and Riak. I chose MongoDB because it is quite commonly used in the industry. I chose Cassandra because it has

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

Cray: Enabling Real-Time Discovery in Big Data

Cray: Enabling Real-Time Discovery in Big Data Cray: Enabling Real-Time Discovery in Big Data Discovery is the process of gaining valuable insights into the world around us by recognizing previously unknown relationships between occurrences, objects

More information

ANALYTICS BUILT FOR INTERNET OF THINGS

ANALYTICS BUILT FOR INTERNET OF THINGS ANALYTICS BUILT FOR INTERNET OF THINGS Big Data Reporting is Out, Actionable Insights are In In recent years, it has become clear that data in itself has little relevance, it is the analysis of it that

More information