Navigating the Big Data infrastructure layer Helena Schwenk

mwd a d v i s o r s Navigating the Big Data infrastructure layer Helena Schwenk A special report prepared for Actuate May 2013 This report is the second in a series of four and focuses principally on explaining what s needed in the infrastructure layer of a Big Data platform. For more information about how this layer relates to other parts of a Big Data platform, please refer to the corresponding papers in this series: Navigating Big Data business analytics and Turning Big Data into Big Insights. Finally, for more information about the opportunities and challenges posed by Big Data for organisations today please refer to the first paper in the series, Unlocking the potential of Big Data. This is a special report prepared independently for Actuate. For further information about MWD Advisors research and advisory services please visit www.mwdadvisors.com. MWD Advisors is a specialist advisory firm which provides practical, independent industry insights to business analytics, process improvement and digital collaboration professionals working to drive change with the help of technology. Our approach combines flexible, pragmatic mentoring and advisory services, built on a deep industry best practice and technology research foundation. www.mwdadvisors.com

Navigating the Big Data infrastructure layer 2 Summary Start with the business need The promise of a Big Data platform is that it takes data in its rawest form and converts it into consumable, actionable information Big Data infrastructure support goes beyond Hadoop A range of technical factors will dictate your Big Data infrastructure choice Any Big Data initiative needs to start by having a clear understanding of the business need and problem you re trying to solve. From here you can then figure out how those problems or opportunities map to data both inside and outside the organisation, and how different technology choices can govern how you manage and exploit that data. The concept of a Big Data platform provides a technology framework for taking data in its rawest form, and transforming it and putting it in a format where it can be consumed and acted upon by decision makers. Three core layers are required to support these capabilities: the lowest layer is responsible for the storage, organisation and retrieval of data; the middle layer is where the analysis of that data occurs; and the upper layer is where data insights are discovered and consumed. This report focuses on the first: the infrastructure layer. Although Hadoop has often been closely associated with Big Data, it s not the only option for organisations wanting to store and organise their data. In fact the Big Data infrastructure layer is likely to encompass a variety of technologies, components and architectures, each designed to serve a particular purpose. These may well include technologies such as Hadoop, MapReduce and distributed No-SQL databases but it could also include others such as in-memory databases, columnar databases and massively parallel processing architectures. Various technical considerations are likely to dictate what technology you implement in your Big Data infrastructure layer. These include, but are not limited to, the type of data under management (for instance whether it s structured or multi-structured data), its usage scenario (for example whether real-time data is a core requirement), and whether the system managing Big Data needs to be enterprise class and production ready that is, whether features such as workload management, availability, performance and standard SQL support are key requirements. Choosing your infrastructure technology will require a careful assimilation of all of these factors.

Navigating the Big Data infrastructure layer 3 Technology cost and sophistication driving the Big Data train As outlined in the first report in this series, Unlocking the potential of Big Data, in spite of all the headlines and vendor rhetoric, the ability to manage growing volumes of data is not a new phenomenon for organisations today. In fact, many early adopters of Business Intelligence (BI) and data warehousing technology (especially in the retail, telecoms and financial services industries) have long been accustomed to capturing and managing large volumes of data. Yet in spite of this we still see the rise and rise of Big Data as a seemingly relatively new concept so what has changed? Through their own technology innovations, web and social data-driven businesses such as Google and LinkedIn have shown us how to process Big Data sets (in their case web searches) on massively scalable storage and computing platforms using commodity hardware. Their technology expertise and success is the inspiration behind open source Big Data technologies such as Apache Hadoop and its ecosystem of tools (which we introduce in more detail below). The challenge of processing certain kinds of Big Data has also driven other technology innovations related to massive parallel processing architectures, in-memory analytics, columnar databases and complex event processing platforms. All of these pieces bring more choices to organisations that want to advance their use and management of Big Data. Similarly, enhancements in predictive analytics, text mining and advanced data visualisation tools make the exploitation of Big Data more straightforward by making it easier to discover hidden or interesting patterns and insights that, in turn, can be used to enhance productivity, drive efficiencies and growth, and create a sustainable competitive advantage. Figure 1: Drivers of broader Big Data adoption Source: MWD Advisors But it s not only technology developments spurring the advancement of Big Data; as figure 1 shows, the deployment economics of technologies are equally important. In particular, the decreasing cost of storage and memory, alongside the scalability of cloud computing platforms and appliances together with the growing influence of open source tools brings the promise of lower cost and more affordable Big Data platforms. The opportunities of Big Data are opening up to a wider audience, as it becomes more economically feasible to exploit, manage and leverage Big Data especially for those organisations that may have been priced out of this activity previously. Given all this potential it s worth reminding ourselves that Big Data on its own cannot unlock business value. Instead it s the application of Big Data to real-world business scenarios that provides the real value and scope for competitive advantage. So for most organisations the challenge then becomes not just how do you process, explore and mine Big Data, but how do you align those insights with your business and act on them in a timely and effective manner.

Navigating the Big Data infrastructure layer 4 A Big Data platform has three layers Most of the commentary around Big Data has focused on the type of data under management whether structured or multi-structured (defined as data stored and organised in a multitude of formats, including text, video, documents, web pages, email messages, audio or social media posts, and so on), or real-time or data in-motion. However, when considering the technical implications of Big Data, you need to think in terms of how data is transformed from its raw state to a point to where it can be consumed and acted on. This requires a set of three supporting capabilities that encompass: Capturing, processing and storing data Exploring and applying advanced analytics techniques Discovering and consuming insights. Today these capabilities are supported by a multitude of technology components some of them are relatively new, while others are based on existing technologies and architectures. In figure 2 we bring these concepts together as part of an overall Big Data platform with three layers. The lowest layer is concerned with organising and storing data; the middle layer is where the analysis of that data occurs; and the upper layer is where data insights are delivered and consumed. Figure 2: Three layers of a Big Data platform Source: MWD Advisors Although these capabilities aren t necessarily new to BI and data warehousing practitioners, it s become apparent that the old models for storing and analysing data don t necessarily apply to all Big Data assets. Not only is the amount of data vast and potentially more time-sensitive in nature, but the variety of data to be managed can be far greater and this is markedly changing the requirements of the technology needed. This report focuses principally on explaining what s needed in the infrastructure layer of a Big Data platform.

Navigating the Big Data infrastructure layer 5 Getting to grips with Big Data infrastructure Regardless of the hype surrounding Big Data, it remains an irrefutable fact that growing volumes of data of all types remain a fundamental part of doing business today. A natural consequence of this data explosion is that many of you can expect to find your existing data warehousing, BI and analytical capabilities hitting a wall or struggling to cope with the volume, speed, variety and workload demands of Big Data. This is something that in turn will require you to look more seriously at how you plan for Big Data architecture in order to support your current as well as future data management and information needs. Introducing Hadoop Handling Big Data brings new disciplines and data processing and storage requirements that aren t always supported within traditional data warehousing and relational database environments. Whereas purpose-built data warehouses are great at handling structured data and for performing certain types of queries, there s often a high cost both in terms of processing time and for the hardware to scale the system out when volumes grow. One of the most prominent technologies associated with the need to scale to Big Data heights is Hadoop. Hadoop is a top level Apache project part of its software foundation that s implemented in Java. The best way to think of Hadoop is as a computing environment that s built on top of a distributed clustered file system, designed specifically for very large-scale data operations. It s an ecosystem of projects that is targeted at simplifying, managing, coordinating and analysing large and varied data sets. Hadoop is generally seen as having two parts: a file system (the Hadoop Distributed File System) that stores and replicates large files across multiple nodes, and a programming model (MapReduce) for distributing the processing of large data files across large clusters. Looking beyond Hadoop: rounding out the infrastructure layer While popularity and interest in Hadoop has grown substantially in the last few years it shouldn t be seen as the only technology synonymous with Big Data. As illustrated in table 1 below, a number of other choices that support the storage, processing and retrieval of Big Data are available. Table 1: Big Data technology options Big Data technology Data type Key facts Usage considerations Hadoop HDFS Multi-structured data such as web and application logs, social media data HDFS is a distributed file system designed to run on clusters of commodity hardware It is particularly suitable for a small number of very large scale and multi-structured datasets (including semi- and multi structured data). The draw of Hadoop is that it is designed to store huge volumes of data without the overhead of relational databases. It is also designed for economic deployment on commodity hardware and provides the elasticity to support flexible scale out. HDFS can only store and retrieve data; it is unable to index it It is therefore best suited to read-only rather than update queries and is not suitable for real-time analysis or ad hoc analysis. Data availability is made possible through data replication Redundancy is built-in for fault tolerance and a capability for a Hadoop cluster to heal itself if a node fails during processing the data is retrieved from another node.

Navigating the Big Data infrastructure layer 6 MapReduce Works well with multistructured data MapReduce is not a storage technology per se but a programming paradigm used to extract and distil value from Big Data. MapReduce programs use low level APIs and therefore require a high degree of programming skills to master. That said, some database vendors run MapReduce in-database allowing developers to take advantage of it within a standard SQL interface. MapReduce programs run in parallel on large clusters of commodity hardware and execute on very large data sets such as HDFS files, although it can work with other file systems and database management system. Hadoop is the dominant open source MapReduce implementation, and although it is implemented in Java, MapReduce programs can also be written in a number of different languages including C++, Python and R. Hive and other query languages Works principally with multi-structured data Hive is a data warehouse system for Hadoop that facilitates data aggregation, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems such as HBase. Hive provides a mechanism to project structure onto data and query it using a SQL-like language called HiveQL and it can also be used to develop MapReduce applications. Other languages such as Pig and JAQL can similarly be used to deploy MapReduce applications. NoSQL databases Works primarily with web- scale, multistructured data covering a variety of data stores including those that are document-orientated, XML-based, or graph and key value stores NoSQL is commonly understood to stand for Not Only SQL This term covers a broad spectrum of databases. HBase as part of the Hadoop project is one of them, but others include database systems such as Cassandra and MongoDb. All are designed to scale to extremely large data sets containing highly varied data since they simply capture the data without categorising and parsing it. NoSQL databases are often termed schemaless In most NoSQL databases there is no fixed schema. This provides flexibility around how data is stored on disk. This makes them very flexible as they have very few data model restrictions thereby allowing a database to store virtually any structure it wants in a data element. Many NoSQL databases support ad hoc queries and often have their own query language NoSQL databases are especially good at supporting the multiple types of Big Data, accepting data from multiple sources in multiple formats, and then permitting program code to sift through, filter, and organise the data. That said, commonly used BI tools (especially those that are dependent on SQL generation) do not easily provide connectivity to NoSQL systems. Columnstore databases Primarily structured data Process data by columns, as opposed to rows This allows data to be stored in a way similar to how people ask business questions and can offer significant performance gains over traditional roworientated systems. Lots of columnar databases use data compression The ability to store data in highly compressed columns can reduce the database size and disk input-output, which can result in quicker access times. Columnar database are not suited to all analytic queries Columnar databases typically lend themselves to scan-based and more predictable analytic queries, but are less well suited to analytic environments where queries involve calculations or are open ended.

Navigating the Big Data infrastructure layer 7 In-memory Works primarily with structured data, although some providers are introducing support for multi structured data Data is stored within main memory rather than on disk The ability to store and process data in-memory means a lower I/O burden, reduced memory consumption and CPU cycles. This makes accessing data within memory an order of magnitude faster than accessing it within an on-disk database system. In-memory is more viable as a processing model today The viability of in-memory analytics is also closely linked to advances in hardware technology, such as 64-bit computing, multicore, and improvements in processor speed. In-memory processing can offer an alternative to conventional storage and query methods The technology facilitates fast querying, slicing and dicing of large data sets and on-the-fly calculations without the need to resort to more traditional methods such as aggregating data, pre-built cubes, database or query tuning. SQL-based MPP databases Primarily structured data Splits the processing of analytic operations across a number of parallel-processing nodes where each node works on its own set of data; also commonly referred to as a shared nothing architecture This boosts the performance of SQL-based analytic queries as operations are run in parallel. MPP SQL databases are especially good at handling large volumes of data that have a consistent, known structure, enabling regular reporting, data mining, and repeated analysis on such data. Requires a different administration approach Typically, the setup for MPP is more complicated, requiring thought about how to partition a database among processors and how to assign work among the processors. However, data warehousing appliances or engineered systems where the database, processing components and storage components are preoptimised for data warehousing do circumnavigate some of these challenges. Navigating the Big Data Infrastructure layer While each of these technology components can be used in isolation to serve a particular need, there are also valid reasons for using them in combination. For example, today we commonly see the pairing of columnar databases with in-memory processing architectures for boosting the performance and/or scalability of the storage layer. Likewise, we are also seeing the introduction of Big Data appliances that combine pre-configured hardware and software components such as an NoSQL database, Apache Hadoop distribution, together will the open source R language to target Big Data unstructured workloads. That said, given the early stage of Big Data market evolution, the choices for organising, storing and retrieving Big Data are more likely to expand in the shorter term before we see signs of convergence and deeper integration across the Big Data technology stack. Hence it s not surprising to see that today s Big Data challenges cannot be solved by a single platform or engine. In fact, our research suggests that at this stage in the market s development, infrastructure technology choices are likely to be governed by a number of factors, including the type of data under management (i.e. whether it s structured or multi-structured data); its usage scenario (for example if real-time data is required); and if the system managing Big Data needs to be enterprise class and production ready.

Navigating the Big Data infrastructure layer 8 As such many organisations implementing Big Data architectures tend to operate a best-of-breed approach, where an enterprise data warehouse is used for storing structured data for production status analysis and reporting, while NoSQL and Hadoop are used for storing multi-structured data where complex analytics and mining can be performed. These different data platforms are then brought together when smaller or more focused slices of that data the insights derived are pushed into the warehouse to enrich existing data. In the longer term however, all of these technologies need to come together and co-exist as part of a new Big Data landscape, and we are starting to see signs of this happening from some of the major Big Data vendors who are acquiring, integrating and developing connectors that help bridge the divide between some of these disparate technologies.

Navigating the Big Data infrastructure layer 9 Key considerations when planning your Big Data infrastructure investment Now is the time to consider storing and utilising a wider range of Big Data sources, such as social, sensor and web data. Today the lower deployment economics, brought about by commodity hardware, open source software and the cloud make Big Data organisation and storage a more realistic and appealing prospect for organisations today. Hadoop HDFS and schemaless NoSQL offerings work well when all or most of the data needs to be analysed and you are finding a traditional RDBMS to be a bottleneck. Using a complete set of data opens up the opportunity for deeper and richer insights rather than basing it only on a smaller sample or snapshot. Ignore the hype that says NoSQL databases will replace all traditional RDBMSs in the near future both can play a role in Big Data platforms. RDBMS offerings are suited to environments where data integrity is important as they play a vital role in processing and querying structured data and serving it up in a wide variety of BI tools and applications. Big Data infrastructure support goes beyond just Hadoop. For example, large scale inmemory computing, columnar and MPP databases also provide the ability to store and process large amounts of (primarily structured data), especially where speed-of-thought analysis or interactive and ad hoc querying may be a core requirement. Expect to find sourcing skills in newer technologies a challenge. Although there is a rich supply of DBA and developers who are versed in RDBMS concepts and programming, the same cannot be said for NoSQL systems. Currently it will be far easier to find experienced RDBMS programmers or administrators than a NoSQL expert. Press your vendor about the levels of integration they provide with Big Data technologies such as Hadoop. Understand how deep this support goes for instance in terms of connectors available and support for SQL as this will be key to determining how easily you can knit together the respective parts of your Big Data analytic environment.