Unleashing the Power of Hadoop for Big Data Analytics

Transcription

1 THOUGHT LEADERSHIP SERIES AUGUST Unleashing the Power of Hadoop for Big Data Analytics Data analytics, long the obscure pursuit of analysts and quants toiling in the depths of enterprises, has emerged as the must-have strategy of organizations across the globe. Competitive edge not only comes from deciphering the whims of customers and markets but also being able to predict shifts before they happen. Fueling the move of data analytics out of back offices and into the forefront of corporate strategy sessions is big data, now made enterprise-ready through technology platforms such as Hadoop and MapReduce. The Hadoop framework is seen as the most efficient file system and solution set to store and package big datasets for consumption by the enterprise, and MapReduce is the construct used to perform analysis over Hadoop files. Hadoop was first conceived as a web search engine for Yahoo!, whose developers were inspired by Google s now-wellknown MapReduce paper. It has become the cornerstone of a thriving big data marketplace. Estimates from Wikibon, the open source IT research community, put the worldwide big data market currently at approximately $18 billion, destined to reach roughly $50 billion by For years, and into the present day, enterprises have been applying data analytics against structured, relational datasets derived from transactional systems, using a wide range of tools from a variety of vendors, from data warehousing platforms to front-end desktop-based analysis software. Now, with the universe of unstructured data rapidly expanding, a new frontier is opening up for analysis, enabling potentially far-reaching insights. Hadoop handles data that traditional relational databases, data warehouses, and other analytics platforms have been unable to effectively manage including user-generated data through social media and machinegenerated data from sensors, appliances, Hadoop offers a range of advantages to data analytics efforts. and applications. Hadoop accomplishes this by applying more efficient formats and file systems to large datasets that would normally have been out of the reach of standard analytics solutions. Currently, the most prevalent application seen among Hadoop sites is log and event data analysis, particularly against the machine-generated data coming from web activity and devices. This may include the gathering and analysis of network traffic applications, capacity requirements, security events, and web interactions. As adoption grows, Hadoop-based data may increasingly play a role in more strategic business information, such as sales analysis and workforce allocation. WHY HADOOP? Hadoop offers a range of advantages to data analytics efforts. First and foremost, it enables the processing and analysis of all forms of data regardless of whether it is highly structured or unstructured. Hadoop is also more cost-effective than traditional analytics platforms such as data warehouses. With data warehouses, for example, investment needs to be made in the platform itself, along with investment in extract, transform, and loading (ETL) of data, and data cleansing, and modeling technologies. As a result, data has to be deemed important enough for the data warehouse investment, limiting its use and any ability to experiment with or pilot new forms of analysis. In Hadoop environments which also can accommodate data warehouse data big data stores can be brought in and processed cost effectively. At Hadoop s core is the principle of moving analytics closer to where the data resides. The framework is based on clusters that distribute the computing jobs required for big data analysis across various nodes. Hadoop is also

2 THOUGHT LEADERSHIP SERIES AUGUST cloud-friendly. While many enterprises choose to implement the framework within their data centers, Hadoop clusters can also be run from the cloud, either via cloud vendors or through hosting services. There is also a robust ecosystem of tools and technologies that has developed around Hadoop. Not only is the framework supported by a range of commercial software vendors, but there are also a number of open-source tools available as well, which enable enterprises to derive value out of big data. Many advanced analytic tools on the market also now support Hadoop, enabling visualization, data mining, predictive analytics, and text analytics against big datasets. There is also greater accuracy and flexibility possible in big data via Hadoop. First, analysis can be run against entire datasets, versus smaller samples, as has been the case in the past. In addition, the Hadoop Distributed File System packages datasets into files that can be easily absorbed by existing applications, without the need to upgrade them to a massively parallel version that can absorb big datasets. CHALLENGES While enterprise adoption of Hadoop is expanding, it brings new types of challenges, from manual coding demands, skills requirements, and a lack of native real-time capabilities. For example, the Hadoop Distributed File System does not offer the native resiliency or the real-time capabilities that enterprises have come to expect with enterprise-grade software packages. Hadoop is natively batchoriented, and thus real-time analysis may not be available without additional tools. Plus, if the Hadoop system goes down, it may take some time to recover and restore the framework. In addition, the technology first developed and released in 2006 is still relatively new on the scene, and implementations still relatively immature. It should be noted, for example, that the operational framework is on version 1.0, as offered through the Apache Foundation. To date, many implementations are either more commonly seen among the large web properties or within the depths of data management or IT departments, as pilot projects or as part of efforts to optimize operations within these departments. While enterprise adoption of Hadoop is expanding, it also brings new types of challenges. Hadoop also requires a high degree of skill and understanding to install and implement. Hadoop implementation and management skills, as well as MapReduce development skills, are in high demand and difficult to find. As a result, enterprises seeking to employ Hadoop-based data analysis environments will either require highly trained IT and data management departments, or will need to rely on thirdparty consultants. STEPS TO SUCCESS The following are steps to success for adopting Hadoop-based big data analytics in enterprises: Learn the technology. The Hadoop framework and ecosystem introduces new sets of solutions, such as the Hadoop Distributed File System and MapReduce Engine, along with a range of add-on applications such as Hbase, Hive, Pig, and Sqoop. There are numerous online training programs available, as well as online tutorials, webinars, books, and white papers to further acquaint enterprise teams with the features and technical details of Hadoop. Develop a test environment to pilot Hadoop projects. Reference architectures are now available across the industry for various scenarios of Hadoop implementations. This will also help in the selection of tools that will benefit users accessing the information coming from the Hadoop framework. Work with the business. Hadoop may be more commonly used for optimizing internal IT or data management operations, but it is rapidly gaining ground as a strategic big data analytics platform as well. Hadoop isn t operated in isolation, since it involves pulling in data from different systems and then publishing that data out to different systems. It s also important to develop use cases for the big data analysis project, to map out data flows and determine what data is required. Very importantly, there has to be a demonstrated return on investment. If Hadoop won t bring in additional revenue or cut costs for the business, it may not be worth implementing the technology. Solving the business problem is the ultimate return. Provide and encourage training and skills development. The Hadoop platform requires specialized skill sets that are not readily available on the job market across several different disciplines in the data analytics space systems administration, application development, data analysis and stewardship, and networking expertise. Many of these skills are already resident within organizations and many individuals in data management or IT departments will be able to grow into these roles.

3 THOUGHT LEADERSHIP SERIES AUGUST MarkLogic : The Best Database for Apache Hadoop* The Best Database for Hadoop is a very audacious claim. This essay will lay out the reasons why we believe it to be true, and why MarkLogic is the best and only choice for an enterprise-class database that integrates with Hadoop at both the storage and compute layers. The integration between MarkLogic and Hadoop provides a single platform that allows you to mix and match both realtime and analytics workloads, without having to duplicate data or create a oneoff infrastructure. WHAT IS HADOOP AND WHY IS IT IMPORTANT? Hadoop is a framework for distributed processing over large groups of commodity machines. It is typically used on data that is too large or unpredictable for traditional databases or data warehouses, or for problems that are so computationally expensive that dividing and conquering in parallel is the only way to perform an analysis in a reasonable amount of time. Google popularized the programming model of MapReduce with a paper in 2004 that described their distributed search indexing infrastructure. Hadoop is an open-source implementation of these concepts. It evolved out of search indexing work at Yahoo!* by Doug Cutting, the creator of Apache Nutch* and Apache Lucene*. Hadoop was a response to not being able to handle what we would now call Big Data in legacy RDBMSs or data warehouses. For Google, there was no database at the time that could scale to handle a crawl of the entire web. And, even if a database could handle computation over that amount of data, the storage would have required a complex and expensive SAN or NAS infrastructure. HDFS, the Hadoop Distributed File System, addresses that cost issue by allowing you to scale out versus up on commodity hardware. However, even with the virtually infinite pool of storage and compute resources that Hadoop promises, organizations still need to get data securely and in real-time to users. MapReduce is fine for batch analysis, but what if you need to provide users with the ability to quickly find specific pieces of data and provide granular updates to the data in real time? If you need to do near-instantaneous analysis and alerting for fraud detection, emergency crisis management or risk mitigation or assessment, can you afford the time it would take for a MapReduce job to complete? MARKLOGIC WITH HADOOP AS COMPUTE INFRASTRUCTURE Low-latency queries and granular updates, of course, require a database. Hadoop alone is not equipped for this type of workload. The popular tech press will have you think it s a stark trade-off between legacy relational databases which provide indexes, transactions, security, and enterprise operations and the popular open-source NoSQL databases which have a flexible data model and commodity scale-out while being distributed and fault-tolerant. What if you could have the best of both of these worlds: the flexibility and scalability of NoSQL along with the reliability and security organizations have come to trust in mature relational databases? MarkLogic is unique in the marketplace in providing the best of NoSQL while also being a hardened and proven enterpriseclass database technology. Created in 2001 to fill the need within enterprise organizations and government entities to store, manage, query and search data, no matter the format or structure, MarkLogic has these NoSQL characteristics: Flexible, with a schema-free, document data model (JSON, XML, Text, Binary) Fast, implemented in C++, optimized for today s I/O systems Scalable, leveraging a shared-nothing distributed architecture and lock-free reads And highly available, with transactional consistency, automatic failover, and replication As an Enterprise NoSQL database MarkLogic was designed from the start to support enterprise-class and enterprisescale application requirements, including: ACID (atomic, consistent, isolated, and durable) transactions Government-grade security features including fine-grained privileges,

4 THOUGHT LEADERSHIP SERIES AUGUST role-based security, document-level permissions, and HTTPS access Real-time indexing, full-text search, geospatial search and alerting Proven reliability, uptime, and over 500 deployed mission-critical and enterprise projects in government, media, financial services, energy, and other industries MarkLogic provides all of the scalability on commodity hardware that s come to define the NoSQL space. Yet, it doesn t force you to give up the enterprise capabilities that are required in a missioncritical application. Some will argue that it is possible to maintain enterprise capabilities with other databases whether RDBMS or NoSQL by moving data to the environment where it will be used. Whether you re setting up a data mart with a data warehouse or a search index, moving data around is costly. ETL is error prone and brittle. Once the data is in two places, you not only have to pay for the extra storage, but also deal with governance and security across multiple systems, often in different parts of an organization. Hadoop introduces yet another environment. However, MarkLogic has taken a different approach. In October of 2012 we introduced the ability to deploy MarkLogic directly into an existing Hadoop environment. This allows you to store all of your data in Hadoop s default file storage. This has several important benefits: You can build real-time enterprise applications for Hadoop-based data You can leverage existing (or upcoming) infrastructure investments to save time and money You will require less data movement and/or duplication over its life cycle You can support mixed workloads: index once, real-time or batch You will save money from using costeffective long-term and long-tail storage The new release is not a fork of MarkLogic or Hadoop. MarkLogic can be deployed against any of the leading commercial Hadoop distributions, allowing administrators to leverage existing infrastructure. Because HDFS is one of several file systems on which you can store MarkLogic data, administrators can also easily and consistently move data between SSD, local disk, SAN, NAS, S3, and HDFS to support specific SLAs and cost objectives without modifying downstream application code. And, because MarkLogic has been engineered to be distributed and minimize file-system I/O from the beginning, this wasn t a complete re-engineering of MarkLogic. It s the same scalable, reliable database that you run on your POSIX file system, just that it can now live on top of the Hadoop file system without modifying an existing application code. There is no other enterprise NoSQL database that can do that. MarkLogic and Hadoop work together in a complementary manner in a Big Data ecosystem. Hadoop excels at offline analytics; MarkLogic excels at online applications. Hadoop has been deployed for model-building; while hundreds of decision-making applications rely and run on MarkLogic. Long-haul batch analysis is best done on Hadoop; while real-time queries and alerts require MarkLogic. Finally, the file system in Hadoop is distributed while MarkLogic distributes indexes. In October 2011, MarkLogic released the MarkLogic Connector for Apache Hadoop*. The connector is a drop-in extension to Hadoop that allows you to efficiently move data between MarkLogic and Hadoop using MapReduce. There are three target use cases for the connector; the first and most common is ETL. Hadoop is a common environment to stage raw data in order to move it into other downstream systems. The Hadoop Connector allows you to tap into the large ecosystem of existing libraries to transform and aggregate data before loading it into MarkLogic. This is, in fact, the fastest way to load data into MarkLogic. MarkLogic s bulk loading tool, mlcp, takes advantage MapReduce Defined When someone says Hadoop, they typically mean the entire ecosystem of projects. At the core, however, are the principal compute and storage infrastructure components mentioned above: MapReduce for distributed computation, with a divide and conquer methodology, and HDFS, the distributed file system. These are the most mature parts of the ecosystem and the foundation for all of the other pieces. MapReduce allows you to break large or complex processing into small, independent pieces. Map processes or filters a chunk of the total input data, and Reduce aggregates and collates intermediate results. The Map and Reduce processes work in parallel. You can scale by adding more commodity hardware, not upgrading to bigger/faster hardware. It is centrally coordinated, so if a node goes down, the system reschedules its work to another. by scheduling Hadoop jobs under the covers to load terabytes, or even petabytes, in parallel. MapReduce is only the first chapter in the MarkLogic + Hadoop story, though. MARKLOGIC WITH HADOOP AS STORAGE INFRASTRUCTURE Beyond its MapReduce processing capabilities, Hadoop via its core Hadoop Distributed File System (HDFS) is also a cheap and reliable way to store Big Data. It is the default data storage for Hadoop, and scales to hundreds of petabytes on commodity hardware. HDFS sits right on top of raw local storage, so users do not need a SAN. Its other characteristics include:

5 THOUGHT LEADERSHIP SERIES AUGUST Designed only for reading large, opaque files from start to finish Optimizes for aggregate throughput, not latency Write-once, read-many Automatic 3x replication for availability and performance on commodity hardware File-level security designed to prevent accidental corruption HDFS is a cost-effective means to keep data that may have otherwise been discarded or archived to tape. It is not designed for real-time data access as needed by user-facing applications. A file system doesn t replace a database. There are no indexes in HDFS, so finding an individual record typically involves scanning through every record in a large file. That s great for large-scale analytics where the computation might need to read every record. However, that doesn t work for queries that need to be interactive to support end-user applications. Hadoop is designed for aggregate throughput. The primary consumer is, of course MapReduce which is all about reading large files end-to-end. Coincidently, this also happens to be MarkLogic s I/O pattern. MarkLogic buffers writes to RAM, with an on-disk journal for durability, and periodically spills those buffers, or stands, to disk. As you get more stands it becomes more expensive to run queries. So, in the background MarkLogic consolidates small stands into larger stands, similar in principle to compaction in HBase. Individual reads are aggressively cached in their compressed on-disk format at the MarkLogic data nodes as well as in their uncompressed form at the query evaluator nodes to avoid disk seeks. To describe how MarkLogic is integrated into Hadoop at the storage layer, it is important to first understand how MarkLogic stores data. MarkLogic stores its index and data in independent partitions, or forests (collections of JSON or XML trees). Hosts attach forests, and provide CPU and RAM. Applications scale up by adding forests, and scale out by adding hosts. Using a shared file system as storage provides shared-disk failover, centralized management, economies of scale, and advanced features like flash backup, de-duplication, and compression. These features come at a cost both in dollars and complexity. A truly shared architecture with pools of storage, like in a SAN or NAS, can be difficult to scale. MarkLogic can leverage Hadoop as a shared file system, and take advantage of Hadoop s data and indexes, journals, offline archives, backups, and storage for binaries. MarkLogic can also leverage Hadoop as a storage tier, allowing you to optimize among cost, performance, and availability. For example, you can benefit from less expensive Hadoop storage for archive data, with high density for efficiency, and shared-disk failover while using another tier of more expensive storage for active data, with low density for ingest performance, and replication for high availability. A tiered storage infrastructure with MarkLogic lets you fluidly and consistently switch between Active, Historical, and Archive data without expensive ETL or dedicated infrastructure. You can perform mixed batch and realtime workloads with Hadoop MapReduce and the MarkLogic Enterprise NoSQL Database. DISTRIBUTION STRATEGY AND INTEL* PARTNERSHIP ENTERPRISE HADOOP MEETS ENTERPRISE NOSQL MarkLogic believes that Hadoop is an emerging part of mainstream enterprise infrastructure and is investing heavily in this future. Intel is investing in making Hadoop easier to deploy and manage, making it faster, and making it more secure. Intel* Distribution for Apache Hadoop* (IDH) allows organizations to be successful with Hadoop without having to be project committers, letting organizations focus on their applications not their infrastructure. MarkLogic announced a partnership earlier this year with Intel that will provide integrated support for applications built with MarkLogic and IDH. This is in addition to the Hadoop distributions that MarkLogic already supports, such as those from Hortonworks* and Cloudera*. SUMMARY Hadoop is still early in terms of mainstream adoption, but MarkLogic is committed to supporting the technology as an emerging component of mainstream enterprise infrastructure, changing the economics of Big Data and we are investing heavily in this future today. MarkLogic is the best database for Hadoop organizations can deploy MarkLogic into an existing Hadoop stack to benefit from: Real-time enterprise applications for Hadoop Less data movement, duplication over its life cycle Mixed workloads for best value and efficiency: Index once, real-time or batch Cost-effective long-term and long-tail storage Leverage existing (or upcoming) infrastructure investments MarkLogic security, high availability, disaster recovery, transactional consistency, search and query MarkLogic and Hadoop are complementary technologies that work well together for today s Big Data challenges. By partnering with Intel, MarkLogic becomes your one-stop-shop for a fully supported platform that ensures low risk for your Big Data deployments. If you are currently implementing, or plan to implement Big Data solutions, use MarkLogic as the foundation of your stack to get the best of batch processing and real-time interactivity. MARKLOGIC For more information, please visit MarkLogic is a registered trademark of MarkLogic Corporation in the United States and/or other countries. All other trademarks mentioned are the property of their respective owners.