2 CLOUDERA WHITE PAPER Table of Contents Introduction Data in Crisis The Data Brain Anatomy of the Platform Essentials of Success 7 A Data Platform 9 The Road Ahead About Cloudera
3 CLOUDERA WHITE PAPER Introduction The modern era of Big Data threatens to upend the status quo across organizations, from the data center to the boardroom. Companies are turning to Apache Hadoop (Hadoop) as the foundation for systems and tools capable of tackling the challenges of massive data growth. Some may doubt that Hadoop is the engine of the new era of data management, yet with the latest advances like Cloudera Impala, Hadoop enables organizations to deploy a central platform to solve end-to-end data problems, from batch processing to real-time applications, and to ask bigger questions of their data. During the mid-000s, the stress of massive data growth exposed flaws in existing data management companies. DATA IN CRISIS A brief history of the data growth challenges facing organizations today illustrates why Hadoop has become the central platform for Big Data. During the mid-000s, the stress of considerable data growth at innovative consumer web companies like Facebook, Google, and Yahoo! exposed flaws in existing data management technologies. The assumed operational model, which dictated massive storage arrays connected to massive computing arrays via a small network pipe, was showing its age. The network capabilities were failing to keep pace with computing demands as data sets increased in size and flow. The lack of affordable and predictable scale-out architectures muted any potential benefits wrung from the growing volumes of data. The data itself was changing as well. Non-traditional types and formats became valuable in reporting and analysis. Business teams were trying to collect new data to combine with existing customer and transaction data. In order to get a more refined picture of consumer activity, businesses wanted data in much larger volumes and from unstructured sources including web server logs, images, blogs, and social media streams. The sheer scale of this new data overwhelmed existing systems and demanded significant effort for even simple changes to data structures or reporting metrics. When orchestrating operations such as adding a single dimension or value, organizations were lucky to enact changes within two weeks. More often, changes required six months to implement. This latency meant the very questions themselves asked by the business had changed by the time IT had adjusted the infrastructure in order to be able to answer those questions. Even more troubling was the emergence of new functional limits of the existing systems. To borrow from former US Secretary of Defense, Donald Rumsfeld: these systems had been designed to examine the known unknowns the questions that a business knows to ask, but does not yet have the answers. The teams at Facebook, Google, and Yahoo! were encountering a different set of questions, the unknown unknowns the questions that a business has yet to ask, but is actively seeking to discover. We knew that the business needed more than just queries, explained Jeff Hammerbacher, former data management leader at Facebook and current chief scientist at Cloudera. Business now cared about processing as well. Yet our existing systems were optimized for queries, not processing. Business needed answers from data sets that required significantly more processing, and they needed a way to explore the questions that these new data sources brought to light. The data exploration challenge stemmed from a fundamental shift in the way organizations consumed data. Data emerged from being simply a source of information for reactive decisions the data in the report on the CFO s desk to the driver of proactive decisions. Data powered the content targeting campaigns and the recommendation engines of the social era. Organizations realized that through the discipline of data science and the breadth and permanence of their raw data sources and refined results, they could produce new revenue streams and cost avoidance strategies. Data had new intrinsic value. Data was now a financial asset, not a byproduct.
4 CLOUDERA WHITE PAPER These events experienced at Facebook, Google, and Yahoo! foreshadowed the challenges that now confront all industries. Back in 00, these companies were increasingly desperate to find a solution to these data management needs. With Cloudera Enterprise, Hadoop becomes the central system in which organizations can solve end-to-end data problems. THE DATA BRAIN The search for a solution ranged beyond the traditional database and data mart products, considering high performance computing (HPC), innovative scale out storage, and virtualization solutions. Each of these systems had components that solved elements of the broader Big Data challenge, but none provided a comprehensive and cohesive structure for addressing the issues in their entirety. HPC suffered the same network bottlenecks of legacy systems. New storage systems were able to optimize the cost per byte, but had no compute capacity. Virtualization excelled at making efficient use of individual machines, but did not have a mechanism for combining multiple machines to act as one. PROCESS INGEST STORE EXPLORE SERVE ANALYZE Figure. Data Brain Lifecycle Back then, Hammerbacher referred to the ideal solution as the Data Brain, which he described as a place to put all our data, no matter what it is, extract value, and be intelligent about it. At this time, Apache Hadoop, a nascent technology based on work pioneered by Google and created by former Yahoo! engineer and current chief architect at Cloudera, Doug Cutting, entered the IT marketplace. The initial focus of Hadoop was to improve and scale storage and processing during search indexing. Early adopters quickly realized that Hadoop, at its core, was more than just a system for building search indexes. The platform addressed the data needs of Facebook and Yahoo! as well as many others in the web and online advertising space. Hadoop offered deep, stable foundations for growth and opportunity. It was the answer to the coming wave of Big Data. For these reasons, Cloudera was formed to focus on growing the Hadoop-based technology platform to meet the Big Data challenges the rest of the world would soon face. Cloudera has propelled Apache Hadoop to become the premier technology for real-time and batch-oriented processing workloads on extremely large and hybrid data sets. With Cloudera Enterprise, Hadoop becomes the central system in which organizations can solve end-to-end data problems that involve any combination of data ingestion, storage, exploration, processing, analytics, and serving.
5 CLOUDERA WHITE PAPER Deploying Hadoop means no practical limit to volume and computing that is both immediate and useable. ANATOMY OF THE PLATFORM Apache Hadoop is open source software that couples elastic and versatile distributed storage with parallel processing of varied, multi-structured data using industry standard servers. Hadoop also has a rich and diverse ecosystem of supporting tools and applications. > Core Architecture: The core of Hadoop is an architecture that marries self-healing, high-bandwidth clustered storage (Hadoop Distributed File System, or HDFS) with fault-tolerant distributed processing (MapReduce). These core components of processing power and storage capacity scale linearly as additional servers are added to a Hadoop cluster. Data housed in HDFS is divided into smaller parts, called splits, which are distributed across storage partitions within the nodes of the cluster. The partitions, which are called blocks, ensure data reliability and access. The MapReduce framework operates in a similar fashion; MapReduce exploits the block distribution during code execution to minimize data movement and ensure optimal data availability. Deploying Hadoop means no practical limit to volume and computing that is both immediate and useable. HDFS Data Distribution Input File MapReduce Compute Distribution Output File Figure. Storage and Compute in Hadoop Node A Node B Node C Node D Node E Node A Node B Node C Node D Node E The underlying storage in HDFS is a flexible file system that accepts any data format and stores information permanently. Hadoop supports pluggable serialization that avoids normalization or restructuring for efficient and reliable storage in the data s original format. As a result, if an application needs to reprocess a data set or read data in a different format, the original data is both local and in its high fidelity, native state. Hadoop reads the format at query time, a process known as schema on read or late-binding, which offers a significant advantage to traditional systems that require data to be formatted first, i.e. schema on write or early-binding, before storage and processing. This latter approach often loses relevant but latent details and requires that an organization re-run the time-consuming full data lifecycle processing to regain any lost information.
6 CLOUDERA WHITE PAPER 6 Hadoop runs on industry standard hardware, and typically the cost per terabyte of Hadoop-based storage is 0x cheaper than traditional relational technology. Hadoop uses servers with local storage, thereby optimizing for high I/O workloads. Servers are connected using standard gigabit Ethernet, which lowers overall system cost yet still allows near limitless storage and processing, thanks to Hadoop s scale-out features. In addition to its use of local storage and standard networking, Hadoop can reduce the total hardware requirements since a single cluster provides both storage and processing. Cloudera Impala is the next step in real-time query engines. > Extending the Core: Over time, the Apache Hadoop ecosystem has matured to make these foundational elements easier to use: > Higher-level languages, like Apache Pig for procedural programming and Apache Hive for SQL-like manipulation and query (HiveQL), streamline integration and ease adoption for non-developers. > Data acquisition tools, like Apache Flume for log file and stream ingestion and Apache Sqoop for bi-directional data movement to and from relational databases, present a greater range of data available within a Hadoop cluster. > End user access tools, like Cloudera Hue for efficient, user-level interaction and Apache Oozie for workflow and scheduling, give IT operations and users alike direct and manageable means to maximize their efforts. > For high-end, real-time serving and delivery, the ecosystem includes Apache HBase, a distributed, column-family database. > Cloudera has further extended the distributed processing architecture beyond batch analysis with the introduction of Impala. Cloudera Impala is the next step in real-time query engines that allows users to query data stored in HDFS and HBase in seconds via a SQL interface. It leverages the metadata, SQL syntax, ODBC driver, and Hue user interface from Hive. Rather than using MapReduce, Impala uses its own processing framework to execute queries. The result is a 0x-0x performance improvement over Hive and enables interactive data exploration. 0, HIVE/MR HIVE/MR Seconds (avg.) 00 0 IMPALA IMPALA 00 GB 000 GB Figure. Improved response times with Cloudera Impala for typical fraud analysis queries.
7 CLOUDERA WHITE PAPER 7 One of the tenets of Big Data is the exponential growth of unstructured data. ESSENTIALS OF SUCCESS What makes a successful Big Data platform? Based on years of experience as the leading vendor and solution provider for Hadoop, Cloudera defines four requirements for a successful platform volume, velocity, variety, and value and while competing systems and technologies satisfy some of these demands, all have shortcomings and ultimately are inadequate platforms for Big Data. > Volume: Big Data is just that data sets that are so massive that typical software systems are incapable of economically storing, let alone managing and computing, the information. A Big Data platform must capture and readily provide such quantities in a comprehensive and uniform storage framework to enable straightforward management and development. While scalable data volume is a common refrain from vendors, and many systems claim to handle petabyte and exabyte-scale data stores, these statements can be misleading. The only commercially available system proven to reach 00PB is on Apache Hadoop. For other systems that do approach these volumes, the typical architectural pattern is to split or shard the data into infrastructure silos to overcome performance and storage issues. Others tie together multiple systems via federation and other virtual means and are typically subject to network latency, capability mismatch, and security constraints. > Velocity: As organizations continue to seek new questions, patterns, and metrics within their data sets, they demand rapid and agile modeling and query capabilities. A Big Data platform should maintain the original format and precision of all ingested data to ensure full latitude of future analysis and processing cycles. The platform should deliver this raw, unfettered data at anytime during these cycles. This requirement is a true litmus test for systems claiming the title of a Big Data platform. If data import requires a schema, then most likely the system has static schemas and proprietary serialization formats that are incapable of easy and rapid changes. Such models make answering the unknown unknowns challenge extremely difficult. This is a key differentiator between legacy relational technology systems and most Big Data solutions. > Variety: One of the tenets of Big Data is the exponential growth of unstructured data. The vast majority of data now originates from sources with either limited or variable structure, such as social media and telemetry. A Big Data platform must accommodate the full spectrum of data types and forms. Some solutions will highlight their flexibility with both unstructured and structured data, but in reality most employ opaque binary large object (BLOB) storage to dump unstructured data wholesale into columns within rigid relational schemas. In essence, the database becomes a file system, and while this technique appears to meet the goal of data flexibility, the system overhead inflates the economics and degrades performance. Relational technologies are simply not the right tool to handle a wide variety of formats, especially variable ones. Some systems support native XML, however, this is a single data format and suffers the same disadvantages as its relational counterparts.
8 CLOUDERA WHITE PAPER 8 Data scientists and developers need the full fidelity of their data. > Value: Driving relevant value, whether as revenue or cost savings, from data is the primary motivator for many organizations. The popularity of long tail business models has forced companies to examine their data in detail to find the patterns, affiliations, and connections to drive these new opportunities. Data scientists and developers need the full fidelity of their data, not a clustered sampling, to seek these opportunities or face the omission of a potential match that could prove wildly successful or downright catastrophic. A Big Data platform should offer organizations a range of languages, frameworks, and entry points to explore, process, analyze, and serve their data while in pursuit of these goals. Some practitioners state they have been providing this faculty for many years, yet most have been using only SQL, which as a query language is not ideal for data processing. While user-defined functions (UDF), which are code within a query to extend and add further capabilities, do enhance SQL, organizations may only exploit the full power of data processing through a true Turing complete system, like Java, Python, Ruby, and other languages, within a MapReduce job. Hadoop meets and exceeds all requirements of a Big Data platform: > Hadoop houses all data together under a single namespace and metadata model, on a single set of nodes, with a single security and governance framework on a linearly scalable, industry standard-based hardware infrastructure. > Hadoop is format agnostic due to its open and extensible data serialization framework and employs the schema on read approach, which allows the ingestion and retrieval of any and all data formats in their native fidelity. > Hadoop, through the schema on read and format-free approach, provides complete control over changes in data formats at any time and at any point during query and processing. > Hadoop and its MapReduce framework and the broader ecosystem, such as Apache Hive, Apache Pig, and Cloudera Impala, grant developers and analysts a diverse yet inclusive set of both low-level and high-level tools for manipulating and querying data. With Hadoop, organizations can support multiple, simultaneous formats and analytic approaches. Only Apache Hadoop offers all these features, and with Cloudera Enterprise, organizations benefit from a single, centralized management console with a single set of dependencies from one vendor, while still enjoying the advantages of open source software like code transparency and no vendor lock-in. The end result is: streamlined management for operators; batch, iterative, and real-time analysis for the data consumer; and faster return on investment for the forward-thinking IT leader.
9 CLOUDERA WHITE PAPER 9 Apache Hadoop and Cloudera offer organizations immediate opportunities to maximize their investment in Big Data. A DATA PLATFORM Apache Hadoop and Cloudera offer organizations immediate opportunities to maximize their investment in Big Data and help establish foundations for future growth and discovery. Cloudera sees three near-term activities for Hadoop in the modern enterprise: optimized infrastructure, predictive modeling, and data exploration. > Optimized Infrastructure: Hadoop can improve and accelerate many existing IT workloads, including archiving, general processing, and, most notably, extract-transform-load (ETL) processes. Current technologies for ETL tightly couple schema mappings within the processing pipeline. If an upstream structure breaks or inadvertently changes, the error cascades through the rest of the pipeline. Changes typically require an all-or-nothing approach, and this modeling approach results in longer cycles to make adjustments or fixes. Using MapReduce, all stages in the data pipeline are persisted to local disk, which offers a highdegree of fault-tolerance. Stage transitions gain flexibility via the schema on read capability. These two features enable iterative and incremental updates to processing flows, as transitional states in MapReduce are related but not dependent on each other. Thus as developers encounter errors, updates may be applied and processing restarted at the point of failure not the entire pipeline itself. Current ETL practices also involve considerable network traffic as data is moved into and out of the ETL grid. This movement translates into either high latency or high provisioning costs. With Hadoop and the MapReduce framework, computing is performed locally and isolates the expense of moving large volumes of data to the initial ingestion stage. While Hadoop offers many advantages for organizations, Hadoop is not a wholesale replacement for the traditional relational system and other storage and analysis solutions. Rather, Hadoop is a strong complement to many existing systems. The combination of these technologies offers enterprises tremendous opportunities to maximize IT investments and expand business capabilities by aligning IT workloads to the strengths of each system. Engineers Data Scientists Analysts Business Users Data Architects System Operators DEVELOPER TOOLS DATA MODELING BI / ANALYTICS ENTERPRISE REPORTING META DATA/ ETL TOOLS CLOUDERA MANAGER CLOUDERA HADOOP ENTERPRISE DATA WAREHOUSE ONLINE SERVING SYSTEM SYS LOGS WEB LOGS FILES RDBMS Figure. Hadoop in the Enterprise WEB/MOBILE APPLICATIONS Customers and End Users
10 CLOUDERA WHITE PAPER 0 For example, many data warehouses run workloads that are poorly aligned to their strengths because organizations had no other alternatives. Now, organizations can shift to Hadoop many of the tasks, such as large-scale data processing and the exploration of historical data. The now unburdened data warehouse is free to focus on its specialized workloads, like current operational analytics and interactive online analytical processing (OLAP) reporting, and yet still benefit from the processing and output of the Hadoop cluster. This architectural pattern has several benefits including a lower cost to store massive data sets, faster data transformations of large data sets, and a reduced data load into the data warehouse, which results in faster overall ETL processing and greater data warehouse capacity and agility. In short, each system the data warehouse and Hadoop focuses on its strength to achieve business goals. > Predictive Modeling: Hadoop is an ideal system for gathering and organizing large volumes of varied data, and its processing frameworks provide data scientists and developers a rich toolset for extracting signals and patterns from bodies of disparate knowledge. Organizations can exploit Hadoop s collection tools, like Flume and Sqoop, to import a sufficient corpus and use tools like Pig, Hive, Apache Crunch, DataFu, and Oozie to execute profiling, quality checks, enrichment, and other necessary steps during data preparation. Model fitting efforts can employ common implementations, such as recommendation engines and Bayes classifiers, using Apache Mahout, which is built upon MapReduce, or construct models directly in MapReduce itself. Organizations can use the same collection of data preparation tools for validation steps too. Commonly, the resulting cleansed data set is exported to a specialized statistical system for final computation and service. > Data Exploration: While Hadoop is a natural platform for large and dynamic data set analytics, the platform s batch processing framework, MapReduce, has not always fit within an organization s interactivity and usability requirements. The design of MapReduce emphasized processing capabilities rather than rapid exploration and ease of use. The introduction of HBase was the first step towards low-latency data delivery, while Hive offered a SQL-based experience to MapReduce. Despite these advancements, developers and data scientists still lacked an interactive data exploration tool that was native to Hadoop, thus they often shifted these workloads to traditional, purpose-built relational systems.. With the addition of Cloudera Impala, Hadoop-based systems have entered the world of real-time interactivity. By allowing users to query data stored in HDFS and HBase in seconds, Impala makes Hadoop usable for iterative analytical processes. Now, developers and data scientists can interact with data at sub-second times without migrating from Hadoop.
11 CLOUDERA WHITE PAPER Hadoop is now the scalable, flexible, and interactive data hub for modern enterprises and organizations. THE ROAD AHEAD The evolution of Hadoop as an enterprise system is accelerating, as clearly demonstrated by innovations like HBase, Hive, and now Impala. The road ahead is one of convergence. When powered by a unified, fully audited, centrally managed solution like Cloudera Enterprise, immediate opportunities optimized infrastructure, predictive modeling, and data exploration become the stepping stones to achieving Hammerbacher s vision of a place to put all our data, no matter what it is, extract value, and be intelligent about it. This goal is within sight; Hadoop is now the scalable, flexible, and interactive data refinery for modern enterprises and organizations. Cloudera Enterprise is the platform for solving demanding, end-to-end data problems. Cloudera Enterprise empowers people and business with: > Speed-to-Insight through iterative, real-time queries and serving; > Usability and Ecosystem Innovation with low-latency query engines and powerful SQL-based interfaces and ODBC/JDBC connectors; > Discovery and Governance by using common metadata and security frameworks; > Data Fidelity and Optimization resulting from local data and compute proximity that brings analysis to on read data where needed; > Cost Savings from lower costs per terabyte, reduced lineage tracking across systems, and agile data modeling. With Cloudera, people now have access to responsive and comprehensive high-performance storage and analysis from a single platform. People are free to explore the unknowns as well as the knowns in a single platform. People get answers as fast as they ask questions. It is time to ask bigger questions.
12 CLOUDERA WHITE PAPER About Cloudera Cloudera, the leader in Apache Hadoop-based software and services, enables data driven enterprises to easily derive business value from all their structured and unstructured data. As the top contributor to the Apache open source community and with tens of thousands of nodes under management across customers in financial services, government, telecommunications, media, web, advertising, retail, energy, bioinformatics, pharma/healthcare, university research, oil and gas and gaming, Cloudera's depth of experience and commitment to sharing expertise are unrivaled. Cloudera provides no representations or warranties regarding the accuracy, reliability, or serviceability of any information or recommendations provided in this publication, or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS, and the use of this information or the implementation of any recommendations or techniques herein is a customer s responsibility and depends on the customer s ability to evaluate and integrate them into the customer s operational environment. Cloudera, Inc. 0 Portage Avenue, Palo Alto, CA 906 USA or cloudera.com 0 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their respective companies. Information is subject to change without notice.
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
WHITE PAPER USING CLOUDERA TO IMPROVE DATA PROCESSING Using Cloudera to Improve Data Processing CLOUDERA WHITE PAPER 2 Table of Contents What is Data Processing? 3 Challenges 4 Flexibility and Data Quality
WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah Cofounder & CTO, Cloudera, Inc. Twitter: @awadallah 1 2 Cloudera Snapshot Founded 2008, by former employees of Employees
Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data
Apache Hadoop in the Enterprise Dr. Amr Awadallah, CTO/Founder @awadallah, email@example.com Cloudera The Leader in Big Data Management Powered by Apache Hadoop The Leading Open Source Distribution of Apache
Deploying an Operational Data Store Designed for Big Data A fast, secure, and scalable data staging environment with no data volume or variety constraints Sponsored by: Version: 102 Table of Contents Introduction
Datenverwaltung im Wandel - Building an Enterprise Data Hub with Cloudera Bernard Doering Regional Director, Central EMEA, Cloudera Cloudera Your Hadoop Experts Founded 2008, by former employees of Employees
Integrating Cloudera and SAP HANA Version: 103 Table of Contents Introduction/Executive Summary 4 Overview of Cloudera Enterprise 4 Data Access 5 Apache Hive 5 Data Processing 5 Data Integration 5 Partner
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
WHITE PAPER LOWER COSTS, INCREASE PRODUCTIVITY, AND ACCELERATE VALUE, WITH ENTERPRISE- READY HADOOP CLOUDERA WHITE PAPER 2 Table of Contents Introduction 3 Hadoop's Role in the Big Data Challenge 3 Cloudera:
More Data in Less Time Leveraging Cloudera CDH as an Operational Data Store Daniel Tydecks, Systems Engineering DACH & CE Goals of an Operational Data Store Load Data Sources Traditional Architecture Operational
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
Cloudera Enterprise Data Hub in Telecom: Three Customer Case Studies Version: 103 Table of Contents Introduction 3 Cloudera Enterprise Data Hub for Telcos 4 Cloudera Enterprise Data Hub in Telecom: Customer
QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QlikView Technical Case Study Series Big Data June 2012 qlikview.com Introduction This QlikView technical case study focuses on the QlikView deployment
B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
ACS 2015 Annual Canberra Conference Are You Big Data Ready? Vladimir Videnovic Business Solutions Director Oracle Big Data and Analytics Introduction Introduction What is Big Data? If you can't explain
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate
Mike Maxey Senior Director Product Marketing Greenplum A Division of EMC 1 Greenplum Becomes the Foundation of EMC s Big Data Analytics (July 2010) E M C A C Q U I R E S G R E E N P L U M For three years,
Microsoft Big Data Solution Brief Contents Introduction... 2 The Microsoft Big Data Solution... 3 Key Benefits... 3 Immersive Insight, Wherever You Are... 3 Connecting with the World s Data... 3 Any Data,
Hadoop Trends and Practical Use Cases John Howey Cloudera firstname.lastname@example.org Kevin Lewis Cloudera email@example.com April 2014 1 Agenda Hadoop Overview Latest Trends in Hadoop Enterprise Ready Beyond
HDP Enabling the Modern Data Architecture Herb Cunitz President, Hortonworks Page 1 Hortonworks enables adoption of Apache Hadoop through HDP (Hortonworks Data Platform) Founded in 2011 Original 24 architects,
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
Collaborative Big Data Analytics 1 Big Data Is Less About Size, And More About Freedom TechCrunch!!!!!!!!! Total data: bigger than big data 451 Group Findings: Big Data Is More Extreme Than Volume Gartner!!!!!!!!!!!!!!!
Evolution to Revolution: Big Data 2.0 An ENTERPRISE MANAGEMENT ASSOCIATES (EMA ) White Paper Prepared for Actian March 2014 IT & DATA MANAGEMENT RESEARCH, INDUSTRY ANALYSIS & CONSULTING Table of Contents
Cloudera in the Public Cloud Deployment Options for the Enterprise Data Hub Version: Q414-102 Table of Contents Executive Summary 3 The Case for Public Cloud 5 Public Cloud vs On-Premise 6 Public Cloud
Apache Hadoop: The Pla/orm for Big Data Amr Awadallah CTO, Founder, Cloudera, Inc. firstname.lastname@example.org, twicer: @awadallah 1 The Problems with Current Data Systems BI Reports + Interac7ve Apps RDBMS (aggregated
White Paper The Business Analyst s Guide to Hadoop Get Ready, Get Set, and Go: A Three-Step Guide to Implementing Hadoop-based Analytics By Alteryx and Hortonworks (T)here is considerable evidence that
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data
SAP Brief SAP Technology SAP Sybase IQ Objectives Tap into Big Data at the Speed of Business A simpler, more affordable approach to Big Data analytics A simpler, more affordable approach to Big Data analytics
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Accelerate your Big Data Strategy Execute faster with Capgemini and Cloudera s Enterprise Data Hub Accelerator Enterprise Data Hub Accelerator enables you to get started rapidly and cost-effectively with
Please give me your feedback Session BB4089 Speaker Claude Lorenson, Ph. D and Wendy Harms Use the mobile app to complete a session survey 1. Access My schedule 2. Click on this session 3. Go to Rate &
WHITEPAPER OPEN MODERN DATA ARCHITECTURE FOR FINANCIAL SERVICES RISK MANAGEMENT A top-tier global bank s end-of-day risk analysis jobs didn t complete in time for the next start of trading day. To solve
IBM InfoSphere BigInsights Enterprise Edition Efficiently manage and mine big data for valuable insights Highlights Advanced analytics for structured, semi-structured and unstructured data Professional-grade
Integrating Hadoop Into Business Intelligence & Data Warehousing Philip Russom TDWI Research Director for Data Management, April 9 2013 TDWI would like to thank the following companies for sponsoring the
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
1 Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop 2 Pivotal s Full Approach It s More Than Just Hadoop Pivotal Data Labs 3 Why Pivotal Exists First Movers Solve the Big Data Utility Gap
White Paper IT Workload Automation: Control Big Data Management Costs with Cisco Tidal Enterprise Scheduler What You Will Learn Big data environments are pushing the performance limits of business processing
Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure Requirements... 5 Solution Spectrum... 6 Oracle s Big Data
Expert Reference Series of White Papers Big Data and Apache Hadoop Adoption: Key Challenges and Rewards 1-800-COURSES www.globalknowledge.com Big Data and Apache Hadoop Adoption: Key Challenges and Rewards
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
BIG DATA IS MESSY PARTNER WITH SCALABLE SCALABLE SYSTEMS HADOOP SOLUTION WHAT IS BIG DATA? Each day human beings create 2.5 quintillion bytes of data. In the last two years alone over 90% of the data on
REAL-TIME OPERATIONAL INTELLIGENCE Competitive advantage from unstructured, high-velocity log and machine Big Data 2 SQLstream: Our s-streaming products unlock the value of high-velocity unstructured log
Modern Data Architecture with Apache Hadoop Talend Big Data Presented by Hortonworks and Talend Executive Summary Apache Hadoop didn t disrupt the datacenter, the data did. Shortly after Corporate IT functions
Storage for Next Generation Data Management Version: Q414-102 Table of Content Storage for the Modern Enterprise 3 The Challenges of Big Data 5 Data at the Center of the Enterprise 6 The Internals of HDFS
White Paper: Enhancing Functionality and Security of Enterprise Data Holdings Examining New Mission- Enabling Design Patterns Made Possible by the Cloudera- Intel Partnership Inside: Improving Return on
White Paper Make the Most of Big Data to Drive Innovation Through Reseach Bob Burwell, NetApp November 2012 WP-7172 Abstract Monumental data growth is a fact of life in research universities. The ability
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
White Paper EMC s Enterprise Hadoop Solution Isilon Scale-out NAS and Greenplum HD By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst February 2012 This ESG White Paper was commissioned
The Five Most Common Big Data Integration Mistakes To Avoid O R A C L E W H I T E P A P E R A P R I L 2 0 1 5 Executive Summary Big Data projects have fascinated business executives with the promise of
CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack
SAP Technical Brief SAP s for Enterprise Information Management SAP Data Services Objectives Integrate and Deliver Trusted Data and Enable Deep Insights Provide a wide-ranging view of enterprise information
CTOlabs.com White Paper: Hadoop for Intelligence Analysis July 2011 A White Paper providing context, tips and use cases on the topic of analysis over large quantities of data. Inside: Apache Hadoop and
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy Presented by: Jeffrey Zhang and Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop?
Big Data Technology ดร.ช ชาต หฤไชยะศ กด Choochart Haruechaiyasak, Ph.D. Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center (NECTEC) National Science and Technology
Traditional BI vs. Business Data Lake A comparison The need for new thinking around data storage and analysis Traditional Business Intelligence (BI) systems provide various levels and kinds of analyses
INDUS / AXIOMINE Adopting Hadoop In the Enterprise Typical Enterprise Use Cases. Contents Executive Overview... 2 Introduction... 2 Traditional Data Processing Pipeline... 3 ETL is prevalent Large Scale
Data Warehouse Optimization Embedding Hadoop in Data Warehouse Environments A Whitepaper Rick F. van der Lans Independent Business Intelligence Analyst R20/Consultancy September 2013 Sponsored by Copyright
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
Introduction to Big Data! with Apache Spark" UC#BERKELEY# So What is Data Science?" Doing Data Science" Data Preparation" Roles" This Lecture" What is Data Science?" Data Science aims to derive knowledge!
Big Data must become a first class citizen in the enterprise An Ovum white paper for Cloudera Publication Date: 14 January 2014 Author: Tony Baer SUMMARY Catalyst Ovum view Big Data analytics have caught
A Next-Generation Analytics Ecosystem for Big Data Colin White, BI Research September 2012 Sponsored by ParAccel BIG DATA IS BIG NEWS The value of big data lies in the business analytics that can be generated
Using Tableau Software with Hortonworks Data Platform September 2013 2013 Hortonworks Inc. http:// Modern businesses need to manage vast amounts of data, and in many cases they have accumulated this data
[ Consumer goods, Data Services ] TEKSYSTEMS GLOBAL SERVICES CUSTOMER SUCCESS STORIES QUICK FACTS Objectives Develop a unified data architecture for capturing Sony Computer Entertainment America s (SCEA)
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
MAKING BIG DATA COME ALIVE Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth Steve Gonzales, Principal Manager email@example.com
Big Data Open Source Stack vs. Traditional Stack for BI and Analytics Part I By Sam Poozhikala, Vice President Customer Solutions at StratApps Inc. 4/4/2014 You may contact Sam Poozhikala at firstname.lastname@example.org.
INSIGHT Oracle's All- Out Assault on the Big Data Market: Offering Hadoop, R, Cubes, and Scalable IMDB in Familiar Packages Carl W. Olofson IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA
Tapping into Hadoop and NoSQL Data Sources in MicroStrategy Presented by: Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop? Customer Case