Hadoop: Strengths and Limitations in National Security Missions
|
|
|
- Roberta Ryan
- 10 years ago
- Views:
Transcription
1 Hadoop: Strengths and Limitations in National Security Missions How SAP Solutions Complement Hadoop to Enable Mission Success Bob Palmer Senior Director, SAP National Security Services (SAP NS2 ) June About SAP National Security Services (SAP NS2 ) SAP National Security Services (SAP NS2) offers a full suite of enterprise applications, analytics, database, cloud, and mobility software solutions from SAP and Sybase with specialized levels of security and support to meet the unique mission requirements of US national security and critical infrastructure customers. SAP National Security Services and SAP NS2 are trademarks owned by SAP Government Support and Services, Inc. (SAP GSS). H adoop a distributed file system for managing large data sets is a top-of-mind subject in the U.S. Intelligence Community and Department of Defense, as well as in large commercial operations such as Yahoo!, ebay and Amazon. This paper seeks to empower our clients and partners with a fundamental understanding of: The basic architecture of Hadoop; Which features and benefits of Hadoop are driving its adoption; Misconceptions that may exist that lead to the idea that Hadoop is the hammer for every nail; and How SAP solutions, delivered by SAP NS2, can extend and complement Hadoop to achieve optimal mission or business outcomes for your organization.
2 What Is Hadoop? Hadoop is a distributed file system, not a database. The Hadoop Distributed File System (HDFS) manages the splitting up and storage of large files of data across many inexpensive commodity servers, which are known as worker nodes. When Hadoop splits up the files, it puts redundant copies of the chunks of the file on more than one disc drive, providing self-healing redundancy if a low cost commodity server fails. Hadoop also manages the distribution of scripts that perform business logic on the data files that are split up on those many server nodes. This splitting up of the business logic to each of the CPUs and RAM on many inexpensive worker nodes is what makes Hadoop work well on very large Big Data files. Analysis logic is performed in parallel on all of the server nodes at once, on each of the 64MB or 128MB chunks of the file. Hadoop software is written in Java and is licensed for free; it was developed as an open-source initiative of the Apache Foundation. The name Hadoop is not an acronym; it was the name of a toy elephant of the son of the original author of the Hadoop framework, Doug Cutting. Companies such as Cloudera and Hortonworks sell support for Hadoop; they maintain their own versions and perform support and maintenance for an annual fee based on the number of nodes. They also sell licensed software utilities to manage and enhance the freeware Hadoop software. Thus, we can say that Cloudera and Hortonworks are to Hadoop as Redhat and SUSE are to Linux. Hadoop Architecture SAP National Security Services SAP NS2
3 The Secret Sauce: MapReduce The exploration or analysis of the data in the Hadoop file system is done through a software process called MapReduce. MapReduce scripts are written in Java by developers on the edge server. The MapReduce engine breaks up jobs into many small tasks, run in parallel and launched on all nodes that have a part of the file; this is how Hadoop addresses the problem of storage and analysis of Big Data. Both the data and the computations on that data are in the RAM memory of the node computer. The map function of the MapReduce engine distributes the logic that the developer wants to execute on a given data file to each of the CPU processors that are on the nodes that have a small chunk of that file on their discs. The reduce function takes the raw results of the mapping function and performs the business logic computations desired against them: aggregation, sums, word counts, averages, et cetera. As an example of a MapReduce function, let s say that you want to find out what the average temperature was for every city in America in The mapper function would organize data from all the chunks of your one-terabyte-sized temperatures for the 20th century file using the key of year Values returned from each of the nodes would be 78, 58, 32, 64, and so on. The reducer task would then take the mapped data organized by year and average them. Hadoop is self-healing, because the map function splits large files into 64MB or 128MB chunks spread out over the many disc drives of the nodes. It redundantly puts copies of the chunks on more than one disc drive. This is why one can build Hadoop deployments using hundreds of inexpensive commodity servers with plain-old disc drives. If a box fails, you don t even fix it, you just yank it out and trash it, and Hadoop manages the redundant data with no loss. Scalability is simple to achieve; if the CPU or RAM memory is being taxed by the MapReduce scripts, just add more inexpensive commodity server CPUs and disc drives. Hadoop as a Threat to Hardware Manufacturers Since it was conceived as a way to slash the per-terabyte cost of storing very large amounts of data because one gets the software for free and uses cheap commodity servers and disc drives to store it, Hadoop is certainly a threat to high-end hardware manufacturers. One response of the hardware 2012 SAP National Security Services SAP NS2 3
4 vendors has been to create Hadoop appliances. For example, enterprise hardware manufacturers will deliver a refrigerator box to your loading dock that has thousands of nodes of Cloudera- or Hortonworks-brand Hadoop already loaded on it with redundant cooling and network interfaces. This approach appeals to a customer who has a long-standing relationship with a hardware vendor and wants a turn-key way to store large amounts of semi-structured data. But this is against the original concept of Hadoop; namely the use of cheap commodity hardware with self-healing software to store data at a minimum cost-per-terabyte. Hadoop appliances also appeal to customers who do not have the existing data-center real estate and infrastructure to power and cool hundreds of commodity servers. Utilities that Interface with Hadoop MapReduce scripts can also be generated automatically by utilities such as HIVE and PIG. HIVE is a utility for developers that know SQL (Standard Query Language); it takes a subset of ANSI SQL and generates MapReduce scripts from it. Note that not all SQL commands are appropriate for a distributed file system like Hadoop. HIVE could be used to bridge from a SQL based business intelligence (BI) reporting front-end to Hadoop, but only a part of the full SQL functionality could be executed by Hadoop. HIVE user documentation publishes which SQL functions are available. Compared to highly optimized, handwritten MapReduce scripts, HIVE-generated scripts typically run about 1.7 times slower. PIG is a similar utility that creates MapReduce jobs, except it is intended for use by developers that are conversant in Perl or Python scripting. It uses a scripting language called PIG Latin. Sqoop is a bulk loader for Hadoop, which takes data from file sources and loads them into Hadoop. Mahout is a utility that performs data mining and text analysis on Hadoop files, and complex statistical analysis. It is a highly configurable, robust, and complex development tool SAP National Security Services SAP NS2
5 The Perceived Business Benefits of Hadoop The attractions of Hadoop are similar to those of other open source business drivers. The procurement process is simpler, without the need for capital expenditure, because of the absence of initial software licensing costs. Paid support from vendors like Cloudera and Hortonworks can be an operations and maintenance (O&M) budget expenditure. The skilled labor needed for MapReduce scripting (and updating those scripts to reflect changing end-user demand) may be seen as a sunk cost if a large employee base of developers already exists in a client s shop, or amongst the system integrator staff. Aside from the initial cost, the freedom and flexibility of deployment unconstrained by licensing considerations makes Hadoop attractive to customers. What Hadoop Is Not So Good At Since Hadoop is a file system and not a database there is no way to change the data in the files. 1 There is no such thing as an update or delete function in Hadoop as there is in a databasemanagement system, and no concept of commit data or roll-back data as in a transactional processing database system. So we can say that Hadoop is a write-once, read-many affair. Therefore, Hadoop is best used for storing large amounts of unstructured or semi-structured data from streaming data sources like log files, internet search results (click-stream data), signal intelligence, or sensor data. It is not as well suited for storing and managing changes to discrete business or mission data elements that have precise meta-data definitions. In order to do analysis or exploration of a file in Hadoop, the whole file must be read for every computational process because by its nature there are no predefined data schema or indexes in Hadoop. 2 Since Hadoop is architected to accommodate very large data files by splitting the file up into small chunks over many worker nodes, it does not perform well if many small files are stored in the distributed file system. If only one or two nodes are needed for the file size, the overhead of managing the distribution has no economy of scale. Therefore we can generalize that Hadoop is not a good way to organize and analyze large numbers of smaller files. 1 There are schemes which can append log files to Hadoop files to emulate the deletion or changing of data elements in Hadoop files. On query, the log file is invoked to indicate that the result of the MapReduce should not include a given element. 2 Hadoop is schema-on-read SAP National Security Services SAP NS2 5
6 Hadoop is good for asking novel questions of raw data that you never even knew you would need as you collected it, and you expect that it will take a lot of batch-processing time to get the results you are looking for. 3 It is less useful when you know the types of questions that will be asked on wellstructured data sets. And Hadoop will not return query results in the sub-second response times that business intelligence users are accustomed to for report queries based on well-structured and indexed data warehouses. For example, Hadoop is not the best solution when a CEO answers the question: Who were our 10 biggest customers in North America in Q3, by product line? Conversely, if you tried to read and sort every single row of Big Data on a relational database management system without any indexes it could take hours, instead of the half hour that Hadoop might need for the same-sized file. It s really apples and oranges, i.e. different use-cases for Hadoop versus data warehouse business intelligence. In other words, Hadoop falls short in performance when a user wants to ask a specific business question to find just three records in a large data set. 4 A well-indexed relational database or a columnar database like SAP s Sybase IQ can deliver those three records out of billions with subsecond response times. Hadoop doesn t know how to look at just a few records to get an answer returned in milliseconds it would have to go through the whole file to generate the result set that the script was looking for. Therefore Hadoop would be a poor choice for a master recruiting database, but it would be a very good repository to track click-stream responses to the recruiting website. 3 For example, an anecdotal data fable is that a large national retailer captures every single check-out in Hadoop. By analyzing the data to understand how products should be placed on shelves, the retailer found that beer and diapers often appeared in the same checkout and should be conveniently shelved near each other, presumably because male buyers that are sent to the store for diapers often also buy beer. 4 Note that this is the very use-case that is right in the wheel house for SAP Sybase IQ and for SAP s in-memory data warehouse, HANA SAP National Security Services SAP NS2
7 Hadoop as an ETL Tool to Load Databases Hadoop is essentially batch oriented; a developer builds a MapReduce script to look at a whole one or two terabyte file and he expects it to run for 20 minutes to an hour or longer. The idea of Hadoop is to be cheap enough per terabyte that you can store all of the raw data, 5 even data that currently has no anticipated uses which is certainly not a well constructed data warehouse philosophy. Because of this, an important use-case for Hadoop is to use it as an ETL (Extract, Transform and Load) 6 tool to cull specific, important data elements that have business or mission value out of the huge data files, and load them into a relational database management system or a columnar database, for transactional application use or business intelligence analysis. Optimizing Mission Performance Using a Hybrid Solution: SAP s Sybase and BusinessObjects Solutions with Hadoop This is where SAP s Sybase IQ Columnar Database Management System combined with SAP BusinessObjects reporting can add important functionality to the total Big Data environment. The result of the reduce function may be loaded into an analysis repository where it can then be exposed to end-users using a self-service business intelligence tool like SAP BusinessObjects Web Intelligence and Explorer. The SAP Sybase columnar RDBMS is uniquely appropriate as a repository for the result of the MapReduce from the Hadoop system. It is optimized in such a way that there is no bottleneck in reporting performance for even extremely large data sets coming from a reduce job. Sybase IQ by its nature is self-indexing. In addition, the columnar vs. row-based storage of data means that the seek time response is much shorter, even when analyzing the very large data sets that may be the result of the Hadoop MapReduce script. The technical reason that Sybase IQ s columnar database is much faster for analysis than traditional row-oriented database is that row-oriented databases serialize all of the attributes of an entity together in one place on the disc drive. 5 For example, anecdotal rumor has it that a large telecommunications vendor couldn t figure out what caused a system-wide outage for several days because the log files were on too many different servers, and they could not analyze them together. Now they are collecting all the logs to one large Hadoop system. 6 Or more precisely in most cases, Extract, Load and Transform SAP National Security Services SAP NS2 7
8 For example, let s say you had the following data: John Doe E8 $65,000 12/01/1975 Mary Smith E4 $22,000 05/02/1989 James Jones O2 $23,000 06/14/1990 etc. etc. etc. etc. On the disc, the row-based storage would look like this: 1: John Doe,E8,$65,000,12/01/1975,etc. 2: Mary Smith,E4,$22,000,05/02/1989,etc. 3: Jim Jones,O2,$23,000,06/14/1990,etc. So if you were doing an analysis of salary level, you would have to query every row, which may be in different segments of the disc, and look for the attributes in which you have analytical interest. This process can be accelerated by building an indexing strategy to reduce the number of rows that need to be queried to get to an analytical result, but that pre-supposes that the queries to be asked are planned and known ahead of time; such tuning is against the paradigm of true ad-hoc query and analysis. In contrast, a columnar data store makes all the attributes of the entities quickly accessible on the disc in one place, e.g.: 1: John Doe,Mary Smith,James Jones,etc. 2: E8,E4,O2,etc, 3: $65,000,$22,000,$23,000,etc. 4: , , ,etc. So all the attributes that are needed for analysis are found all together in the same area of the disc; this allows the system to use far fewer disc hits when accessing data for analysis. It also allows powerful data compression because bit maps can be used to represent the data elements, instead of the full depiction of the data. This can result in significant savings in data center infrastructure costs. Typical compression rates are 3.5 times compression versus raw data, and 10 times compression vs. row based data base storage of that data. 7 7 For example, one large U.S. federal government user of Sybase IQ realized a significant reduction in rack space from 45 racks to 13 racks, which included database, extract transform and load, and storage components SAP National Security Services SAP NS2
9 The current state-of-the-practice for delivering the output of Hadoop MapReduce jobs is often handwritten Java reports or delimited files, which by their nature are not flexible or ad hoc. In addition, if the requirements for analysis are not static, there is an ongoing need to refine and rewrite MapReduce jobs to execute analysis. This typically has been done as collaboration between the data scientist or MapReduce developer and the end-user, who is a subject matter expert in the mission. Iterative Process of Analysis Using MapReduce As shown above, this could produce a bottleneck in analysis that has nothing to do with data node scalability; there may be limited man-hours available for data scientists to react to the ever-changing analysis needs of the end-user analysts. Instead, we propose that Hadoop could be considered a catch-all collection point for continuously generated data. Mission-useful data could then be extracted for further analysis using business logic in Hadoop MapReduce jobs. The output of the MapReduce jobs could be used to load SAP Sybase IQ, and SAP BusinessObjects could then be used to analyze and visualize that result to make it actionable for the mission, in a graphical and flexible user interface (see diagram below). This would empower analysts to get different analytical views and juxtapositions of the data elements with 2012 SAP National Security Services SAP NS2 9
10 sub-second response without the need to go back to the IT shop for a revision to a Reduce job, or a new Java report. Total-cost-of-ownership discussions about Hadoop must include the labor it takes to handwrite and constantly update the MapReduce scripts as driven by ever-changing end-user demands. Here the strengths of SAP BusinessObjects ad hoc reporting capabilities can be important. To the extent that end-user requirements to analyze data can be pushed to a more self-service frontend using SAP BusinessObjects, the need for IT involvement to create new MapReduce scripts can be constrained. This will shorten the time between end-user desire and fulfillment. Thus one could re-frame the use of MapReduce. Instead of having end-user analysis logic distributed to the nodes, use Hadoop to do what it does best, i.e. the heavy lifting against the raw data files with MapReduce; then load the pertinent data that is likely to be needed to answer a given mission need into a Sybase IQ analysis repository, which would then be analyzed with high performance and ad hoc flexibility without rewriting scripts, using SAP BusinessObjects Explorer and Web Intelligence. The SAP BusinessObjects Web Intelligence and Explorer user interfaces are designed to empower non-it subject matter experts to navigate the data domain and slice and dice the data set without any expertise in query language or the underlying data structure. Hadoop with Ad-Hoc Business lntelligence Analysis Capabilities The inherent architecture of Sybase IQ as an analytical repository with sub-second response times, linked to SAP BusinessObjects self-service graphical user interface, means that the organization can be more agile and responsive to user-driven needs for new analysis. In addition, Sybase IQ is wellsuited to be the landing place for the results files of a MapReduce job, because the self-indexing SAP National Security Services SAP NS2
11 nature of its columnar data store means that it is not necessary to plan a well-designed star schema data structure to receive the data, as it would be in a row-based data base. If desired, SAP s Sybase mobility solutions can even enable the delivery of the analysis product to mobile or hand-held devices in the field. The analysis repository may also be the nexus between Hadoop data and other systems in the enterprise; reports can thus seamlessly present analysis of data from heterogeneous systems, data warehouses, and even transactional application data. In the recruiting example cited earlier, this methodology could relate together the well structured data elements about a prospective recruit with the semi-structured data coming into Hadoop files from a recruiting website, and then present that information to the staff in a highly graphical and flexible self-service user interface. Sybase IQ as Nexus for Hadoop Data and Structured Data Data Transparency to the Analyst Summary With this paradigm, we can help clients to optimize mission performance by leveraging the best of both worlds: the ability to store absolutely all incoming data without regard to size or structure in a cost-effective Hadoop file system; and then the use of MapReduce to create a data domain (or data domains) of data elements that can be delivered to non-technical end-users for analysis with ad hoc flexibility and with sub-second response times using SAP s industry leading solutions for reporting and analysis SAP National Security Services SAP NS2 11
12 Author: Bob Palmer Senior Director, SAP National Security Services TM (SAP NS2 TM ) [email protected] Copyright 2012 by SAP Government Support and Services, Inc. All rights reserved. May not be copied or redistributed without permission SAP, R/3, xapps, xapp, SAP NetWeaver, Duet, SAP Business ByDesign, ByDesign, PartnerEdge and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and in several other countries all over the world. Business Objects and the Business Objects logo, BusinessObjects, Crystal Reports, Crystal Decisions, Web Intelligence, Xcelsius and other Business Objects products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of Business Objects S.A. in the United States and in several other countries. Business Objects is an SAP Company. All other product and service names mentioned and associated logos displayed are the trademarks of their respective companies. Data contained in this document serves informational purposes only. National product specifications may vary. The information in this document is proprietary to SAP. This document is a preliminary version and not subject to your license agreement or any other agreement with SAP. This document contains only intended strategies, developments, and functionalities of the SAP product and is not intended to be binding upon SAP to any particular course of business, product strategy, and/or development. Please note that this document is subject to change and may be changed by SAP at any time without notice. SAP assumes no responsibility for errors or omissions in this document. SAP does not warrant the accuracy or completeness of the information, text, graphics, links, or other items contained within this material. This document is provided without a warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. SAP shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials. This limitation shall not apply in cases of intent or gross negligence SAP National Security Services SAP NS2
From Big Data to Actionable Insight
From Big Data to Actionable Insight Bob Palmer Senior Director, SAP National Security Services (SAP NS2 ) Dan Dorchinsky Client Director, SAP National Security Services (SAP NS2 ) August 2012 www.sapns2.com
What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER
What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER A NEW PARADIGM IN INFORMATION TECHNOLOGY There is a revolution happening in information technology, and it s not
Breaking News! Big Data is Solved. What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER
Breaking News! Big Data is Solved. What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER There is a revolution happening in information technology, and it s not just
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
SAP HANA. SAP HANA Performance Efficient Speed and Scale-Out for Real-Time Business Intelligence
SAP HANA SAP HANA Performance Efficient Speed and Scale-Out for Real-Time Business Intelligence SAP HANA Performance Table of Contents 3 Introduction 4 The Test Environment Database Schema Test Data System
The Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
In-Memory Analytics for Big Data
In-Memory Analytics for Big Data Game-changing technology for faster, better insights WHITE PAPER SAS White Paper Table of Contents Introduction: A New Breed of Analytics... 1 SAS In-Memory Overview...
Agile Business Intelligence Data Lake Architecture
Agile Business Intelligence Data Lake Architecture TABLE OF CONTENTS Introduction... 2 Data Lake Architecture... 2 Step 1 Extract From Source Data... 5 Step 2 Register And Catalogue Data Sets... 5 Step
Tap into Hadoop and Other No SQL Sources
Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM
QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QlikView Technical Case Study Series Big Data June 2012 qlikview.com Introduction This QlikView technical case study focuses on the QlikView deployment
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload
Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload Drive operational efficiency and lower data transformation costs with a Reference Architecture for an end-to-end optimization and offload
Data Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
Cost-Effective Business Intelligence with Red Hat and Open Source
Cost-Effective Business Intelligence with Red Hat and Open Source Sherman Wood Director, Business Intelligence, Jaspersoft September 3, 2009 1 Agenda Introductions Quick survey What is BI?: reporting,
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
Testing Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect
Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
SAP BusinessObjects Edge BI, Standard Package Preferred Business Intelligence Choice for Growing Companies
SAP Solutions for Small Businesses and Midsize Companies SAP BusinessObjects Edge BI, Standard Package Preferred Business Intelligence Choice for Growing Companies SAP BusinessObjects Edge BI, Standard
Big Data at Cloud Scale
Big Data at Cloud Scale Pushing the limits of flexible & powerful analytics Copyright 2015 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For
Innovative technology for big data analytics
Technical white paper Innovative technology for big data analytics The HP Vertica Analytics Platform database provides price/performance, scalability, availability, and ease of administration Table of
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate
Big Data Are You Ready? Thomas Kyte http://asktom.oracle.com
Big Data Are You Ready? Thomas Kyte http://asktom.oracle.com The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated
SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013
SAP HANA SAP s In-Memory Database Dr. Martin Kittel, SAP HANA Development January 16, 2013 Disclaimer This presentation outlines our general product direction and should not be relied on in making a purchase
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
SAP BusinessObjects Edge BI, Preferred Business Intelligence. SAP BusinessObjects Portfolio SAP Solutions for Small Businesses and Midsize Companies
SAP BusinessObjects Edge BI, Standard Package Preferred Business Intelligence Choice for Growing Companies SAP BusinessObjects Portfolio SAP Solutions for Small Businesses and Midsize Companies Executive
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics
Set Up Hortonworks Hadoop with SQL Anywhere
Set Up Hortonworks Hadoop with SQL Anywhere TABLE OF CONTENTS 1 INTRODUCTION... 3 2 INSTALL HADOOP ENVIRONMENT... 3 3 SET UP WINDOWS ENVIRONMENT... 5 3.1 Install Hortonworks ODBC Driver... 5 3.2 ODBC Driver
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
Understanding the Value of In-Memory in the IT Landscape
February 2012 Understing the Value of In-Memory in Sponsored by QlikView Contents The Many Faces of In-Memory 1 The Meaning of In-Memory 2 The Data Analysis Value Chain Your Goals 3 Mapping Vendors to
Reference Architecture, Requirements, Gaps, Roles
Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
ENTERPRISE MANAGEMENT AND SUPPORT IN THE INDUSTRIAL MACHINERY AND COMPONENTS INDUSTRY
ENTERPRISE MANAGEMENT AND SUPPORT IN THE INDUSTRIAL MACHINERY AND COMPONENTS INDUSTRY The Industrial Machinery and Components Industry Manufacturers in the industrial machinery and components (IM&C) industry
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
ENTERPRISE MANAGEMENT AND SUPPORT IN THE TELECOMMUNICATIONS INDUSTRY
ENTERPRISE MANAGEMENT AND SUPPORT IN THE TELECOMMUNICATIONS INDUSTRY The Telecommunications Industry Companies in the telecommunications industry face a number of challenges as market saturation, slow
SAP and Hortonworks Reference Architecture
SAP and Hortonworks Reference Architecture Hortonworks. We Do Hadoop. June Page 1 2014 Hortonworks Inc. 2011 2014. All Rights Reserved A Modern Data Architecture With SAP DATA SYSTEMS APPLICATIO NS Statistical
Apache Hadoop: The Big Data Refinery
Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data
Manifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
Platfora Big Data Analytics
Platfora Big Data Analytics ISV Partner Solution Case Study and Cisco Unified Computing System Platfora, the leading enterprise big data analytics platform built natively on Hadoop and Spark, delivers
Big Data for the Rest of Us Technical White Paper
Big Data for the Rest of Us Technical White Paper Treasure Data - Big Data for the Rest of Us 1 Introduction The importance of data warehousing and analytics has increased as companies seek to gain competitive
An Overview of SAP BW Powered by HANA. Al Weedman
An Overview of SAP BW Powered by HANA Al Weedman About BICP SAP HANA, BOBJ, and BW Implementations The BICP is a focused SAP Business Intelligence consulting services organization focused specifically
Data Modeling for Big Data
Data Modeling for Big Data by Jinbao Zhu, Principal Software Engineer, and Allen Wang, Manager, Software Engineering, CA Technologies In the Internet era, the volume of data we deal with has grown to terabytes
ORACLE DATA INTEGRATOR ENTERPRISE EDITION
ORACLE DATA INTEGRATOR ENTERPRISE EDITION Oracle Data Integrator Enterprise Edition 12c delivers high-performance data movement and transformation among enterprise platforms with its open and integrated
SAP BusinessObjects SOLUTIONS FOR ORACLE ENVIRONMENTS
SAP BusinessObjects SOLUTIONS FOR ORACLE ENVIRONMENTS BUSINESS INTELLIGENCE FOR ORACLE APPLICATIONS AND TECHNOLOGY SAP Solution Brief SAP BusinessObjects Business Intelligence Solutions 1 SAP BUSINESSOBJECTS
CIO Guide How to Use Hadoop with Your SAP Software Landscape
SAP Solutions CIO Guide How to Use with Your SAP Software Landscape February 2013 Table of Contents 3 Executive Summary 4 Introduction and Scope 6 Big Data: A Definition A Conventional Disk-Based RDBMs
SAP BusinessObjects Business Intelligence 4 Innovation and Implementation
SAP BusinessObjects Business Intelligence 4 Innovation and Implementation TABLE OF CONTENTS 1- INTRODUCTION... 4 2- LOGON DETAILS... 5 3- STARTING AND STOPPING THE APPLIANCE... 6 4.1 Remote Desktop Connection
Big Data and Big Data Modeling
Big Data and Big Data Modeling The Age of Disruption Robin Bloor The Bloor Group March 19, 2015 TP02 Presenter Bio Robin Bloor, Ph.D. Robin Bloor is Chief Analyst at The Bloor Group. He has been an industry
Please give me your feedback
Please give me your feedback Session BB4089 Speaker Claude Lorenson, Ph. D and Wendy Harms Use the mobile app to complete a session survey 1. Access My schedule 2. Click on this session 3. Go to Rate &
Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013
Big Data Use Case How Rackspace is using Private Cloud for Big Data Bryan Thompson May 8th, 2013 Our Big Data Problem Consolidate all monitoring data for reporting and analytical purposes. Every device
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities
Technology Insight Paper Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities By John Webster February 2015 Enabling you to make the best technology decisions Enabling
Made to Fit Your Needs. SAP Solution Overview SAP Solutions for Small Businesses and Midsize Companies
SAP Solution Overview SAP Solutions for Small Businesses and Midsize Companies SAP Solutions for Small Businesses and Midsize Companies Made to Fit Your Needs. Designed to Help You Grow. Becoming a Best-Run
Extend your analytic capabilities with SAP Predictive Analysis
September 9 11, 2013 Anaheim, California Extend your analytic capabilities with SAP Predictive Analysis Charles Gadalla Learning Points Advanced analytics strategy at SAP Simplifying predictive analytics
HDP Hadoop From concept to deployment.
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
Actian SQL in Hadoop Buyer s Guide
Actian SQL in Hadoop Buyer s Guide Contents Introduction: Big Data and Hadoop... 3 SQL on Hadoop Benefits... 4 Approaches to SQL on Hadoop... 4 The Top 10 SQL in Hadoop Capabilities... 5 SQL in Hadoop
<Insert Picture Here> Oracle and/or Hadoop And what you need to know
Oracle and/or Hadoop And what you need to know Jean-Pierre Dijcks Data Warehouse Product Management Agenda Business Context An overview of Hadoop and/or MapReduce Choices, choices,
SAP SE - Legal Requirements and Requirements
Finding the signals in the noise Niklas Packendorff @packendorff Solution Expert Analytics & Data Platform Legal disclaimer The information in this presentation is confidential and proprietary to SAP and
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Move Data from Oracle to Hadoop and Gain New Business Insights
Move Data from Oracle to Hadoop and Gain New Business Insights Written by Lenka Vanek, senior director of engineering, Dell Software Abstract Today, the majority of data for transaction processing resides
OPEN MODERN DATA ARCHITECTURE FOR FINANCIAL SERVICES RISK MANAGEMENT
WHITEPAPER OPEN MODERN DATA ARCHITECTURE FOR FINANCIAL SERVICES RISK MANAGEMENT A top-tier global bank s end-of-day risk analysis jobs didn t complete in time for the next start of trading day. To solve
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
Cisco Data Preparation
Data Sheet Cisco Data Preparation Unleash your business analysts to develop the insights that drive better business outcomes, sooner, from all your data. As self-service business intelligence (BI) and
Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies
Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies 1 Copyright 2011, Oracle and/or its affiliates. All rights Big Data, Advanced Analytics:
SharePlex for SQL Server
SharePlex for SQL Server Improving analytics and reporting with near real-time data replication Written by Susan Wong, principal solutions architect, Dell Software Abstract Many organizations today rely
Parallel Data Warehouse
MICROSOFT S ANALYTICS SOLUTIONS WITH PARALLEL DATA WAREHOUSE Parallel Data Warehouse Stefan Cronjaeger Microsoft May 2013 AGENDA PDW overview Columnstore and Big Data Business Intellignece Project Ability
Big Data: Tools and Technologies in Big Data
Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can
Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.
Mike Maxey Senior Director Product Marketing Greenplum A Division of EMC 1 Greenplum Becomes the Foundation of EMC s Big Data Analytics (July 2010) E M C A C Q U I R E S G R E E N P L U M For three years,
HDP Enabling the Modern Data Architecture
HDP Enabling the Modern Data Architecture Herb Cunitz President, Hortonworks Page 1 Hortonworks enables adoption of Apache Hadoop through HDP (Hortonworks Data Platform) Founded in 2011 Original 24 architects,
A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel
A Next-Generation Analytics Ecosystem for Big Data Colin White, BI Research September 2012 Sponsored by ParAccel BIG DATA IS BIG NEWS The value of big data lies in the business analytics that can be generated
IBM Analytics. Just the facts: Four critical concepts for planning the logical data warehouse
IBM Analytics Just the facts: Four critical concepts for planning the logical data warehouse 1 2 3 4 5 6 Introduction Complexity Speed is businessfriendly Cost reduction is crucial Analytics: The key to
Introducing Oracle Exalytics In-Memory Machine
Introducing Oracle Exalytics In-Memory Machine Jon Ainsworth Director of Business Development Oracle EMEA Business Analytics 1 Copyright 2011, Oracle and/or its affiliates. All rights Agenda Topics Oracle
SAP HANA SPS 09 - What s New? HANA IM Services: SDI and SDQ
SAP HANA SPS 09 - What s New? HANA IM Services: SDI and SDQ (Delta from SPS 08 to SPS 09) SAP HANA Product Management November, 2014 2014 SAP SE or an SAP affiliate company. All rights reserved. 1 Agenda
Nine Reasons Why SAP Rapid Deployment Solutions Can Make Your Life Easier Get Where You Want to Be, One Step at a Time
SAP Rapid Deployment Solutions Nine Reasons Why SAP Rapid Deployment Solutions Can Make Your Life Easier Get Where You Want to Be, One Step at a Time Nine Reasons Why SAP Rapid Deployment Solutions Can
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
The Enterprise Data Hub and The Modern Information Architecture
The Enterprise Data Hub and The Modern Information Architecture Dr. Amr Awadallah CTO & Co-Founder, Cloudera Twitter: @awadallah 1 2013 Cloudera, Inc. All rights reserved. Cloudera Overview The Leader
Native Connectivity to Big Data Sources in MSTR 10
Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single
Three Open Blueprints For Big Data Success
White Paper: Three Open Blueprints For Big Data Success Featuring Pentaho s Open Data Integration Platform Inside: Leverage open framework and open source Kickstart your efforts with repeatable blueprints
CitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION
ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION EXECUTIVE SUMMARY Oracle business intelligence solutions are complete, open, and integrated. Key components of Oracle business intelligence
Constructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
ENTERPRISE MANAGEMENT AND SUPPORT IN THE AUTOMOTIVE INDUSTRY
ENTERPRISE MANAGEMENT AND SUPPORT IN THE AUTOMOTIVE INDUSTRY The Automotive Industry Businesses in the automotive industry face increasing pressures to improve efficiency, reduce costs, and quickly identify
Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012
Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster Nov 7, 2012 Who I Am Robert Lancaster Solutions Architect, Hotel Supply Team [email protected] @rob1lancaster Organizer of Chicago
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution
Building Your Big Data Team
Building Your Big Data Team With all the buzz around Big Data, many companies have decided they need some sort of Big Data initiative in place to stay current with modern data management requirements.
