White Paper: What You Need To Know About Hadoop

Similar documents
White Paper: Hadoop for Intelligence Analysis

White Paper: Evaluating Big Data Analytical Capabilities For Government Use

White Paper: Datameer s User-Focused Big Data Solutions

Hadoop Ecosystem B Y R A H I M A.

Cloudera Certified Developer for Apache Hadoop

How To Scale Out Of A Nosql Database

Bringing Big Data to People

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Application Development. A Paradigm Shift

Hadoop. Sunday, November 25, 12

Qsoft Inc

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

BIG DATA TRENDS AND TECHNOLOGIES

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Apache Hadoop: Past, Present, and Future

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

BIG DATA TECHNOLOGY. Hadoop Ecosystem

How To Handle Big Data With A Data Scientist

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Three Open Blueprints For Big Data Success

Implement Hadoop jobs to extract business value from large and varied data sets

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Large scale processing using Hadoop. Ján Vaňo

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

The Future of Data Management

Apache Hadoop: The Big Data Refinery

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Data Analyst Program- 0 to 100

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Hadoop IST 734 SS CHUNG

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Certified Big Data and Apache Hadoop Developer VS-1221

Dominik Wagenknecht Accenture

Dell In-Memory Appliance for Cloudera Enterprise

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

White Paper: Leveraging Web Intelligence to Enhance Cyber Security

Internals of Hadoop Application Framework and Distributed File System

How Companies are! Using Spark

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Deploying Hadoop with Manager

A Survey on Big Data Concepts and Tools

Microsoft SQL Server 2012 with Hadoop

The Future of Data Management with Hadoop and the Enterprise Data Hub

June Production Hadoop systems in the enterprise

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

Constructing a Data Lake: Hadoop and Oracle Database United!

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Open source Google-style large scale data analysis with Hadoop

Big Data and New Paradigms in Information Management. Vladimir Videnovic Institute for Information Management

ITG Software Engineering

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Workshop on Hadoop with Big Data

NoSQL and Hadoop Technologies On Oracle Cloud

The Next Wave of Data Management. Is Big Data The New Normal?

Big Data and Hadoop for the Executive A Reference Guide

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Big Data on Microsoft Platform

Big Data With Hadoop

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Peers Techno log ies Pv t. L td. HADOOP

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

A Brief Outline on Bigdata Hadoop

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Keywords: Big Data, Hadoop, cluster, heterogeneous, HDFS, MapReduce

The Inside Scoop on Hadoop

Big Data Explained. An introduction to Big Data Science.

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big data for the Masses The Unique Challenge of Big Data Integration

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Hadoop and Map-Reduce. Swati Gore

Virtualizing Apache Hadoop. June, 2012

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

ITG Software Engineering

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Big Data Introduction

Xiaoming Gao Hui Li Thilina Gunarathne

Hadoop Development & BI- 0 to 100

HDP Hadoop From concept to deployment.

Transcription:

CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack can address Common Terms

CTOlabs.com What You Need To Know About Hadoop Executive Summary Apache Hadoop is a project operating under the auspices of the Apache Software Foundation (ASF). The Hadoop project develops open source software for reliable, scalable, distributed computing. Hadoop holds great promise in helping enterprises deal with the need to conduct analysis over large data sets. What is Hadoop? Everyone in federal IT seems to be talking about Apache Hadoop. But why? What is so special about this word? What do people mean when they say it? This paper provides information designed to put those questions and their answers into context relevant for your mission. The term Hadoop is used two different ways in the federal IT community. Technologists consider it a framework of tools that enable distributed analysis over large quantities of data using commodity hardware. Hadoop is commonly distributed with a selection of related capabilities that make data analysis and feature access easier (like Cloudera s Distribution including Apache Hadoop -- CDH), and this distribution of Hadoop plus capabilities is also often referred to as Hadoop. The rest of this paper uses the term Hadoop in this context, since for most of us the bundle of capabilities in CDH is Hadoop. Why Might You Need Hadoop? Organizations today are collecting and generating more data than ever before, and the formats of this data vary widely. Older methods of dealing with data (such as relational databases) are not keeping up with the size and diversity of data and do not enable fast analysis over large data sets. What Issues Can Hadoop Address? Hadoop capabilities come in two key categories, Storage and Analysis. Storage of data is in the Hadoop Distributed File System (HDFS). HDFS is great for storing very large data sets, including data stores in the trillions of bytes. 1

A White Paper With Context, Tips and Strategies for Enterprise IT Analysis is by use of a model known as MapReduce. MapReduce is a way of using the power of distributed computers to analyze data and then bring results back to correlate in a divide and conquer approach. MapReduce lets you run computations on nodes where the data is. Google, Twitter, Facebook and other large data intensive capabilities use this method by running MapReduce over HDFS with 1000 s of computer nodes. Twitter is a case of special note, where a Hadoop infrastructure is used to monitor and analyze tweets. Since CDH uses the MapReduce method it is very fast, which is important when working with large data sets. Yahoo used it to sort a terabyte in 62 seconds, for example. This is a huge benchmark in the IT world and just one example of many that proves that if you need to perform fast analysis over large data, you should be using Hadoop. CDH also helps address cost issues in IT, since the computing power is delivered using commodity IT. It runs on computers available from any vendor, and does not require high-end computers. The CDH Hadoop is available for free, so that helps with cost as well. With these economical solutions to storage and analysis new challenges can be addressed. Consider, for example, the need to search and discover indicators of fraud in visa applications. Huge quantities of information must be searched and correlated from multiple sources to find indications of fraud, and the quicker the better. Hadoop is perfect for this sort of fast analysis. Other users are leveraging Hadoop to rapidly search and correlate across vast stores of SIGINT, ELINT and unstructured text on adversaries to seek battlefield advantage for our forces. More info on these use cases is available separately. Or consider challenges where no solution has ever been attempted. For example, consider the government data on weather, climate, the environment, pollution, health, quality of life, the economy, natural resources, energy and transportation. Data on those topics exist in many stores across the federal enterprise. The government also has years of information from research conducted at academic institutions across the nation. Imagine the conclusions that could be drawn from analysis over datastores like this. Imagine the benefits to our citizen s health, commodity prices, education and employment of better analysis over these data stores. 2

CTOlabs.com What is a Hadoop Distribution? Hadoop itself provides a great framework of tools for storing and analyzing data, but enterprises make use of other tools to enable IT staff and IT users to write complex queries faster, enable better security, and to facilitate very complex analysis and special-purpose computation on large datasets in a scalable, cost-effective manner. Cloudera s Distribution including Apache Hadoop (CDH) provides a single bundle of all Hadooprelated projects in a package that is tested together and maintained in a way enterprise CIOs expect software to be supported. Summary CDH is an ideal platform for consolidating large-scale data from a variety of new and legacy sources. It complements existing data management solutions with new analysis and processing tools. It delivers immediate value to federal organizations in need of better understanding of data. For more on Hadoop see: http://cloudera.com 3

A White Paper With Context, Tips and Strategies for Enterprise IT For Further Reference Many other terms are used by technologists to describe the detailed features and functions provided in CDH. The following list may help you decipher the language of Big Data: CDH: Cloudera s Distribution including Apache Hadoop. It contains HDFS, Hadoop MapReduce, Hive, Pig, HBase, Sqoop, Flume, Oozie, Zookeeper and Hue. When most people say Hadoop they mean CDH. HDFS: The Hadoop Distributed File System. This is a scalable means of distributing data that takes advantage of commodity hardware. HDFS ensures all data is replicated in a location-aware manner so as to lessen internal datacenter network load, which frees the network for more complicated transactions. Hadoop MapReduce: This process breaks down jobs across the Hadoop datanodes and then reassembles the results from each into a coherent answer. Hive: A data warehouse infrastructure that leverages the power of Hadoop. Hive provides tools that easily summarize queries. Hive puts structure on the data and gives users the ability to query using familiar methods (like SQL). Hive also allows MapReduce programmers to enhance their queries. Pig: A high level data-flow language that enables advanced parallel computation. Pig makes parallel programming much easier. HBase: A scalable, distributed database that supports structured data storage for large tables. Used when you need random, realtime read/write access to you Big Data. It enables hosting of very large tables-- billions of rows times millions of columns -- atop commodity hardware. It is a columnoriented store modeled after Google s BigTable and is optimized for realtime data. HBase has replaced Cassandra at Facebook. Sqoop: Enabling the import and export of SQL to Hadoop. Flume: A distributed, reliable and available service for efficiently collecting, aggregating and moving large amounts of streaming data. 4

CTOlabs.com Oozie: A workflow engine to enhance management of data processing jobs for Hadoop. Manages dependencies of jobs between HDFS, Pig and MapReduce. Zookeeper: A very high performance coordination service for distributed applications. Hue: A browser-based desktop interface for interacting with Hadoop. It supports a file browser, job tracker interface, cluster health monitor and many other easy-to-use features. What Do You Need To Know About Hadoop? The CDH is 100% open source, 100% Apache licensed. It is simplified, with all required component versions and dependencies managed for you. It is integrated, with all components and functions able to interoperate through standard API s. It is reliable, with predicable release schedules, patched with stability fixes, and stress tested. It is an industry standard, so your existing RDBMS, ETL and BI systems work with it. And it is the tool you will need to manage the coming age of Big Data. We are producing more data than we can analyze with traditional methods. Our future requires Hadoop. CTOlabs.com is a technology research, consulting and services agency which focuses on firm. Crucial Point LLC focuses on the national security sector and the technologies required to enhance the security of the nation. Visit Crucial Point LLC online at http://crucialpointllc.com 5

A White Paper With Context, Tips and Strategies for Enterprise IT About the Author Bob Gourley Bob Gourley is the founder of Crucial Point, LLC and CTOlabs.com, a provider of technology concepts, vendor evaluations and technology assessments focused on enterprise grade mission needs. Mr. Gourley s first career was as a naval intelligence officer, which included operational tours afloat and shore. He was the first J2 at DoD s cyber defense organization, the JTF-CND. Following retirement from the Navy, Mr. Gourley was a senior executive with TRW and Northrop Grumman, and then returned to government service as the Chief Technology Officer of the Defense Intelligence Agency. Mr. Gourley was named one of the top 25 most influential CTOs in the globe by Infoworld in 2007, and selected for AFCEA s award for meritorious service to the intelligence community in 2008. He was named by Washingtonian magazine as one of DC s Tech Titans in 2009; and one of the Top 25 Most Fascinating Communicators in Government IT by the Gov2.0 community GovFresh. He holds three masters degrees, including a master of science degree in scientific and technical intelligence from the Naval Postgraduate School, a master of science degree in military science from USMC university, and a master of science degree in computer science from James Madison University. Mr.Gourley has published more than 40 articles on a wide range of topics and is a contributor to the book Threats in the Age of Obama (2009). He is a founding and current member of the board of directors of the Cyber Conflict Studies Association, and serves on the board of the Naval Intelligence Professionals, on the Intelligence Committee of AFCEA, and the Cyber Committee of INSA. 6

For More Information If you have questions or would like to discuss this report, please contact me. As an advocate for better IT in government, I am committed to keeping the dialogue open on technologies, processes and best practices that will keep us moving forward. Contact: Bob Gourley bob@crucialpointllc.com 703-994-0549 All information/data 2011 CTOLabs.com. CTOlabs.com