White Paper: Hadoop for Intelligence Analysis

Similar documents
White Paper: What You Need To Know About Hadoop

White Paper: Datameer s User-Focused Big Data Solutions

White Paper: Evaluating Big Data Analytical Capabilities For Government Use

White Paper: Big Data Solutions For Law Enforcement

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

How To Scale Out Of A Nosql Database

How To Handle Big Data With A Data Scientist

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hadoop Ecosystem B Y R A H I M A.

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Application Development. A Paradigm Shift

Hadoop implementation of MapReduce computational model. Ján Vaňo

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

BIG DATA TRENDS AND TECHNOLOGIES

Hadoop. Sunday, November 25, 12

Cloudera Certified Developer for Apache Hadoop

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Apache Hadoop: The Big Data Refinery

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Bringing Big Data to People

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

The Future of Data Management

Data processing goes big

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Big Data With Hadoop

Big Data and Data Science: Behind the Buzz Words

Big Data on Microsoft Platform

Constructing a Data Lake: Hadoop and Oracle Database United!

A Brief Outline on Bigdata Hadoop

Big Data and Market Surveillance. April 28, 2014

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Big Data and Hadoop for the Executive A Reference Guide

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Large scale processing using Hadoop. Ján Vaňo

Sentimental Analysis using Hadoop Phase 2: Week 2

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Big Data Explained. An introduction to Big Data Science.

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Qsoft Inc

Apache Hadoop: Past, Present, and Future

Dell In-Memory Appliance for Cloudera Enterprise

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Big Data Management and Security

BIG DATA What it is and how to use?

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Manifest for Big Data Pig, Hive & Jaql

Hadoop Job Oriented Training Agenda

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

HadoopTM Analytics DDN

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Workshop on Hadoop with Big Data

Accelerating and Simplifying Apache

A Survey on Big Data Concepts and Tools

Implement Hadoop jobs to extract business value from large and varied data sets

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Microsoft SQL Server 2012 with Hadoop

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

NoSQL and Hadoop Technologies On Oracle Cloud

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Dominik Wagenknecht Accenture

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data Refinery with Big Data Aspects

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Hadoop and Map-Reduce. Swati Gore

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Cloudera Enterprise Data Hub in Telecom:

There s no way around it: learning about Big Data means

Open source Google-style large scale data analysis with Hadoop

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Advanced Big Data Analytics with R and Hadoop

IBM Big Data Platform

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Hadoop IST 734 SS CHUNG

A Modern Data Architecture with Apache Hadoop

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Peers Techno log ies Pv t. L td. HADOOP

Hadoop. for Oracle database professionals. Alex Gorbachev Calgary, AB September 2013

Ubuntu and Hadoop: the perfect match

Virtualizing Apache Hadoop. June, 2012

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Big data for the Masses The Unique Challenge of Big Data Integration

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Big Data Analytics for Space Exploration, Entrepreneurship and Policy Opportunities. Tiffani Crawford, PhD

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

Transcription:

CTOlabs.com White Paper: Hadoop for Intelligence Analysis July 2011 A White Paper providing context, tips and use cases on the topic of analysis over large quantities of data. Inside: Apache Hadoop and Cloudera Intelligence Community Use Cases Context You Can Use Today

CTOlabs.com Hadoop and related capabilities bring new advantages to intelligence analysts Executive Summary Intelligence analysis is all about dealing with Big Data, massive collections of unstructured information. Already, the Intelligence Community works with much more data than it can process and continues to collect more through new and evolving sensors, open-source intelligence, better information sharing, and continued human information gathering. More information is always better, but to make use if it, analysis needs to keep pace through innovations in data management. The Apache Hadoop Technology Apache Hadoop is a project operating under the auspices of the Apache Software Foundation (ASF). The Hadoop project develops open source software for reliable, scalable, distributed computing. Hadoop is an exciting technology that can help analysts and agencies make the most of their data. Hadoop can inexpensively store any type of information from any source on commodity hardware and allow for fast, distributed analysis run in parallel on multiple servers in a Hadoop cluster. Hadoop is reliable, managing and healing itself; scales linearly, working as well with one terabyte of data across three nodes as it does with petabytes of data across thousands; affordable, costing much less per terabyte to store and process data compared to traditional alternatives; and agile, implementing evolving schemas for data as it is read into the system. Cloudera is the leading provider of Hadoop-based software and services and provides Cloudera s Distribution including Apache Hadoop (CDH) which is the most popular way to implement Hadoop. CDH is an open system assembled from the most useful projects in the Hadoop ecosystem bundled together and simplified for use. As CDH is available for free download, it s a great place to start when implementing Hadoop, and Cloudera also offers support, management apps, and training to make the most of Hadoop. 1

A White Paper For The Federal IT Community Intelligence Community Use Cases Given its work with Big Data, implementing Hadoop is a natural choice for the Intelligence Community. Most broadly, using Hadoop to manage data can lead to substantial savings. Hadoop Distributed File System (HDFS) stores information for several cents a gigabyte a month as opposed to traditional methods that cost dollars. HDFS achieves this by pooling commodity servers into a single hierarchical namespace, which works well with large files that are written once and read many times. With organizations everywhere on the lookout for efficient ways to modernize IT, this is an attractive feature. Hadoop can provide opportunities to manage the same data for a fraction of the cost. The unpredictable nature of intelligence developments could also make good use of Hadoop s scalability. Nodes can easily and seamlessly be added or removed from a Hadoop cluster and performance will scale linearly, providing the agility to rapidly mobilize resources. Consider how fast Intelligence Community missions must shift, for example. Just a month after Bin Laden s death the community was boresighted on operations in Libya. Who knows where the next focus area will be? The need is for an ability to easily shift computational power from one mission to another. Hadoop also provides the agility to deal with whatever form or source of information is needed for a project. Most of the data critical to intelligence is unstructured, such as text, video, and images and Hadoop can take data straight from the file store without any special transformation. As this complex data can then be stored and analyzed together, Hadoop can help intelligence agencies overcome information overload. Current technology and intelligence gathering methods produce expansive amounts of data, far more than what our analysts can reasonably hope to monitor on their own. For example, at the end of 2010, General Cartwright noted that it took 19 analysts to process the information gathered by a predator drone, and that, with new sensor technology being developed, dense data sensors meshing together video feeds to cover a city while simultaneously intercepting cell phone calls and e-mails, it would take 2,000 analysts to manually process the data gathered from a single drone. Algorithms are currently being developed to pull out and present to analysts what really matters from all of those video feeds and intercepts. Sorting through such an expanse of complex data is precisely the sort of challenge Hadoop was designed to tackle. The data also becomes more valuable when 2

CTOlabs.com combined with information such as coordinates and times, and subjects identified in videos. Hadoop could then effectively sort and search through the new multi-level and multi-media intelligence almost instantly despite the amount and type of material generated. Hadoop s low cost and speed open up other intelligence capabilities. Hadoop clusters have often been used as data sandboxes to test out new analytics cheaply and quickly to see if they yield results and can be widely implemented. For analysts, this would mean that they could test out theories and algorithms even when they have a low probability of success without much wasted time or resources, allowing them to be more creative and thorough. This in turn helps prevent the failures of imagination blamed for misreading the intelligence before the September 11 attacks and several subsequent plots. Hadoop is also well suited for evolving analysis techniques such as Social Network Analysis and textual analysis, which are both being aggressively developed by intelligence agencies and contractors. Social Network Analysis uses human interactions, such as phone calls, text messages, meetings, and emails, to construct and decipher a social network, identifying leaders and key nodes for linking members, linking groups, getting exposure, and contacting other important members. Rarely are these members the figureheads in the media or even the stated leadership of terrorist and criminal organizations. Social Network Analysis is helpful for identifying high value targets to exploit or eliminate but for sizable organizations deals with a tremendous amount of data, thousands of interactions of varying types by thousands of members and associates,making Hadoop an excellent platform. Some projects such as Klout, which applies Social Network Analysis to social media to determine user influence, style, and role, already run on Hadoop. Hadoop has also been proven as a platform for textual analysis. Large text repositories such as chat rooms, newspapers, or email inboxes are Big Data, expansive and unstructured, and hence well suited for analysis using Hadoop. IBM s Watson, which beat human contestants on Jeopardy!, has been the Leveraging Hardware Design to Enhance Security and Functionality most prominent example of the power of textual analysis, and ran on Hadoop. Watson was able to look through and interpret libraries of text to form the right question for the answers presented in the game, from history to science to pop culture, faster and more accurately than the human champions he went against. Textual analysis 3

A White Paper For The Federal IT Community has value beyond game shows, however, as it can be used on forums and correspondences to analyze sentiment and find hidden connections to people, places, and topics. For Further Reference Many terms are used by technologists to describe the detailed features and functions provided in CDH. The following list may help you decipher the language of Big Data: CDH: Cloudera s Distribution including Apache Hadoop. It contains HDFS, Hadoop MapReduce, Hive, Pig, HBase, Sqoop, Flume, Oozie, Zookeeper and Hue. When most people say Hadoop they mean CDH. HDFS: The Hadoop Distributed File System. This is a scalable means of distributing data that takes advantage of commodity hardware. HDFS ensures all data is replicated in a location-aware manner so as to lessen internal datacenter network load, which frees the network for more complicated transactions. Hadoop MapReduce: This process breaks down jobs across the Hadoop datanodes and then reassembles the results from each into a coherent answer. Hive: A data warehouse infrastructure that leverages the power of Hadoop. Hive provides tools that easily summarize queries. Hive puts structure on the data and gives users the ability to query using familiar methods (like SQL). Hive also allows MapReduce programmers to enhance their queries. Pig: A high level data-flow language that enables advanced parallel computation. Pig makes parallel programming much easier. HBase: A scalable, distributed database that supports structured data storage for large tables. Used when you need random, realtime read/write access to you Big Data. It enables hosting of very large tables-- billions of rows times millions of columns -- atop commodity hardware. It is a columnoriented store modeled after Google s BigTable and is optimized for realtime data. HBase has replaced Cassandra at Facebook. 4

CTOlabs.com Sqoop: Enabling the import and export of SQL to Hadoop. Flume: A distributed, reliable and available service for efficiently collecting, aggregating and moving large amounts of streaming data. Oozie: A workflow engine to enhance management of data processing jobs for Hadoop. Manages dependencies of jobs between HDFS, Pig and MapReduce. Zookeeper: A very high performance coordination service for distributed applications. Hue: A browser-based desktop interface for interacting with Hadoop. It supports a file browser, job tracker interface, cluster health monitor and many other easy-to-use features. More Reading For more use cases for Hadoop in the intelligence community visit: CTOvision.com- an blog for enterprise technologists with a special focus on Big Data. CTOlabs.com - the respository for our research and reporting on all IT issues. Cloudera.com - providing enterprise solutions around CDH plus training, services and support. 5

A White Paper For The Federal IT Community About the Author Alexander Olesker is a technology research analyst at Crucial Point LLC, focusing on disruptive technologies of interest to enterprise technologists. He writes at http://ctovision.com. Alex is a graduate of the Edmund A. Walsh School of Foreign Service at Georgetown University with a degree in Science, Technology, and International Affairs. He researches and writes on developments in technology and government best practices for CTOvision.com and CTOlabs.com, and has written numerous whitepapers on these subjects. Alex has worked or interned in early childhood education, private intelligence, law enforcement, and academia, contributing to numerous publications on technology, international affairs, and security and has lectured at Georgetown and in the Netherlands. Alex is also the founder and primary contributor of an international security blog that has been quoted and featured by numerous pundits and the War Studies blog of King s College, London. Alex is a fluent Russian speaker and proficient in French. Contact Alex at AOlesker@crucialpointllc.com 6

For More Information If you have questions or would like to discuss this report, please contact me. As an advocate for better IT in government, I am committed to keeping the dialogue open on technologies, processes and best practices that will keep us moving forward. Contact: Bob Gourley bob@crucialpointllc.com 703-994-0549 All information/data 2011 CTOLabs.com. CTOlabs.com