Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Similar documents

How To Use Hadoop For Gis

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data Spatial Analytics An Introduction

Chapter 7. Using Hadoop Cluster and MapReduce

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Energy Efficient MapReduce

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

BIG DATA What it is and how to use?

BIG DATA TRENDS AND TECHNOLOGIES

A Performance Analysis of Distributed Indexing using Terrier

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Investigating Hadoop for Large Spatiotemporal Processing Tasks

Data processing goes big

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Using Big Data and GIS to Model Aviation Fuel Burn

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop and Map-Reduce. Swati Gore

Big Data and Market Surveillance. April 28, 2014

Latest Developments in Oceanographic Applications of GIS including!

Manifest for Big Data Pig, Hive & Jaql

Architectures for massive data management

Bringing Big Data Modelling into the Hands of Domain Experts

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Hadoop. Sunday, November 25, 12

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Implement Hadoop jobs to extract business value from large and varied data sets

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Big Data With Hadoop

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

Big Data on Microsoft Platform

An Approach to Implement Map Reduce with NoSQL Databases

How To Handle Big Data With A Data Scientist

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Real Time Big Data Processing

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Workshop on Hadoop with Big Data

Hadoop & its Usage at Facebook

How To Scale Out Of A Nosql Database

Are You Ready for Big Data?

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Big Data Processing with Google s MapReduce. Alexandru Costan

Hadoop & its Usage at Facebook

How Companies are! Using Spark

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

NextGen Infrastructure for Big DATA Analytics.

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

HDFS. Hadoop Distributed File System

Unified Big Data Processing with Apache Spark. Matei

A Brief Outline on Bigdata Hadoop

Oracle Big Data SQL Technical Update

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Integrating Big Data into the Computing Curricula

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Constructing a Data Lake: Hadoop and Oracle Database United!

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

COURSE CONTENT Big Data and Hadoop Training

Big Fast Data Hadoop acceleration with Flash. June 2013

Hadoop Big Data for Processing Data and Performing Workload

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Map Reduce & Hadoop Recommended Text:

Big Data and Apache Hadoop s MapReduce

Similarity Search in a Very Large Scale Using Hadoop and HBase

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Open source Google-style large scale data analysis with Hadoop

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

HiBench Introduction. Carson Wang Software & Services Group

Using distributed technologies to analyze Big Data

Integrating VoltDB with Hadoop

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop Ecosystem B Y R A H I M A.

Oracle Big Data Spatial and Graph

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

A very short Intro to Hadoop

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Application Development. A Paradigm Shift

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Transcription:

Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel

In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will cover: - Big Data overview - The Hadoop platform - How Esri s GIS Tools for Hadoop enables developers to process spatial data on Hadoop - Looking ahead

Big Data Within ArcGIS, Geoprocessing was enhanced at 10.1 SP1 to support 64-bit address spaces - This is sufficient to handle traditional large GIS datasets However, this solution may run into problems when confronted with datasets of a size that are colloquially referred to as Big Data - Internet scale datasets

Age of Data Ubiquity Data is now central to our existence both for corporations and individuals Nimble, thin, data-centric apps exploiting massive data sets generated by both enterprises and consumers Hardware era: 20 30 years Software era: 20 30 years Data era:?

Big data What is it? Big Data is a loosely defined term used to describe data sets so large and complex that they become awkward to work with using standard software in a tolerable elapsed time - Big data "size" is a constantly moving target, ranging from a few dozen terabytes to many petabytes of data - In the past three years, 90% of all recorded data has been generated Every 60 seconds: - 100,000 tweets - 2.4 million Google searches - 11 million instant messages - 170 million email messages - 1,800 TB of data

NYC Taxis by Day Manhattan Taxis Friday after 8pm 6

ArcGIS users have big data Smart Sensors - Electrical meters (AMI), SCADA, UAVs GPS Telemetry - Vehicle tracking, smartphone data collectors, workforce tracking, geofencing Internet data - Social media streams, web log files, customer sentiment Sensor data - Weather sensors, stream gauge measurements, heavy equipment monitors, Imagery - Satellites, frame cameras, drones 7

Value when analyzing data at mass scale As observations increase in frequency - Each individual observation is worth less - as the set of all observations becomes more valuable One single metric from the jet aircraft is much less useful than the analysis of that metric against the same metric from every known flight of that aircraft over time Big Data is the accumulation and analytical processes that uses this data for business value

Big challenges Data acquisition - Filtering and compressing - Planned Suare Kilometer Array Telescope one million terabytes per day Information extraction and cleaning - Extracting health info from X-rays and CT scans Data integration, aggregation, and representation - Heterogenious datasets Modeling and analysis - Nonstandard statistical analysis; very noisy, dynamic, and untrustworthy Interpretation - Decision making metadata, assumptions, very complex

Big data What techniques are applied to handle it? Data distribution Big large data datasets is are not split about into smaller the datasets data. and distributed across a collection of machines Gary King Parallel processing using a collection of machines to process the smaller datasets, combining the partial results together Harvard University Director, Inst. For Quantitative Social Science Fault tolerance making copies of the partitioned data to ensure that if a machine fails, the dataset can still be processed Commodity hardware using standard hardware that is not dependent upon exotic architectures, (Making topologies, the or data point storage that (e.g., while RAID) data is plentiful and easy to collect, the real value is in the analytics) Scalability algorithms and frameworks that can be easily scaled to run on larger collections of machines in order to address larger datasets

Legacy system architecture Processors Disks Network Bottleneck

Distributed system architecture Processing Elements

Distributed system architecture Processing Elements

Until Now Google implemented their enterprise on a distributed network of many nodes, fusing storage and processing into each node Hadoop is an open source implementation of the framework that Google has built their business around for many years

Apache Hadoop Overview Hadoop is a scalable open source framework for the distributed processing of extremely large data sets on clusters of commodity hardware - Maintained by the Apache Software Foundation - Assumes that hardware failures are common Hadoop is primarily used for: - Distributed storage - Distributed computation

Apache Hadoop Hadoop Clusters 20 flowdown PCs 5 7 years old Quad-core, 3GHz 16 GB RAM 1 TB fast disk Traditional Hadoop Clusters The Dredd Cluster

Apache Hadoop Distributed Storage The Hadoop Distributed File System (HDFS) is a hierarchical file system where datasets are organized into directories and files These files are accessed like regular files, however they are actually distributed throughout the Hadoop cluster earthquakes 2011.csv 2012.csv 2013.csv

MapReduce A programming model for processing data with a parallel distributed algorithm on a cluster A MapReduce program is comprised of: - A Map() procedure that performs filtering and comparing, and - A Reduce() procedure that performs a summary operation The MapReduce system - Marshals the distributed servers - Runs the various tasks in parallel - Manages all communications and data transfers - Provides for redundancy and failures MapReduce libraries have been written in many languages; Hadoop is a popular open source implementation

MapReduce Walkthrough An instance of a MapReduce program is a job The job accepts arguments for data input and outputs The job combines - Functions from the MapReduce framework - Splitting large inputs into smaller pieces - Reading inputs and writing outputs - Functions that are written by the application developer - Map function, maps input values to keys - Reduce function, reduces many keys to one

High Level MapReduce Walk-Through An instance of a MapReduce program is a job MapReduce Job

High Level MapReduce Walk-Through The job accepts arguments for data input and outputs MapReduce Job data hdfs://path/to/input hdfs://path/to/output

High Level MapReduce Walk-Through The MapReduce framework logically splits the data space MapReduce Job Split data hdfs://path/to/input hdfs://path/to/output

High Level MapReduce Walk-Through Many functions are run in parallel MapReduce Job Split data hdfs://path/to/input hdfs://path/to/output

High Level MapReduce Walk-Through The framework runs combine to join outputs MapReduce Job Split Combine data hdfs://path/to/input hdfs://path/to/output

High Level MapReduce Walk-Through The framework performs a shuffle and sort on all data MapReduce Job Split Combine Shuffle / Sort data hdfs://path/to/input hdfs://path/to/output

High Level MapReduce Walk-Through The reduce() functions work against the sorted data MapReduce Job Split Combine Shuffle / Sort data reduce() reduce() hdfs://path/to/input hdfs://path/to/output

High Level MapReduce Walk-Through Reducers each write a part of the results to a file MapReduce Job Split Combine Shuffle / Sort data reduce() part 1 reduce() part 2 hdfs://path/to/input hdfs://path/to/output

High Level MapReduce Walk-Through When the job has finished, ArcGIS can retrieve the results Split Combine Shuffle / Sort data reduce() part 1 reduce() part 2 hdfs://path/to/input hdfs://path/to/output

Input text file MapReduce Word Count Example Output text file red red green blue red green blue blue green green blue red blue Split Split 1 / Record 1 red red blue red green Split 2 / Record 1 green blue red green Split 3 / Record 1 green blue blue blue Map Map Map red 1 red 1 blue 1 red 1 green 1 green 1 blue 1 red 1 green 1 green 1 blue 1 blue 1 blue 1 Shuffle / Sort green 1 green 1 green 1 green 1 red 1 red 1 red 1 red 1 blue 1 blue 1 blue 1 blue 1 blue 1 Reduce Reduce green 4 red 4 blue 5 Output writer green 4 red 4 blue 5

Apache Hive Tables and Queries in Hadoop - Part of the Hadoop ecosystem - Provides the ability to interact with data in Hadoop as if it were tables with rows and columns - Supports a SQL-like language for querying the data in Hadoop Note: The source files are never modified by Hive. Tables are merely a view of the data. earthquakes 2011.csv 2012.csv 2013.csv HDFS 2011/06/29, 52.0, 172.0, 0.0, 7.6 2011/10/11, 50.71,-179.5, 0.0, 6.9 Comma separated values datetime latitude longitude depth magnitude 2011/06/29 52.0 172.0 0.0 7.6 2011/10/11 50.71-179.5 0.0 6.9 Hive table SELECT * FROM earthquakes WHERE magnitude > 7 Hive query

GIS Tools for Hadoop Esri on GitHub GIS Tools for Hadoop tools samples Spatial Framework for Hadoop hive Show list of samples and tools Tools and samples using the open source resources that solve specific problems Hive user-defined functions for spatial processing JSON helper utilities spatial-sdk-hive.jar json spatial-sdk-json.jar Geoprocessing Tools for Hadoop HadoopTools.pyt Geometry API Java esri-geometry-api.jar Geoprocessing tools that Copy to/from Hadoop Convert to/from JSON Invoke Hadoop Jobs Java geometry library for spatial data processing https://github.com/esri/gis-tools-for-hadoop

ST_Geometry Hive Spatial Type - Hive functions wrapped around the freely available Java Geometry API - Supports simple geometry types - Point, polyline, polygon - Distributed geometric operations - Topological - union, intersection, cut, - Relational - contains, intersects, crosses, - Aggregate - union, convex hull, intersection, - Others - geodesic distance, buffer, - Supports multiple geometry formats - Well-known text/binary - GeoJSON - Esri JSON Select a count of all points that are contained within the given polygon 1 SELECT count(*) FROM faa 2 WHERE ST_Contains('POLYGON ((-124 32, -124 42, -114 42, -114 32))', 3 ST_Point(faa.longitude, faa.latitude)) https://github.com/esri/spatial-framework-for-hadoop/wiki/hive-spatial

Spatial Aggregation Making Big Data Manageable - Reduces the size of the data - Maintains a summary of numeric attributes - Common aggregation patterns - Polygons - Aggregate points into polygons Polygon Aggregation - Bins - Aggregate points into bins defined by a grid Bin Aggregation

Spatial Aggregation Who needs it? I don t need it. I ll just draw everything.

FAA flight data 17.5 million points

Spatial Aggregation Why it s important - Mapping millions to billions of points in a small, dense area just isn t feasible - It s slow - It s difficult to identify patterns FAA flight data 17.5 million points

Spatial Aggregation Why it s important - Mapping millions to billions of points in a small, dense area just isn t feasible - It s slow - It s difficult to identify patterns

FAA flight data 17.5 million points (spatially binned)

Spatial Aggregation With Bins - Spatial bins are cells defined by a regular grid - Each cell contains a set of attributes summarizing all points that fall within the cell boundary

Spatial Aggregation With Bins Points in bin 2 2 speed = 10 0 1 2 3 4 5 speed = 30 Bin ID 6 7 8 Bin Size 2 count = 2 speed sum = 40 speed avg = 20

Spatial Aggregation Binning with Hive and GIS Tools - Hive functions for spatial binning - ST_Bin given a point, return an integer value that uniquely identifies the containing bin - ST_BinEnvelope given a point (or bin id), return the envelope (bounding box) of the bin - With these functions together, we can create a spatially binned table using Hive ST_Bin(0.5, 'POINT (10 10)') -> 4611685954723114540 ST_BinEnvelope(0.5, 'POINT (10 10)') -> POLYGON ((10 9.5, 10.5 9.5, 10.5 10, 10 10, 10 9.5)) ST_BinEnvelope(0.5, 4611685954723114540) -> POLYGON ((10 9.5, 10.5 9.5, 10.5 10, 10 10, 10 9.5))

Spatial Aggregation Spatial Query Example Bin size is 0.5 degrees 1 FROM (SELECT ST_Bin(0.5, ST_Point(longitude, latitude)) bin_id, * FROM faa) bins 2 SELECT ST_BinEnvelope(0.5, bin_id) shape, 3 COUNT(*) count 4 GROUP BY bin_id Source data is 17.5 million FAA flight points

Spatial Aggregation Spatial Query Example Subquery to augment each flight record with the ID of the bin that contains it 1 FROM (SELECT ST_Bin(0.5, ST_Point(longitude, latitude)) bin_id, * FROM faa) bins 2 SELECT ST_BinEnvelope(0.5, bin_id) shape, 3 COUNT(*) count 4 GROUP BY bin_id latitude longitude 41.733-90.65 34.966-119.05 42.016-76.75 40.966-120.73 37.616-122.03 48.466-122.51 32.766-116.58 33.383-97.633 34.75-119.41 39.466-104.93 FAA latitude longitude bin_id 41.733-90.65 46116857633 34.966-119.05 46116858028 42.016-76.75 46116857603 40.966-120.73 46116857664 37.616-122.03 46116857876 48.466-122.51 46116857208 32.766-116.58 46116858150 33.383-97.633 46116858119 34.75-119.41 46116858059 39.466-104.93 46116857755 FAA with bin ID

Spatial Aggregation Spatial Query Example One record per bin shape count POLYGON ((144.44 13.85, 144.54 13.85, 144.54 13.95, 144.44 13.95, 144.44 13.85)) 33 POLYGON ((144.34 13.75, 144.44 13.75, 144.44 13.85, 144.34 13.85, 144.34 13.75)) 27 1 FROM (SELECT ST_Bin(0.5, ST_Point(longitude, latitude)) bin_id, * FROM faa) bins 2 SELECT ST_BinEnvelope(0.5, bin_id) shape, 3 COUNT(*) count 4 GROUP BY bin_id POLYGON ((144.54 13.75, 144.64 13.75, 144.64 13.85, 144.54 13.85, 144.54 13.75)) 49 POLYGON ((144.94 13.75, 145.04 13.75, 145.04 13.85, 144.94 13.85, 144.94 13.75)) 39 POLYGON ((144.34 13.65, 144.44 13.65, 144.44 13.75, 144.34 13.75, 144.34 13.65)) 29 POLYGON ((144.94 13.65, 145.04 13.65, 145.04 13.75, 144.94 13.75, 144.94 13.65)) 75 POLYGON ((144.34 13.55, 144.44 13.55, 144.44 13.65, 144.34 13.65, 144.34 13.55)) 25 POLYGON ((144.94 13.55, 145.04 13.55, 145.04 13.65, 144.94 13.65, 144.94 13.55)) 71 POLYGON ((144.64 13.45, 144.74 13.45, 144.74 13.55, 144.64 13.55, 144.64 13.45)) 189 Select the envelope of the bin and a count of each point in that bin.

Spatial Aggregation With Polygons - Aggregate points into polygons instead of bins - Works best with a small set of polygons 1 SELECT counties.shape, count(*) 2 FROM counties JOIN faa 3 WHERE ST_Contains(counties.shape, 4 ST_Point(faa.longitude, faa.latitude)) 5 GROUP BY counties.shape;

GIS Tools for Hadoop Visualizing your data in ArcGIS - A model was used to bring the output of our Hive queries back into ArcGIS as a local feature class - Models let you chain geoprocessing tools together - The output of one tool becomes the input to the next tool

GIS Tools for Hadoop Visualizing your data in ArcGIS Copy From HDFS Copy a file or directory from Hadoop to a local file JSON To Features Create a feature class from a JSON file

GIS Tools for Hadoop Visualizing your data in ArcGIS Hadoop service information Hadoop path to the result of the Hive aggregation query Copy From HDFS Copy a file or directory from Hadoop to a local file Path to temporary file on local hard disk JSON To Features Create a feature class from a JSON fileoutput feature class to create JSON file type (usually unenclosed) Big Data and Analytics: The Fundamentals

Other relevant sessions Big Data and Analytics: Getting Started with ArcGIS - Th 10:15 11:30 (Room 4) Big Data and Analytics: Application Examples - We 1:00 1:30 (Tech Theater 16) Real-Time GIS: Applying Real-Time Analytics - We 8:30 9:45 (Room 14B) Real-Time GIS: The Road Ahead - We 1:30 2:45 (Room 14B) Road Ahead: ArcGIS for Server - We 10:15 11:30 (Room 7/8)

Thanks for coming Be sure to fill out the session survey using your mobile app - Select the session (search works well) - Select Technical Workshop Survey - Answer a couple questions and provide any comments you feel appropriate