Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel
|
|
|
- Christina Johnston
- 10 years ago
- Views:
Transcription
1 Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel
2 In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will cover: - Big Data overview - The Hadoop platform - How Esri s GIS Tools for Hadoop enables developers to process spatial data on Hadoop - Looking ahead
3 Big Data Within ArcGIS, Geoprocessing was enhanced at 10.1 SP1 to support 64-bit address spaces - This is sufficient to handle traditional large GIS datasets However, this solution may run into problems when confronted with datasets of a size that are colloquially referred to as Big Data - Internet scale datasets
4 Age of Data Ubiquity Data is now central to our existence both for corporations and individuals Nimble, thin, data-centric apps exploiting massive data sets generated by both enterprises and consumers Hardware era: years Software era: years Data era:?
5 Big data What is it? Big Data is a loosely defined term used to describe data sets so large and complex that they become awkward to work with using standard software in a tolerable elapsed time - Big data "size" is a constantly moving target, ranging from a few dozen terabytes to many petabytes of data - In the past three years, 90% of all recorded data has been generated Every 60 seconds: - 100,000 tweets million Google searches - 11 million instant messages million messages - 1,800 TB of data
6 NYC Taxis by Day Manhattan Taxis Friday after 8pm 6
7 ArcGIS users have big data Smart Sensors - Electrical meters (AMI), SCADA, UAVs GPS Telemetry - Vehicle tracking, smartphone data collectors, workforce tracking, geofencing Internet data - Social media streams, web log files, customer sentiment Sensor data - Weather sensors, stream gauge measurements, heavy equipment monitors, Imagery - Satellites, frame cameras, drones 7
8 Value when analyzing data at mass scale As observations increase in frequency - Each individual observation is worth less - as the set of all observations becomes more valuable One single metric from the jet aircraft is much less useful than the analysis of that metric against the same metric from every known flight of that aircraft over time Big Data is the accumulation and analytical processes that uses this data for business value
9 Big challenges Data acquisition - Filtering and compressing - Planned Suare Kilometer Array Telescope one million terabytes per day Information extraction and cleaning - Extracting health info from X-rays and CT scans Data integration, aggregation, and representation - Heterogenious datasets Modeling and analysis - Nonstandard statistical analysis; very noisy, dynamic, and untrustworthy Interpretation - Decision making metadata, assumptions, very complex
10 Big data What techniques are applied to handle it? Data distribution Big large data datasets is are not split about into smaller the datasets data. and distributed across a collection of machines Gary King Parallel processing using a collection of machines to process the smaller datasets, combining the partial results together Harvard University Director, Inst. For Quantitative Social Science Fault tolerance making copies of the partitioned data to ensure that if a machine fails, the dataset can still be processed Commodity hardware using standard hardware that is not dependent upon exotic architectures, (Making topologies, the or data point storage that (e.g., while RAID) data is plentiful and easy to collect, the real value is in the analytics) Scalability algorithms and frameworks that can be easily scaled to run on larger collections of machines in order to address larger datasets
11 Legacy system architecture Processors Disks Network Bottleneck
12 Distributed system architecture Processing Elements
13 Distributed system architecture Processing Elements
14 Until Now Google implemented their enterprise on a distributed network of many nodes, fusing storage and processing into each node Hadoop is an open source implementation of the framework that Google has built their business around for many years
15 Apache Hadoop Overview Hadoop is a scalable open source framework for the distributed processing of extremely large data sets on clusters of commodity hardware - Maintained by the Apache Software Foundation - Assumes that hardware failures are common Hadoop is primarily used for: - Distributed storage - Distributed computation
16 Apache Hadoop Hadoop Clusters 20 flowdown PCs 5 7 years old Quad-core, 3GHz 16 GB RAM 1 TB fast disk Traditional Hadoop Clusters The Dredd Cluster
17 Apache Hadoop Distributed Storage The Hadoop Distributed File System (HDFS) is a hierarchical file system where datasets are organized into directories and files These files are accessed like regular files, however they are actually distributed throughout the Hadoop cluster earthquakes 2011.csv 2012.csv 2013.csv
18 MapReduce A programming model for processing data with a parallel distributed algorithm on a cluster A MapReduce program is comprised of: - A Map() procedure that performs filtering and comparing, and - A Reduce() procedure that performs a summary operation The MapReduce system - Marshals the distributed servers - Runs the various tasks in parallel - Manages all communications and data transfers - Provides for redundancy and failures MapReduce libraries have been written in many languages; Hadoop is a popular open source implementation
19 MapReduce Walkthrough An instance of a MapReduce program is a job The job accepts arguments for data input and outputs The job combines - Functions from the MapReduce framework - Splitting large inputs into smaller pieces - Reading inputs and writing outputs - Functions that are written by the application developer - Map function, maps input values to keys - Reduce function, reduces many keys to one
20 High Level MapReduce Walk-Through An instance of a MapReduce program is a job MapReduce Job
21 High Level MapReduce Walk-Through The job accepts arguments for data input and outputs MapReduce Job data hdfs://path/to/input hdfs://path/to/output
22 High Level MapReduce Walk-Through The MapReduce framework logically splits the data space MapReduce Job Split data hdfs://path/to/input hdfs://path/to/output
23 High Level MapReduce Walk-Through Many functions are run in parallel MapReduce Job Split data hdfs://path/to/input hdfs://path/to/output
24 High Level MapReduce Walk-Through The framework runs combine to join outputs MapReduce Job Split Combine data hdfs://path/to/input hdfs://path/to/output
25 High Level MapReduce Walk-Through The framework performs a shuffle and sort on all data MapReduce Job Split Combine Shuffle / Sort data hdfs://path/to/input hdfs://path/to/output
26 High Level MapReduce Walk-Through The reduce() functions work against the sorted data MapReduce Job Split Combine Shuffle / Sort data reduce() reduce() hdfs://path/to/input hdfs://path/to/output
27 High Level MapReduce Walk-Through Reducers each write a part of the results to a file MapReduce Job Split Combine Shuffle / Sort data reduce() part 1 reduce() part 2 hdfs://path/to/input hdfs://path/to/output
28 High Level MapReduce Walk-Through When the job has finished, ArcGIS can retrieve the results Split Combine Shuffle / Sort data reduce() part 1 reduce() part 2 hdfs://path/to/input hdfs://path/to/output
29 Input text file MapReduce Word Count Example Output text file red red green blue red green blue blue green green blue red blue Split Split 1 / Record 1 red red blue red green Split 2 / Record 1 green blue red green Split 3 / Record 1 green blue blue blue Map Map Map red 1 red 1 blue 1 red 1 green 1 green 1 blue 1 red 1 green 1 green 1 blue 1 blue 1 blue 1 Shuffle / Sort green 1 green 1 green 1 green 1 red 1 red 1 red 1 red 1 blue 1 blue 1 blue 1 blue 1 blue 1 Reduce Reduce green 4 red 4 blue 5 Output writer green 4 red 4 blue 5
30 Apache Hive Tables and Queries in Hadoop - Part of the Hadoop ecosystem - Provides the ability to interact with data in Hadoop as if it were tables with rows and columns - Supports a SQL-like language for querying the data in Hadoop Note: The source files are never modified by Hive. Tables are merely a view of the data. earthquakes 2011.csv 2012.csv 2013.csv HDFS 2011/06/29, 52.0, 172.0, 0.0, /10/11, 50.71,-179.5, 0.0, 6.9 Comma separated values datetime latitude longitude depth magnitude 2011/06/ /10/ Hive table SELECT * FROM earthquakes WHERE magnitude > 7 Hive query
31 GIS Tools for Hadoop Esri on GitHub GIS Tools for Hadoop tools samples Spatial Framework for Hadoop hive Show list of samples and tools Tools and samples using the open source resources that solve specific problems Hive user-defined functions for spatial processing JSON helper utilities spatial-sdk-hive.jar json spatial-sdk-json.jar Geoprocessing Tools for Hadoop HadoopTools.pyt Geometry API Java esri-geometry-api.jar Geoprocessing tools that Copy to/from Hadoop Convert to/from JSON Invoke Hadoop Jobs Java geometry library for spatial data processing
32 ST_Geometry Hive Spatial Type - Hive functions wrapped around the freely available Java Geometry API - Supports simple geometry types - Point, polyline, polygon - Distributed geometric operations - Topological - union, intersection, cut, - Relational - contains, intersects, crosses, - Aggregate - union, convex hull, intersection, - Others - geodesic distance, buffer, - Supports multiple geometry formats - Well-known text/binary - GeoJSON - Esri JSON Select a count of all points that are contained within the given polygon 1 SELECT count(*) FROM faa 2 WHERE ST_Contains('POLYGON (( , , , ))', 3 ST_Point(faa.longitude, faa.latitude))
33 Spatial Aggregation Making Big Data Manageable - Reduces the size of the data - Maintains a summary of numeric attributes - Common aggregation patterns - Polygons - Aggregate points into polygons Polygon Aggregation - Bins - Aggregate points into bins defined by a grid Bin Aggregation
34 Spatial Aggregation Who needs it? I don t need it. I ll just draw everything.
35 FAA flight data 17.5 million points
36 Spatial Aggregation Why it s important - Mapping millions to billions of points in a small, dense area just isn t feasible - It s slow - It s difficult to identify patterns FAA flight data 17.5 million points
37 Spatial Aggregation Why it s important - Mapping millions to billions of points in a small, dense area just isn t feasible - It s slow - It s difficult to identify patterns
38 FAA flight data 17.5 million points (spatially binned)
39 Spatial Aggregation With Bins - Spatial bins are cells defined by a regular grid - Each cell contains a set of attributes summarizing all points that fall within the cell boundary
40 Spatial Aggregation With Bins Points in bin 2 2 speed = speed = 30 Bin ID Bin Size 2 count = 2 speed sum = 40 speed avg = 20
41 Spatial Aggregation Binning with Hive and GIS Tools - Hive functions for spatial binning - ST_Bin given a point, return an integer value that uniquely identifies the containing bin - ST_BinEnvelope given a point (or bin id), return the envelope (bounding box) of the bin - With these functions together, we can create a spatially binned table using Hive ST_Bin(0.5, 'POINT (10 10)') -> ST_BinEnvelope(0.5, 'POINT (10 10)') -> POLYGON ((10 9.5, , , 10 10, )) ST_BinEnvelope(0.5, ) -> POLYGON ((10 9.5, , , 10 10, ))
42 Spatial Aggregation Spatial Query Example Bin size is 0.5 degrees 1 FROM (SELECT ST_Bin(0.5, ST_Point(longitude, latitude)) bin_id, * FROM faa) bins 2 SELECT ST_BinEnvelope(0.5, bin_id) shape, 3 COUNT(*) count 4 GROUP BY bin_id Source data is 17.5 million FAA flight points
43 Spatial Aggregation Spatial Query Example Subquery to augment each flight record with the ID of the bin that contains it 1 FROM (SELECT ST_Bin(0.5, ST_Point(longitude, latitude)) bin_id, * FROM faa) bins 2 SELECT ST_BinEnvelope(0.5, bin_id) shape, 3 COUNT(*) count 4 GROUP BY bin_id latitude longitude FAA latitude longitude bin_id FAA with bin ID
44 Spatial Aggregation Spatial Query Example One record per bin shape count POLYGON (( , , , , )) 33 POLYGON (( , , , , )) 27 1 FROM (SELECT ST_Bin(0.5, ST_Point(longitude, latitude)) bin_id, * FROM faa) bins 2 SELECT ST_BinEnvelope(0.5, bin_id) shape, 3 COUNT(*) count 4 GROUP BY bin_id POLYGON (( , , , , )) 49 POLYGON (( , , , , )) 39 POLYGON (( , , , , )) 29 POLYGON (( , , , , )) 75 POLYGON (( , , , , )) 25 POLYGON (( , , , , )) 71 POLYGON (( , , , , )) 189 Select the envelope of the bin and a count of each point in that bin.
45 Spatial Aggregation With Polygons - Aggregate points into polygons instead of bins - Works best with a small set of polygons 1 SELECT counties.shape, count(*) 2 FROM counties JOIN faa 3 WHERE ST_Contains(counties.shape, 4 ST_Point(faa.longitude, faa.latitude)) 5 GROUP BY counties.shape;
46 GIS Tools for Hadoop Visualizing your data in ArcGIS - A model was used to bring the output of our Hive queries back into ArcGIS as a local feature class - Models let you chain geoprocessing tools together - The output of one tool becomes the input to the next tool
47 GIS Tools for Hadoop Visualizing your data in ArcGIS Copy From HDFS Copy a file or directory from Hadoop to a local file JSON To Features Create a feature class from a JSON file
48 GIS Tools for Hadoop Visualizing your data in ArcGIS Hadoop service information Hadoop path to the result of the Hive aggregation query Copy From HDFS Copy a file or directory from Hadoop to a local file Path to temporary file on local hard disk JSON To Features Create a feature class from a JSON fileoutput feature class to create JSON file type (usually unenclosed) Big Data and Analytics: The Fundamentals
49 Other relevant sessions Big Data and Analytics: Getting Started with ArcGIS - Th 10:15 11:30 (Room 4) Big Data and Analytics: Application Examples - We 1:00 1:30 (Tech Theater 16) Real-Time GIS: Applying Real-Time Analytics - We 8:30 9:45 (Room 14B) Real-Time GIS: The Road Ahead - We 1:30 2:45 (Room 14B) Road Ahead: ArcGIS for Server - We 10:15 11:30 (Room 7/8)
50 Thanks for coming Be sure to fill out the session survey using your mobile app - Select the session (search works well) - Select Technical Workshop Survey - Answer a couple questions and provide any comments you feel appropriate
51
How To Use Hadoop For Gis
2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop Big Data: Using ArcGIS with Apache Hadoop David Kaiser Erik Hoel Offering 1330 Esri UC2013. Technical Workshop.
Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel
Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined
Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
Big Data Spatial Analytics An Introduction
2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop Big Data Spatial Analytics An Introduction Marwa Mabrouk Mansour Raad Esri iu UC2013. Technical Workshop
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Investigating Hadoop for Large Spatiotemporal Processing Tasks
Investigating Hadoop for Large Spatiotemporal Processing Tasks David Strohschein [email protected] Stephen Mcdonald [email protected] Benjamin Lewis [email protected] Weihe
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Mr. Apichon Witayangkurn [email protected] Department of Civil Engineering The University of Tokyo
Sensor Network Messaging Service Hive/Hadoop Mr. Apichon Witayangkurn [email protected] Department of Civil Engineering The University of Tokyo Contents 1 Introduction 2 What & Why Sensor Network
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon [email protected] [email protected] XLDB
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
Using Big Data and GIS to Model Aviation Fuel Burn
Using Big Data and GIS to Model Aviation Fuel Burn Gary M. Baker USDOT Volpe Center 2015 Transportation DataPalooza June 17, 2015 The National Transportation Systems Center Advancing transportation innovation
Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum
Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum Siva Ravada Senior Director of Development Oracle Spatial and MapViewer 2 Evolving Technology Platforms
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Big Data and Market Surveillance. April 28, 2014
Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part
Latest Developments in Oceanographic Applications of GIS including!
Latest Developments in Oceanographic Applications of GIS including! Near-Real-Time Interactive Map Exemplars & Scientific Empowerment Through Storytelling Dawn Wright Esri Chief Scientist Affiliate Professor,
Manifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
Architectures for massive data management
Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet [email protected] October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University
High Performance Spatial Queries and Analytics for Spatial Big Data Fusheng Wang Department of Biomedical Informatics Emory University Introduction Spatial Big Data Geo-crowdsourcing:OpenStreetMap Remote
Big Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Real Time Big Data Processing
Real Time Big Data Processing Cloud Expo 2014 Ian Meyers Amazon Web Services Global Infrastructure Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Are You Ready for Big Data?
Are You Ready for Big Data? Jim Gallo National Director, Business Analytics February 11, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
How Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
NextGen Infrastructure for Big DATA Analytics.
NextGen Infrastructure for Big DATA Analytics. So What is Big Data? Data that exceeds the processing capacity of conven4onal database systems. The data is too big, moves too fast, or doesn t fit the structures
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Hadoop and its Usage at Facebook. Dhruba Borthakur [email protected], June 22 rd, 2009
Hadoop and its Usage at Facebook Dhruba Borthakur [email protected], June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook
HDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014
BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014 The Data Warehouse Mission Identify all possible enterprise data assets Select those assets
Constructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov
An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
Hadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014
White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH
Real-time Data Analytics mit Elasticsearch Bernhard Pflugfelder inovex GmbH Bernhard Pflugfelder Big Data Engineer @ inovex Fields of interest: search analytics big data bi Working with: Lucene Solr Elasticsearch
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
Using distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
Integrating VoltDB with Hadoop
The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Oracle Big Data Spatial and Graph
Oracle Big Data Spatial and Graph Oracle Big Data Spatial and Graph offers a set of analytic services and data models that support Big Data workloads on Apache Hadoop and NoSQL database technologies. For
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
