Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh sviglas@inf.ed.ac.uk. Stratis Viglas Extreme Computing 1



Similar documents
Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Big Data a threat or a chance?

Large-Scale Data Processing

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Introduction to Hadoop

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

CPS 216: Advanced Database Systems (Data-intensive Computing Systems) Shivnath Babu

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data Analytics. Lucas Rego Drumond

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

NoSQL. Thomas Neumann 1 / 22

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Hadoop. Sunday, November 25, 12

The big data revolution

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

B490 Mining the Big Data. 0 Introduction

Big Graph Processing: Some Background

# Not a part of 1Z0-061 or 1Z0-144 Certification test, but very important technology in BIG DATA Analysis

Data Centric Computing Revisited

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Big Data Challenges in Bioinformatics

From Internet Data Centers to Data Centers in the Cloud

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Open source Google-style large scale data analysis with Hadoop

Big Data With Hadoop

Cosmos. Big Data and Big Challenges. Pat Helland July 2011

Collaborations between Official Statistics and Academia in the Era of Big Data

Apache Hadoop FileSystem and its Usage in Facebook

Play with Big Data on the Shoulders of Open Source

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Cosmos. Big Data and Big Challenges. Ed Harris - Microsoft Online Services Division HTPS 2011

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Similarity Search in a Very Large Scale Using Hadoop and HBase

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Now, Next and the Future: IT, Big Data and other Implications for RIM. Presented by Michael S. Smith /

A Marketer's Guide. to Facebook Metrics

Maximizing Hadoop Performance with Hardware Compression

Are You Ready for Big Data?

Data Aggregation and Cloud Computing

Introduction to Big Data the four V's

Big Data Hope or Hype?

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Apache Hadoop FileSystem Internals

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Energy Efficient MapReduce

Google File System. Web and scalability

Register on projectbotticelli.com. Introduction to BI & Big Data DAX MDX Data Mining

Turn Big Data to Small Data

Search Engine Optimization (SEO)

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop IST 734 SS CHUNG

Parallel Data Warehouse

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

HP Vertica at MIT Sloan Sports Analytics Conference March 1, 2013 Will Cairns, Senior Data Scientist, HP Vertica

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Miguel Ortiz, Sr. Systems Engineer. Globanet

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

CSE-E5430 Scalable Cloud Computing Lecture 2

Top 4 Ways Social Media is Helping to Reshape Marketing

Mobile Big Data AnalyEcs

Product Guide. Sawmill Analytics, Swindon SN4 9LZ UK tel:

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Problems to store, transfer and process the Big Data 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 1

Cloud Computing using MapReduce, Hadoop, Spark

Massive Cloud Auditing using Data Mining on Hadoop

Machine Learning over Big Data

MapReduce, Hadoop and Amazon AWS

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Created by: Hector "H.R" Ramos

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

The Rise of Industrial Big Data

B669 Sublinear Algorithms for Big Data

Journal of Environmental Science, Computer Science and Engineering & Technology

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

BIG DATA, MAPREDUCE & HADOOP

Hadoop Architecture. Part 1

Search Engines are #1 Way to be Found

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

SEO: What is it and Why is it Important?

Hadoop & its Usage at Facebook

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Big Data Processing. Patrick Wendell Databricks

Transcription:

Extreme Computing Big Data Stratis Viglas School of Informatics University of Edinburgh sviglas@inf.ed.ac.uk Stratis Viglas Extreme Computing 1

Petabyte Age Big Data Challenges Stratis Viglas Extreme Computing 2

The Petabyte Age Wired article (June 2008): Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age. Stratis Viglas Extreme Computing 3

The Petabyte Age: It is here now Optimising advertising is a multi-billion dollar business Advert placement on pages, Pricing Fraud detection Analysing query logs, web-page clicks etc and quickly is vital for success Large Hadron Collider: World s largest particle accelerator 15 PB of data generated per year Social Media, Chip designing,... Stratis Viglas Extreme Computing 4

The Petabyte Age: Predicting Flu There is a belief that we are due for a Flu Pandemic People tend to search for flu-related terms when they have it: Google mined query logs between 2003 2007 Simple machine learning techniques to predict Flu levels in the US Results were often 1-2 weeks ahead of traditional monitoring Stratis Viglas Extreme Computing 5

What does a Petabyte look like? A Petabyte is a lot of data: 1PB = 1024 TB; 1TB = 1024GB 1PB: 13 years of HD Video 1.5PB: 10 billion photos on Facebook 20PB: Amount of data processed by Google each day Source: http://mozy.com/blog/misc/how-much-is-a-petabyte/ Stratis Viglas Extreme Computing 6

Big Data Big Data is a relative term If things are breaking, you have Big Data Big Data is not always Petabytes in size Big Data for Informatics is not the same as for Google Big Data is often hard to understand A model explaining it might be as complicated as the data itself This has implications for Science Stratis Viglas Extreme Computing 7

Big Data: Power Laws Big Data typically obeys a power-law: Source: Wikipedia Stratis Viglas Extreme Computing 8

Big Data: Power Laws Implications: Modelling the head is easy, but may not be representative of the full population Dealing with the full population might imply Big Data (eg selling all books, not just block busters) Processing Big Data might reveal power-laws Most items take a small amount of time to process A few items take a lot of time to process Understanding the nature of data is key Stratis Viglas Extreme Computing 9

Problems Dealing with Big Data poses challenges: Storing it is not really a problem Disk space is cheap Efficiently accessing it and deriving results can be hard Reports should be produced now, not decades later Visualising it can be next to impossible How can you comprehend 1 trillion Web pages? Stratis Viglas Extreme Computing 10

Problems: Repeated Observations What makes Big Data big are repeated observations: Mobile phones report their locations every 15 seconds People post on Twitter > 100 million posts a day The Web changes every day Potentially we need unbounded resources Repeated observations motivates Streaming Algorithms Stratis Viglas Extreme Computing 11

Problems: Access Often we want random access to data Find the interests of my friend on Facebook But what if the Data is too big to fit into memory? We can start using disk etc, but poor decisions can make processing too slow Stratis Viglas Extreme Computing 12

Problems: Access Source: Jacobs, The Pathologies of Big Data Stratis Viglas Extreme Computing 13

Problems: Denormalising Arranging our data so we can use sequential access is great But not all decisions can be made locally Finding the interest of my friend on Facebook is easy But what if we want to do this for another person who shares the same friend? Using random access, we would lookup that friend. Using sequential access, we need to localise friend information Localising information means duplicating it Denormalising data can greatly increase the size of it Stratis Viglas Extreme Computing 14

Problems: Non-uniform Allocation Distributed computation is a natural way to tackle Big Data Map-Reduce encourages sequential, disk-based, localised processing of data Map-reduce operates over a cluster of machines One consequence of Power Laws is uneven allocation of data to nodes: The head might go to one or two nodes The tail would spread over all other nodes All workers on the tail would finish quickly. The head workers would be a lot slower Power Laws can turn parallel algorithms into sequential algorithms Stratis Viglas Extreme Computing 15

Problems: Curation Big Data can be the basis of Science: Experiments can happen in silico Discoveries can be made over large, aggregated data sets Data needs to be managed (curated): How can we ensure that experiments are reproducible? Whoever owns the data controls it How can we guarantee that the data will survive? What about access? Growing interest in Open Data Stratis Viglas Extreme Computing 16

Summary Introduced notion of Big Data Looked at various problems Motivated some of the later techniques Stratis Viglas Extreme Computing 17