Challenges for Data Driven Systems



Similar documents
Large-Scale Data Processing

Lecture Data Warehouse Systems

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Privacy Preservation in the Context of Big Data Processing

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Cloud Scale Distributed Data Storage. Jürmo Mehine

Open source large scale distributed data management with Google s MapReduce and Bigtable

Integrating Big Data into the Computing Curricula

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Structured Data Storage

NoSQL Data Base Basics

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Can the Elephants Handle the NoSQL Onslaught?

Big Data and Data Science: Behind the Buzz Words

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Cloud Computing at Google. Architecture

BIG DATA What it is and how to use?

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Unified Big Data Processing with Apache Spark. Matei

Big Data With Hadoop

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Advanced Data Management Technologies

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

How To Scale Out Of A Nosql Database

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Big Systems, Big Data

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Infrastructures for big data

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Preparing Your Data For Cloud

Introduction to Hadoop

A survey of big data architectures for handling massive data

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop IST 734 SS CHUNG

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data Analytics. Lucas Rego Drumond

A programming model in Cloud: MapReduce

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

How To Use Big Data For Telco (For A Telco)

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Cloud Computing Summary and Preparation for Examination

Slave. Master. Research Scholar, Bharathiar University

16.1 MAPREDUCE. For personal use only, not for distribution. 333

An Approach to Implement Map Reduce with NoSQL Databases

MapReduce with Apache Hadoop Analysing Big Data

Open source Google-style large scale data analysis with Hadoop

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

BIG DATA TOOLS. Top 10 open source technologies for Big Data

NoSQL: Going Beyond Structured Data and RDBMS

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

NoSQL for SQL Professionals William McKnight

Sentimental Analysis using Hadoop Phase 2: Week 2

Bringing Big Data Modelling into the Hands of Domain Experts

NoSQL and Hadoop Technologies On Oracle Cloud

Introduction to NOSQL

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

How To Write A Database Program

Managing large clusters resources

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Viswanath Nandigam Sriram Krishnan Chaitan Baru

Transforming the Telecoms Business using Big Data and Analytics

So What s the Big Deal?

Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

GigaSpaces Real-Time Analytics for Big Data

Map Reduce / Hadoop / HDFS

NOSQL INTRODUCTION WITH MONGODB AND RUBY GEOFF

Scaling Out With Apache Spark. DTL Meeting Slides based on

Hadoop. Sunday, November 25, 12

Open Source Technologies on Microsoft Azure

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Databases 2 (VU) ( )

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Transcription:

Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

1800's - 1940's Punched cards (no fault-tolerance) Binary data 1911: IBM appeared A. Payberah 2014 3 1940's - 1970's Magnetic tapes Batch transaction processing Hierarchical DBMS Network DBMS A. Payberah 2014 4

1980's Relational DBMS (tables) and SQL ACID (Atomicity Consistency Isolation Durability) Client-server computing Parallel processing A. Payberah 2014 5 1990's - 2000's The Internet... A. Payberah 2014 6

2010's NoSQL: BASE instead of ACID Basic Availability, Soft-state, Eventual consistency Big Data is emerging! A. Payberah 2014 7 Emergence of Big Data Increase of Storage Capacity Increase of Processing Capacity Availability of Data Hardware and software technologies can manage ocean of data 8

Challenge to process Big Data Integration of complex data processing with programming, networking and storage A key vision for future computing 9 Big Data: Technologies Distributed infrastructure Cloud (e.g. Infrastructure as a service) cf. Multi-core (parallel computing) Storage Distributed storage (e.g. Amazon S3) Data model/indexing High-performance schema-free database (e.g. NoSQL DB) Programming Model Distributed processing (e.g. MapReduce) Operations on big data Analytics 10

Big Data: Technologies Distributed infrastructure Cloud (e.g. Infrastructure as a service) Storage Distributed storage (e.g. Amazon S3) Data model/indexing High-performance schema-free database (e.g. NoSQL DB) Programming Model Distributed processing (e.g. MapReduce) Operations on big data Analytics Realtime Analytics 11 Distributed Infrastructure Zookeeper, Chubby manage 12

Distributed Infrastructure Computing + Storage transparently Cloud computing, Web 2.0 Scalability and fault tolerance Distributed servers Amazon EC2, Google App Engine, Elastic, Azure System? OS, customisations Sizing? RAM/CPU based on tiered model Storage? Quantity, type Distributed storage Amazon S3 Hadoop Distributed File System (HDFS) Google File System (GFS), BigTable 13 Challenges Distribute and shard parts over machines Still fast traversal and read to keep related data together Scale out instead scale up Avoid naïve hashing for sharding Do not depend on the number of node But difficult add/remove nodes Trade off data locality, consistency, availability, read/write/search speed, latency etc. Analytics requires both real time and post fact analytics and incremental operation 14

Big Data: Technologies Distributed infrastructure Cloud (e.g. Infrastructure as a service) Storage Distributed storage (e.g. Amazon S3) Data model/indexing High-performance schema-free database (e.g. NoSQL DB) Programming Model Distributed processing (e.g. MapReduce) Operations on big data Analytics Realtime Analytics 15 Data Model/Indexing Support large data Fast and flexible access to data Operate on distributed infrastructure Is SQL Database sufficient? 16

NoSQL (Schema Free) Database NoSQL database Operate on distributed infrastructure Based on key-value pairs (no predefined schema) Fast and flexible Pros: Scalable and fast Cons: Fewer consistency/concurrency guarantees and weaker queries support Implementations MongoDB, CouchDB, Cassandra, Redis, BigTable, Hibase 17 Big Data: Technologies Distributed infrastructure Cloud (e.g. Infrastructure as a service) Storage Distributed storage (e.g. Amazon S3) Data model/indexing High-performance schema-free database (e.g. NoSQL DB) Programming Model Distributed processing (e.g. MapReduce) Stream processing Operations on big data Analytics Realtime Analytics 18

Distributed Processing Non standard programming models No traditional parallel programming models (e.g. MPI) e.g. MapReduce Data (flow) parallel programming e.g. MapReduce, Dryad/LINQ, NAIAD, Spark 19 MapReduce Target problem needs to be parallelisable Split into a set of smaller code (map) Next small piece of code executed in parallel Results from map operation get synthesised into a result of the original problem (reduce) 20

CIEL: Dynamic Task Graph Data-dependent control flow CIEL: Execution engine for dynamic task graphs (D. Murray et al. CIEL: a universal execution engine for distributed data-flow computing, NSDI 2011) 21 Stream Data Processing Stream Data Processing Stream: infinite sequence of {tuple, timestamp} pairs Continuous query: result of query in unbounded stream Database systems and Data stream systems Database Mostly static data, ad-hoc one-time queries Store and query Data stream Mostly transient data, continuous queries 22

Real-Time Data Departure from traditional static web pages New time-sensitive data is generated continuously Rich connections between entities Challenges: High rate of updates Continuous data mining - Incremental data processing Data consistency 23 Big Data: Technologies Distributed infrastructure Cloud (e.g. Infrastructure as a service) Storage Distributed storage (e.g. Amazon S3) Data model/indexing High-performance schema-free database (e.g. NoSQL DB) Programming Model Distributed processing (e.g. MapReduce) Operations on big data Analytics 24

Techniques for Analysis Applying these techniques: larger and more diverse datasets can be used to generate more numerous and insightful results than smaller, less diverse ones Classification Cluster analysis Crowd sourcing Data fusion/integration Data mining Ensemble learning Genetic algorithms Machine learning NLP Neural networks Network analysis Optimisation Pattern recognition Predictive modelling Regression Sentiment analysis Signal processing Spatial analysis Statistics Supervised learning Simulation Time series analysis Unsupervised learning Visualisation 25 Typical Operation with Big Data Smart sampling of data Reducing data with maintaining statistical properties Find similar items Efficient multidimensional indexing Incremental updating of models Distributed linear algebra dealing with large sparse matrices Plus usual data mining, machine learning and statistics Supervised (e.g. classification, regression) Non-supervised (e.g. clustering..) 26

Do we need new Algorithms? Can t always store all data Online/streaming algorithms Memory vs. disk becomes critical Algorithms with limited passes N 2 is impossible Approximate algorithms 27 Sorting Easy Cases Google 1 trillion items (1PB) sorted in 6 Hours Searching Hashing and distributed search Random split of data to feed M/R operation BUT Not all algorithms are parallelisable 28

More Complex Case: Stream Data Have we seen x before? Rolling average of previous K items Hot list most frequent items seen so far Probability start tracking new item Querying data streams Continuous Query 29 Big Graph Data Bipartite graph of appearing phrases in documents Social Networks Airline Graph Gene expression data Protein Interactions [genomebiology.com] 30

How to Process Big Graph Data? Data-Parallel (MapReduce, DryadLINQ) Partitioned across several machines and replicated No efficient random access to data Graph algorithms are not fully parallelisable Parallel DB Tabular format providing ACID properties Allow data to be partitioned and processed in parallel Graph does not map well to tabular format Moden NoSQL Allow flexible structure (e.g. graph) Trinity, Neo4J, HyperGraphDB In-memory graph store for improving latency (e.g. Redis, Scalable Hyperlink Store (SHS)) 31 Big Graph Data Processing MapReduce is ill-suited for graph processing Many iterations are needed Intermediate results at every iteration harm performance Graph specific data parallel Vertex-based iterative computation model Iterative algorithms common in ML and graph analysis 32

Big Data Analytics Stack A. Payberah 2014 33 Big Data Analytics Stack A. Payberah 2014 34

Session 1: Introduction Topic Areas Session 2: Programming in Data Centric Environment Session 3: Processing Models of Large-Scale Graph Data Session 4: Map/Reduce Hands-on Tutorial with EC2 Session 5: Optimisation in Graph Data Processing + Guest lecture Session 6: Stream Data Processing + Guest lecture Session 7: Scheduling Irregular Tasks Session 8: Project study presentation 35 Summary R212 course web page: www.cl.cam.ac.uk/~ey204/teaching/acs/r212_2014_2015 Enjoy the course! 36