Big Data and Fast Data combined is it possible?



Similar documents
Building Scalable Big Data Pipelines

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Big Data. Marriage of RDBMS-DWH and Hadoop & Co. Author: Jan Ott Trivadis AG Trivadis. Big Data - Marriage of RDBMS-DWH and Hadoop & Co.

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

How To Scale Out Of A Nosql Database

Big Data Course Highlights

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

How To Use Big Data For Telco (For A Telco)

Dominik Wagenknecht Accenture

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

HDP Hadoop From concept to deployment.

How To Create A Data Visualization With Apache Spark And Zeppelin

Hadoop Ecosystem B Y R A H I M A.

Hadoop implementation of MapReduce computational model. Ján Vaňo

Dell In-Memory Appliance for Cloudera Enterprise

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

How To Handle Big Data With A Data Scientist

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Hadoop. Sunday, November 25, 12

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Real Time Data Processing using Spark Streaming

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

TRAINING PROGRAM ON BIGDATA/HADOOP

Application Development. A Paradigm Shift

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

So What s the Big Deal?

Principles and best practices of scalable real-time data systems. Nathan Marz James Warren 6$03/(&+$37(5 MANNING

Constructing a Data Lake: Hadoop and Oracle Database United!

How Companies are! Using Spark

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Workshop on Hadoop with Big Data

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Unified Big Data Processing with Apache Spark. Matei

Qsoft Inc

Openbus Documentation

Hadoop IST 734 SS CHUNG

Information Builders Mission & Value Proposition

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Time series IoT data ingestion into Cassandra using Kaa

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Elephants and Storms - using Big Data techniques for Analysis of Large and Changing Datasets

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Large scale processing using Hadoop. Ján Vaňo

The Future of Data Management

Moving From Hadoop to Spark

#TalendSandbox for Big Data

Putting Apache Kafka to Use!

A Survey on Big Data Concepts and Tools

<Insert Picture Here> Big Data

Comparing SQL and NOSQL databases

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data and Industrial Internet

Real Time Analytics for Big Data. NtiSh Nati

Design and Evolution of the Apache Hadoop File System(HDFS)

Chase Wu New Jersey Ins0tute of Technology

Thomas Baumann Swiss Mobiliar Bern, Switzerland

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Certified Big Data and Apache Hadoop Developer VS-1221

Search and Real-Time Analytics on Big Data

A Brief Introduction to Apache Tez

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

The Internet of Things and Big Data: Intro

Apache HBase. Crazy dances on the elephant back

BIG DATA TRENDS AND TECHNOLOGIES

HDP Enabling the Modern Data Architecture

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

IOT & Big Data: The Future Information Processing Architecture

The Future of Data Management with Hadoop and the Enterprise Data Hub

A Performance Analysis of Distributed Indexing using Terrier

Open source Google-style large scale data analysis with Hadoop

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

INTRODUCTION TO CASSANDRA

An Approach to Implement Map Reduce with NoSQL Databases

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Talend Big Data. Delivering instant value from all your data. Talend

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Hadoop: Embracing future hardware

Architectures for massive data management

Internals of Hadoop Application Framework and Distributed File System

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Big Data and Hadoop for the Executive A Reference Guide

Using distributed technologies to analyze Big Data

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Transcription:

Big Data and Fast Data combined is it possible? Ulises Fasoli DBTA Workshop 2014 - Bern BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN 1

Ulises Fasoli Consultant @ Trivadis Lausanne 7+ years of software development experience Occasional blogger Contact information : Email : ulises.fasoli@trivadis.com Blog: http://ufasoli.blogspot.com Twitter: ufasoli 2

Our company Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and technologies in Switzerland, Germany and Austria. We offer our services in the following strategic business fields: O P E R A T I O N Trivadis Services takes over the interacting operation of your IT systems. 3

AGENDA 1. Big Data and Fast Data, what is it? 2. Architecting (Big) Data Systems 3. The Lambda Architecture 4. Use Case and the Implementation 5. Summary and Outlook 4

Big Data Definition (4 Vs) Characteristics of Big Data: Its Volume, Velocity and Variety in combination Time to action? -> Big Data + Event Processing = Fast Data 5

The world is changing The model of Generating/Consuming Data has changed. Old Model: few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data 6

60 SECONDS 7 DOAG 2014 Big Data und Fast Data - Lambda Architektur und deren Umsetzung 19.11.2014

Internet Of Things Sensors are/will be everywhere There are more devices tapping into the internet than people on earth How do we prepare our systems/architecture for the future? 8 Source: The Economist Source: Cisco

The world is changing new data stores Problem of traditional (R)DBMS approach: Complex object graph Schema evolution Semi-structured data Scaling Order ID: 1001 Order Date: 15.9.2012 Customer First Name: Peter Last Name: Sample Billing Address Street: Somestreet 10 City: Somewhere Postal Code: 55901 Line Items Name Ipod Touch Monster Beat Apple Mouse Quantity 1 2 1 Price 220.95 190.00 69.90 ORDER CUSTOMER ADDRESS ORDER_LINES Polyglot persistence Using multiple data storage technologies (RDMBS + NoSQL + NewSQL + In- Memory) 9

The world is changing New platforms evolving (i.e. Hadoop Ecosystem) 10

Data as an Asset Store everything? Data is just too valuable to delete! We must store everything! It depends Big Data technologies allow to store the raw information from new and existing data sources so that you can later use it to create new data-driven products, which you haven t thought about today! Nonsense! Just store the data you know you need today! 11

AGENDA 1. Big Data and Fast Data, what is it? 2. Architecting (Big) Data Systems 3. The Lambda Architecture 4. Use Case and the Implementation 5. Summary and Outlook 12

What is a data system? A (data) system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist, every hardware failure and every human mistake ever made. A data system answers questions based on information that was acquired in the past 13

What is a data system? - Goal The goal of a data system is to compute arbitrary functions on arbitrary data. Questions are answered by running functions that take data as input query = function (all data) 14

Desired properties of a data system Robust and fault tolerant Low latency read / updates General Extensible Scalable Allow Ad-hoc queries Minimal maintenance Debuggable 15

How do we build (data) systems today Today s Architectures Source of Truth is mutable! Application (Query) Mobile Web RIA Rich Client Mutable Database Source of Truth RDBMS NoSQL NewSQL CRUD pattern What is the problem with this? Lack of Human Fault Tolerance Potential loss of information/data Source of Truth 16

Lack of Human Fault Tolerance Bugs will be deployed to production over the lifetime of a data system Operational mistakes will be made Humans are part of the overall system Just like hard disks, CPUs, memory, software design for human error like you design for any other fault Examples of human error Deploy a bug that increments counters by two instead of by one Accidentally delete data from database Accidental DOS on important internal service Worst two consequences: data loss or data corruption As long as an error doesn t lose or corrupt good data, you can fix what went wrong 17

Lack of Human Fault Tolerance Immutability vs. Mutability An immutable system captures historical records of events Each event happens at a particular time and is always true The U and D in CRUD A mutable system updates the current state of the world Mutable systems inherently lack human fault-tolerance Easy to corrupt or lose data Immutability restricts the range of errors causing data loss/data corruption Vastly more human fault-tolerant Conclusion: Your source of truth should always be immutable 18

A different kind of architecture with immutable source of truth Instead of using our traditional approach why not build data systems like this Immutable data View on Data Application (Query) Source of Truth HDFS NoSQL NewSQL RDBMS View on Data Mobile Web RIA Rich Client Source of Truth 19

How to create the views on the Immutable data? On the fly? Immutable data View Query Materialized, i.e. Pre-computed? Immutable data Pre- Computed Views Query 20

(Big) Data Processing Incoming Data Immutable data? Pre- Computed Views? Query How to compute the materialized views? How to compute queries from the views? 21

Today Big Data Processing means Batch Processing Stream 1 Stream 2 Event HDFS Hadoop Distributed File System Hadoop cluster (Map/Reduce) Data Store optimized for appending large results Queries 22

Big Data Processing - Batch Favorite Product List Changes 01.02.13 Add ipad 64GB 10.03.13 Add Sony RX-100 11. 03.13 Add Canon GX-10 11.03.13 Remove Sony RX-100 12.03.13 Add Nikon S-100 14.04.13 Add BoseQC-15 15.04.13 Add MacBook Pro 15 20.04.13 Remove Canon GX10 Raw information => data compute Current Favorite Product List ipad 64GB Nikon S-100 BoseQC-15 MacBook Pro 15 derive Information => derived Current Product Count 4 23

Big Data Processing Batch But we are not done yet batch-processed data non-processed data now Source of truth results time now time Fully processed data Last full batch period Time for batch job Using only batch processing, leaves you always with a portion of nonprocessed data. 24

Big Data Processing - Adding Real-Time Immutable data Batch Views Incoming Data Query Data Stream? Realtime Views How to compute real-time views How to compute queries from the views? 25

Big Data Processing - Adding Real-Time Immutable data Views Favorite Product List Changes 1.2.13 Add ipad 64GB 10.3.13 Add Sony RX-100 11..3.13 Add Canon GX-10 11.3.13 Remove Sony RX-100 12.3.13 Add Nikon S-100 14.4.13 Add BoseQC-15 15.4.13 Add MacBook Pro 15 20.4.13 Remove Canon GX10 Now Add Canon Scanner compute Current Favorite Product List ipad 64GB Nikon S-100 BoseQC-15 MacBook Pro 15 Data Stream Query Current Product Count incoming Stream of Favorite Product List Changes 5 Add Canon Scanner compute Now Canon Scanner 26

Big Data Processing - Batch & Real Time blended view for end user batch processing worked fine here (e.g. Hadoop) real time processing works here now time Fully processed data Last full batch period Time for batch job Adapted from Ted Dunning (March 2012): http://www.youtube.com/watch?v=7pcmbi5ac20 27

AGENDA 1. Big Data and Fast Data, what is it? 2. Architecting (Big) Data Systems 3. The Lambda Architecture 4. The Use Case and the Implementation 5. Summary and Outlook 28

Lambda Architecture Batch Layer Serving Layer Immutable data B C Batch View D Incoming Data A G Query Speed Layer Data Stream E F Realtime View Lambda => Query = function(all data) 29

Lambda Architecture A. All data is sent to both the batch and speed layer B. Master data set is an immutable, append-only set of data C. Batch layer pre-computes query functions from scratch, result is called Batch Views. Batch layer constantly re-computes the batch views. D. Batch views are indexed and stored in a scalable database to get particular values very quickly. Swaps in new batch views when they are available E. Speed layer compensates for the high latency of updates to the Batch Views F. Uses fast incremental algorithms and read/write databases to produce realtime views G. Queries are resolved by getting results from both batch and real-time views 30

Lambda Architecture Batch Layer Stores the immutable constantly growing dataset Computes arbitrary views from this dataset using BigData technologies (can take hours) Can be always recreated Speed Layer Computes the views from the constant stream of data it receives Needed to compensate for the high latency of the batch layer Incremental model and views are transient Serving Layer Responsible for indexing and exposing the pre-computed batch views so that they can be queried Exposes the incremented real-time views Merges the batch and the real-time views into a consistent result 31

Data Service (Merge) Visualization Lambda Architecture Sensor Layer mobile social IoT Distribution Layer Incoming Data All data Batch Layer Batch recompute Precompute Views Precomputed information Speed Layer Serving Layer batch view batch view Process stream Realtime increment Incremented information real time view real time view Adapted from: Marz, N. & Warren, J. (2013) Big Data. Manning. 32

AGENDA 1. Big Data and Fast Data, what is it? 2. Architecting (Big) Data Systems 3. The Lambda Architecture 4. Use Case and the Implementation 5. Summary and Outlook 33

Project Definition Build a platform for analysing Twitter communications in retrospective and in real-time Scalability and ability for future data fusion with other information is a must Provide a Web-based access to the analytical information Invest into new, innovative and not widely-proven technology PoC environment, a pre-invest for future systems 34

Anatomy of a tweet Time Space Content Social Technical { "created_at":"sun Aug 18 14:29:11 +0000 2013", "id":369103686938546176, "id_str":"369103686938546176", "text":"baloncesto preparaci\u00f3n Eslovenia, Rajoy derrota a Merkel. #quelosepash", "source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\ \u003etwitter for iphone\u003c\/a\u003e", "truncated":false, "in_reply_to_status_id":null, "in_reply_to_status_id_str":null, "in_reply_to_user_id":null, "in_reply_to_user_id_str":null, "in_reply_to_screen_name":null, "user":{ "id":15032594, "id_str":"15032594", "name":"juan Carlos Romo\u2122", "screen_name":"jcsromo", "location":"sopuerta, Vizcaya", "url":null, "description":"portugalujo, saturado de todo, de baloncesto no. Twitter personal.", "protected":false, "followers_count":1331, "friends_count":1326, "listed_count":31, "created_at":"fri Jun 06 21:21:22 +0000 2008", "favourites_count":255, "utc_offset":7200, "time_zone":"madrid", "geo_enabled":true, "verified":false, "statuses_count":22787, "lang":"es", "contributors_enabled":false, "is_translator":false, "profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/2649762203\ be4973d9eb457a45077897879c47c8b7_normal.jpeg", "profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/15032594\/1371570 460", "profile_link_color":"2fc2ef", "profile_sidebar_border_color":"ffffff", "profile_sidebar_fill_color":"252429", "profile_text_color":"666666", "profile_use_background_image":true, "default_profile":false, "default_profile_image":false, "following":null, "follow_request_sent":null, "notifications":null}, "geo":{ "type":"point","coordinates":[43.28261499,-2.96464655]}, "coordinates":{"type":"point","coordinates":[-2.96464655,43.28261499]}, "place":{"id":"cd43ea85d651af92", "url":"https:\/\/api.twitter.com\/1.1\/geo\/id\/cd43ea85d651af92.json", "place_type":"city", "name":"bilbao", "full_name":"bilbao, Vizcaya", "country_code":"es", "country":"espa\u00f1a", "bounding_box":{"type":"polygon","coordinates":[[[-2.9860102,43.2136542], [-2.9860102,43.2901452],[-2.8803248,43.2901452],[-2.8803248,43.2136542]]]}, "attributes":{}}, "contributors": null, "retweet_count":0, "favorite_count":0, "entities":{"hashtags":[{"text":"quelosepash","indices":[58,70]}], "symbols":[], "urls":[], "user_mentions":[]}, "favorited":false, "retweeted":false, "filter_level":"medium", "lang":"es } 35

Views on Tweets in four dimensions Space Time Time Time Time Content Space Content Content Space Social Space Social Social Social Content when where+what+who where when+what+who what when+where+who who when+where+what Time series Timelines Geo maps Density plots Word clouds Topic trends Social network graphs Activity graphs 36

Accessing Twitter Twitter s Search API source Limit Price 3200 / user 5000 / keyword 180 queries/ 15 minute free Twitter s Streaming API 1%-10% of tweets volume free DataSift none 0.15-0.20$ / unit Gnip (acquired by twitter) none By quote 37

Data Service (Merge) Visualization Lambda Architecture Open Source Frameworks for implementing a Lambda Architecture Sensor Layer mobile social IoT Distribution Layer Incoming Data All data Batch Layer Batch recompute Precompute Views Precomputed information Speed Layer Serving Layer batch view batch view Process stream Realtime increment Incremented information real time view real time view 38

Lambda Architecture in Action Twitter Horsebird Client (hbc) Cloudera Distribution Twitter Java API over Streaming API Spring Framework Popular Java Framework used to modularize part of the logic (sensor and serving layer) Apache Kafka Simple messaging framework based on file system to distribute information to both batch and speed layer Apache Avro Serialization system for efficient cross-language RPC and persistent data storage JSON open standard format that uses human-readable text to transmit data objects consisting of attribute value pairs. Distribution of Apache Hadoop: HDFS, MapReduce, Hive, Flume, Pig, Impala Cloudera Impala distributed query execution engine that runs against data stored in HDFS and HBase Apache Zookeeper Distributed, highly available coordination service. Provides primitives such as distributed locks Apache Storm & Trident distributed, fault-tolerant real-time computation system Apache Cassandra distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure 39

Facts & Figures 14 active twitter feeds ~ 14 million tweets/day ( > 5 billion tweets/year) ~ 8 GB/day raw data, compressed (2 DVDs) 66 GB storage capacity / day (replication & views/results included) Cluster of 10 nodes ~100 processors ~40 TB HD capacity in total; 46% used Currently in total 2.7 TB Raw Data 1.1 TB Pre-Processed data in Impala 1 TB Solr indices for full text search Cloudera 4.7.0 with Hadoop, Pig, Hive, Impala and Solr Kafka 0.7, Storm 0.9, DataStax Enterprise Edition >500 GB RAM 40

AGENDA 1. Big Data and Fast Data, what is it? 2. Architecting (Big) Data Systems 3. The Lambda Architecture 4. Use Case and the Implementation 5. Summary and Outlook 41

Summary The lambda architecture Can discard batch views and real-time views and recreate everything from scratch Mistakes corrected via re-computation Scalability through platform and distribution Data storage layer optimized independently from query resolution layer Still in a early stage. But a very interesting idea! Today a zoo of technologies are needed => Infrastructure group might not like it Better with so-called Hadoop distributions and Hadoop V2 (YARN) Logic has to be implemented twice (speed and batch layer) => Kappa architecture? 42

Data Service Visualization Data Service Kappa Architecture Sensor Layer Distribution Layer Replay Batch Layer Serving Layer mobile social IoT Incoming Data All data Batch Analytical analysis Precomputed analytics Speed Layer analytic view Process stream Realtime increment Incremented information real time view real time view Adapted from: Marz, N. & Warren, J. (2013) Big Data. Manning. 43

Weitere Informationen... http://www.digitallifeplus.com/18913/what-happens-online-in-60-secondsinfographic/ http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html http://manning.com/marz/ Manning : Big Data 44

Fragen und Antworten... Ulises Fasoli Consultant Trivadis Lausanne ulises.fasoli@trivadis.com BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN 2014 2013 Trivadis