Big Data and Fast Data combined is it possible?
|
|
|
- Julian Whitehead
- 10 years ago
- Views:
Transcription
1 Big Data and Fast Data combined is it possible? Ulises Fasoli DBTA Workshop Bern BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN 1
2 Ulises Fasoli Trivadis Lausanne 7+ years of software development experience Occasional blogger Contact information : [email protected] Blog: Twitter: ufasoli 2
3 Our company Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and technologies in Switzerland, Germany and Austria. We offer our services in the following strategic business fields: O P E R A T I O N Trivadis Services takes over the interacting operation of your IT systems. 3
4 AGENDA 1. Big Data and Fast Data, what is it? 2. Architecting (Big) Data Systems 3. The Lambda Architecture 4. Use Case and the Implementation 5. Summary and Outlook 4
5 Big Data Definition (4 Vs) Characteristics of Big Data: Its Volume, Velocity and Variety in combination Time to action? -> Big Data + Event Processing = Fast Data 5
6 The world is changing The model of Generating/Consuming Data has changed. Old Model: few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data 6
7 60 SECONDS 7 DOAG 2014 Big Data und Fast Data - Lambda Architektur und deren Umsetzung
8 Internet Of Things Sensors are/will be everywhere There are more devices tapping into the internet than people on earth How do we prepare our systems/architecture for the future? 8 Source: The Economist Source: Cisco
9 The world is changing new data stores Problem of traditional (R)DBMS approach: Complex object graph Schema evolution Semi-structured data Scaling Order ID: 1001 Order Date: Customer First Name: Peter Last Name: Sample Billing Address Street: Somestreet 10 City: Somewhere Postal Code: Line Items Name Ipod Touch Monster Beat Apple Mouse Quantity Price ORDER CUSTOMER ADDRESS ORDER_LINES Polyglot persistence Using multiple data storage technologies (RDMBS + NoSQL + NewSQL + In- Memory) 9
10 The world is changing New platforms evolving (i.e. Hadoop Ecosystem) 10
11 Data as an Asset Store everything? Data is just too valuable to delete! We must store everything! It depends Big Data technologies allow to store the raw information from new and existing data sources so that you can later use it to create new data-driven products, which you haven t thought about today! Nonsense! Just store the data you know you need today! 11
12 AGENDA 1. Big Data and Fast Data, what is it? 2. Architecting (Big) Data Systems 3. The Lambda Architecture 4. Use Case and the Implementation 5. Summary and Outlook 12
13 What is a data system? A (data) system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist, every hardware failure and every human mistake ever made. A data system answers questions based on information that was acquired in the past 13
14 What is a data system? - Goal The goal of a data system is to compute arbitrary functions on arbitrary data. Questions are answered by running functions that take data as input query = function (all data) 14
15 Desired properties of a data system Robust and fault tolerant Low latency read / updates General Extensible Scalable Allow Ad-hoc queries Minimal maintenance Debuggable 15
16 How do we build (data) systems today Today s Architectures Source of Truth is mutable! Application (Query) Mobile Web RIA Rich Client Mutable Database Source of Truth RDBMS NoSQL NewSQL CRUD pattern What is the problem with this? Lack of Human Fault Tolerance Potential loss of information/data Source of Truth 16
17 Lack of Human Fault Tolerance Bugs will be deployed to production over the lifetime of a data system Operational mistakes will be made Humans are part of the overall system Just like hard disks, CPUs, memory, software design for human error like you design for any other fault Examples of human error Deploy a bug that increments counters by two instead of by one Accidentally delete data from database Accidental DOS on important internal service Worst two consequences: data loss or data corruption As long as an error doesn t lose or corrupt good data, you can fix what went wrong 17
18 Lack of Human Fault Tolerance Immutability vs. Mutability An immutable system captures historical records of events Each event happens at a particular time and is always true The U and D in CRUD A mutable system updates the current state of the world Mutable systems inherently lack human fault-tolerance Easy to corrupt or lose data Immutability restricts the range of errors causing data loss/data corruption Vastly more human fault-tolerant Conclusion: Your source of truth should always be immutable 18
19 A different kind of architecture with immutable source of truth Instead of using our traditional approach why not build data systems like this Immutable data View on Data Application (Query) Source of Truth HDFS NoSQL NewSQL RDBMS View on Data Mobile Web RIA Rich Client Source of Truth 19
20 How to create the views on the Immutable data? On the fly? Immutable data View Query Materialized, i.e. Pre-computed? Immutable data Pre- Computed Views Query 20
21 (Big) Data Processing Incoming Data Immutable data? Pre- Computed Views? Query How to compute the materialized views? How to compute queries from the views? 21
22 Today Big Data Processing means Batch Processing Stream 1 Stream 2 Event HDFS Hadoop Distributed File System Hadoop cluster (Map/Reduce) Data Store optimized for appending large results Queries 22
23 Big Data Processing - Batch Favorite Product List Changes Add ipad 64GB Add Sony RX Add Canon GX Remove Sony RX Add Nikon S Add BoseQC Add MacBook Pro Remove Canon GX10 Raw information => data compute Current Favorite Product List ipad 64GB Nikon S-100 BoseQC-15 MacBook Pro 15 derive Information => derived Current Product Count 4 23
24 Big Data Processing Batch But we are not done yet batch-processed data non-processed data now Source of truth results time now time Fully processed data Last full batch period Time for batch job Using only batch processing, leaves you always with a portion of nonprocessed data. 24
25 Big Data Processing - Adding Real-Time Immutable data Batch Views Incoming Data Query Data Stream? Realtime Views How to compute real-time views How to compute queries from the views? 25
26 Big Data Processing - Adding Real-Time Immutable data Views Favorite Product List Changes Add ipad 64GB Add Sony RX Add Canon GX Remove Sony RX Add Nikon S Add BoseQC Add MacBook Pro Remove Canon GX10 Now Add Canon Scanner compute Current Favorite Product List ipad 64GB Nikon S-100 BoseQC-15 MacBook Pro 15 Data Stream Query Current Product Count incoming Stream of Favorite Product List Changes 5 Add Canon Scanner compute Now Canon Scanner 26
27 Big Data Processing - Batch & Real Time blended view for end user batch processing worked fine here (e.g. Hadoop) real time processing works here now time Fully processed data Last full batch period Time for batch job Adapted from Ted Dunning (March 2012): 27
28 AGENDA 1. Big Data and Fast Data, what is it? 2. Architecting (Big) Data Systems 3. The Lambda Architecture 4. The Use Case and the Implementation 5. Summary and Outlook 28
29 Lambda Architecture Batch Layer Serving Layer Immutable data B C Batch View D Incoming Data A G Query Speed Layer Data Stream E F Realtime View Lambda => Query = function(all data) 29
30 Lambda Architecture A. All data is sent to both the batch and speed layer B. Master data set is an immutable, append-only set of data C. Batch layer pre-computes query functions from scratch, result is called Batch Views. Batch layer constantly re-computes the batch views. D. Batch views are indexed and stored in a scalable database to get particular values very quickly. Swaps in new batch views when they are available E. Speed layer compensates for the high latency of updates to the Batch Views F. Uses fast incremental algorithms and read/write databases to produce realtime views G. Queries are resolved by getting results from both batch and real-time views 30
31 Lambda Architecture Batch Layer Stores the immutable constantly growing dataset Computes arbitrary views from this dataset using BigData technologies (can take hours) Can be always recreated Speed Layer Computes the views from the constant stream of data it receives Needed to compensate for the high latency of the batch layer Incremental model and views are transient Serving Layer Responsible for indexing and exposing the pre-computed batch views so that they can be queried Exposes the incremented real-time views Merges the batch and the real-time views into a consistent result 31
32 Data Service (Merge) Visualization Lambda Architecture Sensor Layer mobile social IoT Distribution Layer Incoming Data All data Batch Layer Batch recompute Precompute Views Precomputed information Speed Layer Serving Layer batch view batch view Process stream Realtime increment Incremented information real time view real time view Adapted from: Marz, N. & Warren, J. (2013) Big Data. Manning. 32
33 AGENDA 1. Big Data and Fast Data, what is it? 2. Architecting (Big) Data Systems 3. The Lambda Architecture 4. Use Case and the Implementation 5. Summary and Outlook 33
34 Project Definition Build a platform for analysing Twitter communications in retrospective and in real-time Scalability and ability for future data fusion with other information is a must Provide a Web-based access to the analytical information Invest into new, innovative and not widely-proven technology PoC environment, a pre-invest for future systems 34
35 Anatomy of a tweet Time Space Content Social Technical { "created_at":"sun Aug 18 14:29: ", "id": , "id_str":" ", "text":"baloncesto preparaci\u00f3n Eslovenia, Rajoy derrota a Merkel. #quelosepash", "source":"\u003ca href=\" rel=\"nofollow\ \u003etwitter for iphone\u003c\/a\u003e", "truncated":false, "in_reply_to_status_id":null, "in_reply_to_status_id_str":null, "in_reply_to_user_id":null, "in_reply_to_user_id_str":null, "in_reply_to_screen_name":null, "user":{ "id": , "id_str":" ", "name":"juan Carlos Romo\u2122", "screen_name":"jcsromo", "location":"sopuerta, Vizcaya", "url":null, "description":"portugalujo, saturado de todo, de baloncesto no. Twitter personal.", "protected":false, "followers_count":1331, "friends_count":1326, "listed_count":31, "created_at":"fri Jun 06 21:21: ", "favourites_count":255, "utc_offset":7200, "time_zone":"madrid", "geo_enabled":true, "verified":false, "statuses_count":22787, "lang":"es", "contributors_enabled":false, "is_translator":false, "profile_image_url_https":" be4973d9eb457a c47c8b7_normal.jpeg", "profile_banner_url":" 460", "profile_link_color":"2fc2ef", "profile_sidebar_border_color":"ffffff", "profile_sidebar_fill_color":"252429", "profile_text_color":"666666", "profile_use_background_image":true, "default_profile":false, "default_profile_image":false, "following":null, "follow_request_sent":null, "notifications":null}, "geo":{ "type":"point","coordinates":[ , ]}, "coordinates":{"type":"point","coordinates":[ , ]}, "place":{"id":"cd43ea85d651af92", "url":" "place_type":"city", "name":"bilbao", "full_name":"bilbao, Vizcaya", "country_code":"es", "country":"espa\u00f1a", "bounding_box":{"type":"polygon","coordinates":[[[ , ], [ , ],[ , ],[ , ]]]}, "attributes":{}}, "contributors": null, "retweet_count":0, "favorite_count":0, "entities":{"hashtags":[{"text":"quelosepash","indices":[58,70]}], "symbols":[], "urls":[], "user_mentions":[]}, "favorited":false, "retweeted":false, "filter_level":"medium", "lang":"es } 35
36 Views on Tweets in four dimensions Space Time Time Time Time Content Space Content Content Space Social Space Social Social Social Content when where+what+who where when+what+who what when+where+who who when+where+what Time series Timelines Geo maps Density plots Word clouds Topic trends Social network graphs Activity graphs 36
37 Accessing Twitter Twitter s Search API source Limit Price 3200 / user 5000 / keyword 180 queries/ 15 minute free Twitter s Streaming API 1%-10% of tweets volume free DataSift none $ / unit Gnip (acquired by twitter) none By quote 37
38 Data Service (Merge) Visualization Lambda Architecture Open Source Frameworks for implementing a Lambda Architecture Sensor Layer mobile social IoT Distribution Layer Incoming Data All data Batch Layer Batch recompute Precompute Views Precomputed information Speed Layer Serving Layer batch view batch view Process stream Realtime increment Incremented information real time view real time view 38
39 Lambda Architecture in Action Twitter Horsebird Client (hbc) Cloudera Distribution Twitter Java API over Streaming API Spring Framework Popular Java Framework used to modularize part of the logic (sensor and serving layer) Apache Kafka Simple messaging framework based on file system to distribute information to both batch and speed layer Apache Avro Serialization system for efficient cross-language RPC and persistent data storage JSON open standard format that uses human-readable text to transmit data objects consisting of attribute value pairs. Distribution of Apache Hadoop: HDFS, MapReduce, Hive, Flume, Pig, Impala Cloudera Impala distributed query execution engine that runs against data stored in HDFS and HBase Apache Zookeeper Distributed, highly available coordination service. Provides primitives such as distributed locks Apache Storm & Trident distributed, fault-tolerant real-time computation system Apache Cassandra distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure 39
40 Facts & Figures 14 active twitter feeds ~ 14 million tweets/day ( > 5 billion tweets/year) ~ 8 GB/day raw data, compressed (2 DVDs) 66 GB storage capacity / day (replication & views/results included) Cluster of 10 nodes ~100 processors ~40 TB HD capacity in total; 46% used Currently in total 2.7 TB Raw Data 1.1 TB Pre-Processed data in Impala 1 TB Solr indices for full text search Cloudera with Hadoop, Pig, Hive, Impala and Solr Kafka 0.7, Storm 0.9, DataStax Enterprise Edition >500 GB RAM 40
41 AGENDA 1. Big Data and Fast Data, what is it? 2. Architecting (Big) Data Systems 3. The Lambda Architecture 4. Use Case and the Implementation 5. Summary and Outlook 41
42 Summary The lambda architecture Can discard batch views and real-time views and recreate everything from scratch Mistakes corrected via re-computation Scalability through platform and distribution Data storage layer optimized independently from query resolution layer Still in a early stage. But a very interesting idea! Today a zoo of technologies are needed => Infrastructure group might not like it Better with so-called Hadoop distributions and Hadoop V2 (YARN) Logic has to be implemented twice (speed and batch layer) => Kappa architecture? 42
43 Data Service Visualization Data Service Kappa Architecture Sensor Layer Distribution Layer Replay Batch Layer Serving Layer mobile social IoT Incoming Data All data Batch Analytical analysis Precomputed analytics Speed Layer analytic view Process stream Realtime increment Incremented information real time view real time view Adapted from: Marz, N. & Warren, J. (2013) Big Data. Manning. 43
44 Weitere Informationen Manning : Big Data 44
45 Fragen und Antworten... Ulises Fasoli Consultant Trivadis Lausanne BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN Trivadis
Building Scalable Big Data Pipelines
Building Scalable Big Data Pipelines NOSQL SEARCH ROADSHOW ZURICH Christian Gügi, Solution Architect 19.09.2013 AGENDA Opportunities & Challenges Integrating Hadoop Lambda Architecture Lambda in Practice
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island
Big Data Principles and best practices of scalable real-time data systems NATHAN MARZ JAMES WARREN II MANNING Shelter Island contents preface xiii acknowledgments xv about this book xviii ~1 Anew paradigm
Big Data. Marriage of RDBMS-DWH and Hadoop & Co. Author: Jan Ott Trivadis AG. 2014 Trivadis. Big Data - Marriage of RDBMS-DWH and Hadoop & Co.
Big Data Marriage of RDBMS-DWH and Hadoop & Co. Author: Jan Ott Trivadis AG BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN 1 Mit über 600 IT- und Fachexperten
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS
Enterprise Data Problems in Investment Banks BigData History and Trend Driven by Google CAP Theorem for Distributed Computer System Open Source Building Blocks: Hadoop, Solr, Storm.. 3548 Hypothetical
FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara
CS535 Big Data - Fall 2015 W1.B.1 CS535 Big Data - Fall 2015 W1.B.2 CS535 BIG DATA FAQs Wait list Term project topics PART 0. INTRODUCTION 2. A PARADIGM FOR BIG DATA Sangmi Lee Pallickara Computer Science,
Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming Two Stream Platforms compared DBTA Workshop on Stream Berne, 3.1.014 Guido Schmutz BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH
BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane
BIG DATA Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management Author: Sandesh Deshmane Executive Summary Growing data volumes and real time decision making requirements
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
How To Use Big Data For Telco (For A Telco)
ON-LINE VIDEO ANALYTICS EMBRACING BIG DATA David Vanderfeesten, Bell Labs Belgium ANNO 2012 YOUR DATA IS MONEY BIG MONEY! Your click stream, your activity stream, your electricity consumption, your call
Dominik Wagenknecht Accenture
Dominik Wagenknecht Accenture Improving Mainframe Performance with Hadoop October 17, 2014 Organizers General Partner Top Media Partner Media Partner Supporters About me Dominik Wagenknecht Accenture Vienna
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
HDP Hadoop From concept to deployment.
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
Dell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert [email protected]/
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations
Beyond Lambda - how to get from logical to physical Artur Borycki, Director International Technology & Innovations Simplification & Efficiency Teradata believe in the principles of self-service, automation
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect IT Insight podcast This podcast belongs to the IT Insight series You can subscribe to the podcast through
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Getting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti
Real Time Data Processing using Spark Streaming
Real Time Data Processing using Spark Streaming Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O
The Top 10 7 Hadoop Patterns and Anti-patterns. Alex Holmes @
The Top 10 7 Hadoop Patterns and Anti-patterns Alex Holmes @ whoami Alex Holmes Software engineer Working on distributed systems for many years Hadoop since 2008 @grep_alex grepalex.com what s hadoop...
TRAINING PROGRAM ON BIGDATA/HADOOP
Course: Training on Bigdata/Hadoop with Hands-on Course Duration / Dates / Time: 4 Days / 24th - 27th June 2015 / 9:30-17:30 Hrs Venue: Eagle Photonics Pvt Ltd First Floor, Plot No 31, Sector 19C, Vashi,
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
So What s the Big Deal?
So What s the Big Deal? Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data
Principles and best practices of scalable real-time data systems. Nathan Marz James Warren 6$03/(&+$37(5 MANNING
Principles and best practices of scalable real-time data systems Nathan Marz James Warren 6$03/(&+$37(5 MANNING Big Data by Nathan Marz and James Warren Chapter 1 Copyright 2015 Manning Publications 1
Constructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
How Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day
STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA Processing billions of events every day Neha Narkhede Co-founder and Head of Engineering @ Stealth Startup Prior to this Lead, Streams Infrastructure
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Qsoft Inc www.qsoft-inc.com
Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:
Openbus Documentation
Openbus Documentation Release 1 Produban February 17, 2014 Contents i ii An open source architecture able to process the massive amount of events that occur in a banking IT Infraestructure. Contents:
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Information Builders Mission & Value Proposition
Value 10/06/2015 2015 MapR Technologies 2015 MapR Technologies 1 Information Builders Mission & Value Proposition Economies of Scale & Increasing Returns (Note: Not to be confused with diminishing returns
Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.
Big Data Technology ดร.ช ชาต หฤไชยะศ กด Choochart Haruechaiyasak, Ph.D. Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center (NECTEC) National Science and Technology
Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Time series IoT data ingestion into Cassandra using Kaa
Time series IoT data ingestion into Cassandra using Kaa Andrew Shvayka [email protected] Agenda Data ingestion challenges Why Kaa? Why Cassandra? Reference architecture overview Hands-on Sandbox
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
Elephants and Storms - using Big Data techniques for Analysis of Large and Changing Datasets
Paper DH07 Elephants and Storms - using Big Data techniques for Analysis of Large and Changing Datasets Geoff Low, Medidata Solutions, London, United Kingdom ABSTRACT As an industry we are data-led. We
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
The Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
#TalendSandbox for Big Data
Evalua&on von Apache Hadoop mit der #TalendSandbox for Big Data Julien Clarysse @whatdoesdatado @talend 2015 Talend Inc. 1 Connecting the Data-Driven Enterprise 2 Talend Overview Founded in 2006 BRAND
Putting Apache Kafka to Use!
Putting Apache Kafka to Use! Building a Real-time Data Platform for Event Streams! JAY KREPS, CONFLUENT! A Couple of Themes! Theme 1: Rise of Events! Theme 2: Immutability Everywhere! Level! Example! Immutable
A Survey on Big Data Concepts and Tools
A Survey on Big Data Concepts and Tools D. Rajasekar 1, C. Dhanamani 2, S. K. Sandhya 3 1,3 PG Scholar, 2 Assistant Professor, Department of Computer Science and Engineering, Sri Krishna College of Engineering
<Insert Picture Here> Big Data
Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big
Comparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect
Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate
Big Data and Industrial Internet
Big Data and Industrial Internet Keijo Heljanko Department of Computer Science and Helsinki Institute for Information Technology HIIT School of Science, Aalto University [email protected] 16.6-2015
Real Time Analytics for Big Data. NtiSh Nati Shalom @natishalom
Real Time Analytics for Big Data A Twitter Inspired Case Study NtiSh Nati Shalom @natishalom Big Data Predictions Overthe next few years we'll see the adoption of scalable frameworks and platforms for
Design and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
Chase Wu New Jersey Ins0tute of Technology
CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at
Thomas Baumann Swiss Mobiliar Bern, Switzerland
Thomas Baumann Swiss Mobiliar Bern, Switzerland WHAT IS BIG DATA (1/3): 3-5 V 1 + 1 = 11 Volume Velocity Variety Veracity Value From Terabytes to Exabytes to process From milliseconds to minutes to respond
Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics
In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning
Certified Big Data and Apache Hadoop Developer VS-1221
Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification
Search and Real-Time Analytics on Big Data
Search and Real-Time Analytics on Big Data Sewook Wee, Ryan Tabora, Jason Rutherglen Accenture & Think Big Analytics Strata New York October, 2012 Big Data: data becomes your core asset. It realizes its
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
The Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
Apache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
HDP Enabling the Modern Data Architecture
HDP Enabling the Modern Data Architecture Herb Cunitz President, Hortonworks Page 1 Hortonworks enables adoption of Apache Hadoop through HDP (Hortonworks Data Platform) Founded in 2011 Original 24 architects,
Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture
Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin Adeniyi Abdul 2522715 Agenda Abstract Introduction
IOT & Big Data: The Future Information Processing Architecture
IOT & Big Data: The Future Information Processing Architecture Dr. Michael Faden Dirk Weise BASEL BERN BRUGG GENF LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN
The Future of Data Management with Hadoop and the Enterprise Data Hub
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah Cofounder & CTO, Cloudera, Inc. Twitter: @awadallah 1 2 Cloudera Snapshot Founded 2008, by former employees of Employees
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
INTRODUCTION TO CASSANDRA
INTRODUCTION TO CASSANDRA This ebook provides a high level overview of Cassandra and describes some of its key strengths and applications. WHAT IS CASSANDRA? Apache Cassandra is a high performance, open
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
Talend Big Data. Delivering instant value from all your data. Talend 2014 1
Talend Big Data Delivering instant value from all your data Talend 2014 1 I may say that this is the greatest factor: the way in which the expedition is equipped. Roald Amundsen race to the south pole,
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
Hadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
Architectures for massive data management
Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet [email protected] October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014
White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page
Big Data and Hadoop for the Executive A Reference Guide
Big Data and Hadoop for the Executive A Reference Guide Overview The amount of information being collected by companies today is incredible. Wal- Mart has 460 terabytes of data, which, according to the
Using distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics
HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics ESSENTIALS EMC ISILON Use the industry's first and only scale-out NAS solution with native Hadoop
