Building Scalable Big Data Pipelines

Similar documents
Bringing Big Data to People

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

The Future of Data Management with Hadoop and the Enterprise Data Hub

Big Data and Fast Data combined is it possible?

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

HDP Hadoop From concept to deployment.

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Data Analyst Program- 0 to 100

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Apache Hadoop: Past, Present, and Future

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

ITG Software Engineering

Modernizing Your Data Warehouse for Hadoop

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

The Future of Data Management

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Roadmap Talend : découvrez les futures fonctionnalités de Talend

How to Hadoop Without the Worry: Protecting Big Data at Scale

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

HDP Enabling the Modern Data Architecture

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Comprehensive Analytics on the Hortonworks Data Platform

Self-service BI for big data applications using Apache Drill

IBM Big Data Platform

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Thomas Baumann Swiss Mobiliar Bern, Switzerland

Large Scale/Big Data Federation & Virtualization: A Case Study

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

Big Data Analytics Nokia

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Information Builders Mission & Value Proposition

Dominik Wagenknecht Accenture

Big Data: What You Should Know. Mark Child Research Manager - Software IDC CEMA

Constructing a Data Lake: Hadoop and Oracle Database United!

Talend Big Data. Delivering instant value from all your data. Talend

The Digital Enterprise Demands a Modern Integration Approach. Nada daveiga, Sr. Dir. of Technical Sales Tony LaVasseur, Territory Leader

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Workshop on Hadoop with Big Data

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

TRAINING PROGRAM ON BIGDATA/HADOOP

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Has been into training Big Data Hadoop and MongoDB from more than a year now

IOT & Big Data: The Future Information Processing Architecture

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Big Data and Industrial Internet

ITG Software Engineering

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

A framework for easy development of Big Data applications

Testing Big data is one of the biggest

Self-service BI for big data applications using Apache Drill

BIG DATA TECHNOLOGY. Hadoop Ecosystem

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Hadoop. for Oracle database professionals. Alex Gorbachev Calgary, AB September 2013

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Cloudera Enterprise Data Hub in Telecom:

Big Data. Marriage of RDBMS-DWH and Hadoop & Co. Author: Jan Ott Trivadis AG Trivadis. Big Data - Marriage of RDBMS-DWH and Hadoop & Co.

White Paper: What You Need To Know About Hadoop

Why Spark on Hadoop Matters

Big Data Course Highlights

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Big Data and New Paradigms in Information Management. Vladimir Videnovic Institute for Information Management

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Real Time Big Data Processing

Practical Hadoop by Example

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Tap into Hadoop and Other No SQL Sources

How to Enhance Traditional BI Architecture to Leverage Big Data

Native Connectivity to Big Data Sources in MSTR 10

Upcoming Announcements

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

Please give me your feedback

Certified Big Data and Apache Hadoop Developer VS-1221

How To Use Big Data For Business

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Are You Big Data Ready?

Luncheon Webinar Series May 13, 2013

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Big Data and Data Science. The globally recognised training program

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Big Data Management and Security

Analytics on Spark &

Transcription:

Building Scalable Big Data Pipelines NOSQL SEARCH ROADSHOW ZURICH Christian Gügi, Solution Architect 19.09.2013

AGENDA Opportunities & Challenges Integrating Hadoop Lambda Architecture Lambda in Practice Recommendations

ABOUT ME Solution Architect @ YMC Founder and organizer Swiss Big Data User Group http://www.bigdata-usergroup.ch/ Contact christian.guegi@ymc.ch http://about.me/cguegi @chrisgugi

ABOUT YMC Founded in 2001 Based in Kreuzlingen, Switzerland Big Data Analytics, Web Solutions and Mobile Applications 24 experts Consulting, creation, engineering

OPPORTUNITIES &

BIG DATA WHAT IS THE BIG DEAL? A. New sources and types from inside & outside organisations Internet of things, sensors, RFID, intelligent devices, etc. Unstructured information documents, web logs, email, social media, etc. Trusted 3 rd party sources industry provider & aggregators, governments Open Data, weather, etc. B. Technology innovations to exploit new world of data Low cost storage and process power (cloud, on-premise & hybrid) New software patterns to handle speed & volume, structured and unstructured (In-memory computation, Hadoop, Mapreduce, etc.) Revolution in user experience, analytics, recommendations

BIG DATA CHALLENGES Volume Velocity Variety Veracity Character of data Overwhelming landscape & integration Organisational issues Available talent Align business strategy Data Management Privacy protection Lack of skilled and experienced people

INTEGRATING

TYPICAL RDBMS SZENARIO Apps BI Web Mobile Data Systems DWH ETL RDBMS Data Sources RDBMS NFS Others

BIG DATA SZENARIO Apps BI Web Mobile 1) Recommendations, etc. Data Systems DWH RDBMS 1) Hadoop Data Sources RDBMS NFS Logs Social Media Sensors

HADOOP ECOSYSTEM

LAMBDA

LAMBDA ARCHITECTURE Credits Nathan Marz Former Engineer at Twitter Storm, Cascalog, ElephantDB http://www.manning.com/marz/

DESIGN PRINCIPLES Lambda Architecture Human fault-tolerance Data immutability Re-computation

HUMAN FAULT-TOLERANCE Lambda Architecture Design for human error Bugs in code Accidental data loss Data corruption Protect good data, so you can always fix what went wrong

DATA IMMUTABILIY Lambda Architecture Store data in it s rawest form Create and read but no update No data can be lost To fix the system just delete bad data Can always revert to a true state

DATA IMMUTABILIY Lambda Architecture Capturing change traditionally (mutability) Name Location Name Location Alice Zurich Alice Basel Bob Lucerne Bob Lucerne Tom Bern Tom Bern Capturing change (immutability) Name Location Time Alice Zurich 2009/03/29 Bob Lucerne 2012/04/12 Tom Bern 2010/04/09 Name Location Time Alice Zurich 2009/03/29 Bob Lucerne 2012/04/12 Tom Bern 2010/04/09 Alice Basel 2013/08/20

RE-COMPUTATION Lambda Architecture Always able to re-compute from historical data Basis for all data systems query = function(all data) All Data Pre-computed views Query

LAYERS Lambda Architecture http://www.ymc.ch/en/lambda-architecture-part-1

Lambda in Practice

ONLINE MARKETING Tracking and analytics solution Improve customer targeting and segmentation Various reports Real-time not required

OVERVIEW AdServer Flume log HDFS HDFS Web Campaign Database Sqoop csv HBase Hive Impala Pig Up- & Download FTP fs -put csv Aggregated Data DWH Oozie ZooKeeper Cloudera Manager BI apps

DATA PIPELINE AdServer Flume log HDFS M/R HDFS Avro Tracking Campaign Database Sqoop csv M/R Avro Bulk Importer Profiles FTP fs -put csv M/R Avro DWH Extracting Transformation Loading

ADVANTAGES Extensible easily add speed layer later on Complements existing DWH/BI system ETL phases are decoupled Reliable Infrastructure Each step can be replayed Scalable Storage Processing Highly available Ad-hoc analysis right from the beginning

RECOMMENDATIONS

RECOMMENDATIONS Not a fixed, one-size-fits-all approach Adopt to your needs/requirements Hadoop complements existing systems How real-time do I need to be? Immutability and pre-computation are just good ideas! Store information in rawest format possible Use a serialization framework (Avro, Thrift, Protocol Buffers)

THANK YOU!

CONTACT US christian.guegi@ymc.ch Tel. +41 (0)71 508 24 76 www.ymc.ch @chrisgugi YMC AG Sonnenstrasse 4 CH-8280 Kreuzlingen Switzerland Photo Credits: Slide 05: Success opportunity achieve by Stephen McCulloch Slide 08: Matrix by Gamaliel Espinoza Macedo. Slide 12: Layers by Katelyn Leblanc Slide 20: Mining For Information by JD Hancock Slide 27: Warning Question by longzijun