Big Data is Dead, Long Live Business Intelligence?

Similar documents
SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Hadoop & Spark Using Amazon EMR

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Real Time Big Data Processing

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Cloud Big Data Architectures

Integrating a Big Data Platform into Government:

Big Data Web Analytics Platform on AWS for Yottaa

Microservices on AWS

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Thing Big: How to Scale Your Own Internet of Things.

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

STREAM ANALYTIX. Industry s only Multi-Engine Streaming Analytics Platform

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

The Future of Data Management

HDP Hadoop From concept to deployment.

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Introduction to AWS in Higher Ed

Big data blue print for cloud architecture

Amazon Web Services. Lawrence Berkeley LabTech Conference 9/10/15. Jamie Baker Federal Scientific Account Manager AWS WWPS

Big Data Use Case: Business Analytics

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

BIG DATA & DATA SCIENCE

Making big data simple with Databricks

Next-Generation Cloud Analytics with Amazon Redshift

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

Big Data Research in the AMPLab: BDAS and Beyond

Dashboard Engine for Hadoop

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Analytics on Spark &

Analyzing Big Data with AWS

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Big Data Pipeline and Analytics Platform

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

From Spark to Ignition:

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Building 1000 node cluster on EMR Manjeet Chayel

How to Leverage Cloud to Quickly Build Scalable Applications

Native Connectivity to Big Data Sources in MSTR 10

The Internet of Things and Big Data: Intro

CRITEO INTERNSHIP PROGRAM 2015/2016

Unified Batch & Stream Processing Platform

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Oracle Big Data SQL Technical Update

SAP and Hortonworks Reference Architecture

Scalable Architecture on Amazon AWS Cloud

Ganzheitliches Datenmanagement

BIG DATA ANALYTICS For REAL TIME SYSTEM

Building your Big Data Architecture on Amazon Web Services

Unlocking the True Value of Hadoop with Open Data Science

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Big Data for everyone Democratizing big data with the cloud. Steffen Krause Technical

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Information Builders Mission & Value Proposition

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

Big Data Trends and HDFS Evolution

Introduction to Amazon Web Services! Leo Senior Solutions Architect

Upcoming Announcements

Moving From Hadoop to Spark

Linux A first-class citizen in Windows Azure. Bruno Terkaly bterkaly@microsoft.com Principal Software Engineer Mobile/Cloud/Startup/Enterprise

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Internet of Things. Opportunity Challenges Solutions

Big Data & Cloud Computing. Faysal Shaarani

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Technology Enablement

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

BIG DATA TECHNOLOGY. Hadoop Ecosystem

How To Create A Data Visualization With Apache Spark And Zeppelin

Azure Data Lake Analytics

Big Data Spatial Analytics An Introduction

CLOUD COMPUTING FOR THE ENTERPRISE AND GLOBAL COMPANIES Steve Midgley Head of AWS EMEA

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Predictive Analytics with Storm, Hadoop, R on AWS

Professional Hadoop Solutions

The Future of Data Management with Hadoop and the Enterprise Data Hub

Unified Big Data Processing with Apache Spark. Matei

How To Handle Big Data With A Data Scientist

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Big Data and Industrial Internet

Building Cloud-powered Mobile Apps

How Companies are! Using Spark

BIG DATA TRENDS AND TECHNOLOGIES

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Ryan Horn, Lead Software Engineer at Twilio. November 12, 2014 Las Vegas. BDT312 Using the Cloud to Scale from a Database to a Data Platform

Time-Series Databases and Machine Learning

Productionizing a 24/7 Spark Streaming Service on YARN

Transcription:

berlin Big Data is Dead, Long Live Business Intelligence? Michael Muckel, Head of Data Platform Markus Schmidberger, Data Platform Architect Berlin, April 12 th 2016 2016, Amazon Web s, Inc. or its Affiliates. All rights reserved.

Glomex: A ProSiebenSat.1 company Page 2

Glomex The Global Media Exchange Glomex Video Value Platform Publishers Content providers Media Delivery Platform Non-P7S1 publishers External broadcasters Web-only content owners Media Exchange Platform Page 3

Glomex Data Platform Video Value Platform Media Delivery Platform Media Exchange Platform Data Platform Real-time-Monitoring Batch Analytics Machine Learning Page 4

Key Components of our New Data Platform Real-Time Monitoring Enable our development teams to serve our content to our users in the best quality possible. Analytics Provide our teams access to the data to enable data-driven development of new features and products. Content Discovery Find the most relevant content for our customers and their users. Page 5

Lambda Architecture AWS Lambda Graphic provided by http://lambda-architecture.net Page 6

Simplify Data Processing data ingest / collect store process / analyze visualize / serve answers Time to Answer (Latency) Throughput Cost more concrete numbers at the end Page 7

Data Processing in Big Data World Collect Store ETL Analyze Consume IoT Logging Applications ios Web Apps Mobile Apps Logstash A Android Transactional Data Search Data File Data Stream Data Search SQL NoSQL Cache File Storage Stream Storage Amazon ElastiCache Amazon DynamoDB Amazon RDS Amazon ES Amazon S3 Amazon Glacier Apache Kafka Amazon Kinesis Amazon DynamoDB Hot Warm Hot Cold Hot ML Stream Processing Batch Interactive Amazon ML Amazon Redshift Impala Pig Streaming Amazon Kinesis AWS Lambda Amazon Elastic MapReduce Fast Slow Fast Analysis & Visualization Notebooks IDE Predictions Amazon QuickSight Apps & APIs Page 8

Our Data Platform Architecture Data Platform - Micro Layout Content API Content Content Discovery other modules CDN files CDN Log Data Management Metadata KPI & Analytics Data API Portal data stream data stream AdProxy Log VAS Log Data Lake Data Layer Technical Monitoring Real-Time Dashboards data stream Player Feedback Data Quality Dev / Ops Analytics Data Platform Access Team External Data Data Science Analytics Data Science UI Data Platform Monitoring INGEST STORE PROCESS & ANALYSE VISUALIZE & SERVE Page 9

Real-Time Player Monitoring Data Platform - Micro Layout Content API Content Content Discovery other modules CDN files CDN Log Data Management Metadata KPI & Analytics Data API Portal data stream data stream AdProxy Log VAS Log Data Lake Data Layer Technical Monitoring Real-Time Dashboards data stream Player Feedback Data Quality Dev / Ops Analytics Data Platform Access Team External Data Data Science Analytics Data Science UI Data Platform Monitoring INGEST STORE PROCESS & ANALYSE VISUALIZE & SERVE Page 10

Monitoring Video-Streaming Experience Focus on Metrics from the User s Perspective From Server-Uptime To (anonymized) Real-User Monitoring Page 11

1 Analyze 3 Automate 2 Take Actions Page 12

Our Ingest Process Page 13

Kinesis Firehose is doing his job Next session: Streaming Data: The Opportunity and How to Work With It Page 14

Data Facts 20 GB 5 Billion ~100 ms Per day click-stream data in Kinesis Firehose Record processed per day Data freshness to S3 Page 15

ElasticSearch + Grafana for real-time analyses Not AWS managed! Page 16

ElasticSearch on Spot Instances Page 17

CDN Batch Processing Data Platform - Micro Layout Content API Content Content Discovery other modules CDN files CDN Log Data Management Metadata KPI & Analytics Data API Portal data stream data stream AdProxy Log VAS Log Data Lake Data Layer Technical Monitoring Real-Time Dashboards data stream Player Feedback Data Quality Dev / Ops Analytics Data Platform Access Team External Data Data Science Analytics Data Science UI Data Platform Monitoring INGEST STORE PROCESS & ANALYSE VISUALIZE & SERVE Page 18

Processing CDN-Logs 25 GB 300 Million Per day as zipped log-files Record processed per day + Normal challenges with external data sources Out-of-order deliver / Data quality issues / Varying file sizes / etc. Page 19

Requirements for our Data Processing Pipeline Monitor Complete Pipeline Enable Reprocessing of Historical Datasets Be Ready to Scale Page 20

Our CDN Pipeline Page 21

AWS Lambda Limits 5 min 512 MB AWS Lambda Timeout AWS Lambda temp disk How to process 800MB gziped logfile? How to split compressed gzip files? Splitter using Amazon SQS and Amazon EC2 Spot Instances Page 22

Our Meta Data Store AWS Big Data Blog: https://blogs.aws.amazon.com/bigdata/post/tx2yrx3y16cvqfz/building-and- Maintaining-an-Amazon-S3-Metadata-Index-without-Servers

Our Meta Data Store Page 24

Be serverless and serve data Amazon Kinesis AWS Lambda AWS Lambda Amazon API Gateway Page 25

CDN Batch Facts 2.3 min 600 rec/sec 6 1 $ / hour Average run-time of AWS Lambda Processing time Parallel AWS Lambda functions Cost for 25 GB/day CDN processing AWS Lambda duration Redshift CPU Page 26

Data Science Environment Data Platform - Micro Layout Content API Content Content Discovery other modules CDN files CDN Log Data Management Metadata KPI & Analytics Data API Portal data stream data stream AdProxy Log VAS Log Data Lake Data Layer Technical Monitoring Real-Time Dashboards data stream Player Feedback Data Quality Dev / Ops Analytics Data Platform Access Team External Data Data Science Analytics Data Science UI Data Platform Monitoring INGEST STORE PROCESS & ANALYSE VISUALIZE & SERVE Page 27

Data Science Environment Project Jupyter: http://jupyter.org/ Page 28

Data Science Environment - Architecture Data Sources Amazon Kinesis Amazon Redshift Amazon S3 Elasticsearch Cluster Technology Amazon EMR In development Development Github In development Page 29

Our Lambda Architecture on AWS Data Platform - Lambda Architecture Batch Layer other player modules CDN files Amazon Redshift AWS Lambda Amazon API Gateway Portal AWS Lambda Amazon Elastic MapReduce + Spark Serving Layer EC2 with Caravel S3 EC2 with Jupyther Team data stream Instance with Kinesis Agent Amazon KinesisFirehose AWS Lambda EC2 with ElasticSearch EC2 with Grafana Speed Layer Applications Page 30

Key Takeaways Lambda Architecture Enrich your traditional, batch-driven BI-workflow with real-time analytics Use Lambda-Architecture as a guiding principle and adapt it to your needs Page 31

Key Takeaways Focus on features development and robust pipelines not on infrastructure management AWS managed services provide an robust way to run complex big data infrastructures Follow best-practices provided by AWS and the community Page 32

Key Takeaways Provide an open data environments Trust the creativity of your engineering teams to find insights in your datasets Structure your data that it can be access in processed and raw form Notebooks provide easy access to even large distributed datasets Page 33

Michael Muckel, Head of Data Platform Markus Schmidberger, Data Platform Architect We are hiring Data Scientists Data Engineers Project Managers