berlin Big Data is Dead, Long Live Business Intelligence? Michael Muckel, Head of Data Platform Markus Schmidberger, Data Platform Architect Berlin, April 12 th 2016 2016, Amazon Web s, Inc. or its Affiliates. All rights reserved.
Glomex: A ProSiebenSat.1 company Page 2
Glomex The Global Media Exchange Glomex Video Value Platform Publishers Content providers Media Delivery Platform Non-P7S1 publishers External broadcasters Web-only content owners Media Exchange Platform Page 3
Glomex Data Platform Video Value Platform Media Delivery Platform Media Exchange Platform Data Platform Real-time-Monitoring Batch Analytics Machine Learning Page 4
Key Components of our New Data Platform Real-Time Monitoring Enable our development teams to serve our content to our users in the best quality possible. Analytics Provide our teams access to the data to enable data-driven development of new features and products. Content Discovery Find the most relevant content for our customers and their users. Page 5
Lambda Architecture AWS Lambda Graphic provided by http://lambda-architecture.net Page 6
Simplify Data Processing data ingest / collect store process / analyze visualize / serve answers Time to Answer (Latency) Throughput Cost more concrete numbers at the end Page 7
Data Processing in Big Data World Collect Store ETL Analyze Consume IoT Logging Applications ios Web Apps Mobile Apps Logstash A Android Transactional Data Search Data File Data Stream Data Search SQL NoSQL Cache File Storage Stream Storage Amazon ElastiCache Amazon DynamoDB Amazon RDS Amazon ES Amazon S3 Amazon Glacier Apache Kafka Amazon Kinesis Amazon DynamoDB Hot Warm Hot Cold Hot ML Stream Processing Batch Interactive Amazon ML Amazon Redshift Impala Pig Streaming Amazon Kinesis AWS Lambda Amazon Elastic MapReduce Fast Slow Fast Analysis & Visualization Notebooks IDE Predictions Amazon QuickSight Apps & APIs Page 8
Our Data Platform Architecture Data Platform - Micro Layout Content API Content Content Discovery other modules CDN files CDN Log Data Management Metadata KPI & Analytics Data API Portal data stream data stream AdProxy Log VAS Log Data Lake Data Layer Technical Monitoring Real-Time Dashboards data stream Player Feedback Data Quality Dev / Ops Analytics Data Platform Access Team External Data Data Science Analytics Data Science UI Data Platform Monitoring INGEST STORE PROCESS & ANALYSE VISUALIZE & SERVE Page 9
Real-Time Player Monitoring Data Platform - Micro Layout Content API Content Content Discovery other modules CDN files CDN Log Data Management Metadata KPI & Analytics Data API Portal data stream data stream AdProxy Log VAS Log Data Lake Data Layer Technical Monitoring Real-Time Dashboards data stream Player Feedback Data Quality Dev / Ops Analytics Data Platform Access Team External Data Data Science Analytics Data Science UI Data Platform Monitoring INGEST STORE PROCESS & ANALYSE VISUALIZE & SERVE Page 10
Monitoring Video-Streaming Experience Focus on Metrics from the User s Perspective From Server-Uptime To (anonymized) Real-User Monitoring Page 11
1 Analyze 3 Automate 2 Take Actions Page 12
Our Ingest Process Page 13
Kinesis Firehose is doing his job Next session: Streaming Data: The Opportunity and How to Work With It Page 14
Data Facts 20 GB 5 Billion ~100 ms Per day click-stream data in Kinesis Firehose Record processed per day Data freshness to S3 Page 15
ElasticSearch + Grafana for real-time analyses Not AWS managed! Page 16
ElasticSearch on Spot Instances Page 17
CDN Batch Processing Data Platform - Micro Layout Content API Content Content Discovery other modules CDN files CDN Log Data Management Metadata KPI & Analytics Data API Portal data stream data stream AdProxy Log VAS Log Data Lake Data Layer Technical Monitoring Real-Time Dashboards data stream Player Feedback Data Quality Dev / Ops Analytics Data Platform Access Team External Data Data Science Analytics Data Science UI Data Platform Monitoring INGEST STORE PROCESS & ANALYSE VISUALIZE & SERVE Page 18
Processing CDN-Logs 25 GB 300 Million Per day as zipped log-files Record processed per day + Normal challenges with external data sources Out-of-order deliver / Data quality issues / Varying file sizes / etc. Page 19
Requirements for our Data Processing Pipeline Monitor Complete Pipeline Enable Reprocessing of Historical Datasets Be Ready to Scale Page 20
Our CDN Pipeline Page 21
AWS Lambda Limits 5 min 512 MB AWS Lambda Timeout AWS Lambda temp disk How to process 800MB gziped logfile? How to split compressed gzip files? Splitter using Amazon SQS and Amazon EC2 Spot Instances Page 22
Our Meta Data Store AWS Big Data Blog: https://blogs.aws.amazon.com/bigdata/post/tx2yrx3y16cvqfz/building-and- Maintaining-an-Amazon-S3-Metadata-Index-without-Servers
Our Meta Data Store Page 24
Be serverless and serve data Amazon Kinesis AWS Lambda AWS Lambda Amazon API Gateway Page 25
CDN Batch Facts 2.3 min 600 rec/sec 6 1 $ / hour Average run-time of AWS Lambda Processing time Parallel AWS Lambda functions Cost for 25 GB/day CDN processing AWS Lambda duration Redshift CPU Page 26
Data Science Environment Data Platform - Micro Layout Content API Content Content Discovery other modules CDN files CDN Log Data Management Metadata KPI & Analytics Data API Portal data stream data stream AdProxy Log VAS Log Data Lake Data Layer Technical Monitoring Real-Time Dashboards data stream Player Feedback Data Quality Dev / Ops Analytics Data Platform Access Team External Data Data Science Analytics Data Science UI Data Platform Monitoring INGEST STORE PROCESS & ANALYSE VISUALIZE & SERVE Page 27
Data Science Environment Project Jupyter: http://jupyter.org/ Page 28
Data Science Environment - Architecture Data Sources Amazon Kinesis Amazon Redshift Amazon S3 Elasticsearch Cluster Technology Amazon EMR In development Development Github In development Page 29
Our Lambda Architecture on AWS Data Platform - Lambda Architecture Batch Layer other player modules CDN files Amazon Redshift AWS Lambda Amazon API Gateway Portal AWS Lambda Amazon Elastic MapReduce + Spark Serving Layer EC2 with Caravel S3 EC2 with Jupyther Team data stream Instance with Kinesis Agent Amazon KinesisFirehose AWS Lambda EC2 with ElasticSearch EC2 with Grafana Speed Layer Applications Page 30
Key Takeaways Lambda Architecture Enrich your traditional, batch-driven BI-workflow with real-time analytics Use Lambda-Architecture as a guiding principle and adapt it to your needs Page 31
Key Takeaways Focus on features development and robust pipelines not on infrastructure management AWS managed services provide an robust way to run complex big data infrastructures Follow best-practices provided by AWS and the community Page 32
Key Takeaways Provide an open data environments Trust the creativity of your engineering teams to find insights in your datasets Structure your data that it can be access in processed and raw form Notebooks provide easy access to even large distributed datasets Page 33
Michael Muckel, Head of Data Platform Markus Schmidberger, Data Platform Architect We are hiring Data Scientists Data Engineers Project Managers