Building a data analytics platform with Hadoop, Python and R

Similar documents
The Inside Scoop on Hadoop

Apache Hadoop: Past, Present, and Future

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

The Future of Data Management

Big Data Infrastructure at Spotify

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Big Data Too Big To Ignore

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Big Data Analytics Nokia

Moving From Hadoop to Spark

Hadoop & Spark Using Amazon EMR

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Virtualizing Apache Hadoop. June, 2012

HDP Enabling the Modern Data Architecture

Big Data Course Highlights

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

HDP Hadoop From concept to deployment.

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

How To Create A Data Visualization With Apache Spark And Zeppelin

Cost-Effective Business Intelligence with Red Hat and Open Source

Hadoop & SAS Data Loader for Hadoop

Self-service BI for big data applications using Apache Drill

KNIME & Avira, or how I ve learned to love Big Data

Savanna Hadoop on. OpenStack. Savanna Technical Lead

Hadoop and Map-Reduce. Swati Gore

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Upcoming Announcements

Sentimental Analysis using Hadoop Phase 2: Week 2

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

HiBench Introduction. Carson Wang Software & Services Group

Comprehensive Analytics on the Hortonworks Data Platform

Data processing goes big

The Future of Data Management with Hadoop and the Enterprise Data Hub

the missing log collector Treasure Data, Inc. Muga Nishizawa

HADOOP. Revised 10/19/2015

Case Study : 3 different hadoop cluster deployments

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Native Connectivity to Big Data Sources in MSTR 10

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Oracle Big Data SQL Technical Update

I/O Considerations in Big Data Analytics

Dashboard Engine for Hadoop

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Self-service BI for big data applications using Apache Drill

How Cisco IT Built Big Data Platform to Transform Data Management

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Data Analyst Program- 0 to 100

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Implement Hadoop jobs to extract business value from large and varied data sets

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Big data blue print for cloud architecture

Best Practices for Hadoop Data Analysis with Tableau

BIG DATA HADOOP TRAINING

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Constructing a Data Lake: Hadoop and Oracle Database United!

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Big Data : Experiments with Apache Hadoop and JBoss Community projects

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Apache Hadoop. Alexandru Costan

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

A Brief Introduction to Apache Tez

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Real-time Big Data Analytics with Storm

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

White Paper: What You Need To Know About Hadoop

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Job Oriented Training Agenda

So What s the Big Deal?

Consulting and Systems Integration (1) Networks & Cloud Integration Engineer

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Oracle Big Data Essentials

Hadoop IST 734 SS CHUNG

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Ali Ghodsi Head of PM and Engineering Databricks

A Performance Analysis of Distributed Indexing using Terrier

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Business Intelligence for Big Data

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Unified Big Data Analytics Pipeline. 连 城

Transcription:

Building a data analytics platform with Hadoop, Python and R

Agenda Me Sanoma Past Present Future 3 18 November 2013

/me @skieft Software architect for Sanoma Managing the data and search team Focus on the digital operation Work: Centralized services Data platform Search Like: 4 Work Water(sports) Whiskey Tinkering: Arduino, Raspberry PI, 18 November soldering 2013 stuff

Sanoma: Market leader in chosen businesses and markets One of the leading media and learning companies in Europe #1 media company in the Netherlands and Finland Among top 2 educational players in all its 6 markets of operation Head office in Helsinki, Finland Focus on consumer media and learning 2012 financials Netsales meur 2,376 EBIT meur 231 Personnel 10,381(FTE) 5 18 November 2013 Presentation name

Over 300 magazines

7 18 November 2013

8 18 November 2013 Presentation name

Past

History < 2008 2009 2010 2011 2012 2013

Self service Photo credits: misternaxal - http://www.flickr.com/photos/misternaxal/2888791930/

Self service levels Personal Departmental Corporate Full self service Information is created by end users with little or no oversight. Users are empowered to integrate different data sources and make their own calculations. Support with publishing dashboards and data loading Information has been created by end users and is worth sharing, but has not been validated. Full service and support on dashboard Information that has gone through a rigorous validation process can be disseminated as official data. Information Workers Information Consumers Full Agility Centralized Development Excel B.I.S.S.S.S. Static reports 12 18 November 2013 Presentation name

History < 2008 2009 2010 2011 2012 2013 13 11/18/2013 Sanoma Media

Glue: ETL EXTRACT TRANSFORM LOAD 14 18 November 2013 Presentation name

15 18 November 2013 Presentation name

16 18 November 2013 Presentation name

17 18 November 2013 Presentation name

18 18 November 2013 Presentation name

#fail

Learnings Traditional ETL tools don t scale and are not effective for Big Data sources Big Data projects are not BI projects Doing full end-to-end integrations and dashboard development doesn t scale Qlikview was not good enough as the front-end to the cluster Hadoop requires developers not BI consultants 20 18.11.2013 Sanoma Media Presentation name / Author

History < 2008 2009 2010 2011 2012 2013 21 11/18/2013 Sanoma Media

Russell Jurney Agile Data 23 18.11.2013 Sanoma Media Presentation name / Author

New glue Photo credits: Sheng Hunglin - http://www.flickr.com/photos/shenghunglin/304959443/

ETL Tool features Processing Scheduling Data quality Data lineage Versioning Annotating 26 18 November 2013 Presentation name

27 18 November 2013 Presentation name Photo credits: atomicbartbeans - http://www.flickr.com/photos/atomicbartbeans/71575328/

28 18.11.2013 Sanoma Media Presentation name / Author

29 18.11.2013 Sanoma Media Presentation name / Author

Processing - Jython No JVM startup overhead for Hadoop API usage Relatively concise syntax (Python) Mix Python standard library with any Java libs 30 18 November 2013 Presentation name

Scheduling - Jenkins Flexible scheduling with dependencies Saves output E-mails on errors Scales to multiple nodes REST API Status monitor Integrates with version control 31 18 November 2013 Presentation name

ETL Tool features Processing Bash & Jython Scheduling Jenkins Data quality Data lineage Versioning Mecurial (hg) Annotating Commenting the code 32 18 November 2013 Presentation name

Processes Photo credits: Paul McGreevy - http://www.flickr.com/photos/48379763@n03/6271909867/

Independent jobs Source (external) HDFS upload + move in place Staging (HDFS) MapReduce + HDFS move Hive-staging (HDFS) Hive map external table + SELECT INTO Hive 34 18 November 2013 Presentation name

Typical data flow - Extract 1. Job code from hg (mercurial) Jenkins Hadoop HDFS Hadoop M/R or Hive QlikView 2. Get the data from S3, FTP or API 3. Store the raw source data on HDFS

Typical data flow - Transform 1. Job code from hg (mercurial) 2. Execute M/R or Hive Query Jenkins Hadoop HDFS Hadoop M/R or Hive QlikView 3. Transform the data to the intermediate structure

Typical data flow - Load 1. Job code from hg (mercurial) 2. Execute Hive Query Jenkins Hadoop HDFS Hadoop M/R or Hive QlikView R / Python 3. Execute model 4. Move results to HDFS or to real time serving system

Typical data flow - Load 1. Job code from hg (mercurial) 2. Execute Hive Query Jenkins Hadoop HDFS Hadoop M/R or Hive QlikView 3. Load data from to QlikView

Out of order jobs At any point, you don t really know what made it to Hive Will happen anyway, because some days the data delivery is going to be three hours late Or you get half in the morning and the other half later in the day It really depends on what you do with the data This is where metrics + fixable data store help... 39 18 November 2013 Presentation name

Fixable data store Using Hive partitions Jobs that move data from staging create partitions When new data / insight about the data arrives, drop the partition and reinsert Be careful to reset any metrics in this case Basically: instead of trying to make everything transactional, repair afterwards Use metrics to determine whether data is fit for purpose 40 18 November 2013 Presentation name

Photo credits: DL76 - http://www.flickr.com/photos/dl76/3643359247/

HUE

Hadoop job scheduling Schedulers to spread the load on your cluster CapacityScheduler: vs FairScheduler: The default since: CDH 4.1 Use scheduling pools to seperate workloads. ETL vs User based vs Consuming Applications 43 18 November 2013 Presentation name

Quotas Since user can break stuff, easily....and SQL skills get rusty SELECT * FROM views LEFT JOIN clicks WHERE day = '2013-09-26 Views table is 10TB and 36x10 9 rows Clicks table is 100GB and 36x10 6 rows Or use Strict mode = 1 in hive 44 18 November 2013 Presentation name

QlikView

R Studio Server

Architecture

s o u r c e s Scheduled exports High Level Architecture AdServing (incl. mobile, video,cpc,yield, etc.) Meta Data Back Office Analytics Jenkins & Jython ETL Scheduled loads QlikView Dashboard/ Reporting Market Content Hadoop Hive & HUE AdHoc Queries Subscription R Studio Advanced Analytics 48

Sanoma Media The Netherlands Infra Colocation Managed Hosting Dev., test and acceptance VMs Big Data platform Production VMs Production VMs DC1 DC2 NO SLA (yet) Limited support BD9-5 No Full system backups Managed by us with systems department help 99,98% availability 24/7 support Multi DC Full system backups High performance EMC SAN storage Managed by dedicated systems department

Sanoma Media The Netherlands Infra Colocation Queue data for 3 days Managed Hosting Dev., test and acceptance VMs Processing Storage Big Data platform Batch DC1 Production VMs Run on stale data for 3 days Collecting Real time Production VMs Recommendations DC2 NO SLA (yet) Limited support BD9-5 No Full system backups Managed by us with systems department help 99,98% availability 24/7 support Multi DC Full system backups High performance EMC SAN storage Managed by dedicated systems department

Present

Current state Use case A/B testing + deeper analyses Ad auction price optimization Recommendations Search optimizations 52 18 November 2013 Presentation name

Current state - Usage Main use case for reporting and analytics Sanoma standard data platform, used by other Sanoma countries too: Finland, Hungary,.. ~ 100 Users: analysts & developers 25 daily users 43 source systems, with 125 different sources 300 tables in hive 53 18 November 2013 Presentation name

Current state Tech and team Team: 1 product manager 2 developers 2 data scientists 1 Qlikview application manager ½ architect Platform: 30-50 nodes > 300TB storage ~2000 jobs/day Typical data node / task tracker: 2 system disks (RAID 1) 4 data disks (2TB, 3TB or 4TB) 24GB RAM 54 18 November 2013 Presentation name

Current State - Real time Extending own collecting infrastructure Using this to drive recommendations, user segementation and targetting in real time Moving from Flume to Kafka First production project with Storm 55 18 November 2013 Presentation name

s o u r c e s Scheduled exports High Level Architecture AdServing (incl. mobile, video,cpc,yield, etc.) Meta Data Back Office Analytics Jenkins & Jython ETL Scheduled loads QlikView Dashboard/ Reporting Market Content Hadoop Hive & HUE AdHoc Queries Subscription CLOE (Real time) Collecting Flume Storm Stream processing Recommendatio ns & online learning R Studio Advanced Analytics Recommendations / Learn to Rank 56

Future

What s next Cloudera search (Solr Cloud on Hadoop) Index creation External Ranker Easier scaling and maintainance Moving some NLP (Natural Language Processing) and Image recognition workload to hadoop Optimizing Job scheduling (Fair Scheduling Pools) Automated query optimization tips for analysts Full roll out R integration, with rmr2 More support for: cascading, scalding, pig, etc. 58 18 November 2013 Presentation name

Thank you! Questions?