Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Similar documents

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Big Data Analytics Nokia

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Upcoming Announcements

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

A Brief Introduction to Apache Tez

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Analytics on Spark &

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Databricks. A Primer

Bringing Big Data to People

Using distributed technologies to analyze Big Data

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Native Connectivity to Big Data Sources in MSTR 10

Apache Hadoop: Past, Present, and Future

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

How Companies are! Using Spark

HDP Hadoop From concept to deployment.

Databricks. A Primer

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Apache Kylin Introduction Dec 8,

Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Big Data Infrastructure at Spotify

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Integrating a Big Data Platform into Government:

Dashboard Engine for Hadoop

Next-Generation Cloud Analytics with Amazon Redshift

Moving From Hadoop to Spark

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Big Data Analytics - Accelerated. stream-horizon.com

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Information Builders Mission & Value Proposition

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Big Data Success Step 1: Get the Technology Right

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Big Data for Investment Research Management

Dell In-Memory Appliance for Cloudera Enterprise

Luncheon Webinar Series May 13, 2013

How To Turn Big Data Into An Insight

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

SQL Server 2012 Business Intelligence Boot Camp

Big Data Patterns. Ron Bodkin Founder and President, Think Big

Oracle Big Data SQL Technical Update

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Cloudera Enterprise Data Hub in Telecom:

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Unified Big Data Analytics Pipeline. 连城

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Business Intelligence for Big Data

Data Integration Hub

Chase Wu New Jersey Ins0tute of Technology

SAP and Hortonworks Reference Architecture

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

Unified Big Data Processing with Apache Spark. Matei

Unified Batch & Stream Processing Platform

HDP Enabling the Modern Data Architecture

Data Security in Hadoop

the missing log collector Treasure Data, Inc. Muga Nishizawa

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Ganzheitliches Datenmanagement

Self-Service Business Intelligence: The hunt for real insights in hidden knowledge Whitepaper

Business Intelligence and Healthcare

Making big data simple with Databricks

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

AtScale Intelligence Platform

Information Architecture

Testing Big data is one of the biggest

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

PARC and SAP Co-innovation: High-performance Graph Analytics for Big Data Powered by SAP HANA

Are You Big Data Ready?

Sterling Business Intelligence

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

BIG DATA TRENDS AND TECHNOLOGIES

The Future of Data Management

Talend Big Data. Delivering instant value from all your data. Talend

Big Data for Investment Research Management

June Production Hadoop systems in the enterprise

Hadoop Ecosystem B Y R A H I M A.

Tap into Hadoop and Other No SQL Sources

INTRODUCING DRUID: FAST AD-HOC QUERIES ON BIG DATA MICHAEL DRISCOLL - CEO ERIC TSCHETTER - LEAD METAMARKETS

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

KNIME & Avira, or how I ve learned to love Big Data

Apigee Insights Increase marketing effectiveness and customer satisfaction with API-driven adaptive apps

Monitor and Manage Your MicroStrategy BI Environment Using Enterprise Manager and Health Center

Please give me your feedback

Transcription:

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Self Service at scale 6 5 4 3 2 1

? Relational? MPP? Hadoop?

Linkedin data 350M Members 25B 3.5M 4.8B 2M Quarterly page views Active company profiles Endorsements Jobs

Translate data into insights Analytics Infrastructure Business Insights Member Insights

The Good Old Days

Data Flow@10000 ft

Scale Challenges 1. Human intervention 2. Long latencies to obtain insights from data. 3. Complexity of integration with increasing data sources.

What does it take to build a self-service, real-time, democratic analytics platform?

Analytics Infra Self Serve Applications [reporting, lineage, perf tuning etc] (WhereHows, Dr. Elephant, ) Core Data Warehouse [Views, Metrics, Dimensions, Datasets, Core flows] Data Management Systems [ingest, export, access, workflows] (Gobblin, ) Storage and Compute (Hadoop, Pinot, Cubert)

Storage and Compute Platforms Y A R N Pig Hive Cubert Scalding Map-Reduce Spark Tez HDFS Hadoop Pinot

Hadoop @ LinkedIn Online Data Serving Deployment x Clusters (~x000 nodes) xx+ PB of data xxx k jobs / week xm compute hrs / month Ingest ETL ETL R & D Export PROD PROD

Supporting > 1000 Hadoop users Development process do code, [review], deploy, while (! good); Hadoop is complex: lots of knobs, tuning helps Performance symptoms not easily identifiable: scattered evidence Performance implications of changes

Dr. Elephant: diagnosis

What about real-time analytics?

Slow Queries

Solution Avoid joins at query time when possible. Denormalize data in Hadoop and load into a fast engine for slice-n-dice.

Real-time analytics A challenge for Hadoop Slice and dice billions of records, hundreds of dimensions End to end freshness of minutes not hours Sub-second query response times e.g. Which are top regions that contribute to my profile views? Which industries in those regions?

Pinot for realtime analytics g Distributed, fault-tolerant Compressed Columnar indexes Data ingestion from Kafka and Hadoop No joins, yet.

Who viewed my profile

Pinot: Data Flow Profile Member-facing Who Viewed My Profile ProfileViewEvent Internal Kafka minutes Pinot Profile Analytics Dashboard Hadoop hours / days segment building

Pig and Hive are great but... Operate on individual records Re-compute scheduled batch ETL jobs with full scans. Can do better by reorganization and processing data in blocks

Cubert: Accelerating Batch computation Pig/Hive Cubert 40 hours 35 hours 30 hours 25 hours 20 hours 15 hours 10 hours 5 hours 0 hours XLNT (Statistical) SPI (Graph) Plato (OLAP Cube)

Cubert Internals Organizes data in blocks Blocks created and transformed with operators Cubert provides a scripting language and a runtime to execute the operators in Map-Reduce operations.

Technology Stack Self Serve Applications [reporting, lineage, perf tuning etc] (WhereHows, Dr. Elephant, ) Core Data Warehouse [Views, Metrics, Dimensions, Datasets, Core flows] Data Management Systems [ingest, export, access, workflows] (Gobblin, ) Storage and Compute (Hadoop, Pinot, Cubert)

Perception

Reality

Unifying Ingress into Hadoop

Ingest operator chain

Gobblin: roadmap Open source in 2014 Current work Continuous and batch ingest Data profiling, summarization Flexible deployment Resource utilization and sharing

Workflow Management Workflow Mgmt Apps Scheduling Backend Azkaban EasyDat a Oozie

Technology Stack Self Serve Applications [reporting, lineage, perf tuning etc] (WhereHows, Dr. Elephant, ) Core Data Warehouse [Views, Metrics, Dimensions, Datasets, Core flows] Data Management Systems [ingest, export, access, workflows] (Gobblin, ) Storage and Compute (Hadoop, Pinot, Cubert)

WhereHows Data Exploration Discover datasets Spread across storage systems (HDFS, TD, Kafka ) Murky semantics for data and columns Lineage to traverse relationships Discover processes Spread across process execution engines (Azkaban, Adhoc, Appworx, EasyData) See code and logic Correlate data and processes

WhereHows

Lineage in action

Reporting and Visualization 1. Dashboards

Reporting and Visualization 1. Dashboards 2. Curated Exploration

Reporting and Visualization 1. Dashboards 2. Curated Exploration 3. Ad-hoc

Summary Reporting: dashboards, curated exploration, ad-hoc WhereHows: explore data, lineage Workflow Mgmt Gobblin*: data ingest Azkaban EasyDat a Oozie Dr. Elephant* for tuning Hadoop Cubert* for batch M/R Hadoop storage & compute Pinot* for real-time querying Y A R N Pinot Pig Hive Cubert Scalding Map-Reduce Spark Tez HDFS Hadoop

Thanks! Greg Arnold, Sr. Director Engineering