Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering
Self Service at scale 6 5 4 3 2 1
? Relational? MPP? Hadoop?
Linkedin data 350M Members 25B 3.5M 4.8B 2M Quarterly page views Active company profiles Endorsements Jobs
Translate data into insights Analytics Infrastructure Business Insights Member Insights
The Good Old Days
Data Flow@10000 ft
Scale Challenges 1. Human intervention 2. Long latencies to obtain insights from data. 3. Complexity of integration with increasing data sources.
What does it take to build a self-service, real-time, democratic analytics platform?
Analytics Infra Self Serve Applications [reporting, lineage, perf tuning etc] (WhereHows, Dr. Elephant, ) Core Data Warehouse [Views, Metrics, Dimensions, Datasets, Core flows] Data Management Systems [ingest, export, access, workflows] (Gobblin, ) Storage and Compute (Hadoop, Pinot, Cubert)
Storage and Compute Platforms Y A R N Pig Hive Cubert Scalding Map-Reduce Spark Tez HDFS Hadoop Pinot
Hadoop @ LinkedIn Online Data Serving Deployment x Clusters (~x000 nodes) xx+ PB of data xxx k jobs / week xm compute hrs / month Ingest ETL ETL R & D Export PROD PROD
Supporting > 1000 Hadoop users Development process do code, [review], deploy, while (! good); Hadoop is complex: lots of knobs, tuning helps Performance symptoms not easily identifiable: scattered evidence Performance implications of changes
Dr. Elephant: diagnosis
What about real-time analytics?
Slow Queries
Solution Avoid joins at query time when possible. Denormalize data in Hadoop and load into a fast engine for slice-n-dice.
Real-time analytics A challenge for Hadoop Slice and dice billions of records, hundreds of dimensions End to end freshness of minutes not hours Sub-second query response times e.g. Which are top regions that contribute to my profile views? Which industries in those regions?
Pinot for realtime analytics g Distributed, fault-tolerant Compressed Columnar indexes Data ingestion from Kafka and Hadoop No joins, yet.
Who viewed my profile
Pinot: Data Flow Profile Member-facing Who Viewed My Profile ProfileViewEvent Internal Kafka minutes Pinot Profile Analytics Dashboard Hadoop hours / days segment building
Pig and Hive are great but... Operate on individual records Re-compute scheduled batch ETL jobs with full scans. Can do better by reorganization and processing data in blocks
Cubert: Accelerating Batch computation Pig/Hive Cubert 40 hours 35 hours 30 hours 25 hours 20 hours 15 hours 10 hours 5 hours 0 hours XLNT (Statistical) SPI (Graph) Plato (OLAP Cube)
Cubert Internals Organizes data in blocks Blocks created and transformed with operators Cubert provides a scripting language and a runtime to execute the operators in Map-Reduce operations.
Technology Stack Self Serve Applications [reporting, lineage, perf tuning etc] (WhereHows, Dr. Elephant, ) Core Data Warehouse [Views, Metrics, Dimensions, Datasets, Core flows] Data Management Systems [ingest, export, access, workflows] (Gobblin, ) Storage and Compute (Hadoop, Pinot, Cubert)
Perception
Reality
Unifying Ingress into Hadoop
Ingest operator chain
Gobblin: roadmap Open source in 2014 Current work Continuous and batch ingest Data profiling, summarization Flexible deployment Resource utilization and sharing
Workflow Management Workflow Mgmt Apps Scheduling Backend Azkaban EasyDat a Oozie
Technology Stack Self Serve Applications [reporting, lineage, perf tuning etc] (WhereHows, Dr. Elephant, ) Core Data Warehouse [Views, Metrics, Dimensions, Datasets, Core flows] Data Management Systems [ingest, export, access, workflows] (Gobblin, ) Storage and Compute (Hadoop, Pinot, Cubert)
WhereHows Data Exploration Discover datasets Spread across storage systems (HDFS, TD, Kafka ) Murky semantics for data and columns Lineage to traverse relationships Discover processes Spread across process execution engines (Azkaban, Adhoc, Appworx, EasyData) See code and logic Correlate data and processes
WhereHows
Lineage in action
Reporting and Visualization 1. Dashboards
Reporting and Visualization 1. Dashboards 2. Curated Exploration
Reporting and Visualization 1. Dashboards 2. Curated Exploration 3. Ad-hoc
Summary Reporting: dashboards, curated exploration, ad-hoc WhereHows: explore data, lineage Workflow Mgmt Gobblin*: data ingest Azkaban EasyDat a Oozie Dr. Elephant* for tuning Hadoop Cubert* for batch M/R Hadoop storage & compute Pinot* for real-time querying Y A R N Pinot Pig Hive Cubert Scalding Map-Reduce Spark Tez HDFS Hadoop
Thanks! Greg Arnold, Sr. Director Engineering