ON-LINE VIDEO ANALYTICS EMBRACING BIG DATA David Vanderfeesten, Bell Labs Belgium
ANNO 2012 YOUR DATA IS MONEY BIG MONEY! Your click stream, your activity stream, your electricity consumption, your call detailed records, your pictures, your video, your tweets, your holiday experiences, your relationships, your favorite playlist, your agenda, your job activities,
SETTING THE STAGE Big data technologies applied to a telco use case Smart analytics applied to your daily interests and activities
AGENDA ALCATEL-LUCENT S ONLINE VIDEO ANALYTICS SOLUTION BIG data challenges BIG data technologies BIG data architecture BIGGER and faster tomorrow 4
APPGLIDE ALCATEL-LUCENT S VIDEO ANALYTICS SOLUTION End-User Quality of Experience Content Delivery Network Performance Cross- Correlation Engine Content Usage & Viewer Engagement Unique online video analytics solution using data from multiple data sources that feeds an analytics collection & correlation engine, provides unparalleled insights into End-User Experience and Behavior, all viewed through a rich visual customer portal or sent to a third party system. 5
BIG DATA CHALLENGES APPLIED GARTNER S DEFINITION VOLUME VARIETY IMPORTANT BIG DATA ASPECTS VELOCITY 6
LIMITATIONS EXISTING DATA INFRASTRUCTURE ORACLE MYSQL Scaling write throughput limited options : 1. Switch to more powerful hardware 2. Buy more expensive database solutions 3. Horizontal scaling through sharding/partitioning Availability concerns when data set grows 1. Schema upgrades requiring downtime for table locking (or expensive copy operations) 2. Failover from master to slave is causing downtime 7
QUIZ QUESTION: WHAT ARE - SOLUTIONS? 8
COMMON CHARACTERISTICS NOSQL DATA INFRASTRUCTURES Non-relational data models Distributed Horizontal scalability Schema-less/schema free Trade-off consistency-high availability Specialization of data infrastructure for specific use cases and data models Document-oriented data stores Key-Value stores Graph databases Table-oriented data stores 9
MAKING A DISTRIBUTED DATA INFRASTRUCTURE SYSTEM DESIGN CHOICES Consistency or availability A B Fail the request or not? C Favoring consistency in case of failure Favoring availability in case of failure 10
OUR VIDEO ANALYTICS USE CASE OUR BIG DATA INFRASTRUCTURE SELECTION Next Slides: What is Cassandra? What is Cassandra not? Why did we select Cassandra? A couple of mechanism we applied 11
SELECTION FOR OUR SOLUTION APACHE CASSANDRA Free and Open Source Software (FOSS) Horizontal scalable data infrastructure up to 100s of nodes Supports schema-less structure Highly optimized for write performance with very good read performance characteristics Fault-tolerant: advanced replication strategies Ad-hoc querying support with Hadoop map-reduce overlay Good middle ground! Volume Variety Velocity 12
APACHE CASSANDRA IN A NUTSHELL Distributed data infrastructure Peer-to-peer architecture All nodes identical Consistent hash ring cola: colb: colc: Hash(key1) a1 b1 c1 colb: cole: colf: Hash(key2) b1 e1 e2 cole: Hash(key3) E1` cola: colb: colc: Hash(key1) a1 b1 c1 colb: cole: colf: Hash(key2) b1 e1 e2 cole: Hash(key3) E1` Multiple data partitioning strategies Random hash partitioner Order preserved partitioner Hash(key10) E F Cassandra ring D cola: a1 Hash(key1) Hash(key2) Hash(key3) cola: a1 colb: b1 cole: E1` Up to 2 billion sorted columns colb: b1 colb: b1 cole: e1 colc: c1 colf: e2 A Hash(key1) C Hash(key2) Hash(key3) B Hash(key1) Hash(key2) Hash(key3) colc: c1 cola: a1 colb: b1 cole: E1` colb: b1 cole: e1 colc: c1 colf: e2 cola: a1 colb: b1 cole: E1` colb: b1 cole: e1 colc: c1 colf: e2 Hash(key7) colb: b1 cole: e1 colf: e2 Hash(key13) cole: E1` 13
CASSANDRA VERSUS RDBMS No relational data model No joins Limited support for transactional properties of RDBMS Only simple native indexing mechanisme allowing some grouping of data No transaction support No rollback mechanism Ad-hoc queries Only non real-time, through Hadoop map-reduce overlay 14
DATA MODELLING PATTERNS DENORMALISATION Slicing and dicing through graphs is stored in fully denormalized format Each data view corresponds with one row in Cassandra 15
DATA MODELLING PATTERNS EXPLOIT CASSANDRA WIDE ROW SUPPORT & COUNTER COLUMNFAMILIES All timeserie graphs for all metrics (per data view) in one row 16
BIG DATA ARCHITECTURE CDN Log Files Analytics Engine Linear scalable, big data infrastructure OLAP cubes Analytics Portal End-User Clients HLS Multiple Data Sources Dynamic Streaming Smooth Streaming Near-Real Time Data Collection & Analysis Analysis and Scoring Rich Portal with Dynamic Filters Other Sources QoE Agents Industry-Leading Video Analytics Algorithms Unique Video Scoring Model QoE Scoring CDN Performance Content Trends Web Services-Based DPI EMS/NMS Static Probes API For Raw and Processed Data 17
near-realtime is not realtime enough? 18
BIG DATA ARCHITECTURE TELL ME WHAT HAPPENS NOW! CDN Log Files Analytics Engine Linear scalable, big data infrastructure Network Operations Center ALERTS In Real Time Analytics Portal End-User Clients HLS Dynamic Streaming Smooth Streaming Multiple Data Sources Near-Real Time Data Collection & Analysis Streaming Analytics Engine Analysis and Scoring Rich Portal with Dynamic Filters Other Sources QoE Agents Industry-Leading Video Analytics Algorithms Unique Video Scoring Model QoE Scoring CDN Performance Content Trends Web Services-Based DPI EMS/NMS Static Probes API For Raw and Processed Data 19
BIG DATA INFRASTRUCTURE TELL ME WHAT HAPPENS NOW! Streaming analytics, complex event processing engines Future work: event stream distribution frameworks: E.g. open source project Storm Horizontal scalable Fault tolerant Hadoop Map-Reduce for realtime cases Millions messages/second 20
CONCLUSION ADVANTAGES OF THE NEW INFRASTRUCTURE 1. Horizontal scalable OLTP and near-realtime analytics infrastructure 2. TCO: solution can runs on commodity hardware, FOSS based 3. High availability: no downtime for upgrades 4. New data sources can be integrated without database schema changes 5. Very good fit for cloud environments 21
CONCLUSION LESSONS LEARNED 1. Realtime <->ad-hoc queries conflict 2. RDBMS with surrounded tooling is pretty well understood by development team 3. Learning curve of best practices is longer when using distributed data infrastructure like Cassandra 22
BIG DATA STORY CONTINUES LATEST EVOLUTIONS IN BIG DATA LANDSCAPE Cloudera released near realtime query engine on batch oriented Hadoop HDFS (Impala) + HBase. Adding near- realtime SQL query facilities New relational database infrastructures and enhancements with improved scaleability properties closing the gap with NoSQL solutions. Complete new databases New MySQL storage engines Application transparant clustering and sharding 23
Questions? Feel free to stop by at our booth: Big Data Empowered Online Video Analytics